My CS/AI Quick Reference Guide

Ryan Sander, Graduate Student @ MIT

M.Eng. Artificial Intelligence, MIT 2021
B.S. EECS, Mathematical Economics, MIT 2020

This guide serves to provide users with efficient access to a variety of different concepts and practical problem/solution discussions in the fields of:

  1. Computer Science & Programming
  2. Electrical Engineering
  3. Economics
  4. Mathematics
  5. Leadership
  6. Writing

If you would like to learn more about (a growing list of) these concepts in-depth or would like to practice them, please check out my GitHub repository here. This guide is by no means complete and is routinely updated. My hope is that in its current state, this guide provides you with valuable insight into concepts and implementations for your projects in the aforementioned fields.

WARNING: This guide is not intended for commercial use, and should not be redistributed for profit. I have made efforts to make citations where they are due, but a few may still be missing.

Operating Systems:

Python:

  • Native Python:
    • os (operating system)
    • argparse (argument parsing)
    • re (regular expressions)
    • json (create/process json files)
    • tarball (create/process tarball files)
    • h5py (create/process h5 files)
      • TensorFlow: (Deep Learning + Numerical Computing)
        • TensorFlow
        • TensorBoard (plotting)
        • Keras (high-level TensorFlow API)
        • TensorFlow-Agents (Reinforcement Learning in TensorFlow)
  • PyTorch (Deep Learning + Numerical Computing)
    • PyTorch
    • TensorBoard (plotting)
    • Torchvision (Computer Vision with PyTorch)
    • Points3D (coming soon - 3D Computer Vision)
  • Ray/Rllib (Scalable Reinforcement Learning)
  • Mujoco_py (MuJoCo with Python)
  • OpenCV (Computer Vision)
  • NumPy (Numerical Programming Package)
  • Xvfb (Headless rendering)
  • OpenAI Gym (Reinforcement Learning Simulations)
    • car-racing
    • Vizdoom
  • Pyglet (Graphics, animations, and applications)
  • PyQt5 (GUI, Displays)
  • Pandas (Data Analysis, Processing, and Manipulation)
  • SciPy (Scientific Programming Functions)
  • Scikit-Learn (Supervised and unsupervised machine learning)
  • Matplotlib (Matlab-based plotting functions)
  • Rospy (coming soon)
  • General Programming Practices
  • Miscellaneous
  • Pomegranate
  • Librosa
  • pymed (coming soon)
  • Shrink PDFs with command-line (coming soon)

Performance Engineering:

IDEs:

Interpreters:

Scientific Computing:

  • Matlab
  • Stata
  • R (coming soon)
  • Julia
    • Native Julia
    • Gen (probabilistic programming)

Web Development:

Cloud Computing and Containerization:

  • Cloud Services:
    • Amazon Web Services (AWS)
      • Elastic Compute Cloud (EC2)
      • Boto3 (Python EC2 API)
      • S3
      • awscli (AWS command-line interface)
    • Google Cloud Compute (GC2)
    • Microsoft Azure (coming soon)
    • MIT Satori (coming soon)
    • MIT SuperCloud
  • Docker

Other Tools:

Remote tools (coming soon):

  • Zoom (coming soon)
  • AWW (whiteboard app) (coming soon)
  • Anydesk (coming soon)
  • Cocalc (coming soon) - link
  • WebEx (coming soon)
  • Skype (coming soon)
  • Google Colab (coming soon)
  • PuTTy (coming soon)

Additionally, the latter component of this guide is meant to serve as a high-level resource/guide to concepts in:

Artificial intelligence:

General AI:

Mathematics:

  • Real analysis
  • Fourier Analysis
  • Linear Algebra (coming soon)
  • Multivariable Calculus (coming soon)
  • Differential Equations (coming soon)

Computer Science:

Signal Processing:

  • Deterministic and stochastic signal processing
  • Filtering
  • Fourier, Laplace, and Z-Transforms
  • Electromagnetics (coming soon)

Economics:

Research, Management, and Operations:

Other Topics:

Python (and CLI) Packages for Machine Learning

  • Scikit-learn: Package containing many fully-parameterizable machine learning models, such as SVM, Gradient-Boosted Decision Trees, etc.
  • Pandas: Library for manipulating data and doing feature engineering. The main data structure in this library is the dataframe. Especially helpful for:
    • Text data/mixed data types of numerical/non-numerical data
    • Other forms of “rectangular data”.
    • Data missing many values/data requiring extensive pre-processing/cleaning
  • Numpy: Library for manipulating vector/matrix/numerical data. The main data structure in this library is array. Especially helpful for:
    • Linear algebra
    • Processing large volumes of vector data at a time
    • Computer vision/working with images
  • SpaCy: Library for natural language processing, written in Python and Cython.
  • Scipy: Library for running complicated machine learning models when data is stored in an array format/has been manipulated by numpy.
  • OpenCV: Library for image processing and classical (mainly unsupervised) computer vision algorithms. Uses a C++ wrapper in the background, making it quite efficient and able to process image data efficiently. Has a strong bridge to numpy arrays, making this library very effective for computer vision when used with numpy.
  • Pillow: Library for image processing. When this package loads images, rather than loading them into a numpy array (as OpenCV does), it loads images into their own custom objects. Contains a lot of additional functionality for processing and manipulating images via their numerical data directly.
  • PyTesseract: Library for Optical Character Recognition. Quite useful with OpenCV. You may need to install OpenCV and pillow.
  • Mallet: CLI-based package for topic modeling through Latent Dirichlet Allocation. Quite useful if you’re trying to make a generative text model from a corpus (set) of textual documents.
  • Matplotlib: A library for plotting data for visualization. Can create plots for anything from 1D line plotting to 3D surface plotting.
  • boto3: This Python package is used as AWS’s EC2 API. If you ever need to integrate AWS EC2 into your machine learning projects through Python, boto3 is probably the best way to do so.
  • Keras: High-level deep learning API for TensorFlow. This library is especially useful if you want to prototype relatively straightforward, low-customization-needed models. If you need higher degrees of customization or as-fast-as-possible runtime, consider using PyTorch or TensorFlow.
  • TensorFlow: A highly-scalable and efficient library for creating fully-customizable machine learning models. If you plan to deploy a deep learning model, we strongly recommend using either this library or PyTorch. If you learn this library, we strongly recommend only learning TensorFlow 2.0, as TensorFlow 1.0 will soon become deprecated. Installation after 1.14 (highly recommended) includes both CPU and GPU capabilities, but TensorFlow is able to automatically determine what device to use based off the devices available. On the backend is a wrapper for Compute Unified Device Architecture (CUDA).
  • PyTorch: A highly-scalable and efficient library for creating fully-customizable machine learning models. If you plan to deploy a deep learning model, we strongly recommend using either this library or TensorFlow. This library is typically considered to be more interpretable than TensorFlow, with comparable run-time performance. On the backend is a wrapper for Compute Unified Device Architecture (CUDA).
  • CUDA (Compute Unified Device Architecture): This is an API developed by NVIDIA that is used in both PyTorch and TensorFlow to coordinate devices for making computations. It is an important library to know if you are trying to build complex, real-time models with GPUs/multicore processors. Not completely Pythonic, but has some functionality within Python.

Other Important Tools for Machine Learning

  • Jupyter: Jupyter is a framework for writing files into individual blocks of code. Especially helpful for when interfacing with remote virtual machines without a graphical user interface (GUI) (FREE).
  • Google Colab: This resource looks quite similar to Jupyter notebook. It is run in the cloud (like AWS or Microsoft Azure), but it’s completely free, and doesn’t require an ssh connection. The best part is: you can even run GPUs (for free!) This makes the platform useful for training large neural network/computer vision models quickly. This resource is more oriented for research and development, and likely is not the best platform to use for deploying your models/applications (FREE).
  • Amazon Web Services Elastic Computing (AWS EC2): A cloud computing platform developed by Amazon that enables the creation of high-speed virtual machine computers. This resource is especially helpful for complex machine learning models, such as convolutional neural networks. Can be integrated with other resources, such as AWS Lambda and GitHub. (NOT FREE, but with some smart research prices are relatively reasonable).
  • Amazon Mechanical Turk: Found/have a dataset but don’t have labels for it? Amazon Mechanical Turk uses a crowdsourcing platform which pays people to label images for you. It’s especially great to use for large datasets (NOT FREE).
  • Google Cloud Compute: A cloud computing platform developed by Google that enables the creation of high-speed virtual machine computers. User interface may be easier to use than AWS. This resource is especially helpful for complex machine learning models, such as convolutional neural networks (NOT FREE, but with some smart research prices are relatively reasonable).
  • Microsoft Azure: A cloud computing platform developed by Microsoft. I haven’t used it, but it seems comparable and offers a lot of the same features as AWS and Google Cloud Compute.
  • Pip: This is an installer for Python packages. (“Python Install Package”.)
  • Anaconda: Anaconda is a package manager for configuring and installing virtual environments on almost any computer. This is very useful for containerization, ensuring that the same environment used between different users gives each user the same computing environment (FREE except for enterprise).
  • Docker: Docker is a containerization service that intuitively “containerizes” your code, and makes it operable across many different platforms and environments. Like Anaconda, it makes installing, maintaining and updating, and upgrading dependencies far easier. It is especially powerful when used for web application development.
  • GitHub: A version control platform that allows for a persistent state to be hosted in a safe location online, as well as for users to work on the same code projects at the same time (FREE except for enterprise).
  • Bash/Command-Line: The command line is helpful for creating, installing, and managing packages. Though each command line is different (especially between different operating systems), they can all be used to accelerate the machine learning process.
  • Stack Overflow: Do not be afraid to google things! Weird stack trace error that you cannot understand? Search it on stack overflow! Don’t know how to find something in a bash script? Again, try Stack Overflow! There is a good chance other people have ran into the exact same questions/problems as you, so let’s take advantage of that!
  • Program Creek: Another great source of documentation for code. Has a lot of helpful examples for showing how various functions are used.
  • IDEs (Integrated Development Environments): These are great for coding more efficiently and effectively. IDEs typically provide a more user-friendly development environment, have debugging capabilities built-in, etc. Some recommended IDEs are:
  • Kite: AI-powered auto-complete and docstring writing. Very new but works really well and drastically reduces time spent typing. Configurable with certain (at this point probably most) IDEs (FREE).
  • Python Package Documentation: Many packages we use in this course, such as numpy, scipy, pandas, sklearn, etc. have developed really strong, concise, and user-friendly documentation. If you need to use a function with one of these packages but do not know which function to use, try searching the documentation here! Chances are that a function already exists.

Other Resources for Machine Learning: