GitHub Repository: probml/pyprobml
Path: blob/master/notebooks/tutorials/software.md
¹¹⁹² views

Python software ecosystem

General coding

In this book, we will use Python 3. For a good introduction, see e.g., the free books Whirlwind tour of Python by Jake Vanderplas or Dive into Python 3 by Mark Pilgrim.

Each chapter is associated with one or more Jupyter notebooks, which mix code and results. We use the Google colab version of notebooks, which run on the Google Compute Platform (GCP), so you don't have to install code locally. To avoid installing packages every time you open a colab, you can follow these steps.

When developing larger software projects locally, it is often better to use an IDE (interactive development environment), which keeps the code separate from the results. I like to use Spyder, although many people use PyCharm. For a browser-based IDE, you can use JupyterLab.

Software for data science and ML

We will leverage many standard libraries from Python's "data science stack", listed in the table below. For a good introduction to these, see e.g., the free book Python Datascience Handbook by Jake Vanderplas, or the class Computational Statistics in Python by Cliburn Chen at Duke University. For an excellent book on scikit-learn, see Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow v2 by Aurelion Geron.

| Name | Functionality | | ---- | ---- | | Numpy | Vector and matrix computations | | Scipy | Various scientific / math / stats / optimization functions | | Matplotlib, Seaborn | Plotting | | Pandas, Xarray | Dataframes and named arrays | | Scikit-learn | Many ML methods (excluding deep learning) | | Jax | Accelerated version of Numpy with autograd support |

Software for deep learning

Deep learning is about composing primitive differentiable functions into a computation graph in order to make more complex functions, and then using automatic differentiation ("autograd") to compute gradients of the output with respect to model parameters , which we can pass to an optimizer, to fit the function to data. This is sometimes called "differentiable programming".

DL therefore requires several different libraries, to perform tasks such as

specify the model
compute gradients using automatic differentiation
train the model (pass data to the gradient function, and gradients to the optimizer function)
serve the model (pass input to a trained model, and pass output to some service)

The training and serving often uses hardware accelerators, such as GPUs. (Some libraries also support distributed computation, but we will not need use this feature in this book.)

There are several popular DL frameworks, which implement the above functionality, some of as

Name	More info
Tensorflow2	tf_intro.ipynb
JAX	JAX tutorials
PyTorch	PyTorch website
MXNet	Dive into deep learning book

In this book, we mostly use Tensorflow 2 and JAX. However, we also welcome contributions in PyTorch. (More details on the JAX ecosystem can be found here

Software for probabilistic modeling

In this book, we focus on probabilistic models, both supervised (conditional) models of the form $p(y|x)$ , as well as unsupervised models of the form $p(z,x)$ , where $x$ are the features, $y$ are the labels (if present), and $z$ are the latent variables. For simple special cases, such as GMMs and PCA, we can use sklearn. However, to create more complex models, we need more flexible libraries. We list some examples below.

The term "probabilistic programming language" (PPL) is used to describe systems that allow the creation of "randomly shaped" models, whos structure is determined e.g., by stochastic control flow. The Stan library specifiis the model using a domain specific language (DSL); most other libraries specify the model via an API. In this book, we focus on PyMc3 and numpyro.

Name	Description
Tensorflow probability	PPL built on top of TF2.
Edward 1	PPL built on top of TF2 or Numpy.
Edward 2	Low-level PPL built on top of TF2.
Pyro	PPL built on top of PyTorch.
NumPyro	Similar to Pyro, but built on top of JAX instead of PyTorch.
PyStan	Python interface to Stan, which uses the BUGS DSL for PGMs. Custom C++ autodiff library.
PyMc3	Similar to PyStan, but uses Theano for autodiff. (Future versions will use JAX.)

There are also libraries for inference in probabilistic models in which all variables are discrete. Such models do not need autograd. We give some examples below.

Name	Description
PgmPy	Discrete PGMs.
Pomegranate	Discrete PGMs. GPU support.

Python software ecosystem

General coding

Software for data science and ML

Software for deep learning

Software for probabilistic modeling

Product

Resources

Company