Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
probml
GitHub Repository: probml/pyprobml
Path: blob/master/notebooks/tutorials/software.md
1192 views

Python software ecosystem

General coding

In this book, we will use Python 3. For a good introduction, see e.g., the free books Whirlwind tour of Python by Jake Vanderplas or Dive into Python 3 by Mark Pilgrim.

Each chapter is associated with one or more Jupyter notebooks, which mix code and results. We use the Google colab version of notebooks, which run on the Google Compute Platform (GCP), so you don't have to install code locally. To avoid installing packages every time you open a colab, you can follow these steps.

When developing larger software projects locally, it is often better to use an IDE (interactive development environment), which keeps the code separate from the results. I like to use Spyder, although many people use PyCharm. For a browser-based IDE, you can use JupyterLab.

Software for data science and ML

We will leverage many standard libraries from Python's "data science stack", listed in the table below. For a good introduction to these, see e.g., the free book Python Datascience Handbook by Jake Vanderplas, or the class Computational Statistics in Python by Cliburn Chen at Duke University. For an excellent book on scikit-learn, see Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow v2 by Aurelion Geron.

| Name | Functionality | | ---- | ---- | | Numpy | Vector and matrix computations | | Scipy | Various scientific / math / stats / optimization functions | | Matplotlib, Seaborn | Plotting | | Pandas, Xarray | Dataframes and named arrays | | Scikit-learn | Many ML methods (excluding deep learning) | | Jax | Accelerated version of Numpy with autograd support |

Software for deep learning

Deep learning is about composing primitive differentiable functions into a computation graph in order to make more complex functions, and then using automatic differentiation ("autograd") to compute gradients of the output with respect to model parameters , which we can pass to an optimizer, to fit the function to data. This is sometimes called "differentiable programming".

DL therefore requires several different libraries, to perform tasks such as

  • specify the model

  • compute gradients using automatic differentiation

  • train the model (pass data to the gradient function, and gradients to the optimizer function)

  • serve the model (pass input to a trained model, and pass output to some service)

The training and serving often uses hardware accelerators, such as GPUs. (Some libraries also support distributed computation, but we will not need use this feature in this book.)

There are several popular DL frameworks, which implement the above functionality, some of as

In this book, we mostly use Tensorflow 2 and JAX. However, we also welcome contributions in PyTorch. (More details on the JAX ecosystem can be found here

Software for probabilistic modeling

In this book, we focus on probabilistic models, both supervised (conditional) models of the form p(yx)p(y|x), as well as unsupervised models of the form p(z,x)p(z,x), where xx are the features, yy are the labels (if present), and zz are the latent variables. For simple special cases, such as GMMs and PCA, we can use sklearn. However, to create more complex models, we need more flexible libraries. We list some examples below.

The term "probabilistic programming language" (PPL) is used to describe systems that allow the creation of "randomly shaped" models, whos structure is determined e.g., by stochastic control flow. The Stan library specifiis the model using a domain specific language (DSL); most other libraries specify the model via an API. In this book, we focus on PyMc3 and numpyro.

NameDescription
Tensorflow probabilityPPL built on top of TF2.
Edward 1PPL built on top of TF2 or Numpy.
Edward 2Low-level PPL built on top of TF2.
PyroPPL built on top of PyTorch.
NumPyroSimilar to Pyro, but built on top of JAX instead of PyTorch.
PyStanPython interface to Stan, which uses the BUGS DSL for PGMs. Custom C++ autodiff library.
PyMc3Similar to PyStan, but uses Theano for autodiff. (Future versions will use JAX.)

There are also libraries for inference in probabilistic models in which all variables are discrete. Such models do not need autograd. We give some examples below.

NameDescription
PgmPyDiscrete PGMs.
PomegranateDiscrete PGMs. GPU support.