\boolfalse {citerequest}\boolfalse {citetracker}\boolfalse {pagetracker}\boolfalse {backtracker}\relax \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.1}{\ignorespaces (a) The pdf's for a $\mathcal {N}(0,1)$, $\mathcal {T}_1(0,1)$ and $\mathrm {Lap}(0,1/\sqrt {2})$. The mean is 0 and the variance is 1 for both the Gaussian and Laplace. The mean and variance of the Student distribution is undefined when $\nu =1$. (b) Log of these pdf's. Note that the Student distribution is not log-concave for any parameter value, unlike the Laplace distribution. Nevertheless, both are unimodal. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#student\_laplace\_pdf\_plot.ipynb}{student\_laplace\_pdf\_plot.ipynb}. \relax }}{8}{figure.caption.8} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.2}{\ignorespaces Illustration of Gaussian (blue), sub-Gaussian (uniform, green) and super-Gaussian (Laplace, red) distributions in 1d and 2d. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#sub\_super\_gauss\_plot.ipynb}{sub\_super\_gauss\_plot.ipynb}.\relax }}{10}{figure.caption.9} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.3}{\ignorespaces (a) The Pareto pdf $\unhbox \voidb@x \hbox {Pareto}(x|k,m)$. (b) Same distribution on a log-log plot. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#pareto\_dist\_plot.ipynb}{pareto\_dist\_plot.ipynb}. \relax }}{12}{figure.caption.10} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.4}{\ignorespaces A log-log plot of the frequency vs the rank for the words in H. G. Wells' {\em The Time Machine}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#zipfs\_law\_plot.ipynb}{zipfs\_law\_plot.ipynb}. Adapted from a figure from \citep [Sec 8.3]{dive}. \relax }}{12}{figure.caption.13} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.5}{\ignorespaces Visualization of a 2d Gaussian density in terms of level sets of constant probability density. (a) A full covariance matrix has elliptical contours. (b) A diagonal covariance matrix is an axis aligned ellipse. (c) A spherical covariance matrix has a circular shape. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gauss\_plot\_2d.ipynb}{gauss\_plot\_2d.ipynb}. \relax }}{14}{figure.caption.14} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.6}{\ignorespaces (a) Cartoon illustration of why the typical set of a Gaussian is not centered at the mode of the distribution. (b) Illustration of the typical set of a Gaussian, which is concentrated in a thin annulus of thickness $\sigma D^{1/4}$ and distance $\sigma D^{1/2}$ from the origin. We also show an image with the highest density (the all gray image on the left). as well as some high probability samples (the speckle noise images on the right). From Figure 1 of \citep {Nalisnick2019}. Used with kind permission of Eric Nalisnick. \relax }}{15}{figure.caption.15} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.7}{\ignorespaces We observe ${\bm {x}}=(0,-1)$ (red cross) and ${\bm {y}}=(1,0)$ (green cross) and estimate $\mathbb {E}\left [{{\bm {z}}|{\bm {x}},{\bm {y}},{\bm {\theta }}}\right ]$ (black cross). (a) Equally reliable sensors, so the posterior mean estimate is in between the two circles. (b) Sensor 2 is more reliable, so the estimate shifts more towards the green circle. (c) Sensor 1 is more reliable in the vertical direction, Sensor 2 is more reliable in the horizontal direction. The estimate is an appropriate combination of the two measurements. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#sensor\_fusion\_2d.ipynb}{sensor\_fusion\_2d.ipynb}. \relax }}{21}{figure.caption.16} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.8}{\ignorespaces Some samples from the Wishart distribution, $\boldsymbol {\Sigma }\sim \mathrm {Wi}(\mathbf {S},\nu )$, where $\mathbf {S}=[ 3.1653, -0.0262; -0.0262, 0.6477]$, and $\nu =3$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#wishart\_plot.ipynb}{wishart\_plot.ipynb}. \relax }}{24}{figure.caption.17} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.9}{\ignorespaces Marginals of the Wishart distribution in \cref {fig:wishartSamples}. (a) $p(\Sigma _{11})$. (b) $p(\Sigma _{22})$. (c) $p(\rho )$, where $\rho =\frac {\Sigma _{12}}{\sqrt {\Sigma _{11} \Sigma _{22}}}$ is the correlation coefficient. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#wishart\_plot.ipynb}{wishart\_plot.ipynb}. \relax }}{25}{figure.caption.18} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.10}{\ignorespaces (a) The Dirichlet distribution when $K=3$ defines a distribution over the simplex, which can be represented by the triangular surface. Points on this surface satisfy $0 \leq \theta _c \leq 1$ and $\DOTSB \sum@ \slimits@ _{c=1}^3 \theta _c = 1$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#dirichlet\_3d\_triangle\_plot.ipynb}{dirichlet\_3d\_triangle\_plot.ipynb}. (b) Plot of the Dirichlet density for $\boldsymbol {\alpha }=(20,20,20)$. (c) Plot of the Dirichlet density for $\boldsymbol {\alpha }=(3,3,20)$. (d) Plot of the Dirichlet density for $\boldsymbol {\alpha }=(0.1, 0.1, 0.1)$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#dirichlet\_3d\_spiky\_plot.ipynb}{dirichlet\_3d\_spiky\_plot.ipynb}. \relax }}{26}{figure.caption.19} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.11}{\ignorespaces Samples from a 5-dimensional symmetric Dirichlet distribution for different parameter values. (a) $\boldsymbol {\alpha }= (0.1,\ldots ,0.1)$. This results in very sparse distributions, with many 0s. (b) $\boldsymbol {\alpha }= (1,\ldots ,1)$. This results in more uniform (and dense) distributions. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#dirichlet\_samples\_plot.ipynb}{dirichlet\_samples\_plot.ipynb}. \relax }}{27}{figure.caption.20} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.12}{\ignorespaces Illustration of injective and surjective functions. \relax }}{42}{figure.caption.21} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.13}{\ignorespaces Example of the transformation of a density under a nonlinear transform. Note how the mode of the transformed distribution is not the transform of the original mode. Adapted from Exercise 1.4 of \citep {BishopBook}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#bayes\_change\_of\_var.ipynb}{bayes\_change\_of\_var.ipynb}. \relax }}{43}{figure.caption.22} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.14}{\ignorespaces Illustration of the probability integral transform. Left column: 3 different pdf's for $p(X)$ from which we sample $x_n \sim p(x)$. Middle column: empirical cdf of $y_n=P_X(x_n)$. Right column: empirical pdf of $p(y_n)$ using a kernel density estimate. Adapted from Figure 11.17 of \citep {BMC}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ecdf\_sample.ipynb}{ecdf\_sample.ipynb}. \relax }}{44}{figure.caption.23} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.15}{\ignorespaces Illustration of the Kolmogorov\IeC {\textendash }Smirnov statistic. The red line is a model CDF, the blue line is an empirical CDF, and the black arrow is the K\IeC {\textendash }S statistic. From \url {https://en.wikipedia.org/wiki/Kolmogorov\_Smirnov_test}. Used with kind permission of Wikipedia author Bscan. \relax }}{45}{figure.caption.24} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.16}{\ignorespaces State transition diagrams for some simple Markov chains. Left: a 2-state chain. Right: a 3-state left-to-right chain. \relax }}{46}{figure.caption.25} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.17}{\ignorespaces Example output from an 10-gram character-level Markov model trained on the King James Bible. The prefix ``christians'' is given to the model. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ngram\_character\_demo.ipynb}{ngram\_character\_demo.ipynb}. \relax }}{47}{figure.caption.26} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.18}{\ignorespaces (a) {\bf Hinton diagram} showing character bigram counts as estimated from H. G. Wells' book {\em The Time Machine}. Characters are sorted in decreasing unigram frequency; the first one is a space character. The most frequent bigram is 'e-', where - represents space. (b) Same as (a) but each row is normalized across the columns. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#bigram\_hinton\_diagram.ipynb}{bigram\_hinton\_diagram.ipynb}. \relax }}{48}{figure.caption.27} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.19}{\ignorespaces Some Markov chains. (a) A 3-state aperiodic chain. (b) A reducible 4-state chain. \relax }}{50}{figure.caption.28} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.20}{\ignorespaces Samples from two distributions which are (a) different and (b) similar. From a figure from \citep {GrettonNIPS19tutorial}. Used with kind permission of Arthur Getton. \relax }}{54}{figure.caption.29} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.21}{\ignorespaces The Gaussian $q$ which minimizes $\alpha $-divergence to $p$ (a mixture of two Gaussians), for varying $\alpha $. From Figure 1 of \citep {Minka05}. Used with kind permission of Tom Minka. \relax }}{55}{figure.caption.30} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.22}{\ignorespaces A smooth witness function for comparing two distributions which are (a) different and (b) similar. From a figure from \citep {GrettonNIPS19tutorial}. Used with kind permission of Arthur Getton. \relax }}{56}{figure.caption.31} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.23}{\ignorespaces Effect of bandwidth parameter $\sigma $ on the witness function defined by a Gaussian kernel. From a figure from \citep {GrettonNIPS19tutorial}. Used with kind permission of Dougal Sutherland. \relax }}{58}{figure.caption.32} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {2.24}{\ignorespaces Summary of the two main kinds of divergence measures between two probability distributions $P$ and $Q$. From a figure from \citep {GrettonNIPS19tutorial}. Used with kind permission of Arthur Getton. \relax }}{59}{figure.caption.33} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.1}{\ignorespaces Predictions made by a polynomial regression model fit to a small dataset. (a) Plugin approximation to predictive density using the MLE. The curves shows the posterior mean, $\mathbb {E}\left [{y|{\bm {x}}}\right ]$, and the error bars show the posterior standard deviation, $\mathrm {std}\left [{y|{\bm {x}}}\right ]$, around this mean. (b) Bayesian posterior predictive density, obtained by integrating out the parameters. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#linreg\_post\_pred\_plot.ipynb}{linreg\_post\_pred\_plot.ipynb}. \relax }}{65}{figure.caption.34} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.2}{\ignorespaces Two distributions in which the mode (highest point) is untypical of the distribution; the mean (vertical red line) is a better summary. (a) A bimodal distribution. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#bimodal\_dist\_plot.ipynb}{bimodal\_dist\_plot.ipynb}. (b) A skewed $\mathrm {Ga}(1,1)$ distribution. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gamma\_dist\_plot.ipynb}{gamma\_dist\_plot.ipynb}. \relax }}{66}{figure.caption.35} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.3}{\ignorespaces A distribution on a discrete space in which the mode (black point L, with probability $p_1$) is untypical of most of the probability mass (gray circles, with probability $p_2<p_1$). The small black circle labeled M (near the top left) is the posterior mean, which is not well defined in a discrete state space. C (the top left vertex) is the centroid estimator, made up of the maximizer of the posterior marginals. See text for details. From Figure 1 of \citep {Carvahlo07}. Used with kind permission of Luis Carvalho. \relax }}{66}{figure.caption.36} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.4}{\ignorespaces Inferring the mean of a univariate Gaussian with known $\sigma ^2$. (a) Using strong prior, $p(\mu ) = \mathcal {N}(\mu |0,1)$. (b) Using weak prior, $p(\mu ) = \mathcal {N}(\mu |0,5)$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gauss\_infer\_1d.ipynb}{gauss\_infer\_1d.ipynb}. \relax }}{72}{figure.caption.37} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.5}{\ignorespaces Sequential updating of the posterior for $\sigma ^2$ starting from an uninformative prior. The data was generated from a Gaussian with known mean $\mu =5$ and unknown variance $\sigma ^2=10$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gauss\_seq\_update\_sigma\_1d.ipynb}{gauss\_seq\_update\_sigma\_1d.ipynb} \relax }}{72}{figure.caption.38} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.6}{\ignorespaces The $NI\chi ^2(\mu ,\sigma ^2|m,\kappa ,\nu ,\sigma ^2)$ distribution. $m$ is the prior mean and $k$ is how strongly we believe this; $\sigma ^2$ is the prior variance and $\nu $ is how strongly we believe this. (a) $m=0,\kappa =1,\nu =1,\sigma ^2=1$. Notice that the contour plot (underneath the surface) is shaped like a ``squashed egg''. (b) We increase the strength of our belief in the mean by setting $\kappa =5$, so the distribution for $\mu $ around $m=0$ becomes narrower. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#nix\_plots.ipynb}{nix\_plots.ipynb}. \relax }}{74}{figure.caption.39} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.7}{\ignorespaces Illustration of Bayesian inference for a 2d Gaussian random vector ${\bm {z}}$. (a) The data is generated from ${\bm {y}}_n \sim \mathcal {N}({\bm {z}},\boldsymbol {\Sigma }_y)$, where ${\bm {z}}=[0.5, 0.5]^{{\mkern -1.5mu\mathsf {T}}}$ and $\boldsymbol {\Sigma }_y=0.1 ([2, 1; 1, 1])$. We assume the sensor noise covariance $\boldsymbol {\Sigma }_y$ is known but ${\bm {z}}$ is unknown. The black cross represents ${\bm {z}}$. (b) The prior is $p({\bm {z}}) = \mathcal {N}({\bm {z}}|\boldsymbol {0},0.1 \mathbf {I}_2)$. (c) We show the posterior after 10 data points have been observed. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gauss\_infer\_2d.ipynb}{gauss\_infer\_2d.ipynb}. \relax }}{76}{figure.caption.40} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.8}{\ignorespaces Graphical models representing different kinds of assumptions about the parameter priors. (a) A semi-conjugate prior for a Gaussian. (b) A conjugate prior for a Gaussian. \relax }}{78}{figure.caption.41} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.9}{\ignorespaces Distribution on the correlation coefficient $\rho $ induced by a 2d LKJ distribution with varying parameter. Adapted from Figure 14.3 of \citep {rethinking2}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#lkj\_1d.ipynb}{lkj\_1d.ipynb}. \relax }}{86}{figure.caption.42} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.10}{\ignorespaces Illustration of 3 different maximum entropy priors. Adapted from Figure 1.10 of \citep {BMC}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#maxent\_priors.ipynb}{maxent\_priors.ipynb}. \relax }}{87}{figure.caption.43} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.11}{\ignorespaces Illustration of Jeffrey's prior for Alice (who uses the rate $\theta $) and Bob (who uses the odds $\phi =\theta /(1-\theta )$). Adapted from Figure 1.9 of \citep {BMC}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#jeffreys\_prior\_binomial.ipynb}{jeffreys\_prior\_binomial.ipynb}. \relax }}{89}{figure.caption.44} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.12}{\ignorespaces PGM for a hierarchical binomial model. (a) ``Unrolled'' model. (b) Same model, using plate notation. \relax }}{93}{figure.caption.45} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.13}{\ignorespaces Data and inferences for the hierarchical binomial model fit using HMC. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hbayes\_binom\_rats.ipynb}{hbayes\_binom\_rats.ipynb}. \relax }}{93}{figure.caption.46} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.14}{\ignorespaces 8-schools dataset. (a) Raw data. Each row plots $y_j \pm \sigma _j$. Vertical line is the pooled estimate. (b) Posterior 95\% credible intervals for $\theta _j$. Vertical line is posterior mean $\mathbb {E}\left [{\mu |{\mathcal {D}}}\right ]$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#schools8.ipynb}{schools8.ipynb}. \relax }}{95}{figure.caption.47} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.15}{\ignorespaces Marginal posterior density $p(\tau |{\mathcal {D}})$ for the 8-schools dataset. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#schools8.ipynb}{schools8.ipynb}. \relax }}{95}{figure.caption.48} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.16}{\ignorespaces Posterior $p(\mu _0,\qopname \relax o{log}(\tau )|{\mathcal {D}})$ for the 8 schools model using (a) centered parameterization and (b) non-centered parameterization. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#schools8.ipynb}{schools8.ipynb}. \relax }}{96}{figure.caption.49} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.17}{\ignorespaces A hierarchical Gaussian Bayesian model. (a) Centered parameterization. (b) Non-centered parameterization. \relax }}{97}{figure.caption.50} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.18}{\ignorespaces Data and inferences for the hierarchical binomial model fit using empirical Bayes. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#eb\_binom.ipynb}{eb\_binom.ipynb}. \relax }}{98}{figure.caption.51} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.19}{\ignorespaces A Markov chain in which we put a different Dirichlet prior on every row of the transition matrix $\mathbf {A}$, but the hyperparameters of the Dirichlet are shared. \relax }}{101}{figure.caption.52} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.20}{\ignorespaces Schematic of 5-fold cross validation. \relax }}{104}{figure.caption.53} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.21}{\ignorespaces (a) Histogram of Newcomb's data. (b) Histograms of data sampled from Gaussian model. (c) Histogram of test statistic on data sampled from the model, which represents $p(\text {test}(\cc@accent {"707E}{{\mathcal {D}}}^s)|{\mathcal {D}})$, where $\text {test}({\mathcal {D}})=\qopname \relax m{min}\{y \in {\mathcal {D}}\}$. The vertical line is the test statistic on the true data, $\text {test}({\mathcal {D}})$. (d) Same as (c) except $\text {test}({\mathcal {D}}) = \mathbb {V}\{ y \in {\mathcal {D}}\}$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#newcomb\_plugin\_demo.ipynb}{newcomb\_plugin\_demo.ipynb}. \relax }}{110}{figure.caption.54} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {3.22}{\ignorespaces Posterior predictive distribution for divorce rate vs actual divorce rate for 50 US states. Both axes are standardized (i.e., z-scores). A few outliers are annotated. Adapted from Figure 5.5 of \citep {rethinking2}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#linreg\_divorce\_ppc.ipynb}{linreg\_divorce\_ppc.ipynb}. \relax }}{111}{figure.caption.55} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.1}{\ignorespaces Illustration of first and second order Markov models. \relax }}{114}{figure.caption.56} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.2}{\ignorespaces The (simplified) student network. ``Diff'' is the difficulty of the class. ``Intel'' is the intelligence of the student. ``Grade'' is the grade of the student in this class. ``SAT'' is the score of the student on the SAT exam. ``Letter'' is whether the teacher writes a good or bad letter of recommendation. The circles (nodes) represent random variables, the edges represent direct probabilistic dependencies. The tables inside each node represent the conditional probability distribution of the node given its parents. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#student\_pgm.ipynb}{student\_pgm.ipynb}. \relax }}{116}{figure.caption.57} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.3}{\ignorespaces (a) Hierarchical latent variable model with 2 layers. (b) Same as (a) but with autoregressive connections within each layer. The observed ${\bm {x}}$ variables are the shaded leaf nodes at the bottom. The unshaded nodes are the hidden ${\bm {z}}$ variables. \relax }}{117}{figure.caption.58} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.4}{\ignorespaces Bayes ball rules. A shaded node is one we condition on. If there is an arrow hitting a bar, it means the ball cannot pass through; otherwise the ball can pass through. \relax }}{120}{figure.caption.59} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.5}{\ignorespaces (a-b) Bayes ball boundary conditions. (c) Example of why we need boundary conditions. $Y'$ is an observed child of $Y$, rendering $Y$ ``effectively observed'', so the ball bounces back up on its way from $X$ to $Z$. \relax }}{120}{figure.caption.60} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.6}{\ignorespaces Samples from a jointly Gaussian DPGM\xspace , $p(x,y,z) = \mathcal {N}(x|-5,1) \mathcal {N}(y|5,1) \mathcal {N}(z|x+y,1)$. (a) Unconditional marginal distributions, $p(x)$, $p(y)$, $p(z)$. (b) Unconditional joint distribution, $p(x,y)$. (c) Conditional marginal distribution, $p(x|z>2.5)$, $p(y|z>2.5)$, $p(z|z>2.5)$. (d) Conditional joint distribution, $p(x,y|z>2.5)$. Adapted from \citep {causalityCloudera}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#berksons\_gaussian.ipynb}{berksons\_gaussian.ipynb}. \relax }}{122}{figure.caption.62} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.7}{\ignorespaces Illustration of belief updating in the ``Student'' PGM. The histograms show the marginal distribution of each node. Nodes with shaded titles are clamped to an observed value. (a) Posterior after conditioning on Grade=C. (b) Posterior after also conditioning on SAT=Good. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#student\_pgm.ipynb}{student\_pgm.ipynb}. \relax }}{125}{figure.caption.63} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.8}{\ignorespaces A DPGM\xspace representing the joint distribution $p({\bm {y}}_{1:N},{\bm {x}}_{1:N},{\bm {\theta }}_y,{\bm {\theta }}_x)$. Here ${\bm {\theta }}_x$ and ${\bm {\theta }}_y$ are global parameter nodes that are shared across the examples, whereas ${\bm {x}}_n$ and $y_n$ are local variables. \relax }}{126}{figure.caption.64} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.9}{\ignorespaces A DPGM\xspace representing the joint distribution $p({\bm {z}}_{1:N},{\bm {x}}_{1:N},{\bm {\theta }}_z,{\bm {\theta }}_x)$. The local variables ${\bm {z}}_n$ are hidden, whereas ${\bm {x}}_n$ are observed. This is typical for learning unsupervised latent variable models. \relax }}{129}{figure.caption.67} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.10}{\ignorespaces Left: data points ${\bm {x}}_n$ are conditionally independent given ${\bm {\theta }}$. Right: Same model, using plate notation. This represents the same model as the one on the left, except the repeated ${\bm {x}}_n$ nodes are inside a box, known as a plate; the number in the lower right hand corner, $N$, specifies the number of repetitions of the ${\bm {x}}_n$ node. \relax }}{131}{figure.caption.68} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.11}{\ignorespaces (a) Factor analysis model illustrated as a DPGM\xspace . We show the components of ${\bm {z}}$ (top row) and ${\bm {x}}$ (bottom row) as individual scalar nodes. (b) Equivalent model, where ${\bm {z}}$ and ${\bm {x}}$ are collapsed to vector-valued nodes, and parameters are added, using plate notation. \relax }}{132}{figure.caption.69} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.12}{\ignorespaces (a) Naive Bayes classifier as a DPGM\xspace . (b) Model augmented with plate notation. \relax }}{133}{figure.caption.70} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.13}{\ignorespaces Tree-augmented naive Bayes classifier for $D=4$ features. The tree topology can change depending on the value of $y$, as illustrated. \relax }}{134}{figure.caption.71} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.14}{\ignorespaces (a) A 2d lattice represented as a DAG. The dotted red node $X_8$ is independent of all other nodes (black) given its Markov blanket, which include its parents (blue), children (green) and co-parents (orange). (b) The same model represented as a UPGM\xspace . The red node $X_8$ is independent of the other black nodes given its neighbors (blue nodes). \relax }}{135}{figure.caption.72} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.15}{\ignorespaces (a) The two ground states for a small ferromagnetic Ising model where $J=1$. (b) Two different states for a small Ising model which have the same energy. Left: $J=1$, so neighboring pixels have similar values. Right: $J=-1$, so neighboring pixels have different values. From Figures 31.7 and 31.8 of \citep {MacKay03}. \relax }}{137}{figure.caption.73} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.16}{\ignorespaces Samples from an associative Ising model with varying $J>0$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gibbs\_demo\_ising.ipynb}{gibbs\_demo\_ising.ipynb}. \relax }}{138}{figure.caption.74} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.17}{\ignorespaces Visualizing a sample from a 10-state Potts model of size $128 \times 128$. The critical value is $J_c = \qopname \relax o{log}(1+\sqrt {10}) = 1.426$. for different association strengths: (a) $J=1.40$, (b) $J=1.43$, (c) $J=1.46$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gibbs\_demo\_potts.ipynb}{gibbs\_demo\_potts.ipynb}. \relax }}{139}{figure.caption.75} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.18}{\ignorespaces Examples of how an associative memory can reconstruct images. These are binary images of size $150 \times 150$ pixels. Top: training images. Middle row: partially visible test images. Bottom row: final state estimate. Adapted from Figure 2.1 of \cite {Hertz91}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hopfield\_demo.ipynb}{hopfield\_demo.ipynb}. \relax }}{141}{figure.caption.76} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.19}{\ignorespaces (a) A general Boltzmann machine, with an arbitrary graph structure. The shaded (visible) nodes are partitioned into input and output, although the model is actually symmetric and defines a joint distribution on all the nodes. (b) A restricted Boltzmann machine with a bipartite structure. Note the lack of intra-layer connections. \relax }}{142}{figure.caption.77} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.20}{\ignorespaces Some reconstructed images generated by a binary RBM fit to MNIST. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#rbm\_contrastive\_divergence.ipynb}{rbm\_contrastive\_divergence.ipynb}. \relax }}{143}{figure.caption.78} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.21}{\ignorespaces (a) Deep Boltzmann machine. (b) Deep belief network. The top two layers define the prior in terms on an RBM. The remaining layers are a directed graphical model that ``decodes'' the prior into observable data. \relax }}{144}{figure.caption.80} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.22}{\ignorespaces (a) A DPGM\xspace . (b) Its moralized version, represented as a UPGM\xspace . \relax }}{149}{figure.caption.81} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.23}{\ignorespaces Relationship between Markov properties of UPGM\xspace 's. \relax }}{150}{figure.caption.82} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.24}{\ignorespaces (a) The ancestral graph induced by the DAG in \cref {fig:graphSeven}(a) wrt $U=\{X_2, X_4, X_5\}$. (b) The moralized version of (a). \relax }}{150}{figure.caption.83} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.25}{\ignorespaces A grid-structured MRF with hidden nodes $z_i$ and local evidence nodes $x_i$. The prior $p({\bm {z}})$ is an undirected Ising model, and the likelihood $p({\bm {x}}|{\bm {z}})=\DOTSB \prod@ \slimits@ _i p(x_i|z_i)$ is a directed fully factored model. \relax }}{151}{figure.caption.84} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.26}{\ignorespaces Example of image denoising using mean field variational inference. We use an Ising prior with $W_{ij}=1$ and a Gaussian noise model with $\sigma =2$. (a) Noisy image. (b) Result of inference. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ising\_image\_denoise\_demo.ipynb}{ising\_image\_denoise\_demo.ipynb}. \relax }}{152}{figure.caption.85} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.27}{\ignorespaces (a) A small 2d lattice. (b) The representation used by pseudo likelihood. Solid nodes are observed neighbors. Adapted from Figure 2.2 of \citep {CarbonettoMasters}. \relax }}{154}{figure.caption.86} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.28}{\ignorespaces A 1d conditional random field (CRF) for sequence labeling. \relax }}{156}{figure.caption.87} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.29}{\ignorespaces A CRF for joint POS (part of speech) tagging and NP (noun phrase) segmentation. From Figure 4.E.1 of \citep {KollerBook}. Used with kind permission of Daphne Koller. \relax }}{157}{figure.caption.88} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.30}{\ignorespaces A skip-chain CRF for named entity recognition. From Figure 4.E.1 of \citep {KollerBook}. Used with kind permission of Daphne Koller. \relax }}{158}{figure.caption.89} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.31}{\ignorespaces Illustration of a simple parse tree based on a context free grammar in Chomsky normal form. The feature vector $\Psi ({\bm {x}},{\bm {y}})$ counts the number of times each production rule was used, and is used to define the energy of a particular tree structure, $E({\bm {y}}|{\bm {x}}) = -{\bm {w}}^{\mkern -1.5mu\mathsf {T}}\Psi ({\bm {x}},{\bm {y}})$. The probability distribution over trees is given by $p({\bm {y}}|{\bm {x}}) \propto \qopname \relax o{exp}(-E({\bm {y}}|{\bm {x}}))$. From Figure 5.2 of \citep {Altun07}. Used with kind permission of Yasemin Altun. \relax }}{159}{figure.caption.90} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.32}{\ignorespaces A grid-structured CRF with label nodes $y_i$ and local evidence nodes $x_i$. \relax }}{159}{figure.caption.91} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.33}{\ignorespaces A fully connected CRF is added to the output of a CNN, in order to increase the sharpness of the segmentation boundaries. From Figure 3 of \citep {Chen2015iclr}. Used with kind permission of Jay Chen. \relax }}{160}{figure.caption.92} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.34}{\ignorespaces Pictorial structures model for a face and body. Each body part corresponds to a node in the CRF whose state space represents the location of that part. The edges (springs) represent pairwise spatial constraints. The local evidence nodes are not shown. Adapted from a figure by Pedro Felzenszwalb. \relax }}{161}{figure.caption.93} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.35}{\ignorespaces DPGM\xspace 's and UPGM\xspace 's can perfectly represent different sets of distributions. Some distributions can be perfectly represented by either DPGM\xspace 's or UPGM\xspace 's; the corresponding graph must be chordal. \relax }}{164}{figure.caption.94} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.36}{\ignorespaces A UPGM\xspace and two failed attempts to represent it as a UPGM\xspace . From Figure 3.10 of \citep {KollerBook}. Used with kind permission of Daphne Koller. \relax }}{164}{figure.caption.95} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.37}{\ignorespaces Left: The full student DPGM\xspace . Right: the equivalent UPGM\xspace . We add moralization arcs D-I, G-J and L-S. Adapted from Figure 9.8 of \citep {KollerBook}. \relax }}{165}{figure.caption.96} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.38}{\ignorespaces (a) An undirected graphical model. (b) A directed equivalent, obtained by adding a dummy observed child node. \relax }}{166}{figure.caption.97} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.39}{\ignorespaces Two discriminative models for sequential data. (a) An undirected model (CRF). (b) A directed model (MEMM). \relax }}{166}{figure.caption.98} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.40}{\ignorespaces A partially directed acyclic graph (PDAG). The chain components are $\{A\}$, $\{B\}$, $\{C,D,E\}$, $\{F,G\}$, $\{H\}$ and $\{I\}$. Adapted from Figure 4.15 of \citep {KollerBook}. \relax }}{168}{figure.caption.99} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.41}{\ignorespaces (a) A DAG with two hidden variables (shaded). (b) The corresponding ADMG. The bidirected edges reflect correlation due to the hidden variable. (c) A Markov equivalent ADMG. From Figure 3 of \citep {Silva09}. Used with kind permission of Ricardo Silva. \relax }}{168}{figure.caption.100} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.42}{\ignorespaces A VAR(2) process represented as a dynamic chain graph. From \citep {Dahlhaus00}. Used with kind permission of Rainer Dahlhaus. \relax }}{169}{figure.caption.101} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.43}{\ignorespaces (a) A bi-directed graph. (b) The equivalent DAG. Here the $z$ nodes are latent confounders. Adapted from Figures 5.12-5.13 of \citep {Choi11thesis}. \relax }}{170}{figure.caption.102} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.44}{\ignorespaces (a) A simple UPGM\xspace . (b) A factor graph representation assuming one potential per maximal clique. (c) Same as (b), but graph is visualized differently. (d) A factor graph representation assuming one potential per edge. \relax }}{171}{figure.caption.103} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.45}{\ignorespaces (a) A simple DPGM\xspace . (b) Its corresponding factor graph. \relax }}{172}{figure.caption.104} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.46}{\ignorespaces A Forney factor graph. (a) Directed version. (b) Hierarchical version. \relax }}{173}{figure.caption.105} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.47}{\ignorespaces An FFG with an equality constraint node. \relax }}{173}{figure.caption.106} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.48}{\ignorespaces RPM for the book review domain. (a) Template for a generic customer $C_i$ and book $B_j$ pair. $R$ is rating, $Q$ is quality, $H$ is honesty, and $K$ is kindness. (b) Unrolled model for 2 books and 2 customers. \relax }}{176}{figure.caption.107} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.49}{\ignorespaces An extension of the book review RPM to handle identity uncertainty about which book a given customer is actually reviewing. The $R(c,b)$ node now depends on all books, since we don't know which one is being referred to. We can select one of these parents based on the mapping specified by the user's library, $L(c)$. \relax }}{177}{figure.caption.108} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.50}{\ignorespaces An example of a ground Markov logic network represented as a pairwise MRF for 2 people. Adapted from Figure 2.1 from \citep {Domingos09}. Used with kind permission of Pedro Domingos. \relax }}{178}{figure.caption.110} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.51}{\ignorespaces (a) PGM for modeling relationship between salary, education and work experience. (b) Corresponding SCM. \relax }}{182}{figure.caption.111} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.52}{\ignorespaces An SCM in which we intervene on $Z$. (a) Hard intervention, in which we clamp $Z$ and thus cut its incoming edges (shown as dotted). (b) Soft intervention, in which we change $Z$'s mechanism. The square node is an ``action'' node, using the influence diagram notation from \cref {sec:influenceDiag}. \relax }}{183}{figure.caption.112} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {4.53}{\ignorespaces Illustration of the potential outcomes framework as a SCM. The nodes with dashed edges are unobserved. In this example, for unit $1$, we select action $A_1=0$ and observe $Y_1 = Y_1^0 = y_1$, whereas for unit $2$, we select action $A_2=1$ and observe $Y_2 = Y_2^1= y_2$. \relax }}{186}{figure.caption.114} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.1}{\ignorespaces Demonstration of the mode-covering or mode-seeking behavior of KL divergence. The original distribution $q$ is bimodal. When we minimize $D_{\mathbb {KL}}\left ({q} \mathrel {\delimiter "026B30D } {p}\right )$, then $p$ covers the modes of $q$ (orange). When we minimize $D_{\mathbb {KL}}\left ({p} \mathrel {\delimiter "026B30D } {q}\right )$, then $p$ ignores some of the modes of $q$ (green). \relax }}{195}{figure.caption.115} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.2}{\ignorespaces (a) Illustration of Bregman divergence. (b) A locally linear approximation to a non-convex function. \relax }}{201}{figure.caption.116} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.3}{\ignorespaces Entropy of a Bernoulli random variable as a function of $\theta $. The maximum entropy is $\qopname \relax o{log}_2 2 = 1$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#bernoulli\_entropy\_fig.ipynb}{bernoulli\_entropy\_fig.ipynb}. \relax }}{203}{figure.caption.117} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.4}{\ignorespaces Distribution of adult heights. The continuous entropy of the distribution depends on its units of measurement. If heights are measured in feet, this distribution has a continuous entropy of 0.43 bits. If measured in centimeters it's 5.4 bits. If measured in meters it's -1.3 bits. Data taken from \url {https://ourworldindata.org/human-height}. \relax }}{204}{figure.caption.118} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.5}{\ignorespaces The marginal entropy, joint entropy, conditional entropy and mutual information represented as information diagrams. Used with kind permission of Katie Everett. \relax }}{207}{figure.caption.119} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.6}{\ignorespaces Illustration of multivariate mutual information between three random variables. From \url {https://en.wikipedia.org/wiki/Mutual_information}. Used with kind permission of Wikipedia author PAR. \relax }}{209}{figure.caption.120} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.7}{\ignorespaces Subset of size 16242 x 100 of the 20-newsgroups data. We only show 1000 rows, for clarity. Each row is a document (represented as a bag-of-words bit vector), each column is a word. The red lines separate the 4 classes, which are (in descending order) comp, rec, sci, talk (these are the titles of USENET groups). We can see that there are subsets of words whose presence or absence is indicative of the class. The data is available from \url {http://cs.nyu.edu/\nobreakspace {}roweis/data.html}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#newsgroups\_visualize.ipynb}{newsgroups\_visualize.ipynb}. \relax }}{214}{figure.caption.121} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.8}{\ignorespaces Part of a relevance network constructed from the 20 newsgroup data. data shown in Figure\nobreakspace {}\ref {fig:newsgroupsSpyWithLabels}. We show edges whose mutual information is greater than or equal to 20\% of the maximum pairwise MI. For clarity, the graph has been cropped, so we only show a subset of the nodes and edges. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#relevance\_network\_newsgroup\_demo.ipynb}{relevance\_network\_newsgroup\_demo.ipynb}. \relax }}{214}{figure.caption.122} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.9}{\ignorespaces Illustration of the rate-distortion tradeoff. See text for details. From Figure 1 of \citep {brokenElbo}. Used with kind permission of Alex Alemi. \relax }}{216}{figure.caption.123} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.10}{\ignorespaces (a) A simple error-correcting code DPGM\xspace . $x_i$ are the sent bits, $y_i$ are the received bits. $x_3$ is an even parity check bit computed from $x_1$ and $x_2$. (b) Posterior over codewords given that ${\bm {y}}=(1,0,0)$; the probability of a bit flip is 0.2. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#error\_correcting\_code\_demo.ipynb}{error\_correcting\_code\_demo.ipynb}. \relax }}{219}{figure.caption.124} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.11}{\ignorespaces Information diagrams for information bottleneck. (a) $Z$ can contain any amount of information about $X$ (whether it useful for predicting $Y$ or not), but it cannot contain information about $Y$ that is not shared with $X$. (b) The optimal representation for $Z$ maximizes $\MI (Z,Y)$ and minimizes $\MI (Z,X)$. Used with kind permission of Katie Everett. \relax }}{220}{figure.caption.125} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.12}{\ignorespaces 2d embeddings of MNIST digits created by an MLP classifier. (a) Deterministic model. (b) Stochastic VIB model. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#vib\_demo\_2021.ipynb}{vib\_demo\_2021.ipynb}. Used with kind permission of Alex Alemi. \relax }}{223}{figure.caption.126} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {5.13}{\ignorespaces Conditional entropy bottleneck (CEB) chooses a representation $Z$ that maximizes $\MI (Z,Y)$ and minimizes $\MI (X,Z|Y)$. Used with kind permission of Katie Everett. \relax }}{223}{figure.caption.127} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.1}{\ignorespaces A circuit for a function $f$ over three primitives, and its decomposition into two circuits without fan-out. Input nodes are drawn in green.\relax }}{232}{figure.caption.135} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.2}{\ignorespaces Changing the mean of a Gaussian by a fixed amount (from solid to dotted curve) can have more impact when the (shared) variance is small (as in a) compared to when the variance is large (as in b). Hence the impact (in terms of prediction accuracy) of a change to $\mu $ depends on where the optimizer is in $(\mu ,\sigma )$ space. From Figure 3 of \citep {Honkela2010}, reproduced from \citep {ValpolaPhD}. Used with kind permission of Antti Honkela. \relax }}{236}{figure.caption.138} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.3}{\ignorespaces Illustration of the benefits of natural gradient vs steepest descent on a 2d problem. (a) Trajectories of the two methods in parameter space (red = steepest descent, blue = NG). They both start in the bottom right, at $(1,-1)$. (b) Objective vs number of iterations. (c) Gradient field in the ${\bm {\theta }}$ parameter space. (d) Gradient field in the whitened $\boldsymbol {\phi }= \mathbf {F}^{\frac {1}{2}} {\bm {\theta }}$ parameter space used by NG. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#nat\_grad\_demo.ipynb}{nat\_grad\_demo.ipynb}. \relax }}{239}{figure.caption.139} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.4}{\ignorespaces Illustration of the Gumbel-Softmax (concrete) distribution with $K=7$ states at different temperatures $\tau $. The top row shows $\mathbb {E}\left [{{\bm {z}}}\right ]$, and the bottom row shows samples ${\bm {z}}\sim \mathrm {GumbelSoftmax}(\boldsymbol {\alpha },\tau )$. The left column shows a discrete (categorical) distribution, which always produces one-hot samples. From Figure 1 of \citep {gumbelSoftmax}. Used with kind permission of Ben Poole. \relax }}{248}{figure.caption.140} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.5}{\ignorespaces Illustration of straight-through estimator when applied to a binary threshold function in the middle of an MLP. From \url {https://www.hassanaskary.com/python/pytorch/deep\%20learning/2020/09/19/intuitive-explanation-of-straight-through-estimators.html}. Used with kind permission of Hassan Askary. \relax }}{249}{figure.caption.141} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.6}{\ignorespaces Illustration of a bound optimization algorithm. Adapted from Figure 9.14 of \citep {BishopBook}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#em\_log\_likelihood\_max.ipynb}{em\_log\_likelihood\_max.ipynb}. \relax }}{250}{figure.caption.142} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.7}{\ignorespaces Illustration of data imputation using a multivariate Gaussian. (a) Scatter plot of true values vs imputed values using true parameters. (b) Same as (a), but using parameters estimated with EM. We just show the first four variables, for brevity. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gauss\_imputation\_em\_demo.ipynb}{gauss\_imputation\_em\_demo.ipynb}. \relax }}{256}{figure.caption.143} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.8}{\ignorespaces Illustration of possible behaviors of variational EM. (a) The lower bound increases at each iteration, and so does the likelihood. (b) The lower bound increases but the likelihood decreases. In this case, the algorithm is closing the gap between the approximate and true posterior. This can have a regularizing effect. Adapted from Figure 6 of \citep {SaulSigmoid96}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#var\_em\_bound.ipynb}{var\_em\_bound.ipynb}. \relax }}{257}{figure.caption.144} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.9}{\ignorespaces Illustration of the robustness obtained by using a Bayesian approach to parameter estimation. (a) When the minimum ${\bm {\theta }}_*$ lies next to a ``wall'', the Bayesian solution shifts away from the boundary to avoid large losses due to perturbations of the parameters. (b) The Bayesian solution prefers flat minima over sharp minima, to avoid large losses due to perturbations of the parameters. From Figure 1 of \citep {BLR}. Used with kind permission of Emtiyaz Khan. \relax }}{260}{figure.caption.145} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.10}{\ignorespaces Illustration of sequential Bayesian optimization over three iterations. The rows correspond to a training set of size $t=2,3,4$. The solid black line is the true, but unknown, function $f(x)$. The dotted black line is the posterior mean, $\mu (x)$. The shaded blue intervals are the 95\% credible interval derived from $\mu (x)$ and $\sigma (x)$. The solid black dots correspond to points whose function value has already been computed, i.e., $x_n$ for which $f(x_n)$ is known. The green curve at the bottom is the acquisition function. The red dot is the proposed next point to query, which is the maximum of the acquisition function. From Figure 1 of \citep {Shahriari2016}. Used with kind permission of Nando de Freitas. \relax }}{268}{figure.caption.147} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.11}{\ignorespaces The first row shows the objective function, (the Branin function defined on $\mathbb {R}^2$), and its posterior mean and variance using a GP estimate. White dots are the observed data points. The second row shows 3 different acquisition functions (probability of improvement, expected improvement, and upper confidence bound); the white triangles are the maxima of the corresponding acquisition functions. From Figure 6 of \citep {Brochu2010}. Used with kind permission of Nando de Freitas. \relax }}{270}{figure.caption.148} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.12}{\ignorespaces Illustration of grid search (left) vs random search (right). From Figure 1 of \citep {Bergstra2012}. Used with kind permission of James Bergstra. \relax }}{276}{figure.caption.150} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.13}{\ignorespaces (a) A peaky distribution. (b) Corresponding energy function. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#simulated\_annealing\_2d\_demo.ipynb}{simulated\_annealing\_2d\_demo.ipynb}. \relax }}{277}{figure.caption.151} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.14}{\ignorespaces Annealed version of the distribution in \cref {fig:peaks} at different temperatures. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#simulated\_annealing\_2d\_demo.ipynb}{simulated\_annealing\_2d\_demo.ipynb}. \relax }}{278}{figure.caption.152} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.15}{\ignorespaces Simulated annealing applied to the distribution in \cref {fig:peaks}. (a) Temperature vs iteration and probability of each visited point vs iteration. (b) Visited samples, superimposed on the target distribution. The big red dot is the highest probability point found. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#simulated\_annealing\_2d\_demo.ipynb}{simulated\_annealing\_2d\_demo.ipynb}. \relax }}{278}{figure.caption.153} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.16}{\ignorespaces Illustration of a genetic algorithm applied to the 8-queens problem. (a) Initial population of 4 strings. (b) We rank the members of the population by fitness, and then compute their probability of mating. Here the integer numbers represent the number of nonattacking pairs of queens, so the global maximum has a value of 28. We pick an individual ${\bm {\theta }}$ with probability $p({\bm {\theta }}) = \mathcal {L}({\bm {\theta }})/Z$, where $Z = \DOTSB \sum@ \slimits@ _{{\bm {\theta }}\in \mathcal {P}} \mathcal {L}({\bm {\theta }})$ sums the total fitness of the population. For example, we pick the first individual with probability $24 / 78 = 0.31$, the second with probability $23/78 = 0.29$, etc. In this example, we pick the first individual once, the second twice, the third one once, and the last one does not get to breed. (c) A split point on the ``chromosome'' of each parent is chosen at random. (d) The two parents swap their chromosome halves. (e) We can optionally apply pointwise mutation. From Figure 4.6 of \citep {Russell10}. Used with kind permission of Peter Norvig. \relax }}{280}{figure.caption.154} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.17}{\ignorespaces The 8-queens states corresponding to the first two parents in \cref {fig:aima-genetic}(c) and their first child in \cref {fig:aima-genetic}(d). We see that the encoding $32752411$ means that the first queen is in row 3 (counting from the bottom left), the second queen is in row 2, etc. The shaded columns are lost in the crossover, but the unshaded columns are kept. From Figure 4.7 of \citep {Russell10}. Used with kind permission of Peter Norvig. \relax }}{280}{figure.caption.155} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.18}{\ignorespaces Illustration of crossover operator in a genetic program. (a-b) the two parents, representing $\qopname \relax o{sin}(x) + (x+y)^2$ and $\qopname \relax o{sin}(x) + \sqrt {x^2+y}$. The red circles denote the two crossover points. (c-d) the two children, representing $\qopname \relax o{sin}(x) + (x^2)^2$ and $\qopname \relax o{sin}(x) + \sqrt {x+y+y}$. Adapted from Figure 9.2 of \citep {Mitchell97} \relax }}{281}{figure.caption.156} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.19}{\ignorespaces Illustration of the BOA algorithm (EDA applied to a generative model structured as a Bayes net). Adapted from Figure 3 of \citep {Pelikan2012}. \relax }}{282}{figure.caption.157} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.20}{\ignorespaces Illustration of the CMA-ES method applied to a simple 2d function. The dots represent members of the population, and the dashed orange ellipse represents the multivariate Gaussian. From \url {https://en.wikipedia.org/wiki/CMA-ES}. Used with kind permission of Wikipedia author Sentewolf. \relax }}{285}{figure.caption.158} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.21}{\ignorespaces \textit {(left)} Matching a family of 5 points to another is equivalent to considering a permutation in $\{1,\dots ,n\}$. When to each pair $({\mathbf {x}}_i,{\mathbf {y}}_j)\in \mathbb {R}^2$ is associated a cost equal to the distance $\delimiter "026B30D {\mathbf {x}}_i-{\mathbf {y}}_j\delimiter "026B30D $, the optimal matching problem involves finding a permutation $\sigma $ that minimizes $\delimiter "026B30D {\mathbf {x}}_i-{\mathbf {y}}_{\sigma _i}\delimiter "026B30D $ for $i$ in $\{1,2,3,4,5\}$. \textit {(middle)} The Kantorovich formulation of optimal transport generalizes optimal matchings, and arises when comparing discrete measures, that is families of weighted points that do not necessarily share the same size but do share the same total mass. The relevant variable is a matrix $P$ of size $n\times m$, which must satisfy row-sum and column-sum constraints, and which minimizes its dot product with matrix $C_{ij}$. \textit {(right)} another direct extension of the matching problem lies when, intuitively, the number $n$ of points that is described is such that the considered measures become continuous densities. In that setting, and unlike the Kantorovich setting, the goal is to seek a map $T:\mathcal {X}\rightarrow \mathcal {X}$ which, to any point $x$ in the support of the input measure $\mu $ is associated a point $y=T(x)$ in the support of $\nu $. The push-forward constraint $T_\sharp \mu =\nu $ ensures that $\nu $ is recovered by applying map $T$ to all points in the support of $\mu $; the optimal map $T^\star $ is that which minimizes the distance between $x$ and $T(x)$, averaged over $\mu $. \relax }}{286}{figure.caption.159} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.22}{\ignorespaces The value relationships between coffee ({\ensuremath c}), lemon ({\ensuremath l}), milk ({\ensuremath m}), and tea ({\ensuremath t}). On the left, we first see a simple square showing the relationships between coffee and tea, and see that they are substitutive (or submodular). In this, and all of the shapes, the vertex label set is indicated in curly braces and the value at that vertex is a blue integer in a box. We next see a three-dimensional cube that adds lemon to coffee and tea set. We see that tea and lemon are complementary (supermodular) but coffee and lemon are additive (modular, or independent). We next see a four-dimensional hypercube (tesseract) showing all of the value relationships described in the text. The four-dimensional hypercube is also shown as a lattice (on the right) showing the same relationships as well as two (red and green, also shown in the tesseract) of the eight three-dimensional cubes contained within. \relax }}{294}{figure.caption.169} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {6.23}{\ignorespaces Far Left: cardinality constrained (to ten) submodular maximization of a facility location function over 1000 points in two dimensions. Similarities are based on a Gaussian kernel $\text {sim}(\ensuremath {a},\ensuremath {v}) = \qopname \relax o{exp}(-d(\ensuremath {a},\ensuremath {v}))$ where $d(\cdot ,\cdot )$ is a distance. Selected points are green stars and the greedy order is also shown next to each selected point. Right three plots: different uniformly-at-random subsets of size ten. \relax }}{301}{figure.caption.170} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {7.1}{\ignorespaces Graphical models with (a) Global hidden variables for representing the Bayesian discriminative model $p({\bm {y}}_{1:N},{\bm {\theta }}_y|{\bm {x}}_{1:N}) = p({\bm {\theta }}_y) \DOTSB \prod@ \slimits@ _{n=1}^N p({\bm {y}}_n|{\bm {x}}_n;{\bm {\theta }}_y)$; (b) Local hidden variables for representing the generative model $p({\bm {x}}_{1:N},{\bm {z}}_{1:N}|{\bm {\theta }}) = \DOTSB \prod@ \slimits@ _{n=1}^N p({\bm {z}}_n|{\bm {\theta }}_z) p({\bm {x}}_n|{\bm {z}}_n,{\bm {\theta }}_x)$; (c) Local and global hidden variables for representing the Bayesian generative model $p({\bm {x}}_{1:N},{\bm {z}}_{1:N},{\bm {\theta }}) = p({\bm {\theta }}_z) p({\bm {\theta }}_x) \DOTSB \prod@ \slimits@ _{n=1}^N p({\bm {z}}_n|{\bm {\theta }}_z) p({\bm {x}}_n|{\bm {z}}_n,{\bm {\theta }}_x)$. Shaded nodes are assumed to be known (observed), unshaded nodes are hidden. \relax }}{316}{figure.caption.171} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {7.2}{\ignorespaces Approximating the posterior of a beta-Bernoulli model. (a) Grid approximation using 20 grid points. (b) Laplace approximation. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#laplace\_approx\_beta\_binom.ipynb}{laplace\_approx\_beta\_binom.ipynb}. \relax }}{319}{figure.caption.172} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {7.3}{\ignorespaces ADVI applied to the beta-Bernoulli model. (a) Approximate vs true posterior. (b) Negative ELBO over time. (c) Variational $\mu $ parameter over time. (d) Variational $\sigma $ parameter over time. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#advi\_beta\_binom.ipynb}{advi\_beta\_binom.ipynb}. \relax }}{321}{figure.caption.173} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {7.4}{\ignorespaces Approximating the posterior of a beta-Bernoulli model using MCMC. (a) Kernel density estimate derived from samples from 4 independent chains. (b) Trace plot of the chains as they generate posterior samples. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hmc\_beta\_binom.ipynb}{hmc\_beta\_binom.ipynb}. \relax }}{322}{figure.caption.174} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {7.5}{\ignorespaces Different approximations to a bimodal 2d distribution. (a) Local MAP estimate. (b) Parametric Gaussian approximation. (c) Correlated samples from near one mode. (d) Independent samples from the distribution. Adapted from Figure 2 of \citep {Papandreou2014}. Used with kind permission of George Panadreou. \relax }}{324}{figure.caption.175} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {7.6}{\ignorespaces Accuracy vs compute time for different inference methods. Code source: \url {https://github.com/tminka/ep-clutter-example}. From Figure 1 of \citep {Minka01b}. Used with kind permission of Tom Minka. \relax }}{325}{figure.caption.176} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.1}{\ignorespaces A state-space model represented as a graphical model. ${\bm {z}}_t$ are the hidden variables at time $t$, ${\bm {y}}_t$ are the observations (outputs), and ${\bm {u}}_t$ are the optional inputs. \relax }}{328}{figure.caption.177} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.2}{\ignorespaces The main kinds of inference for state-space models. The shaded region is the interval for which we have data. The arrow represents the time step at which we want to perform inference. $t$ is the current time, $T$ is the sequence length, $\ell $ is the lag and $h$ is the prediction horizon. \relax }}{328}{figure.caption.178} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.3}{\ignorespaces The state transition matrix $\mathbf {A}$ and observation matrix $\mathbf {B}$ for the casino HMM. Adapted from \citep [p54]{Durbin98}. \relax }}{329}{figure.caption.179} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.4}{\ignorespaces Inference in the dishonest casino. Vertical gray bars denote times when the hidden state corresponded to the loaded die. Blue lines represent the posterior probability of being in that state given different subsets of observed data. If we recover the true state exactly, the blue curve will transition at the same time as the gray bars. (a) Filtered estimate. (b) Smoothed estimates. (c) MAP trajectory. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#casino\_hmm.ipynb}{casino\_hmm.ipynb}. \relax }}{330}{figure.caption.180} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.5}{\ignorespaces Computing the two-slice joint distribution for an HMM from the forwards messages, backwards messages, and local evidence messages. \relax }}{335}{figure.caption.181} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.6}{\ignorespaces The trellis of states vs time for a Markov chain. Adapted from \citep {Rabiner89}. \relax }}{337}{figure.caption.182} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.7}{\ignorespaces Illustration of Viterbi decoding in a simple HMM for speech recognition. (a) A 3-state HMM for a single phone. We are visualizing the state transition diagram. We assume the observations have been vector quantized into 7 possible symbols, $C_1,\ldots ,C_7$. Each state $s_1,s_2,s_3$ has a different distribution over these symbols. Adapted from Figure 15.20 of \citep {Russell02}. (b) Illustration of the Viterbi algorithm applied to this model, with data sequence $C1, C3, C4, C6$. The columns represent time, and the rows represent states. The numbers inside the circles represent the $\delta _t(j)$ value for that state. An arrow from state $i$ at $t-1$ to state $j$ at $t$ is annotated with two numbers: the first is the probability of the $i \rightarrow j$ transition, and the second is the probability of generating observation ${\bm {y}}_{t}$ from state $j$. The red lines/ circles represent the most probable sequence of states. Adapted from Figure 24.27 of \citep {Russell95b}. \relax }}{338}{figure.caption.183} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.8}{\ignorespaces Illustration of Kalman filtering and smoothing for a linear dynamical system. (a) Observations (green cirles) are generated by an object moving to the right (true location denoted by blue squares). (b) Results of online Kalman filtering. Red cross is the posterior mean, circles are 95\% confidence ellipses derived from the posterior covariance. (c) Same as (b), but using offline Kalman smoothing. The MSE in the trajectory for filtering is 3.13, and for smoothing is 1.71. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#kf\_tracking.ipynb}{kf\_tracking.ipynb}. \relax }}{341}{figure.caption.184} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.9}{\ignorespaces (a) Observations and true and estimated state. (b) Marginal distributions for time step $t=20$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#discretized\_ssm\_student.ipynb}{discretized\_ssm\_student.ipynb}. \relax }}{349}{figure.caption.187} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.10}{\ignorespaces Discretized posterior of the latent state at each time step. Red cross is the true latent state. Red circle is observation. (a) Filtering. (b) Smoothing. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#discretized\_ssm\_student.ipynb}{discretized\_ssm\_student.ipynb}. \relax }}{349}{figure.caption.188} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.11}{\ignorespaces Illustration of the fact that a broad prior (a) may result in a more complex posterior than a narrow prior (b). Consequently, the EKF approximation may work poorly in situations of high uncertainty. From Figure 3.5 of \citep {Thrun06}. Used with kind permission of Dieter Fox. \relax }}{352}{figure.caption.190} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.12}{\ignorespaces Illustration of the fact that if the function function is very nonlinear (a) at the current operating point, the posterior will be less well approximated by the EKF than if the function is locally linear (b). From Figure 3.6 of \citep {Thrun06}. Used with kind permission of Dieter Fox. \relax }}{353}{figure.caption.191} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.13}{\ignorespaces Illustration of filtering applied to a 2d nonlinear dynamical system. (a) True underlying state and observed data. (b) Extended Kalman filter estimate. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ekf\_vs\_ukf.ipynb}{ekf\_vs\_ukf.ipynb}. \relax }}{353}{figure.caption.192} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.14}{\ignorespaces Illustration of a pendulum swinging. $g$ is the force of gravity, $w(t)$ is a random external force, and $\alpha $ is the angle wrt the vertical. Adapted from Figure 3.10 in \citep {Sarkka13}. \relax }}{354}{figure.caption.193} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.15}{\ignorespaces Filtering algorithms applied to the pendulum model. (a) EKF. (b) UKF. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#pendulum\_1d.ipynb}{pendulum\_1d.ipynb}. \relax }}{354}{figure.caption.194} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.16}{\ignorespaces An example of the unscented transform in two dimensions. From \citep {Wan01unscented}. Used with kind permission of Eric Wan. \relax }}{356}{figure.caption.195} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.17}{\ignorespaces Illustration of UKF applied to a 2d nonlinear dynamical system. (a) True underlying state and observed data. (b) UKF estimate. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ekf\_vs\_ukf.ipynb}{ekf\_vs\_ukf.ipynb}. \relax }}{359}{figure.caption.198} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.18}{\ignorespaces Illustration of the predict-update-project cycle of assumed density filtering. $q_t \in \mathcal {Q}$ is a tractable distribution, whereas we may have $p_{t|t-1} \not \in \mathcal {Q}$ and $p_t^* \not \in \mathcal {Q}$. \relax }}{366}{figure.caption.202} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.19}{\ignorespaces A taxonomy of filtering algorithms. Adapted from Figure 2 of \citep {Wuthrich2016}. \relax }}{367}{figure.caption.203} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.20}{\ignorespaces ADF for a switching linear dynamical system with 2 discrete states. (a) GPB2 method. (b) IMM method. \relax }}{369}{figure.caption.204} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.21}{\ignorespaces A dynamic logistic regression model. ${\bm {w}}_t$ are the regression weights at time $t$, and $\eta _t = {\bm {w}}_t^{\mkern -1.5mu\mathsf {T}}{\bm {x}}_t$. Compare to \cref {fig:RLSDGM}. \relax }}{370}{figure.caption.205} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.22}{\ignorespaces Bayesian inference applied to a 2d binary logistic regression problem, $p(y=1|{\bm {x}}) = \sigma (w_0 + w_1 x_1 + w_2 x_2)$. We show the training data and the posterior predictive produced by different methods. (a) Offline MCMC approximation. (b) Offline Laplace approximation. (c) Online ADF approximation at the final step of inference. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#adf\_logistic\_regression\_demo.ipynb}{adf\_logistic\_regression\_demo.ipynb}. \relax }}{372}{figure.caption.206} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {8.23}{\ignorespaces Marginal posteriors over time for the ADF method. The horizontal line is the offline MAP estimate. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#adf\_logistic\_regression\_demo.ipynb}{adf\_logistic\_regression\_demo.ipynb}. \relax }}{372}{figure.caption.207} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.1}{\ignorespaces An undirected tree and two equivalent directed trees. \relax }}{374}{figure.caption.208} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.2}{\ignorespaces Belief propagation on a pairwise, rooted tree. \relax }}{375}{figure.caption.209} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.3}{\ignorespaces Illustration of how the top-down message from $s$ to $t$ is computed during BP on a tree. The $u_i$ nodes are the other children of $s$, besides $t$. Square nodes represent clique potentials. \relax }}{376}{figure.caption.210} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.4}{\ignorespaces Message passing on a bipartite factor graph. Square nodes represent factors, and circles represent variables. The $y_i$ nodes correspond to the neighbors $x'_i$ of $f$ other than $x$. From Figure 6 of \citep {Kschischang01}. Used with kind permission of Brendan Frey. \relax }}{380}{figure.caption.211} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.5}{\ignorespaces Interpolating noisy data using Gaussian belief propagation applied to a 1d MRF. Generated by \href {https://colab.research.google.com/github/probml/pgm-jax/blob/main/gaussian-loopy-bp/gauss-bp-1d-line.ipynb}{gauss-bp-1d-line.ipynb}. \relax }}{381}{figure.caption.212} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.6}{\ignorespaces (a) A simple loopy graph. (b) The computation tree, rooted at node 1, after 4 rounds of message passing. Nodes 2 and 3 occur more often in the tree because they have higher degree than nodes 1 and 2. From Figure 8.2 of \citep {Monster}. Used with kind permission of Martin Wainwright. \relax }}{382}{figure.caption.213} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.7}{\ignorespaces Illustration of the behavior of loopy belief propagation on an $11 \times 11$ Ising grid with random potentials, $w_{ij} \sim \mathrm {Unif}(-C,C)$, where $C=11$. For larger $C$, inference becomes harder. (a) Percentage of messasges that have converged vs time for 3 different update schedules: Dotted = damped synchronous (few nodes converge), dashed = undamped asychnronous (half the nodes converge), solid = damped asychnronous (all nodes converge). (b-f) Marginal beliefs of certain nodes vs time. Solid straight line = truth, dashed = sychronous, solid = damped asychronous. From Figure 11.C.1 of \citep {KollerBook}. Used with kind permission of Daphne Koller. \relax }}{383}{figure.caption.214} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.8}{\ignorespaces (a) Clusters superimposed on a $3 \times 3$ lattice graph. (b) Corresponding hyper-graph. Nodes represent clusters, and edges represent set containment. From Figure 4.5 of \citep {Monster}. Used with kind permission of Martin Wainwright. \relax }}{385}{figure.caption.215} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.9}{\ignorespaces (a) A simple factor graph representation of a (2,3) low-density parity check code. Each message bit (hollow round circle) is connected to two parity factors (solid black squares), and each parity factor is connected to three bits. Each parity factor has the form $\psi _{stu}(x_s,x_t,x_u) = \mathbb {I}\left ({x_s \otimes x_t \otimes x_u = 1}\right )$, where $\otimes $ is the xor operator. The local evidence factors for each hidden node are not shown. (b) A larger example of a random LDPC code. We see that this graph is ``locally tree-like'', meaning there are no short cycles; rather, each cycle has length $\sim \qopname \relax o{log}m$, where $m$ is the number of nodes. This gives us a hint as to why loopy BP works so well on such graphs. (Note, however, that some error correcting code graphs have short loops, so this is not the full explanation.) From Figure 2.9 from \citep {Monster}. Used with kind permission of Martin Wainwright. \relax }}{386}{figure.caption.216} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.10}{\ignorespaces Factor graphs for affinity propagation. Circles are variables, squares are factors. Each $c_i$ node has $N$ possible states. From Figure S2 of \citep {Frey07}. Used with kind permission of Brendan Frey. \relax }}{387}{figure.caption.217} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.11}{\ignorespaces Example of affinity propagation. Each point is colored coded by how much it wants to be an exemplar (red is the most, green is the least). This can be computed by summing up all the incoming availability messages and the self-similarity term. The darkness of the $i \rightarrow k$ arrow reflects how much point $i$ wants to belong to exemplar $k$. From Figure 1 of \citep {Frey07}. Used with kind permission of Brendan Frey. \relax }}{388}{figure.caption.218} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.12}{\ignorespaces Example of the elimination process, in the order $C,D,I,H,G,S,L$. When we eliminate $I$ (figure c), we add a fill-in edge between $G$ and $S$, since they are not connected. Adapted from Figure 9.10 of \citep {KollerBook}. \relax }}{390}{figure.caption.219} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.13}{\ignorespaces Encoding a 3-SAT problem on $n$ variables and $m$ clauses as a DGM. The $Q_s$ variables are binary random variables. The $C_t$ variables are deterministic functions of the $Q_s$'s, and compute the truth value of each clause. The $A_t$ nodes are a chain of AND gates, to ensure that the CPT for the final $x$ node has bounded size. The double rings denote nodes with deterministic CPDs. From Figure 9.1 of \citep {KollerBook}. Used with kind permission of Daphne Koller. \relax }}{394}{figure.caption.222} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {9.14}{\ignorespaces Sending multiple messages along a tree. (a) $z_1$ is root. (b) $z_2$ is root. (c) $z_4$ is root. (d) All of the messages needed to compute all singleton marginals. Adapted from Figure 4.3 of \citep {JordanBook}. \relax }}{395}{figure.caption.223} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.1}{\ignorespaces Illustration of variational inference. The large oval represents the set of variational distributions $\mathcal {Q} = \{ q({\bm {z}}; \boldsymbol {\psi }): \boldsymbol {\psi }\in \mathbb {R}^K \}$, where $K$ is the number of variational parameters. The true distribution is the point $p({\bm {z}}|{\bm {x}})$, which we assume lies outside the set. Our goal is to find the best approximation to $p$ within our variational family; this is the point $\boldsymbol {\psi }^*$ which is closest in KL divergence. We find this point by starting an optimization procedure from the random initial point $\boldsymbol {\psi }^{\text {init}}$. Adapted from a figure by David Blei. \relax }}{402}{figure.caption.224} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.2}{\ignorespaces A grid-structured MRF with hidden nodes $z_i$ and local evidence nodes $x_i$. The prior $p({\bm {z}})$ is an undirected Ising model, and the likelihood $p({\bm {x}}|{\bm {z}})=\DOTSB \prod@ \slimits@ _i p(x_i|z_i)$ is a directed fully factored model. \relax }}{405}{figure.caption.225} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.3}{\ignorespaces Example of image denoising using mean field (with parallel updates and a damping factor of 0.5). We use an Ising prior with $W_{ij}=1$ and a Gaussian noise model with $\sigma =2$. We show the results after 1, 3 and 15 iterations across the image. Compare to \cref {fig:isingImageDenoiseGibbs}, which shows the results of using Gibbs sampling. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ising\_image\_denoise\_demo.ipynb}{ising\_image\_denoise\_demo.ipynb}. \relax }}{406}{figure.caption.226} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.4}{\ignorespaces Graphical models with (a) Global hidden variable ${\bm {\theta }}_x$ and observed variables ${\bm {x}}_{1:N}$. (b) Local hidden variables ${\bm {z}}_{1:N}$, global hidden variables ${\bm {\theta }}_x, {\bm {\theta }}_z$, and observed variables ${\bm {x}}_{1:N}$. \relax }}{407}{figure.caption.227} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.5}{\ignorespaces Factored variational approximation (red) to the Gaussian-Gamma distribution (green). (a) Initial guess. (b) After updating $q(\mu |\boldsymbol {\psi }_{\mu })$. (c) After updating $q(\lambda |\boldsymbol {\psi }_{\lambda })$. (d) At convergence (after 5 iterations). Adapted from Fig. 10.4 of \citep {BishopBook}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#unigauss\_vb\_demo.ipynb}{unigauss\_vb\_demo.ipynb}. \relax }}{410}{figure.caption.228} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.6}{\ignorespaces (a) We plot $\qopname \relax o{exp}(\psi (x))$ vs $x$ (green line). We see that it performs a form of shrinkage, so that small values get set to zero. (b) We plot $N_k$ vs time for 4 different states ($z$ values), starting from random initial values. We perform a series of VBEM updates, ignoring the likelihood term. We see that states that initially had higher counts get reinforced, and sparsely populated states get killed off. From From \citep {Liang07acl}. Used with kind permission of Percy Liang. \relax }}{414}{figure.caption.229} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.7}{\ignorespaces We visualize the posterior mean parameters at various stages of the VBEM algorithm applied to a mixture of Gaussians model on the Old Faithful data. Shading intensity is proportional to the mixing weight. We initialize with K-means and use $\alpha _0=0.001$ as the Dirichlet hyper-parameter. (The red dot on the right panel represents all the unused mixture components, which collapse to the prior at 0.) Adapted from Figure 10.6 of \citep {BishopBook}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#variational\_mixture\_gaussians\_demo.ipynb}{variational\_mixture\_gaussians\_demo.ipynb}. \relax }}{415}{figure.caption.230} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.8}{\ignorespaces We visualize the posterior values of $\boldsymbol {\alpha }_k$ for the model in \cref {fig:VBgmmFaithful} after the first and last iteration of the algorithm. We see that unnecessary components get ``killed off''. (Interestingly, the initially large cluster 6 gets ``replaced'' by cluster 5.) Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#variational\_mixture\_gaussians\_demo.ipynb}{variational\_mixture\_gaussians\_demo.ipynb}. \relax }}{416}{figure.caption.231} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.9}{\ignorespaces Lower bound vs iterations for the VB algorithm in \cref {fig:VBgmmFaithful}. The steep parts of the curve correspond to places where the algorithm figures out that it can increase the bound by ``killing off'' unnecessary mixture components, as described in \cref {sec:VBGMMchooseK}. The plateaus correspond to slowly moving the clusters around. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#variational\_mixture\_gaussians\_demo.ipynb}{variational\_mixture\_gaussians\_demo.ipynb}. \relax }}{416}{figure.caption.232} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.10}{\ignorespaces Gaussian variational approximation to a Gaussian posterior for the mean of a 2d Gaussian. TODO: Implement this in JAX, and add rank-1 version. From Figure 4 of \citep {Kucukelbir2016advi}. \relax }}{423}{figure.caption.236} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.11}{\ignorespaces ADVI Gaussian approximation applied to a small logistic regression model. TODO: implement this in JAX, and add rank-one. From Figure 5 of \citep {Kucukelbir2016advi}. \relax }}{424}{figure.caption.237} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.12}{\ignorespaces ADVI full-rank Gaussian approximation to the LKJ posterior for a 2d covariance matrix. From \url {https://docs.pymc.io/en/v3/pymc-examples/examples/case_studies/LKJ.html}. \relax }}{426}{figure.caption.238} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.13}{\ignorespaces Posterior over the mixing weights (histogram) and the means and covariances of each Gaussian mixture component, using $K=10$, when fitting the model to the Old Faithful dataset from \cref {fig:VBgmmFaithful}. (a) MAP approximation. (b-d) 3 samples from the Gaussian approximation. The intensity of the shading is proportional to the mixture weight. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#vb\_gmm.ipynb}{vb\_gmm.ipynb}. \relax }}{427}{figure.caption.239} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.14}{\ignorespaces Predictive performance of ADVI diagonal Gaussian approximation applied to an ARD linear regression model compared to MCMC. From Figure 10a of \citep {Kucukelbir2016advi}. \relax }}{428}{figure.caption.240} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.15}{\ignorespaces SVI for fitting a mixture of 3 Gaussians in 2d. (a) 3000 training points. (b) Fitted density, plugging in the posterior mean parameters. (c) Kernel density estimate fit to 10,000 samples from $q(\boldsymbol {\mu }_1|\boldsymbol {\psi }_{{\bm {\theta }}})$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#svi\_gmm\_demo\_2d.ipynb}{svi\_gmm\_demo\_2d.ipynb}. \relax }}{431}{figure.caption.241} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.16}{\ignorespaces Using an inverse autoregressive flow network to approximage the posterior of a VAE. From Figure 3.1 of \citep {Kingma2019vae}. Used with kind permission of Durk Kingma. \relax }}{435}{figure.caption.243} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.17}{\ignorespaces Illustration of the CUBO upper bound being minimized by CHIVI (\cref {sec:CHIVI}) and the ELBO lower bound being maximized by BBVI (\cref {sec:BBVI}) on UCI Ionosphere dataset and a synthetic dataset where we know the true value of $\qopname \relax o{log}p({\bm {x}})$. From Figure 1 of \citep {Dieng2017nips}. Used with kind permission of Adji Dieng. \relax }}{438}{figure.caption.244} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.18}{\ignorespaces Illustrating forwards vs reverse KL on a symmetric Gaussian. The blue curves are the contours of the true distribution $p$. The red curves are the contours of a factorized approximation $q$. (a) Minimizing $D_{\mathbb {KL}}\left ({p} \mathrel {\delimiter "026B30D } {q}\right )$. (b) Minimizing $D_{\mathbb {KL}}\left ({q} \mathrel {\delimiter "026B30D } {p}\right )$. Adapted from Figure 10.2 of \citep {BishopBook}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#kl\_pq\_gauss.ipynb}{kl\_pq\_gauss.ipynb}. \relax }}{441}{figure.caption.245} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {10.19}{\ignorespaces Combining a logistic likelihood factor $f_i = p(y_i|{\bm {\theta }})$ with the cavity prior, $q_{-i}^{\text {cavity}}= g_{-i}({\bm {\theta }})$, to get the tilted distribution, $q_{i}^{\text {tilted}}= p(y_i|{\bm {\theta }}) g_{-i}({\bm {\theta }})$. Adapted from Figure 2 of \citep {Gelman2014EP}. \relax }}{444}{figure.caption.246} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {11.1}{\ignorespaces Estimating $\pi $ by Monte Carlo integration using 5000 samples. Blue points are inside the circle, red points are outside. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mc\_estimate\_pi.ipynb}{mc\_estimate\_pi.ipynb}. \relax }}{448}{figure.caption.247} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {11.2}{\ignorespaces 10 and 100 samples from a Gaussian distribution, $\mathcal {N}(\mu =1.5, \sigma ^2=0.25)$. A dotted red line denotes kernel density estimate derived from the samples. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mc\_accuracy\_demo.ipynb}{mc\_accuracy\_demo.ipynb}. \relax }}{449}{figure.caption.248} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {11.3}{\ignorespaces Sampling from $\mathcal {N}(3, 1)$ using an inverse CDF. \relax }}{450}{figure.caption.249} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {11.4}{\ignorespaces (a) Schematic illustration of rejection sampling. From Figure 2 of \citep {Andrieu03}. Used with kind permission of Nando de Freitas. (b) Rejection sampling from a $\mathrm {Ga}(\alpha =5.7,\lambda =2)$ distribution (solid blue) using a proposal of the form $M \mathrm {Ga}(k,\lambda -1)$ (dotted red), where $k=\delimiter "4262304 5.7 \delimiter "5263305 =5$. The curves touch at $\alpha -k=0.7$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#rejection\_sampling\_demo.ipynb}{rejection\_sampling\_demo.ipynb}. \relax }}{452}{figure.caption.250} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {11.5}{\ignorespaces (a) Idea behind adaptive rejection sampling. We place piecewise linear upper (and lower) bounds on the log-concave density. Adapted from Figure 1 of \citep {Gilks92}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ars\_envelope.ipynb}{ars\_envelope.ipynb}. (b-c) Using ARS to sample from a half-Gaussian. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ars\_demo.ipynb}{ars\_demo.ipynb}. \relax }}{454}{figure.caption.251} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {11.6}{\ignorespaces In importance sampling, we should sample from a distribution that takes into account regions where $\pi ({\bm {x}})$ has high probability {\em and} where $\varphi ({\bm {x}})$ is large. Here the function to be evaluated is an indicator function of a set, corresponding to a set of rare events in the tail of the distribution. From Figure 3 of \citep {Andrieu03}. Used with kind permission of Nando de Freitas. \relax }}{456}{figure.caption.252} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {11.7}{\ignorespaces Illustration of Monte Carlo (MC), Quasi MC (QMC) from a Sobol sequence, and Randomized QMC using a scrambling method. Adapted from Figure 1 of \citep {Owen2020}. Used with kind permission of Art Owen. \relax }}{461}{figure.caption.253} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.1}{\ignorespaces An example of the Metropolis Hastings algorithm for sampling from a mixture of two 1D Gaussians ($\boldsymbol {\mu }=(-20,20)$, $\boldsymbol {\pi }=(0.3, 0.7)$, $\boldsymbol {\Sigma }=(100,100)$), using a Gaussian proposal with standard deviation of $\tau \in \{1, 8, 500\}$. (a) When $\tau =1$, the chain gets trapped near the starting state and fails to sample from the mode at $\mu =-20$. (b) When $\tau =500$, the chain is very ``sticky'', so its effective sample size is low (as reflected by the rough histogram approximation at the end). (c) Using a variance of $\sigma =8$ is just right and leads to a good approximation of the true distribution (shown in red). Compare to \cref {fig:gibbsGmm}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mcmc\_gmm\_demo.ipynb}{mcmc\_gmm\_demo.ipynb}. \relax }}{467}{figure.caption.255} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.2}{\ignorespaces Illustration of checkerboard pattern for a 2d MRF. This allows for parallel updates. \relax }}{471}{figure.caption.256} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.3}{\ignorespaces Example of image denoising using Gibbs sampling. We use an Ising prior with $J=1$ and a Gaussian noise model with $\sigma =2$. (a) Sample from the posterior after one sweep over the image. (b) Sample after 5 sweeps. (c) Posterior mean, computed by averaging over 15 sweeps. Compare to \cref {fig:isingImageDenoiseMeanfield} which shows the results of mean field inference. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ising\_image\_denoise\_demo.ipynb}{ising\_image\_denoise\_demo.ipynb}. \relax }}{472}{figure.caption.257} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.4}{\ignorespaces (a) Some samples from a mixture of two 1d Gaussians generated using Gibbs sampling. Color denotes the value of $z$, vertical location denotes the value of $x$. Horizontal axis represents time (sample number). (b) Traceplot of $x$ over time, and the resulting empirical distribution is shown in blue. The true distribution is shown in red. Compare to \cref {fig:nandoMH}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mcmc\_gmm\_demo.ipynb}{mcmc\_gmm\_demo.ipynb}. \relax }}{473}{figure.caption.258} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.5}{\ignorespaces Illustration of potentially slow sampling when using Gibbs sampling for a skewed 2D Gaussian. Adapted from Figure 11.11 of \citep {BishopBook}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gibbs\_gauss\_demo.ipynb}{gibbs\_gauss\_demo.ipynb}. \relax }}{475}{figure.caption.259} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.6}{\ignorespaces (a) A mixture model represented as an ``unrolled'' DPGM\xspace . (b) After integrating out the continuous latent parameters. \relax }}{476}{figure.caption.260} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.7}{\ignorespaces Comparison of collapsed (red) and vanilla (blue) Gibbs sampling for a mixture of $K=4$ two-dimensional Gaussians applied to $N=300$ data points. We plot log probability of the data vs iteration. (a) 20 different random initializations. (b) logprob averaged over 100 different random initializations. Solid line is the median, thick dashed in the 0.25 and 0.75 quantiles, and thin dashed are the 0.05 and 0.95 quintiles. From Figure 2.20 of \citep {SudderthThesis}. Used with kind permission of Erik Sudderth. \relax }}{478}{figure.caption.262} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.8}{\ignorespaces Slice sampling. (a) Illustration of one step of the algorithm in 1d. Given a previous sample $x^i$, we sample $u^{i+1}$ uniformly on $[0,f(x^i)]$, where $f=\cc@accent {"707E}{p}$ is the (unnormalized) target density. We then sample $x^{i+1}$ along the slice where $f(x) \geq u^{i+1}$. From Figure 15 of \citep {Andrieu03}. Used with kind permission of Nando de Freitas. (b) Output of slice sampling applied to a 1d distribution. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#slice\_sampling\_demo\_1d.ipynb}{slice\_sampling\_demo\_1d.ipynb}. \relax }}{479}{figure.caption.263} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.9}{\ignorespaces Binomial regression for 1d data. Left: Grid approximation to posterior. Right: Slice sampling approximation. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#slice\_sampling\_demo\_2d.ipynb}{slice\_sampling\_demo\_2d.ipynb}. \relax }}{479}{figure.caption.264} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.10}{\ignorespaces Illustration of the Swendsen Wang algorithm on a 2d grid. Used with kind permission of Kevin Tang. \relax }}{481}{figure.caption.265} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.11}{\ignorespaces (a) A mixture model. (b) After integrating out the discrete latent variables. \relax }}{490}{figure.caption.268} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.12}{\ignorespaces Illustration of convergence to the uniform distribution over $\{0,1,\ldots ,20\}$ using a symmetric random walk starting from (left) state 10, and (right) state 17. Adapted from Figures 29.14 and 29.15 of \citep {MacKay03}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#random\_walk\_integers.ipynb}{random\_walk\_integers.ipynb}. \relax }}{491}{figure.caption.269} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.13}{\ignorespaces A Markov chain with low conductance. The dotted arcs represent transitions with very low probability. From Figure 12.6 of \citep {KollerBook}. Used with kind permission of Daphne Koller. \relax }}{493}{figure.caption.270} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.14}{\ignorespaces Marginals (left) and trace plot (right) for the univariate Gaussian using the diffuse prior. Black vertical lines indicate HMC divergences. Adapted from Figures 9.9--9.10 of \citep {rethinking2}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mcmc\_traceplots\_unigauss.ipynb}{mcmc\_traceplots\_unigauss.ipynb}. \relax }}{494}{figure.caption.271} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.15}{\ignorespaces Marginals (left) and trace plot (right) for the univariate Gaussian using the sensible prior. Adapted from Figures 9.9--9.10 of \citep {rethinking2}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mcmc\_traceplots\_unigauss.ipynb}{mcmc\_traceplots\_unigauss.ipynb}. \relax }}{494}{figure.caption.272} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.16}{\ignorespaces Trace rank plot for the univariate Gaussian using the diffuse prior. Black vertical lines indicate HMC divergences. Adapted from Figures 9.9--9.10 of \citep {rethinking2}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mcmc\_traceplots\_unigauss.ipynb}{mcmc\_traceplots\_unigauss.ipynb}. \relax }}{495}{figure.caption.273} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.17}{\ignorespaces Trace rank plot for the univariate Gaussian using the sensible prior. Adapted from Figures 9.9--9.10 of \citep {rethinking2}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mcmc\_traceplots\_unigauss.ipynb}{mcmc\_traceplots\_unigauss.ipynb}. \relax }}{495}{figure.caption.274} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.18}{\ignorespaces Examples of two distinct challenges that arise when trying to asses if Markov chain samples have converged. In both examples ((b) and (c)) we have taken samples from a bimodal distribution using HMC. However, we have used different values of hyperparameters (step size ($\eta $), inverse mass matrix ($\Sigma $) and integration step size ($L$)) for both cases. (a) Bimodal distribution. (b) Each chain looks stationary, but they have not converged to a common distribution. In this case non-split $\cc@accent {"705E}{R}$ can detect non-convergence. HMC hyperparameters - $\eta = 0.05$, $\Sigma = [0.8]$, $L=13$ (c) The two sequences cover a common distribution, but are non-stationary. Here non-split $\cc@accent {"705E}{R}$ is not able to detect non-convergence, while split $\cc@accent {"705E}{R}$ has detected non-convergence in both cases. HMC hyperparameters - $\eta = 0.1$, $\Sigma = [0.1]$, $L=13$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#rhat\_slow\_mixing\_chains.ipynb}{rhat\_slow\_mixing\_chains.ipynb}. \relax }}{497}{figure.caption.275} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.19}{\ignorespaces Autocorrelation functions for various MCMC samplers for the mixture of two 1D Gaussians. (a-c) These are the MH samplers in \cref {fig:nandoMH}. (d) This is the Gibbs sampler in \cref {fig:gibbsGmm}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mcmc\_gmm\_demo.ipynb}{mcmc\_gmm\_demo.ipynb}. \relax }}{498}{figure.caption.276} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.20}{\ignorespaces Neal's funnel. (a) Joint density. (a) HMC samples from centered representation. (b) HMC samples from non-centered representation. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#neals\_funnel.ipynb}{neals\_funnel.ipynb}. \relax }}{501}{figure.caption.277} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.21}{\ignorespaces To compare a 1d model against a 2d model, we first have to map the 1d model to 2d space so the two have a common measure. Note that we assume the ridge has finite support, so it is integrable. From Figure 17 of \citep {Andrieu03}. Used with kind permission of Nando de Freitas. \relax }}{507}{figure.caption.279} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {12.22}{\ignorespaces Fitting an RBF network to some 1d data using RJMCMC. (a) Prediction on test set. (b) Plot of $p(k|{\mathcal {D}})$ vs iteration. (c) Final posterior $p(k|{\mathcal {D}})$. Adapted from Figure 4 of \citep {Andrieu01rbf}. \relax }}{508}{figure.caption.281} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.1}{\ignorespaces Illustration of particle filtering (using the dynamical prior as the proposal) applied to a 2d nonlinear dynamical system. (a) True underlying state and observed data. (b) PF estimate of the posterior mean. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#bootstrap\_filter.ipynb}{bootstrap\_filter.ipynb}. \relax }}{512}{figure.caption.282} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.2}{\ignorespaces (a) Illustration of weight degeneracy for SIS applied to the model in \cref {eqn:nonmarkov}. with parameters $(\phi ,q,\beta ,r) = (0.9, 10.0, 0.5,1.0)$. We use $T=6$ steps ad $n_s=5$ samples. We see that as $t$ increases, almost all the probability mass concentrates on particle 3. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#sis\_vs\_smc.ipynb}{sis\_vs\_smc.ipynb}. Adapted from Figure 2 of \citep {Naesseth2019}. (b) Illustration of the bootstrap particle filtering algorithm. \relax }}{516}{figure.caption.284} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.3}{\ignorespaces (a) Illustration of diversity of samples in SMC applied to the model in \cref {eqn:nonmarkov}. (b) Illustration of the path degeneracy problem. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#sis\_vs\_smc.ipynb}{sis\_vs\_smc.ipynb}. Adapted from Figure 3 of \citep {Naesseth2019}. \relax }}{518}{figure.caption.286} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.4}{\ignorespaces Illustration of how to sample from the empirical CDF $P(x) = \DOTSB \sum@ \slimits@ _{n=1}^N W^n \mathbb {I}\left ({x \geq n}\right )$ shown in black. The height of step $n$ is $W_n$. If $U^m$ picks step $n$, then we set the ancestor of $m$ to be $n$, i.e., $A^m=n$. In this example, $A^{1:3} = (1,2,2)$. Adapted from Figure 9.3 of \citep {Chopin2020}. \relax }}{519}{figure.caption.287} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.5}{\ignorespaces Effective sample size at each step for the bootstrap particle filter and a guided particle filter for a Gaussian SSM with Poisson likelihood. Adapted from Figure 10.4 of \citep {Chopin2020}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#pf\_guided\_neural\_decoding.ipynb}{pf\_guided\_neural\_decoding.ipynb}. \relax }}{524}{figure.caption.289} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.6}{\ignorespaces Illustration of state estimation for a switching linear model. (a) Black dots are observations, hollow circles are the true location, colors represent the discrete state. (b) Estimate from RBPF. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#rbpf\_maneuver.ipynb}{rbpf\_maneuver.ipynb}. (c) Estimate from bootstrap filter. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#bootstrap\_filter\_maneuver.ipynb}{bootstrap\_filter\_maneuver.ipynb}. \relax }}{528}{figure.caption.291} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.7}{\ignorespaces Posterior marginals of the location of the object over time, derived from the mixture of Gaussian representation. (a) $x$ location (dimension 0). (b) $y$ location (dimension 2). Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#rbpf\_maneuver\_demo.ipynb}{rbpf\_maneuver\_demo.ipynb}. \relax }}{528}{figure.caption.292} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.8}{\ignorespaces (a) Ground truth discrete state vs time, (b) Posterior distribution for the discrete state, derived from the particle representation. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#rbpf\_maneuver\_demo.ipynb}{rbpf\_maneuver\_demo.ipynb}. \relax }}{529}{figure.caption.293} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.9}{\ignorespaces Illustration of graphical model underlying SLAM. ${\bm {l}}_t^k$ is the location of landmark $k$ at time $t$, ${\bm {r}}_t$ is the location of the robot at time $t$, and ${\bm {y}}_t$ is the observation vector. In the model on the left, the landmarks are static, on the right, their location can change over time. The robot's observations are based on the distance to the nearest landmarks from the current state, denoted $f({\bm {r}}_t,{\bm {l}}_t^k)$. In this example, the robot gets the following observation trace: ${\bm {y}}_1=[f({\bm {r}}_1,{\bm {l}}_1^1), f({\bm {r}}_1,{\bm {l}}_1^2)]$, ${\bm {y}}_2=f({\bm {r}}_2,{\bm {l}}_2^2)$ up to ${\bm {y}}_T=f({\bm {r}}_T,{\bm {l}}_T^2)$. Thus we see that the number of observations per time step is variable. Adapted from Figure 15.A.3 of \citep {KollerBook}. \relax }}{530}{figure.caption.294} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.10}{\ignorespaces Illustration of the SLAM problem. (a) A robot starts at the top left and moves clockwise in a circle back to where it started. We see how the posterior uncertainty about the robot's location increases and then decreases as it returns to a familar location, closing the loop. If we performed smoothing, this new information would propagate backwards in time to disambiguate the entire trajectory. (b) We show the precision matrix, representing sparse correlations between the landmarks, and between the landmarks and the robot's position (pose). This sparse precision matrix can be visualized as a Gaussian graphical model, as shown on the right. From Figure 15.A.3 of \citep {KollerBook}. Used with kind permission of Daphne Koller. \relax }}{530}{figure.caption.295} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.11}{\ignorespaces (a) Illustration of a bimodal target distribution. (b) Tempered versions of the target at different inverse temperatures, from $\lambda _T=1$ down to $\lambda _1=0$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#smc\_tempered\_1d\_bimodal.ipynb}{smc\_tempered\_1d\_bimodal.ipynb}. \relax }}{535}{figure.caption.297} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.12}{\ignorespaces Sampling from the bimodal distribution in \cref {fig:bimodalTarget}. (a) HMC. (b) NUTS. (c) Tempered SMC with HMC kernel (single step). (d) Adaptive inverse temperature schedule. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#smc\_tempered\_1d\_bimodal.ipynb}{smc\_tempered\_1d\_bimodal.ipynb}. \relax }}{535}{figure.caption.298} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {13.13}{\ignorespaces Illustration of IBIS applied to 30 samples from $\mathcal {N}(\mu =3.14, \sigma =1)$. (a) Posterior approximation after $t=1$ and $t=29$ observations. (b) Effective sample size over time. The sudden jumps up occur whenever resampling is triggered, which happens when the ESS drops below 500. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#smc\_ibis\_1d.ipynb}{smc\_ibis\_1d.ipynb}. \relax }}{537}{figure.caption.300} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {14.1}{\ignorespaces Reliability diagrams for the ResNet CNN image classifier \citep {He2016eccv} applied to CIFAR-100 dataset. ECE is the expected calibration error, and measures the size of the red gap. Methods from left to right: original probabilities; after temperature scaling; after histogram binning; after isotonic regression. From Figure 4 of \citep {calibrationNN}. Used with kind permission of Chuan Guo. \relax }}{547}{figure.caption.301} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {14.2}{\ignorespaces Visualization of 3 different approaches to calibrating a binary probabilistic classifier. Black crosses are the observed binary labels, red lines are the calibrated outputs. (a) Platt scaling. (b) Histogram binning with 3 bins. The output in each bin is the average of the binary labels in each bin. (c) The scaling-binning calibrator. This first applies Platt scaling, and then computes the average of the scaled points (gray circles) in each bin. From Figure 1 of \citep {Kumar2019nips}. Used with kind permission of Ananya Kumar. \relax }}{549}{figure.caption.302} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {14.3}{\ignorespaces Softmax distribution $\mathcal {S}({\bm {a}}/T)$, where ${\bm {a}}=(3,0,1)$, at temperatures of $T=100$, $T=2$ and $T=1$. When the temperature is high (left), the distribution is uniform, whereas when the temperature is low (right), the distribution is ``spiky'', with most of its mass on the largest element. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#softmax\_plot.ipynb}{softmax\_plot.ipynb}. \relax }}{549}{figure.caption.303} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {14.4}{\ignorespaces Prediction set examples on Imagenet. We show three progressively more difficult examples of the class fox squirrel and the prediction sets generated by conformal prediction. From Figure 1 of \citep {Angelopoulos2021}. Used with kind permission of Anastasios Angelopoulos. \relax }}{553}{figure.caption.304} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {14.5}{\ignorespaces (a) Illusration of adaptive prediction set. From Figure 5 of \citep {Angelopoulos2021}. Used with kind permission of Anastasios Angelopoulos. (b) Illustrate of conformalized quantile regression. From Figure 6 of \citep {Angelopoulos2021}. Used with kind permission of Anastasios Angelopoulos. (c) Illustration of pinball loss function. \relax }}{555}{figure.caption.305} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.1}{\ignorespaces Linear regression for predicting height given weight, $y \sim \mathcal {N}(\alpha + \beta x, \sigma ^2)$. (a) Prior predictive samples using a Gaussian prior for $\beta $. (b) Prior predictive samples using a Log-Gaussian prior for $\beta $. (c) Posterior predictive samples using the Log-Gaussian prior. The inner shaded band is the 95\% credible interval for $\mu $, representing epistemic uncertainty. The outer shaded band is the 95\% credible interval for the observations $y$, which also adds data uncertainty due to $\sigma $. Adapted from Figures 4.5 and 4.10 of \citep {rethinking2}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#linreg\_height\_weight.ipynb}{linreg\_height\_weight.ipynb}. \relax }}{568}{figure.caption.308} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.2}{\ignorespaces Linear regression for predicting height given weight for the full dataset (including children) using polynomial regression. (a) Posterior fit for linear model with log-Gaussian prior for $\beta _1$. (b) Posterior fit for quadratic model with log-Gaussian prior for $\beta _2$. (c) Posterior fit for quadratic model with Gaussian prior for $\beta _2$. Adapted from Figure 4.11 of \citep {rethinking2}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#linreg\_height\_weight.ipynb}{linreg\_height\_weight.ipynb}. \relax }}{570}{figure.caption.309} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.3}{\ignorespaces (a) Representing lasso using a Gaussian scale mixture prior. (b) Graphical model for group lasso with 2 groups, the first has size $G_1=2$, the second has size $G_2=3$. \relax }}{571}{figure.caption.310} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.4}{\ignorespaces Illustration of why ARD results in sparsity. The vector of inputs ${\bm {x}}$ does not point towards the vector of outputs ${\bm {y}}$, so the feature should be removed. (a) For finite $\alpha $, the probability density is spread in directions away from ${\bm {y}}$. (b) When $\alpha =\infty $, the probability density at ${\bm {y}}$ is maximized. Adapted from Figure 8 of \citep {Tipping01}. \relax }}{574}{figure.caption.311} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.5}{\ignorespaces (a) Prior on logistic regression output when using $\mathcal {N}(0,\omega )$ prior for the offset term, for $\omega =10$ or $\omega =1.5$. Adapted from Figure 11.3 of \citep {rethinking2}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#logreg\_prior\_offset.ipynb}{logreg\_prior\_offset.ipynb}. (b) Distribution over the fraction of 1s we expect to see when using binary logistic regression applied to random binary feature vectors of increasing dimensionality. We use a $\mathcal {N}(0,1.5)$ prior on the regression coefficients. Adapted from Figure 3 of \citep {Gelman2020workflow}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#logreg\_prior.ipynb}{logreg\_prior.ipynb}. \relax }}{577}{figure.caption.312} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.6}{\ignorespaces (a) Illustration of the data and some decision boundaries, where colors correspond to the points in the (b) panel. (b) Log-likelihood for a logistic regression model. The line is drawn from the origin in the direction of the MLE (which is at infinity). The numbers correspond to 4 points in parameter space, corresponding to the lines in (a). (c) Unnormalized log posterior (assuming vague spherical prior). (d) Laplace approximation to posterior. Adapted from a figure by Mark Girolami. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#logreg\_laplace\_demo.ipynb}{logreg\_laplace\_demo.ipynb}. \relax }}{579}{figure.caption.313} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.7}{\ignorespaces Posterior predictive distribution for a logistic regression model in 2d. (a): contours of $p(y=1|{\bm {x}},\cc@accent {"705E}{{\bm {w}}}_{map})$. (b): samples from the posterior predictive distribution. (c): Averaging over these samples. (d): moderated output (probit approximation). Adapted from a figure by Mark Girolami. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#logreg\_laplace\_demo.ipynb}{logreg\_laplace\_demo.ipynb}. \relax }}{580}{figure.caption.314} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.8}{\ignorespaces Illustration of the posterior over the decision boundary for classifying iris flowers (setosa vs versicolor) using 2 input features. (a) 25 examples per class. Adapted from Figure 4.5 of \citep {Martin2018}. (b) 5 examples of class 0, 45 examples of class 1. Adapted from Figure 4.8 of \citep {Martin2018}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#logreg\_iris\_bayes\_2d.ipynb}{logreg\_iris\_bayes\_2d.ipynb}. \relax }}{581}{figure.caption.315} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.9}{\ignorespaces The logistic (sigmoid) function $\sigma (x)$ in solid red, with the Gaussian cdf function $\Phi (\lambda x)$ in dotted blue superimposed. Here $\lambda =\sqrt {\pi /8}$, which was chosen so that the derivatives of the two curves match at $x=0$. Adapted from Figure 4.9 of \citep {BishopBook}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#probit\_plot.ipynb}{probit\_plot.ipynb}. \relax }}{583}{figure.caption.316} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.10}{\ignorespaces Fitting a probit regression model in 2d using a quasi-Newton method or EM. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#probit\_reg\_demo.ipynb}{probit\_reg\_demo.ipynb}. \relax }}{585}{figure.caption.317} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.11}{\ignorespaces A hierarchical Bayesian linear regression model for the radon problem. \relax }}{588}{figure.caption.318} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.12}{\ignorespaces Posterior marginals for $\alpha _c$ and $\beta _c$ for each county in the radon model. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#linreg\_hierarchical\_non\_centered.ipynb}{linreg\_hierarchical\_non\_centered.ipynb}. \relax }}{588}{figure.caption.319} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.13}{\ignorespaces Predictions from the radon model for 3 different counties in Minnesota. Black dots are observed datapoints. Green represents results of hierarchical (shared) prior, blue represents results of non-hierarchical prior. Thick lines are the result of using the posterior mean, thin lines are the result of using posterior samples. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#linreg\_hierarchical\_non\_centered.ipynb}{linreg\_hierarchical\_non\_centered.ipynb}. \relax }}{589}{figure.caption.320} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {15.14}{\ignorespaces (a) Bivariate posterior $p(\beta _c,\sigma _{\beta }|{\mathcal {D}})$ for the hierarchical radon model for county $c=75$ using centered parameterization. (b) Similar to (a) except we plot $p(\cc@accent {"707E}{\beta }_c,\sigma _{\beta }|{\mathcal {D}})$ for the non-centered parameterization. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#linreg\_hierarchical\_non\_centered.ipynb}{linreg\_hierarchical\_non\_centered.ipynb}. \relax }}{589}{figure.caption.321} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.1}{\ignorespaces An artiticial ``neuron'', the most basic building block of a DNN. (a) The output $y$ is a weighted combination of the inputs ${\bm {x}}$, where the weights vector is denoted by ${\bm {w}}$. (b) Alternative depiction of the neuron's behavior. The bias term $b$ can be emulated by defining $w_N=b$ and $X_N=1$. \relax }}{592}{figure.caption.322} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.2}{\ignorespaces (a) Some popular activation functions. ``ReLU'' stands for ``restricted linear unit''. ``GELU'' stands for ``Gaussian error linear unit''. (b) Plot of their gradients. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#activation\_fun\_deriv.ipynb}{activation\_fun\_deriv.ipynb}. \relax }}{593}{figure.caption.324} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.3}{\ignorespaces A 2d convolutional layer with 3 input channels and 2 output channels. The kernel has size $3 \times 3$ and we use stride 1 with 0 padding, so the the $6 \times 6$ input gets mapped to the $4 \times 4$ output. \relax }}{594}{figure.caption.325} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.4}{\ignorespaces A residual connection around a convolutional layer. \relax }}{595}{figure.caption.326} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.5}{\ignorespaces Illustration of dropout. (a) A standard neural net with 2 hidden layers. (b) An example of a thinned net produced by applying dropout with $p=0.5$. Units that have been dropped out are marked with an x. From Figure 1 of \citep {Srivastava2014}. Used with kind permission of Geoff Hinton. \relax }}{596}{figure.caption.327} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.6}{\ignorespaces Attention layer. (a) Mapping a single query ${\bm {q}}$ to a single output, given a set of keys and values. From Figure 10.3.1 of \citep {dive}. Used with kind permission of Aston Zhang. (b) Detailed visualization of attention. Here the $n=2$ queries ${\bm {q}}_j$ are derived by embedding input vectors ${\bm {x}}_j$ corresponding to the words (discrete tokens) ``Thinking'' and ``Machines''. We assume there are $n=2$ keys and values, and that the queries, keys and values all have the same size, namely $d_k=64$. (We only show 3 dimensions for brevity.) From \url {https://jalammar.github.io/illustrated-transformer/}. Used with kind permission of Jay Alammar. \relax }}{597}{figure.caption.328} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.7}{\ignorespaces (a) Scaled dot-product attention in matrix form. (b) Multi-head attention. From Figure 2 of \citep {vaswani2017attention}. Used with kind permission of Ashish Vaswani. \relax }}{598}{figure.caption.329} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.8}{\ignorespaces Recurrent layer. \relax }}{599}{figure.caption.330} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.9}{\ignorespaces Explicit vs implicit layers. \relax }}{600}{figure.caption.331} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.10}{\ignorespaces A feedforward neural network with $D$ inputs, $K_1$ hidden units in layer 1, $K_2$ hidden units in layer 2, and $C$ outputs. $w_{jk}^{(l)}$ is the weight of the connection from node $j$ in layer $l-1$ to node $k$ in layer $l$. \relax }}{601}{figure.caption.332} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.11}{\ignorespaces One of the first CNNs ever created, for classifying MNIST images. From Figure 3 of \citep {LeCun89}. Implemented in \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#lecun1989.ipynb}{lecun1989.ipynb}. \relax }}{602}{figure.caption.333} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.12}{\ignorespaces Illustration of a recurrent neural network (RNN). (a) With self-loop. (b) Unrolled in time. \relax }}{602}{figure.caption.334} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.13}{\ignorespaces Visualizing the difference between an RNN and a transformer. From \citep {Joshi2020}. Used with kind permission of Chaitanya Joshi. \relax }}{603}{figure.caption.335} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.14}{\ignorespaces High level structure of the encoder-decoder transformer architecture. From \url {https://jalammar.github.io/illustrated-transformer/}. Used with kind permission of Jay Alammar. \relax }}{603}{figure.caption.336} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.15}{\ignorespaces The encoder block of a transformer for two input tokens. From \url {https://jalammar.github.io/illustrated-transformer/}. Used with kind permission of Jay Alammar. \relax }}{604}{figure.caption.337} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.16}{\ignorespaces A transformer model where we use 2 encoder blocks and 2 decoder blocks. (The second decoder block is not expanded.) We assume there are 2 input and 2 output tokens. From \url {https://jalammar.github.io/illustrated-transformer/}. Used with kind permission of Jay Alammar. \relax }}{605}{figure.caption.338} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.17}{\ignorespaces Left: Illustration of a $5 \times 5$ image, where each pixel is either off (light yellow) or on (dark yellow). Each non-border pixel has 8 nearest neighbors. We highlight the node at location (2,1), where the top-left is (0,0). Middle: The corresponding adjacency matrix, which is sparse and banded. Right: Visualization of the graph structure. Dark nodes correspind to pixels that are on, light nodes correspond to pixels that are off. Dark edges correspond to the neighbors of the node at (2,1). From \citep {Sanchez-lengeling2021}. Used with kind permission of Benjamin Sanchez-Lengeling. \relax }}{606}{figure.caption.339} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.18}{\ignorespaces A simple graph where each node has 2 types (0=light yellow, 1=dark yellow), each edge has 2 types (1=gray, 2=blue), and the global feature vector is a constant (0=red). We represent the topology using an adjaceny list. From \citep {Sanchez-lengeling2021}. Used with kind permission of Benjamin Sanchez-Lengeling. \relax }}{606}{figure.caption.340} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.19}{\ignorespaces A basic GNN layer. We update the embedding vectors $\mathbf {U}$, $\mathbf {V}$ and $\mathbf {V}$ using the global, node and edge functions $f$. From \citep {Sanchez-lengeling2021}. Used with kind permission of Benjamin Sanchez-Lengeling. \relax }}{607}{figure.caption.341} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.20}{\ignorespaces Aggregating edge information into two different nodes. From \citep {Sanchez-lengeling2021}. Used with kind permission of Benjamin Sanchez-Lengeling. \relax }}{607}{figure.caption.342} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.21}{\ignorespaces An end-to-end GNN classifier. From \citep {Sanchez-lengeling2021}. Used with kind permission of Benjamin Sanchez-Lengeling. \relax }}{608}{figure.caption.343} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.22}{\ignorespaces Message passing in one layer of a GNN. First the global node $U_n$ and the local nodes $V_n$ send messages to the edges $E_n$, which get updated to give $E_{n+1}$. Then the nodes get updated to give $V_{n+1}$. Finally the global node gets updated to give $U_{n+1}$. From \citep {Sanchez-lengeling2021}. Used with kind permission of Benjamin Sanchez-Lengeling. \relax }}{609}{figure.caption.344} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.23}{\ignorespaces Left: a multigraph can have different edge types. Right: a hypergraph can have edges which connect multiple nodes. From \citep {Sanchez-lengeling2021}. Used with kind permission of Benjamin Sanchez-Lengeling. \relax }}{609}{figure.caption.345} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.24}{\ignorespaces (a) A graph neural network aggregation block. Here ${\bm {h}}_i^{\ell }$ is the hidden representation for node $i$ in layer $\ell $, and $\mathcal {N}(i)$ are $i$'s neighbors. The output is given by ${\bm {h}}_i^{\ell +1} = \ensuremath {\mathrm {ReLU}}\xspace (\mathbf {U}^{\ell } {\bm {h}}_i + \DOTSB \sum@ \slimits@ _{j \in \mathrm {nbr}(i)} \mathbf {V}^{\ell } {\bm {h}}^{\ell }_j)$. (b) A transformer encoder block. Here $h_i^{\ell }$ is the hidden representation for word $i$ in layer $\ell $, and $\mathcal {S}$ are all the words in the sentence. The output is given by ${\bm {h}}_i^{\ell +1} = \mathrm {Attn}(\mathbf {Q}^{\ell } {\bm {h}}_i^{\ell }, \{ \mathbf {K}^{\ell } {\bm {h}}_j, \mathbf {V}^{\ell } {\bm {h}}_j^{\ell } \})$. From \citep {Joshi2020}. Used with kind permission of Chaitanya Joshi. \relax }}{610}{figure.caption.346} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {16.25}{\ignorespaces Graph connectivity for different types of transformer. Top left: in a vanilla Transformer, every node is connected to every other node. Top right: in Transformer-XL, nodes are grouped into blocks. Bottom: in BPT, we use a binary partitioning of the graph to create virtual node clusters. From \url {https://graphdeeplearning.github.io/post/transformers-are-gnns/}. Used with kind permission of Chaitanya Joshi. \relax }}{611}{figure.caption.347} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.1}{\ignorespaces The effects of changing the hyperparameters on an MLP with one hidden layer. (a) Random functions sampled from a Gaussian prior with hyperparameters $\alpha _{1}=5$, $\beta _1=1$, $\alpha _2=1$, $\beta _2=1$. (b) Increasing $\alpha _{1}$ by factor of 5. (c) Increasing $\beta _{1}$ by factor of 5. (d) Inreasing $\alpha _{2}$ by factor of 5. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mlp\_priors\_demo.ipynb}{mlp\_priors\_demo.ipynb}. \relax }}{615}{figure.caption.348} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.2}{\ignorespaces Illustration of an MLP with (a) point estimate for each weight, (b) a marginal distribution for each weight, corresponding to a fully factored posterior approximation. From Figure 1 of \citep {Blundell2015}. Used with kind permission of Charles Blundell. \relax }}{619}{figure.caption.349} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.3}{\ignorespaces Illustration of an MLP fit to the two-moons dataset using HMC. (a) Posterior mean. (b) Posterior standard derivation. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#bnn\_mlp\_2d\_hmc.ipynb}{bnn\_mlp\_2d\_hmc.ipynb}. \relax }}{622}{figure.caption.350} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.4}{\ignorespaces Illustration of stochastic weight averaging (SWA). The three crosses represent different SGD solutions. The star in the middle is the average of these parameter values. From Figure 1 of \citep {Izmailov2018}. Used with kind permission of Andrew Wilson. \relax }}{623}{figure.caption.351} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.5}{\ignorespaces Cartoon illustration of the NLL as it varies across the parameter space. Subspace methods (red) model the local neighborhood around a local mode, whereas ensemble methods (blue) approximate the posterior using a set of distinct modes. From Figure 1 of \citep {deepEnsembleLandscape}. Used with kind permission of Balaji Lakshminarayanan. \relax }}{623}{figure.caption.352} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.6}{\ignorespaces Deep ensemble with random priors. (a) Individual predictions from each member. Blue is the fixed random prior function, orange is the trainable function, green is the combination of the two. (b) Overall prediction from the ensemble, for increasingly large values of $\beta $. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#randomized\_priors.ipynb}{randomized\_priors.ipynb}. \relax }}{625}{figure.caption.353} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.7}{\ignorespaces Comparison of approximate Bayesian inference methods for a 1d regression problem. (a) The ``true'' predictive distribution obtained by combining $200$ HMC chains. (b) Deep ensembles predictive distribution using $50$ independently trained networks. (c) Predictive distribution for factorized variational inference (VI). (d) Convergence of the predictive distributions for deep ensembles and variational inference as a function of the number of samples in terms of the average Wasserstein distance between the marginals in the range of input positions. The multi-basin deep ensembles approach provides a better approximation of the Bayesian predictive distribution than the conventional single-basin VI approach, which is overconfident between data clusters. The top panels show the Wasserstein distance between the true predictive distribution and the deep ensemble and VI approximations, as a function of inputs $x$. From Figure 4 of \citep {Wilson2020prob}. Used with kind permission of Andrew Wilson. \relax }}{626}{figure.caption.354} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.8}{\ignorespaces Illustration of batch ensemble with 2 ensemble members. From Figure 2 of \citep {batchEnsemble2020}. Used with kind permission of Paul Vicol. \relax }}{627}{figure.caption.355} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.9}{\ignorespaces Flat vs sharp minima. From Figures 1 and 2 of \citep {Hochreiter1997}. Used with kind permission of J\IeC {\"u}rgen Schmidhuber. \relax }}{629}{figure.caption.356} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.10}{\ignorespaces Diversity of high performing functions sampled from the posterior. Top row: we show predictions on the 1d input domain for 4 different functions. We see that they extrapolate in different ways outside of the support of the data. Bottom row: we show a 2d subspace spanning two distinct modes (MAP estimates), and connected by a low-loss curved path computed as in \citep {Garipov2018}. From Figure 8 of \citep {Wilson2020prob}. Used with kind permission of Andrew Wilson. \relax }}{630}{figure.caption.357} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.11}{\ignorespaces Left: Effective dimensionality as a function of model width and depth for a CNN on CIFAR-$100$. Center: Test loss as a function of model width and depth. Right: Train loss as a function of model width and depth. Yellow level curves represent equal parameter counts ($1e5$, $2e5$, $4e5$, $1.6e6$). The green curve separates models with near-zero training loss. Effective dimensionality serves as a good proxy for generalization for models with low train loss. We see wide but shallow models overfit, providing low train loss, but high test loss and high effective dimensionality. For models with the same train loss, lower effective dimensionality can be viewed as a better compression of the data at the same fidelity. Thus depth provides a mechanism for compression, which leads to better generalization. From Figure 2 of \citep {Maddox2020}. Used with kind permission of Andrew Wilson. \relax }}{631}{figure.caption.358} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.12}{\ignorespaces Illustration of the behavior of different kinds of model families and the prior distribution they induce over datasets. (a) The purple model is a simple linear model that has small support, and can only represent a few kinds of datasets. The pink model is an unstructured MLP: this has support over a large range of datasets but with a fairly uninformative (broad) prior. Finally the green model is a CNN; this has support over a large range of datasets but with a fairly uninformative (broad) prior. (b) The posterior for the green model (CNN) rapidly collapses to the true model. (c) The posterior for the purple model (linear) also rapidly collapses, but to a solution which cannot represent the true model. (d) The posterior for the pink model (MLP) collapses very slowly (as a function of dataset size). From Figure 2 of \citep {Wilson2020prob}. Used with kind permission of Andrew Wilson. \relax }}{632}{figure.caption.359} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.13}{\ignorespaces Illustration of the double descent risk curve for RFF model on a subset of MNIST. There are $N=10^4$ examples, so interpolation occurs when the x-axis (which is number of features times 1000) equals 10. Top row: test error. Middle row: norm of ${\bm {w}}^*$. Bottom row: train error. From Figure 2 of \citep {Belkin2019PNAS}. Used with kind permission of Mikhail Belkin. \relax }}{633}{figure.caption.360} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.14}{\ignorespaces Two shallow MLPs fit to $N=10$ data points (shown in red) using one layer of $K=40$ or $K=4000$ random \ensuremath {\mathrm {ReLU}}\xspace features, i.e., the model has the form $h(x) = w_0 + \DOTSB \sum@ \slimits@ _{k=1}^K w_k \phi (x; {\bm {v}}_k)$, where $\phi (x;{\bm {v}}) = \qopname \relax m{max}(v_0 + v_1 x, 0)$, ${\bm {v}}_k$ are chosen randomly and ${\bm {w}}$ is optimized. The piecewise linear nature of the fitted function is visually apparent when $K=40$, but the function is smoother with larger $K$. The scaled norm of the parameter vectors, $\frac {||{\bm {w}}||_2}{\sqrt {K}}$, is 695 and 159 for $K=40$ and $K=4000$, reflecting the fact that the latter model is simpler. (Similar results hold when we optimize both $\mathbf {V}$ and ${\bm {w}}$, but it is harder to implement the minimum norm solution in this case.) From Figure 3 of \citep {Belkin2019PNAS}. Used with kind permission of Mikhail Belkin. \relax }}{634}{figure.caption.361} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.15}{\ignorespaces Curves for training risk (dashed line) and test risk (solid line) as a function of the size of the hypothesis space $\mathcal {H}$ for the unknown function $f$. (a) Classical U-shaped curve in the under-parameterized regime. (b) Illustration of the double descent risk curve that arises when we consider models that may be over-parameterized. From Figure 1 of \citep {Belkin2019PNAS}. Used with kind permission of Mikhail Belkin. \relax }}{634}{figure.caption.362} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.16}{\ignorespaces A ResNet-18 with layers of varying width trained on CIFAR-$100$ with $20\%$ label corruption. Standard SGD training leads to significant double descent behavior. SWAG \citep {Maddox2019}, which performs a single basin Bayesian model average, helps mitigate double descent. Multi-SWAG \citep {Wilson2020prob}, which performs exhaustive multi-basin marginalization, entirely alleviates double descent. There is also a dramatic performance improvement, even for accuracy of point predictions, between Multi-SWAG and classical SGD training. From Figure 8 of \citep {Wilson2020prob}. Used with kind permission of Andrew Wilson. \relax }}{635}{figure.caption.363} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.17}{\ignorespaces A resolution of double descent. We replicate the \emph {double descent} behaviour of deep neural networks using a ResNet-18 on CIFAR-$100,$ where train loss decreases to zero with sufficiently wide model while test loss decreases, then increases, and then decreases again. Unlike model width, the \emph {effective dimensionality} computed from the eigenspectrum of the Hessian of the loss on \emph {training data alone} follows the test loss in the overparameterized regime, acting as a much better proxy for generalization than naive parameter counting. From Figure 1 of \citep {Maddox2020}. Used with kind permission of Andrew Wilson. \relax }}{636}{figure.caption.364} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.18}{\ignorespaces \textbf {Bayesian neural networks under covariate shift}. \textbf {(a)}: Performance of a ResNet-20 on the \textit {pixelate} corruption in CIFAR-10-C. For the highest degree of corruption, a Bayesian model average underperforms a MAP solution by 25\% (44\% against 69\%) accuracy. See \citet {Izmailov2021hmc} for details. \textbf {(b)}: Visualization of the weights in the first layer of a Bayesian fully-connected network on MNIST sampled via HMC. \textbf {(c)}: The corresponding MAP weights. We visualize the weights connecting the input pixels to a neuron in the hidden layer as a $28 \times 28$ image, where each weight is shown in the location of the input pixel it interacts with. This figure is adapted from Figure 1 of \citet {Izmailov2021bma}. \relax }}{637}{figure.caption.365} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.19}{\ignorespaces \textbf {Evaluation on CIFAR-10-C.} Accuracy and log-likelihood of HMC, SGD, deep ensembles, SGLD and MFVI on a distribution shift task, where the CIFAR-10 test set is corrupted in 16 different ways at various intensities on the scale of 1 to 5. We use the ResNet-20-FRN architecture. Boxes capture the quartiles of performance over each corruption, with the whiskers indicating the minimum and maximum. HMC is surprisingly the worst of the considered methods: even a single SGD solution provides better OOD robustness. This figure is adapted from Figure 7 of \citet {Izmailov2021hmc}. \relax }}{637}{figure.caption.366} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.20}{\ignorespaces \textbf {\textbf {Bayesian inference samples weights along low-variance principal components from the prior, while MAP sets these weights to zero.}} \textbf {(a)}: The distribution (mean $\pm $ 2 std) of projections of the weights of the first layer on the directions corresponding to the PCA components of the data for BNN samples and MAP solution using MLP and CNN architectures with different prior scales. In each case, MAP sets the weights along low-variance components to zero, while BNN samples them from the prior. \textbf {(b)}: Accuracy of BNN and MAP solutions on the MNIST test set with Gaussian noise applied along the 50 highest and 50 lowest variance PCA components of the train data (left and right respectively). MAP is very robust to noise along low-variance PCA directions, while BMA is not; the two methods are similarly robust along the highest-variance PCA components. This figure is adapted from Figure 4 of \citet {Izmailov2021bma}. \relax }}{639}{figure.caption.367} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.21}{\ignorespaces Sequential Bayesian inference for the parameters of an MLP applied to a nonlinear regression problem using the extended Kalman filter. We show results after seeing the first 10, 20, 30 and 40 observations. (For a video of this, see \url {https://bit.ly/3wXnWaM}.) Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ekf\_mlp.ipynb}{ekf\_mlp.ipynb}. \relax }}{641}{figure.caption.368} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.22}{\ignorespaces (a) Two moons synthetic dataset. (b) Multi-task version, where we rotate the data to create 18 related tasks (groups). Each dataset has 50 training and 50 test points. Here we show the first 4 tasks. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#bnn\_hierarchical.ipynb}{bnn\_hierarchical.ipynb}. \relax }}{645}{figure.caption.369} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.23}{\ignorespaces (a) Illustration of a hierarchical Bayesian discriminative model with $T$ groups or tasks. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hbayes\_figures2.ipynb}{hbayes\_figures2.ipynb}. (b) Zoom-in on the dependency structure, for the case where $p(y|{\bm {x}},{\bm {\theta }}_t)$ is an MLP with 2 hidden layers, with sizes $L_1$ and $L_2$. The input is $D$-dimensional, and output is the scalar logit for the binary output. Used with kind permission of Aleyna Kara. \relax }}{646}{figure.caption.370} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.24}{\ignorespaces Results of fitting separate MLPs on each dataset. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#bnn\_hierarchical.ipynb}{bnn\_hierarchical.ipynb}. \relax }}{647}{figure.caption.371} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {17.25}{\ignorespaces Results of fitting hierarchical MLP on all datasets jointly. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#bnn\_hierarchical.ipynb}{bnn\_hierarchical.ipynb}. \relax }}{647}{figure.caption.372} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.1}{\ignorespaces A Gaussian process for 2 training points, ${\bm {x}}_1$ and ${\bm {x}}_2$, and 1 testing point, ${\bm {x}}_{*}$, represented as a graphical model representing $p({\bm {y}},{\bm {f}}_{X}|\mathbf {X}) = \mathcal {N}({\bm {f}}_{X}|m(\mathbf {X}), \mathcal {K}(\mathbf {X})) \DOTSB \prod@ \slimits@ _i p(y_i|f_i)$. The hidden nodes $f_i=f({\bm {x}}_i)$ represent the value of the function at each of the data points. These hidden nodes are fully interconnected by undirected edges, forming a Gaussian graphical model; the edge strengths represent the covariance terms $\Sigma _{ij}=\mathcal {K}({\bm {x}}_i,{\bm {x}}_j)$. If the test point ${\bm {x}}_{*}$ is similar to the training points ${\bm {x}}_1$ and ${\bm {x}}_2$, then the value of the hidden function $f_{*}$ will be similar to $f_1$ and $f_2$, and hence the predicted output $y_*$ will be similar to the training values $y_1$ and $y_2$. \relax }}{650}{figure.caption.373} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.2}{\ignorespaces Function samples from a GP with an ARD kernel. (a) $\ell _1=\ell _2=1$. Both dimensions contribute to the response. (b) $\ell _1=1$, $\ell _2=5$. The second dimension is essentially ignored. Adapted from Figure 5.1 of \citep {Rasmussen06}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gpr\_demo\_ard.ipynb}{gpr\_demo\_ard.ipynb}. \relax }}{653}{figure.caption.376} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.3}{\ignorespaces Functions sampled from a GP with a Matern kernel. (a) $\nu =5/2$. (b) $\nu =1/2$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gp\_kernel\_plot.ipynb}{gp\_kernel\_plot.ipynb}. \relax }}{654}{figure.caption.378} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.4}{\ignorespaces Functions sampled from a GP using various stationary periodic kernels. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gp\_kernel\_plot.ipynb}{gp\_kernel\_plot.ipynb}. \relax }}{655}{figure.caption.380} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.5}{\ignorespaces Examples of 1d structures obtained by multiplying elementary kernels. Top row shows $\mathcal {K}(x,x'=1)$. Bottom row shows some functions sampled from $GP(f|0,\mathcal {K})$. Adapted from Figure 2.2 of \citep {duvenaud-thesis-2014}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#combining\_kernels\_by\_multiplication.ipynb}{combining\_kernels\_by\_multiplication.ipynb}. \relax }}{657}{figure.caption.382} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.6}{\ignorespaces Examples of 1d structures obtained by summing elementary kernels. Top row shows $\mathcal {K}(x,x'=1)$. Bottom row shows some functions sampled from $GP(f|0,\mathcal {K})$. Adapted from Figure 2.2 of \citep {duvenaud-thesis-2014}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#combining\_kernels\_by\_summation.ipynb}{combining\_kernels\_by\_summation.ipynb}. \relax }}{657}{figure.caption.383} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.7}{\ignorespaces Left: some functions sampled from a GP prior with RBF kernel. Middle: some samples from a GP posterior, after conditioning on 5 noise-free observations. Right: some samples from a GP posterior, after conditioning on 5 noisy observations. The shaded area represents $\mathbb {E}\left [{f({\bm {x}})}\right ] \pm 2 \sqrt {\mathbb {V}[f({\bm {x}})]}$. Adapted from Figure 2.2 of \citep {Rasmussen06}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gpr\_demo\_noise\_free.ipynb}{gpr\_demo\_noise\_free.ipynb}. \relax }}{661}{figure.caption.384} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.8}{\ignorespaces Kernel ridge regression (KRR) compared to Gaussian process regression (GPR) using the same kernel. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#krr\_vs\_gpr.ipynb}{krr\_vs\_gpr.ipynb}. \relax }}{668}{figure.caption.385} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.9}{\ignorespaces Contours of the posterior predictive probability for a binary classifier generated by a GP with an SE kernel. (a) Manual kernel parameters: short length scale, $\ell =0.5$, variance $3.16^2 \approx 9.98$. (b) Learned kernel parameters: long length scale, $\ell =1.19$, variance $4.79^2 \approx 22.9$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gpc\_demo\_2d.ipynb}{gpc\_demo\_2d.ipynb}. \relax }}{670}{figure.caption.388} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.10}{\ignorespaces Poisson regression with a GP. (a) Observed data (black dots) and true log rate function (yellow line). (b) Posterior predictive distribution (shading shows 1 and 2 $\sigma $ bands) from MCMC. (c) Posterior predictive distribution from SVI. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gp\_poisson\_1d.ipynb}{gp\_poisson\_1d.ipynb}. \relax }}{671}{figure.caption.389} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.11}{\ignorespaces We show the relative risk of heart disease in Finland using a Poisson GP fit to 911 data points. Left: posterior mean. Right: posterior variance. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gp\_spatial\_demo.ipynb}{gp\_spatial\_demo.ipynb}. \relax }}{672}{figure.caption.390} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.12}{\ignorespaces Illustration of the graphical model for a GP on $n$ observations, ${\bm {f}}_{1:n}$, and one test case, $f_{*}$, with inducing variables ${\bm {u}}$. The thick lines indicate that all variables are fully interconnected. The observations $y_i$ (not shown) are locally connected to each $f_i$. (a) no approximations are made. (b) we assume $f_{*}$ is conditionally independent of ${\bm {f}}_{X}$ given ${\bm {u}}$. From Figure 1 of \citep {Quinonero05}. Used with kind permission of Joaquin Quinonero-Candela. \relax }}{675}{figure.caption.392} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.13}{\ignorespaces Illustration of how to choose inducing points from a discrete input domain (here DNA sequences of length 4) to maximize the log marginal likelihood. From Figure 1 of \citep {Fortuin2018}. Used with kind permission of Vincent Fortuin. \relax }}{678}{figure.caption.393} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.14}{\ignorespaces RMSE on test set as a function of training set size using a GP with Matern $3/2$ kernel with shared lengthscale across all dimensions. Solid lines: exact inference. Dashed blue: SGPR method (closed-form batch solution to the Gaussian variational approximation) of \cref {sec:SGPR} with $M=512$ inducing points. Dashed orange: SVGP method (SGD on Gaussian variational approxiation) of \cref {sec:SVIGP} with $M=1024$ inducing points. Number of input dimensions: KEGGU $D=27$, 3DRoad $D=3$, Song $D=90$. From Figure 4 of \citep {Wang2019nips}. Used with kind permission of Andrew Wilson. \relax }}{682}{figure.caption.394} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.15}{\ignorespaces Some 1d GPs with RBF kernels but different hyper-parameters fit to 20 noisy observations. The hyper-parameters $(\ell ,\sigma _f,\sigma _y)$ are as follows: (a) (1,1,0.1) (b) (3.0, 1.16, 0.89). Adapted from Figure 2.5 of \citep {Rasmussen06}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gpr\_demo\_change\_hparams.ipynb}{gpr\_demo\_change\_hparams.ipynb}. \relax }}{684}{figure.caption.395} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.16}{\ignorespaces Illustration of local minima in the marginal likelihood surface. (a) We plot the log marginal likelihood vs $\sigma _y^2$ and $\ell $, for fixed $\sigma _f^2=1$, using the 7 data points shown in panels b and c. (b) The function corresponding to the lower left local minimum, $(\ell ,\sigma ^2_n) \approx (1,0.2)$. This is quite ``wiggly'' and has low noise. (c) The function corresponding to the top right local minimum, $(\ell ,\sigma ^2_n) \approx (10,0.8)$. This is quite smooth and has high noise. The data was generated using $(\ell ,\sigma ^2_n)=(1,0.1)$. Adapted from Figure 5.5 of \citep {Rasmussen06}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gpr\_demo\_marglik.ipynb}{gpr\_demo\_marglik.ipynb}. \relax }}{686}{figure.caption.396} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.17}{\ignorespaces Three different approximations to the posterior over hyper-parameters: grid-based, Monte Carlo, and central composite design. From Figure 3.2 of \citep {Vanhatalo10}. Used with kind permission of Jarno Vanhatalo. \relax }}{687}{figure.caption.397} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.18}{\ignorespaces Difference between estimation and inference for kernel hyper-parameters. (a) Empirical Bayes approach based on optimization. We plot the posterior predicted mean given a plug-in estimate, $\mathbb {E}\left [{f(x)|{\mathcal {D}},\cc@accent {"705E}{{\bm {\theta }}}}\right ]$. (b) Bayesian approach based on HMC. We plot the posterior predicted mean, marginalizing over hyper-parameters, $\mathbb {E}\left [{f(x)|{\mathcal {D}}}\right ]$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gp\_kernel\_opt.ipynb}{gp\_kernel\_opt.ipynb}. \relax }}{688}{figure.caption.398} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.19}{\ignorespaces Comparison of different additive model classes for a 4d function. Circles represent different interaction terms, ranging from first-order to fourth-order. Color shades represent different weighting terms. Left: hierarchical kernel learning uses a nested hierarchy of terms. Right: additive GPs use a weighted sum of additive kernels of different orders. From Figure 6.2 of \citep {duvenaud-thesis-2014}. Used with kind permission of David Duvenaud. \relax }}{689}{figure.caption.399} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.20}{\ignorespaces Predictive distribution of each term im a GP-GAM model applied to a dataset with 8 continuous inputs and 1 continuous output, representing the strength of some concrete. From Figure 2.7 of \citep {duvenaud-thesis-2014}. Used with kind permission of David Duvenaud. \relax }}{690}{figure.caption.400} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.21}{\ignorespaces Example of a search tree over kernel expressions. From Figure 3.2 of \citep {duvenaud-thesis-2014}. Used with kind permission of David Duvenaud. \relax }}{690}{figure.caption.401} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.22}{\ignorespaces Top row: airline dataset and posterior distribution of the model discovered after a search of depth 10. Subsequent rows: predictions of the individual components. From Figure 3.5 of \citep {duvenaud-thesis-2014}, based on \citep {Lloyd2014}. Used with kind permission of David Duvenaud. \relax }}{691}{figure.caption.402} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.23}{\ignorespaces Illustration of a GP with a spectral mixture kernel in 1d. (a) Learned vs true kernel. (b) Predictions using learned kernel. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gp\_spectral\_mixture.ipynb}{gp\_spectral\_mixture.ipynb}. \relax }}{692}{figure.caption.403} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.24}{\ignorespaces Extrapolations (point predictions and 95\% credible set) on CO$_2$ and airline datasets using Gaussian processes with Matern, rational quadratic, periodic, RBF (SE), and spectral mixture kernels, each with hyperparameters learned using empirical Bayes. From Figure from \citep {Wilson2014thesis}. \relax }}{693}{figure.caption.404} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.25}{\ignorespaces Deep Kernel Learning: A Gaussian process with a deep kernel maps $D$ dimensional inputs $\bm {x}$ through $L$ parametric hidden layers followed by a hidden layer with an infinite number of basis functions, with base kernel hyperparameters $\bm {\theta }$. Overall, a Gaussian process with a deep kernel produces a probabilistic mapping with an infinite number of adaptive basis functions parametrized by $\bm {\gamma } = \{\bm {w},\bm {\theta }\}$. All parameters $\bm {\gamma }$ are learned through the marginal likelihood of the Gaussian process. From Figure 1 of \citep {Wilson2016aistats}. \relax }}{693}{figure.caption.405} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.26}{\ignorespaces Modeling a discontinuous function with (a) a GP with a ``shallow'' Matern $\frac {3}{2}$ kernel, and (b) a GP with a ``deep'' MLP + Matern kernel. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gp\_deep\_kernel\_learning.ipynb}{gp\_deep\_kernel\_learning.ipynb}. \relax }}{694}{figure.caption.406} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.27}{\ignorespaces {\bf Left}: The learned covariance matrix of a deep kernel with spectral mixture base kernel on a set of test cases for the {\bf Olivetti faces dataset}, where the test samples are ordered according to the {\it orientations} of the input faces. {\bf Middle}: The respective covariance matrix using a deep kernel with RBF base kernel. {\bf Right}: The respective covariance matrix using a standard RBF kernel. From Figure 5 of \citep {Wilson2016aistats}. \relax }}{695}{figure.caption.407} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.28}{\ignorespaces Illustration of the neural net kernel, corresponding to $\mathcal {K}(x,x') = \mathbb {E}\left [{\text {erf}((w_0 +w_1 x) \text {erf}(w_0 + w_1 x'))}\right ]$, where $w_0 \sim \mathcal {N}(0,\sigma _0^2)$ and $w_1 \sim \mathcal {N}(0,\sigma _1^2)$. (a) We plot $\mathcal {K}(x,x')$ for $\sigma _1=1$ (b) $\sigma _1=50$. In both cases, $\sigma _0=1$. (c-d). We plot samples from $f \sim \mathrm {GP}(0,\mathcal {K})$. From Figure 2 of \citep {Mohammadi2019}. Used with kind permission of Hossein Mohammadi. \relax }}{697}{figure.caption.408} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.29}{\ignorespaces Comparison of a parametric fully connected network (FCN) and its GP equivalent. Here there are $N=4$ input examples, each of which are $D_0=3$ dimensional. From Figure 3 of \citep {Novak2019}. Used with kind permission of Roman Novak. \relax }}{698}{figure.caption.409} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.30}{\ignorespaces Graphical model corresponding to a deep GP with 2 layers. The dashed nodes are deterministic funcitons of their parents, and represent kernel matrices. The shaded nodes are observed, the unshaded nodes are hidden. From Figure 5 of \citep {Pleiss2021}. Used with kind permission of Geoff Pleiss. \relax }}{701}{figure.caption.410} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.31}{\ignorespaces Some data (red points) sampled from a step function fit with (a) a standard GP with RBF kernel and (b) a deep GP with 4 layers of RBF kernels. The solid blue line is the posterior mean. The pink shaded area represents the posterior variance ($\mu (x) \pm 2 \sigma (x)$). The thin blue dots in (b) represent posterior samples. Generated by deepgp\_stepdata. \relax }}{701}{figure.caption.411} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.32}{\ignorespaces Illustration of the functions learned at each layer of the DGP. (a) Input to layer 1. (b) Layer 2 to layer 2. (c) Layer 2 to layer 3. (d) Layer 3 to output. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#deepgp\_stepdata.ipynb}{deepgp\_stepdata.ipynb} \relax }}{702}{figure.caption.412} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.33}{\ignorespaces (a) Posterior of 2-layer RBF deep GP fit to a noisy step function. Columns represent width of 1, 16, 256 and infinity. (b) Average posterior covariance of the DGP, given by $\mathbb {E}_{{{\bm {f}}_1({\bm {x}}), {\bm {f}}_1({\bm {x}}')|{\bm {y}}}}\left [ {\mathcal {K}_2({\bm {f}}_1({\bm {x}}), {\bm {f}}_1({\bm {x}}'))} \right ]$. As the width increases, the covariance becomes stationary, as shown by the kernel's constant diagonals. From Figure 1 of \citep {Pleiss2021}. Used with kind permission of Geoff Pleiss. \relax }}{704}{figure.caption.413} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {18.34}{\ignorespaces (a) The observed Mauna Loa $\text {CO}_2$ time series. (b) Forecasts from a GP. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gp\_mauna\_loa.ipynb}{gp\_mauna\_loa.ipynb}. \relax }}{706}{figure.caption.414} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.1}{\ignorespaces Effect of Gaussian noise of increasing magnitude on an image classifier. The model is a ResNet-50 CNN trained on ImageNet with added Gaussian noise of standard deviation $\sigma $. From Figure 23 of \citep {Ford2019}. Used with kind permission of Justin Gilmer. \relax }}{710}{figure.caption.415} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.2}{\ignorespaces Illustration of how image classifiers generalize poorly to new environments. (a) In the training data, most cows ocur on grassy backgrounds. (b-c) In these test image, the cow occurs ``out of context'', namely on a beach. In (b), the cow is not detected. In (c), it is classified with a generic ``mammal'' label. Top five labels and their confidences are produced by ClarifAI.com, which is a state of the art commerical vision system. From Figure 1 of \citep {beery2018recognition}. Used with kind permission of Sara Beery. \relax }}{710}{figure.caption.416} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.3}{\ignorespaces Illustration of distribution shift for a 2d binary classification problem. From Figure 1 of \citep {Pesaranghader2018}. Used with kind permission of Ali Pesaranghader. \relax }}{712}{figure.caption.418} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.4}{\ignorespaces (a) Illustration of covariate shift. Light gray represents training distribution, dark gray represents test distribution. We see the test distribution has shifted to the right but the underlying input-output function is constant. (b) Dashed line: fitting a linear model across the full support of $X$. Solid black line: fitting the same model only on parts of input space that have high likelihood under the test distribution. From Figures 1--2 of \citep {Storkey2009}. Used with kind permission of Amos Storkey. \relax }}{712}{figure.caption.419} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.5}{\ignorespaces Causal diagrams for different kinds of dataset selection. (The double ringed $S$ node is an indicator variable that is set to $S=1$ iff its parents meet some criterion, and only samples where $S=1$ are included.) (a) No selection. (b) Selection on $X$. (c) Selection on $Y$. (d) Selection on $X$ and $Y$, which can cause selection bias. \relax }}{714}{figure.caption.420} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.6}{\ignorespaces Illustration of a two-stage decision problem. First we must decide if the input image is out-of-distribution (OOD) or not. If it is not, we must return the set of class labels that have high probabilitiy. From Figure from \citep {Angelopoulos2021}. Used with kind permission of Anastasios Angelopoulos. \relax }}{719}{figure.caption.421} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.7}{\ignorespaces Likelihoods from a Glow normalizing flow model (\cref {sec:glow}) trained on CIFAR10 and evaluated on different test sets. The SVHN street sign dataset has lower visual complexity, and hence higher likelihood. Qualitatively similar results are obtained for other generative models and data set. From Figure 1 of \citep {Serra2020}. Used with kind permission of Joan Serra. \relax }}{721}{figure.caption.422} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.8}{\ignorespaces Accuracy vs confidence plots for an MLP fit to the MNIST training set, and then evaluated on one batch from the MNIST test set. (a) Plugin approach, computed using SGD. (b) Bayesian approach, computed using 10 samples from SGLD. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#bnn\_mnist\_sgld.ipynb}{bnn\_mnist\_sgld.ipynb}. \relax }}{722}{figure.caption.423} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.9}{\ignorespaces Similar to \cref {fig:accConfMNIST}, except that performance is evaluated on the Fashion MNIST dataset. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#bnn\_mnist\_sgld.ipynb}{bnn\_mnist\_sgld.ipynb}. \relax }}{723}{figure.caption.424} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.10}{\ignorespaces Different modeling assumptions for discriminative models of the form $p({\bm {x}}|\boldsymbol {\phi }) p({\bm {y}}|{\bm {x}},{\bm {\theta }})$. (a) Standard assumption that the output distribution ${\bm {\theta }}$ is independent of the input distribution $\boldsymbol {\phi }$. (b) We assume that the output distribution ${\bm {\theta }}$ depends on the input distribution $\boldsymbol {\phi }$ in some way. (c) Same as (b), except we have two input distributions, the training distribution $\boldsymbol {\phi }$ and the shifted test distribution $\cc@accent {"707E}{\boldsymbol {\phi }}$. Adapted from Figures 1--3 of \citep {Zhou2021bayes}. \relax }}{724}{figure.caption.425} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.11}{\ignorespaces Schematic overview of techniques for learning from 1 or more different distributions. Adapted from slide 3 of \citep {Scardapane2021}. \relax }}{725}{figure.caption.426} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.12}{\ignorespaces Illustration of multi-headed network for multi-task learning. \relax }}{727}{figure.caption.427} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.13}{\ignorespaces Hierarchical Bayesian model for learning from $T$ domains, and then testing on a new distribution. We assume all the parameters of each distribution $\boldsymbol {\phi }_t$ come from a common prior $p(\boldsymbol {\psi }_t|{\bm {\theta }})$. which lets us share information between them. \relax }}{728}{figure.caption.428} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.14}{\ignorespaces Illustration of invariant causal prediction. The hammer symbol represents variables whose distribution is perturbed in the given environment. An invariant predictor must use features $\{X_2,X_4\}$. Considering indirect causes instead of direct ones (e.g. $\{X_2, X_5\}$) or an incomplete set of direct causes (e.g. $\{X_4\}$) may not be sufficient to guarantee invariant prediction. From Figure 1 of \citep {ICP}. Used with kind permission of Jonas Peters. \relax }}{729}{figure.caption.429} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.15}{\ignorespaces Hierarchical Bayesian model for meta-learning. There are $T$ tasks, each of which has a training set ${\mathcal {D}}^t=\{ ({\bm {x}}_n^t,{\bm {y}}_n^t): n=1:N_t\}$ and a test set ${\mathcal {D}}_{\mathrm {test}}^t = \{ (\cc@accent {"707E}{{\bm {x}}}_m^t,\cc@accent {"707E}{{\bm {y}}}_m^t): m=1:M_t\}$. $\boldsymbol {\psi }^t$ are the task specific parameters, and ${\bm {\theta }}$ are the shared parameters. From Figure 1 of \citep {Gordon2019}. Used with kind permission of Jonathan Gordon. \relax }}{730}{figure.caption.430} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.16}{\ignorespaces An illustration of domain drift. \relax }}{734}{figure.caption.431} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.17}{\ignorespaces An illustration of concept drift. \relax }}{734}{figure.caption.432} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.18}{\ignorespaces An illustration of class incremental learning. Adapted from Figure 1 of \citep {Lesort2021drift}. \relax }}{735}{figure.caption.433} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.19}{\ignorespaces Different kinds of incremental learning. Adapted from Figure 1 of \citep {Zeno2018}. \relax }}{735}{figure.caption.434} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.20}{\ignorespaces Some failure modes in class incremental learning. We train on task 1 (blue) and evaluate on tasks 1--3 (blue, orange, yellow); we then train on task 2 and evaluate on tasks 1--3; etc. (a) Catastrophic forgetting refers to the phenomenon in which performance on a previous task drops when trained on a new task. (b) Too little plasticity (e.g., due to too much regularization) refers to the phenomenon in which only the first task is learned. Adapted from Figure 2 of \citep {Hadsell2020}. \relax }}{737}{figure.caption.435} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.21}{\ignorespaces What success looks like for class incremental learning. We train on task 1 (blue) and evaluate on tasks 1--3 (blue, orange, yellow); we then train on task 2 and evaluate on tasks 1--3; etc. (a) No forgetting refers to the phenomenon in which performance on previous tasks does not degrade over time. (b) Forward transfer refers to the phenomenon in which training on past tasks improves performance on future tasks beyond what would have been obtained by training from scratch. (c) Backwards transfer refers to the phenomenon in which training on future tasks improves performance on past tasks beyond what would have been obtained by training from scratch. Adapted from Figure 2 of \citep {Hadsell2020}. \relax }}{737}{figure.caption.436} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.22}{\ignorespaces Online learning illustrated as a graphical model. \relax }}{738}{figure.caption.437} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.23}{\ignorespaces Example of an adversarial attack on an image classifier. Left column: original image which is correctly classified. Middle column: small amount of structured noise which is added to the input (magnitude of noise is magnified by $10 \times $). Right column: new image, which is confidently misclassified as a ``gibbon'', even though it looks just like the original ``panda'' image. Here $\epsilon =0.007$ From Figure 1 of \citep {Goodfellow2015adversarial}. Used with kind permission of Ian Goodfellow. \relax }}{740}{figure.caption.438} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.24}{\ignorespaces Images that look like random noise but which cause the CNN to confidently predict a specific class. From Figure 1 of \citep {Nguyen2015}. Used with kind permission of Jeff Clune. \relax }}{742}{figure.caption.439} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.25}{\ignorespaces Synthetic images that cause the CNN to confidently predict a specific class. From Figure 1 of \citep {Nguyen2015}. Used with kind permission of Jeff Clune. \relax }}{742}{figure.caption.440} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.26}{\ignorespaces An adversarially modified image to evade spam detectors. The image is constructed from scratch, and does not involve applying a small perturbation to any given image. This is an illustrative example of how large the space of possible adversarial inputs $\Delta $ can be when the attacker has full control over the input. From \citep {biggio2011survey}. Used with kind permission of Battista Biggio. \relax }}{743}{figure.caption.441} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {19.27}{\ignorespaces (a) When the input dimension $n$ is large and the decision boundary is locally linear, even a small error rate in random noise will imply the existence of small adversarial perturbations. Here, $d({\bm {x}}_0, E)$ denotes the distance from a clean input ${\bm {x}}_0$ to an adversarial example (A) while the distance from ${\bm {x}}_0$ to a random sample $N(0; \sigma ^2 I$ (B) will be approximately $\sigma \sqrt {n}$. As $n \rightarrow \infty $ the ratio of $d(x_0, A)$ to $d({\bm {x}}_0, B)$ goes to 0. (b) A 2d slice of the InceptionV3 decision boundary through three points: a clean image (black), an adversarial example (red), and an error in random noise (blue). The adversarial example and the error in noise lie in the same region of the error set which is misclassified as ``miniature poodle'', which closely resembles a halfspace as in Figure (a). Used with kind permission of Justin Gilmer. \relax }}{745}{figure.caption.442} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {20.1}{\ignorespaces Summary of various kinds of deep generative model. Here ${\bm {x}}$ is the observed data, ${\bm {z}}$ is the latent code, and ${\bm {x}}'$ is a sample from the model. AR models do not have a latent code ${\bm {z}}$. For diffusion models and flow models, the size of ${\bm {z}}$ is the same as ${\bm {x}}$. For AR models, $x^d$ is the $d$'th dimension of ${\bm {x}}$. Adapted from Figure 1 of \citep {weng2021diffusion}. \relax }}{750}{figure.caption.444} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {20.2}{\ignorespaces Synthetic faces from a score-based generative model (\cref {sec:scoreGen}). From Figure 12 of \citep {song2020score}. Used with kind permission of Yang Song. \relax }}{751}{figure.caption.445} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {20.3}{\ignorespaces Some $1024 \times 1024$ images generated from text prompts by the Imagen diffusion model (\cref {sec:imagen}). From Figure 1 of \citep {imagen}. Used with kind permission of William Chan. \relax }}{752}{figure.caption.446} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {20.4}{\ignorespaces A nonparametric (Parzen) density estimator in 1d estimated from 6 data points, denoted by x. Top row: uniform kernel. Bottom row: Gaussian kernel. Left column: bandwidth parameter $h=1$. Right column: bandwidth parameter $h=2$. Adapted from \url {http://en.wikipedia.org/wiki/Kernel_density_estimation}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#parzen\_window\_demo2.ipynb}{parzen\_window\_demo2.ipynb}. \relax }}{753}{figure.caption.447} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {20.5}{\ignorespaces Missing data imputation using the mean of each column. \relax }}{754}{figure.caption.448} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {20.6}{\ignorespaces Part of the output of the dynamic topic model when applied to articles from {\em Science}. At the top, we show the top 10 words for the neuroscience topic over time. On the bottom left, we show the probability of three words within this topic over time. On the bottom right, we list paper titles from different years that contained this topic. From Figure 4 of \citep {Blei06dtm}. Used with kind permission of David Blei. \relax }}{754}{figure.caption.449} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {20.7}{\ignorespaces Interpolation between two images (top row) in the latent space of a VAE. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#vae\_compare\_results.ipynb}{vae\_compare\_results.ipynb}. \relax }}{755}{figure.caption.450} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {20.8}{\ignorespaces Arithmetic in the latent space of a VAE . The first column is an input image, with embedding ${\bm {z}}$. Subsequent columns show the decoding of ${\bm {z}}+ s \boldsymbol {\Delta }$, where $s \in \{-4,-3,-2,-1,0,1,2,3,4\}$ and $\boldsymbol {\Delta }= \overline {{\bm {z}}}^+ - \overline {{\bm {z}}}^-$ is the difference in the average embeddings of images with or without a certain attribute (here, wearing sunglasses). Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#vae\_celebA\_lightning.ipynb}{vae\_celebA\_lightning.ipynb}. \relax }}{756}{figure.caption.451} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {20.9}{\ignorespaces (a) Model samples with good (high) inception score are visually realistic. (b) Model samples with good (low) FID score are visually realistic and diverse. \relax }}{760}{figure.caption.452} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {20.10}{\ignorespaces Illustration of nearest neighbors in feature space: in the top left we have the query sample generated using BigGAN, and the rest of the images are its nearest neighbors from the dataset. The nearest neighbors search is done in the feature space of a pretrained classifier. From Figure 13 of \citep {bigGAN}. Used with kind permission of Andy Brock.\relax }}{762}{figure.caption.453} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.1}{\ignorespaces Schematic illustration of a VAE. From a figure in \citep {Hafner2018tfdistvae}. Used with kind permission of Danijar Hafner. \relax }}{764}{figure.caption.454} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.2}{\ignorespaces Illustration of the reparameterization trick. The objective $f$ depends on the variational parameters $\boldsymbol {\phi }$, the observed data ${\bm {x}}$, and the latent random variable ${\bm {z}}\sim q_{\boldsymbol {\phi }}({\bm {z}}|{\bm {x}})$. On the left, we show the standard form of the computation graph. On the right, we show a reparameterized form, in which we move the stochasticity into the noise source $\boldsymbol {\epsilon }$, and compute ${\bm {z}}$ deterministically, ${\bm {z}}= f(\boldsymbol {\phi },{\bm {x}},\boldsymbol {\epsilon })$. The rest of the graph is deterministic, so we can backpropagate the gradient of the scalar $f$ wrt $\boldsymbol {\phi }$ through ${\bm {z}}$ and into $\boldsymbol {\phi }$. From Figure 2.3 of \citep {Kingma2019vae}. Used with kind permission of Durk Kingma. \relax }}{767}{figure.caption.455} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.3}{\ignorespaces Illustration of image reconstruction. (a) Original images. (b) Variational autoencoder. (c) Deterministic autoencoder. Both models have the same convolutional structure with 256 latent dimensions. Both are trained for 30 epochs with a batch size of 256. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#vae\_compare\_results.ipynb}{vae\_compare\_results.ipynb}.. \relax }}{771}{figure.caption.457} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.4}{\ignorespaces Illustration of image sampling. (a) Variational autoencoder. (b) Deterministic autoencoder. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#vae\_compare\_results.ipynb}{vae\_compare\_results.ipynb}. \relax }}{772}{figure.caption.458} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.5}{\ignorespaces The maximum likelihood (ML) objective can be vieweed as the minimization of $D_{\mathbb {KL}}\left ({p_{{\mathcal {D}}}({\bm {x}})} \mathrel {\delimiter "026B30D } {p_{{\bm {\theta }}}({\bm {x}})}\right )$, while the ELBO objective is minimization of $D_{\mathbb {KL}}\left ({q_{{\mathcal {D}},\boldsymbol {\phi }}({\bm {x}},{\bm {z}})} \mathrel {\delimiter "026B30D } {p_{{\bm {\theta }}}({\bm {x}},{\bm {z}})}\right )$, which upper bounds $D_{\mathbb {KL}}\left ({q_{{\mathcal {D}}}({\bm {x}})} \mathrel {\delimiter "026B30D } {p_{{\bm {\theta }}}({\bm {x}})}\right )$. From Figure 2.4 of \citep {Kingma2019vae}. Used with kind permission of Durk Kingma. \relax }}{773}{figure.caption.459} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.6}{\ignorespaces Comparison of $\sigma $-VAE (top row), standard VAE (second row), and $\beta $-VAE (remaining rows). The model is a convolutional hierarchical VAE (\cref {sec:HVAE}) with Gaussian prior and Gaussian likelihood. From Figure 2 of \citep {Rybkin2021}. Used with kind permission of Oleh Rybkin. \relax }}{775}{figure.caption.460} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.7}{\ignorespaces Illustration of 2d embedding space for (a) VAE and (b) MMD-VAE fit to MNIST. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#vae\_latent\_space.ipynb}{vae\_latent\_space.ipynb}. \relax }}{779}{figure.caption.461} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.8}{\ignorespaces Illustration of multi-modal VAE. (a) The generative model with $N=2$ modalities. (b) The product of experts (PoE) inference network is derived from $N$ individual Gaussian experts $E_i$. $\mu _0$ and $\sigma _0$ are parameters of the prior. (c) If a modality is missing, we omit its contribution to the posterior. From Figure 1 of \citep {Wu2018vae}. Used with kind permission of Mike Wu. \relax }}{779}{figure.caption.462} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.9}{\ignorespaces Illustration of different VAE variants for handling missing data. From Figure 1 of \citep {Collier2020}. Used with kind permission of Mark Collier. \relax }}{782}{figure.caption.463} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.10}{\ignorespaces Imputing missing pixels given a masked out image using a VAE using a MCAR assumption. From Figure 2 of \citep {Collier2020}. Used with kind permission of Mark Collier. \relax }}{783}{figure.caption.464} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.11}{\ignorespaces Illustration of a VAE with a bidirectional RNN encoder and a unidirectional RNN decoder. The output generator can use a GMM and/or softmax distribution. From Figure 2 of \citep {Ha2018iclr}. Used with kind permission of David Ha. \relax }}{785}{figure.caption.465} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.12}{\ignorespaces (a) Samples from the latent space of a VAE text model, as we interpolate between two sentences (on first and last line). Note that the intermediate sentences are grammatical, and semantically related to their neighbors. From Table 8 of \citep {Bowman2016conll}. (b) Same as (a), but now using a deterministic autoencoder (with the same RNN encoder and decoder). From Table 1 of \citep {Bowman2016conll}. Used with kind permission of Sam Bowman. \relax }}{786}{figure.caption.467} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.13}{\ignorespaces Conditional generation of cats from sketch-RNN model. We increase the temperature parameter from left to right. From Figure 5 of \citep {Ha2018iclr}. Used with kind permission of David Ha. \relax }}{787}{figure.caption.469} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.14}{\ignorespaces Application of VAE-RNN to molecule design. (a) The VAE-RNN model is trained on a sequence representation of molecules known as SMILES. We can fit an MLP to map from the latent space to properties of the molecule, such as its ``fitness'' $f({\bm {z}})$. (b) We can perform gradient ascent in $f({\bm {z}})$ space, and then decode the result to a new molecule with high fitness. From Figure 1 of \citep {GomezBombarelli2018}. Used with kind permission of Rafael Gomez-Bombarelli. \relax }}{788}{figure.caption.471} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.15}{\ignorespaces (a) VAE. (b) Skip-VAE. From Figure 1 of \citep {Dieng2019aistats}. Used with kind permission of Adji Dieng. \relax }}{789}{figure.caption.472} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.16}{\ignorespaces Hierarchical VAEs with 3 stochastic layers. Left: Generative model. Right: Inference network. Diamond is a residual network, $\oplus $ is feature combination (e.g., concatenation), and h is a trainable parameter. We first do bottom-up inference, by propagating ${\bm {x}}$ up to ${\bm {z}}_3$ to compute ${\bm {z}}_3^s \sim q_{\boldsymbol {\phi }}({\bm {z}}_3|{\bm {x}})$, and then we perform top-down inference by computing ${\bm {z}}_2^s \sim q_{\boldsymbol {\phi }}({\bm {z}}_2|{\bm {x}},{\bm {z}}_3^s)$ and then ${\bm {z}}_1^s \sim q_{\boldsymbol {\phi }}({\bm {z}}_1|{\bm {x}},{\bm {z}}_{2:3}^s)$. From Figure 2 of \citep {Vahdat2020}. Used with kind permission of Arash Vahdat. \relax }}{791}{figure.caption.473} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.17}{\ignorespaces The top-down encoder used by the hierarchical VAE in \citep {Child2021}. Each convolution is preceded by the GELU nonlinearity. The model uses average pooling and nearest-neighbor upsampling for the pool and unpool layers. The posterior $q_{\boldsymbol {\phi }}$ and prior $p_{{\bm {\theta }}}$ are diagonal Gaussians. From Figure 3 of \citep {Child2021}. Used with kind permission of Ronan Child. \relax }}{793}{figure.caption.474} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.18}{\ignorespaces Reconstruction from the VD-VAE model trained on CIFAR10 using different numbers of latent layers. The latent code is a sample from the posterior for the first $L_1 < L$ layers, and then is sampled from the prior (ignoring higher levels) at a low temperature; this emulates models of different stochastic depth. Top row: input images. Subsequent rows: reconstructing down to resolutions 1, 4, 8, 16, 32. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#vdvae\_demo\_cifar.ipynb}{vdvae\_demo\_cifar.ipynb}. \relax }}{794}{figure.caption.475} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.19}{\ignorespaces Samples from a VDVAE model (trained on FFHQ dataset) from different levels of the hierarchy. From Figure 1 of \citep {Child2021}. Used with kind permission of Ronan Child. \relax }}{794}{figure.caption.476} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.20}{\ignorespaces Left: a hierarchical VAE which emulates an autoregressive model using an identify encoder, autoregressive prior, and identity decoder. Right: a hierarchical VAE with a 2 layer hierarchical latent code. The bottom hidden nodes (black) are conditionally independent given the top layer. From Figure 2 of \citep {Child2021}. Used with kind permission of Rewon Child. \relax }}{795}{figure.caption.477} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.21}{\ignorespaces Autoencoder for MNIST using 256 binary latents. Top row: input images. Middle row: reconstruction. Bottom row: latent code, reshaped to a $16 \times 16$ image. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#quantized\_autoencoder\_mnist.ipynb}{quantized\_autoencoder\_mnist.ipynb}. \relax }}{797}{figure.caption.478} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.22}{\ignorespaces VQ-VAE architecture. From Figure 1 of \citep {vqvae}. Used with kind permission of Aaron van den Oord. \relax }}{798}{figure.caption.479} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.23}{\ignorespaces Hierarchical extension of VQ-VAE. (a) Encoder and decoder architecture. (b) Combining a Pixel-CNN prior with the decoder. From Figure 2 of \citep {Razavi2019}. Used with kind permission of Aaron van den Oord. \relax }}{800}{figure.caption.480} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.24}{\ignorespaces Illustration of the Gumbel Softmax trick applied to $K=4$ codebook vectors in $L=2$ dimensions. From \url {https://ml.berkeley.edu/blog/posts/dalle2/}. Used with kind permission of Charlie Snell. \relax }}{800}{figure.caption.481} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.25}{\ignorespaces Illustration of the dVAE model (a) Encoder. (b) Decoder. From \url {https://ml.berkeley.edu/blog/posts/dalle2/}. Used with kind permission of Charlie Snell. \relax }}{801}{figure.caption.482} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {21.26}{\ignorespaces Illustration of the VQ-GAN. From Figure 2 of \citep {VQGAN}. Used with kind permission of Patrick Esser. \relax }}{802}{figure.caption.483} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {22.1}{\ignorespaces A fully-connected auto-regressive model. \relax }}{808}{figure.caption.485} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {22.2}{\ignorespaces Illustration of the wavenet model using dilated (atrous) convolutions, with dilation factors of 1, 2, 4 and 8. From Figure 3 of \citep {wavenet}. Used with kind permission of Aaron van den Oord. \relax }}{809}{figure.caption.486} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {22.3}{\ignorespaces Illustration of causal 2d convolution in the PixelCNN model. The red histogram shows the empirical distribution over discretized values for a single pixel of a single RGB channel. The red and green $5 \times 5$ array shows the binary mask, which selects the top left context, in order to ensure the convolution is causal. The diagrams on the right illustrate how we can avoid blind spots by using a vertical context stack, that contains all previous rows, and a horizontal context stack, that just contains values from the current row. From Figure 1 of \citep {pixelCNN}. Used with kind permission of Aaron van den Oord. \relax }}{810}{figure.caption.487} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {22.4}{\ignorespaces Sample text generated by GPT-2 in response to an input prompt. From \url {https://openai.com/blog/better-language-models/}. \relax }}{811}{figure.caption.488} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {22.5}{\ignorespaces Illustration of few shot learning with GPT-3. The model is asked to create an example sentence using a new word whose meaning is provided in the prompt. Boldface is GPT-3\IeC {\textquoteright }s completions, light gray is human input. From Figure 3.16 of \citep {gpt3}. \relax }}{812}{figure.caption.489} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {22.6}{\ignorespaces A snippet of a piano performance visualized as a pianoroll (left) and encoded as performance events (right, serialized from left to right and then down the rows). There are 128 discrete values for note on/off, 32 values for velocity, and 100 for time shift, so the input is a sequence of one-hot vectors of length 388. From Figure 7 of \citep {musicTransformer}. Used with kind permission of Anna Huang. \relax }}{812}{figure.caption.490} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {22.7}{\ignorespaces Illustration of attention in the music transformer. Different colored lines correspond to the 6 attention heads. Line thickness corresponds to attention weights. From Figure 8 of \citep {musicTransformer}. Used with kind permission of Anna Huang. \relax }}{813}{figure.caption.491} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {22.8}{\ignorespaces Some images generated by the DALL-E model in response to a text prompt. (a) ``An armchair in the shape of an avocado''. (b) ``An illustration of a baby hedgehog in a christmas sweater walking a dog''. From \url {https://openai.com/blog/dall-e}. Used with kind permission of Aditya Ramesh. \relax }}{813}{figure.caption.492} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {23.1}{\ignorespaces Mapping a 2d standard Gaussian to a more complex distribution using an invertible MLP\spacefactor \@m {}. Points are color-coded by their initial position along the y-axis. Adapted from {\citep {JangBlog}}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#flow\_2d\_mlp.ipynb}{flow\_2d\_mlp.ipynb}. \relax }}{816}{figure.caption.493} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {23.2}{\ignorespaces Non-linear squared flow (NLSq). Left: an invertible mapping consisting of 4 NLSq layers. Middle: red is the base distribution (Gaussian), blue is the distribution induced by the mapping on the left. Right: density of a 5-layer autoregressive flow using NLSq transformations and a Gaussian base density, trained on a mixture of 4 Gaussians. From Figure 5 of \citep {Ziegler2019}. Used with kind permission of Zachary Ziegler. \relax }}{820}{figure.caption.494} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {23.3}{\ignorespaces Illustration of a coupling layer ${\bm {x}}=f({\bm {u}})$. A bijection, with parameters determined by ${\bm {u}}^B$, is applied to ${\bm {u}}^A$ to generate ${\bm {x}}^A$; meanwhile ${\bm {x}}^B={\bm {u}}^B$ is passed through unchanged, so the mapping can be inverted. From Figure 3 of \citep {Kobyzev2019}. Used with kind permission of Ivan Kobyzev. \relax }}{822}{figure.caption.495} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {23.4}{\ignorespaces Images generated by a flow model trained on MNIST for 10k steps. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#flow\_spline\_mnist.ipynb}{flow\_spline\_mnist.ipynb}. \relax }}{824}{figure.caption.496} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {23.5}{\ignorespaces (a) Affine autoregressive flow with one layer. In this figure, ${\bm {u}}$ is the input to the flow (sample from the base distribution) and ${\bm {x}}$ is its output (sample from the transformed distribution). (b) Inverse of the above. From \citep {JangBlog}. Used with kind permission of Eric Jang. \relax }}{826}{figure.caption.497} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {23.6}{\ignorespaces Density estimation with affine autoregressive flows, using a Gaussian base distribution. (a) True density. (b) Estimated density using a single autoregressive layer with ordering $(x_1,x_2)$. On the left (contour plot) we show $p({\bm {x}})$. On the right (green dots) we show samples of ${\bm {u}}={\bm {f}}^{-1}({\bm {x}})$, where ${\bm {x}}$ is sampled from the true density. (c) Same as (b), but using 5 autoregressive layers and reversing the variable ordering after each layer. Adapted from Figure 1 of \citep {MAF}. Used with kind permission of Iain Murray. \relax }}{826}{figure.caption.498} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {23.7}{\ignorespaces Inverse autoregressive flow that uses affine scalar bijections. In this figure, ${\bm {u}}$ is the input to the flow (sample from the base distribution) and ${\bm {x}}$ is its output (sample from the transformed distribution) From \citep {JangBlog}. Used with kind permission of Eric Jang. \relax }}{827}{figure.caption.499} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {23.8}{\ignorespaces (a) Two moons dataset. (b) Samples from a neural spline flow fit to this data. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#two\_moons\_normalizing\_flow.ipynb}{two\_moons\_normalizing\_flow.ipynb}. \relax }}{833}{figure.caption.500} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {24.1}{\ignorespaces Combining two energy functions in 2d by summation, which is equivalent to multiplying the corresponding probability densities. We also illustrate some sampled trajectories towards high probability (low energy) regions. From Figure 14 of \citep {Du2019ebm}. Used with kind permission of Yilun Du. \relax }}{838}{figure.caption.501} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {24.2}{\ignorespaces Illustration of contrastive divergence sampling for an RBM. The visible nodes are initialized at an example drawn from the data set. Then we sample a hidden vector, then another visible vector, etc. Eventually (at ``infinity'') we will be producing samples from the joint distribution $p({\bm {x}},{\bm {z}}|{\bm {\theta }})$. \relax }}{842}{figure.caption.502} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {24.3}{\ignorespaces Fitting a score-based generative model to the 2d swiss roll dataset. (a) Training set. (b) Learned score function trained using the basic score matching. (c) Superposition of learned score function and empirical density. (d) Langevin sampling applied to the learned model. We show 3 different trajectories, each of length 25. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#score\_matching\_swiss\_roll.ipynb}{score\_matching\_swiss\_roll.ipynb}. \relax }}{849}{figure.caption.505} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {25.1}{\ignorespaces Denoising diffusion probabilistic model. The forwards diffusion process, $q({\bm {x}}_t|{\bm {x}}_{t-1})$, implements the (non-learned) inference network; this just adds Gaussian noise at each step. The reverse diffusion process, $p_{{\bm {\theta }}}({\bm {x}}_{t-1}|{\bm {x}}_t)$, implements the decoder; this is a learned Gaussian model. From Figure 1 of \citep {ho2020denoising}. Used with kind permission of Ajay Jain. \relax }}{858}{figure.caption.506} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {25.2}{\ignorespaces Continuous time VDM diffusion model on 2d swiss roll. Top row: samples from the encoder $q$ at 3 different time slices. Middle row: samples from the generator $p_{{\bm {\theta }}}$ at 3 different time slices. Bottom row: Visualization of the predicted error term $\cc@accent {"705E}{\boldsymbol {\epsilon }}_{{\bm {\theta }}}({\bm {x}}_t;t) = {\bm {x}}_t - \cc@accent {"705E}{{\bm {x}}}_{{\bm {\theta }}}({\bm {x}}_t;t)$. We see that this resembles the score function, in that it points towards regions of high data density. Adapted from Figure 1 of \citep {Sohl-Dickstein2015}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#vdm\_2d.ipynb}{vdm\_2d.ipynb}. Used with kind permission of Durk Kingma. \relax }}{865}{figure.caption.507} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {25.3}{\ignorespaces Illustration of some image-to-image tasks using the Palette conditional diffusion model. From Figure 1 of \citep {palette}. Used with kind permission of Chitwan Saharia. \relax }}{868}{figure.caption.508} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.1}{\ignorespaces Visualizing the difference between prescribed and implicit generative models. Prescribed models provide direct access to the learned density (sometimes unnormalized). Implicit models only provide access to a simulator which can be used to generate samples from an implied density. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#genmo\_types\_implicit\_explicit.ipynb}{genmo\_types\_implicit\_explicit.ipynb} \relax }}{872}{figure.caption.509} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.2}{\ignorespaces The aim of implicit generative modelling objectives: to measure distances between distributions only from samples, in order to distinguish between distributions which are further apart (left) compared to those which are closer (right).\relax }}{873}{figure.caption.510} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.3}{\ignorespaces Overview of approaches for learning in implicit generative models\relax }}{873}{figure.caption.511} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.4}{\ignorespaces Optimal critics in Integral Probability Metrics (IPMs). Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ipm\_divergences.ipynb}{ipm\_divergences.ipynb} \relax }}{880}{figure.caption.514} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.5}{\ignorespaces The KL divergence cannot provide learning signal for distributions without overlapping support (left), while the smooth approximation given by a learned decision surface like an MLP can (right). Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ipm\_divergences.ipynb}{ipm\_divergences.ipynb} \relax }}{881}{figure.caption.515} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.6}{\ignorespaces Saturating $\qopname \relax o{log}(1-D(G(z))$ vs non-saturating $- \qopname \relax o{log}D(G(z))$ loss functions. The non-saturating loss provides stronger gradients when the discriminator is easily detecting that generated samples are fake. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gan\_loss\_types.ipynb}{gan\_loss\_types.ipynb} \relax }}{884}{figure.caption.516} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.7}{\ignorespaces Illustration of mode collapse and mode hopping in GAN training. (a) The dataset, a mixture of 16 Gaussians in 2 dimensions. (b-f) Samples from the model after various amounts of training. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gan\_mixture\_of\_gaussians.ipynb}{gan\_mixture\_of\_gaussians.ipynb}. \relax }}{886}{figure.caption.519} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.8}{\ignorespaces Visualizing divergence using a simple GAN: DiracGAN. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#dirac\_gan.ipynb}{dirac\_gan.ipynb} \relax }}{889}{figure.caption.520} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.9}{\ignorespaces Continuous (left) and discrete dynamics (right) take different trajectories in DiracGAN. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#dirac\_gan.ipynb}{dirac\_gan.ipynb}\relax }}{890}{figure.caption.521} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.10}{\ignorespaces Learning an implicit posterior using an adversarial approach, as done in BiGAN. From Figure 1 of \citep {donahue2016adversarial}. Used with kind permission of Jeff Donahue. \relax }}{892}{figure.caption.522} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.11}{\ignorespaces DCGAN convolutional generator. From Figure 1 of \citep {DCGAN}. Used with kind permission of Alec Radford.\relax }}{894}{figure.caption.523} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.12}{\ignorespaces Attention queries used by a SAGAN model, showcasing the global span of attention. Each row first shows the input image and a set of color coded query locations in the image. The subsequent images show the attention maps corresponding to each query location in the first image, with the query color coded location being shown, and arrows from it to the attention map are used to highlight the most attended regions. From Figure 1 of \citep {zhang2019self}. Used with kind permission of Han Zhang. \relax }}{895}{figure.caption.524} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.13}{\ignorespaces LapGAN generation algorithm: the generation process starts with a low dimension sample, which gets upscaled and residually added to the output of a generator at a higher resolution. The process gets repeated multiple times. From Figure 1 of \citep {denton2015deep}. Used with kind permission of Emily Denton.\relax }}{895}{figure.caption.525} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.14}{\ignorespaces ProgressiveGAN training algorithm. The input to the discriminator at the bottom of the figure is either a generated image, or a real image (denotes as `Reals' in the figure) at the corresponding resolution. From Figure 1 of \citep {Karras2018}. Used with kind permission of Tero Karras.\relax }}{896}{figure.caption.526} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.15}{\ignorespaces Increasingly realistic synthetic faces generated by different kinds of GAN, specifically (from left to right): original GAN \citep {GAN}, DCGAN \citep {DCGAN}, CoupledGAN \citep {COGAN}, ProgressiveGAN \citep {Karras2018}, StyleGAN \citep {styleGAN}. Used with kind permission of Ian Goodfellow. An online demo, which randomly generates face images using StyleGAN, can be found at \url {https://thispersondoesnotexist.com}. \relax }}{898}{figure.caption.527} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.16}{\ignorespaces Example results on several image-to-image translation problems as generated by the pix2pix conditional GAN. From Figure 1 of \citep {pix2pix}. Used with kind permission of Philip Isola. \relax }}{899}{figure.caption.528} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.17}{\ignorespaces Illustration of the CycleGAN training scheme. (a) Illustration of the 4 functions that are trained. (b) Forward cycle consistency from $\mathcal {X}$ back to $\mathcal {X}$. (c) Backwards cycle consistency from $\mathcal {Y}$ back to $\mathcal {Y}$. From Figure 3 of \citep {cycleGAN}. Used with kind permission of Jun-Yan Zhu. \relax }}{900}{figure.caption.529} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {26.18}{\ignorespaces Some examples of unpaired image-to-image translation generated by the CycleGAN model. From Figure 1 of \citep {cycleGAN}. Used with kind permission of Jun-Yan Zhu. \relax }}{900}{figure.caption.530} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {27.1}{\ignorespaces Vision as inverse graphics. The agent (here represented by a human head) has to infer the scene ${\bm {z}}$ given the image ${\bm {x}}$ using an estimator. From Figure 1 of \citep {Rao1999}. Used with kind permission of Rajesh Rao. \relax }}{908}{figure.caption.531} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.1}{\ignorespaces A mixture of 3 Gaussians in 2d. (a) We show the contours of constant probability for each component in the mixture. (b) A surface plot of the overall density. Adapted from Figure 2.23 of \citep {BishopBook}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gmm\_plot\_demo.ipynb}{gmm\_plot\_demo.ipynb}. \relax }}{911}{figure.caption.533} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.2}{\ignorespaces (a) Some data in 2d. (b) A possible clustering using $K=5$ clusters computed using a GMM. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gmm\_2d.ipynb}{gmm\_2d.ipynb}. \relax }}{911}{figure.caption.534} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.3}{\ignorespaces We fit a mixture of 20 Bernoullis to the binarized MNIST digit data. We visualize the estimated cluster means $\cc@accent {"705E}{\boldsymbol {\mu }}_k$. The numbers on top of each image represent the estimated mixing weights $\cc@accent {"705E}{\pi }_k$. No labels were used when training the model. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mix\_bernoulli\_em\_mnist.ipynb}{mix\_bernoulli\_em\_mnist.ipynb}. \relax }}{912}{figure.caption.535} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.4}{\ignorespaces Example of recovering a clean image (right) from a corrupted version (left) using MAP estimation with a GMM patch prior and Gaussian likelihood. First row: image denoising. Second row: image deblurring. Third row: image inpainting. From \citep {Rosenbaum2015} and \citep {Zoran2011}. Used with kind permission of Dan Rosenbaum and Daniel Zoran. \relax }}{914}{figure.caption.536} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.5}{\ignorespaces Illustration of the parameters learned by a GMM applied to image patches. Each of the 3 panels corresponds to a different mixture component $k$. Within each panel, we show the eigenvectors (reshaped as images) of the covariance matrix $\boldsymbol {\Sigma }_k$ in decreasing order of eigenvalue. We see various kinds of patterns, including ones that look like the ones learned from PCA (see \cref {fig:patches}), but also ones that look like edges and texture. From Figure 6 of \citep {Zoran2011}. Used with kind permission of Daniel Zoran. \relax }}{916}{figure.caption.537} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.6}{\ignorespaces Illustration of the FA generative process, where we have $L=1$ latent dimension generating $D=2$ observed dimensions; we assume $\boldsymbol {\Psi }=\sigma ^2 \mathbf {I}$. The latent factor has value $z \in \mathbb {R}$, sampled from $p(z)$; this gets mapped to a 2d offset $\boldsymbol {\delta }= z {\bm {w}}$, where ${\bm {w}}\in \mathbb {R}^2$, which gets added to $\boldsymbol {\mu }$ to define a Gaussian $p({\bm {x}}|z) = \mathcal {N}({\bm {x}}|\boldsymbol {\mu }+ \boldsymbol {\delta },\sigma ^2 \mathbf {I})$. By integrating over $z$, we ``slide'' this circular Gaussian ``spray can'' along the principal component axis ${\bm {w}}$, which induces elliptical Gaussian contours in ${\bm {x}}$ space centered on $\boldsymbol {\mu }$. Adapted from Figure 12.9 of \citep {BishopBook}. \relax }}{920}{figure.caption.538} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.7}{\ignorespaces Mixture of factor analyzers as a PGM. \relax }}{926}{figure.caption.539} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.8}{\ignorespaces Mixture of PPCA models fit to a 2d dataset, using $L=1$ latent dimensions. (a) $K=1$ mixture components. (b) $K=10$ mixture components. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mix\_ppca\_demo.ipynb}{mix\_ppca\_demo.ipynb}. \relax }}{927}{figure.caption.540} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.9}{\ignorespaces Illustration of estimating the effective dimensionalities in a mixture of factor analysers using variational Bayes EM with an ARD prior. Black are negative values, white are positive, gray is 0. The blank columns have been forced to 0 via the ARD mechanism, reducing the effective dimensionality. The data was generated from 6 clusters with intrinsic dimensionalities of $7,4,3,2,2,1$, which the method has successfully estimated. From Figure 4.4 of \citep {Beal03}. Used with kind permission of Matt Beal. \relax }}{929}{figure.caption.541} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.10}{\ignorespaces We show the estimated number of clusters, and their estimated dimensionalities, as a function of sample size. The ARD algorithm found two different solutions when $N=8$. Note that more clusters, with larger effective dimensionalities, are discovered as the sample sizes increases. From Table 4.1 of \citep {Beal03}. Used with kind permission of Matt Beal. \relax }}{929}{figure.caption.542} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.11}{\ignorespaces Random samples from the MixFA model fit to CelebA. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mix\_ppca\_celebA.ipynb}{mix\_ppca\_celebA.ipynb}. Adapted from Figure 4 of \citep {Richardson2018}. Used with kind permission of Yair Weiss. \relax }}{930}{figure.caption.543} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.12}{\ignorespaces (a) Visualization of the parameters learned by the MFA model. The top row shows the mean $\boldsymbol {\mu }_k$ and noise variance $\boldsymbol {\Psi }_k$, reshaped from 12,288-dimensional vectors to $64 \times 64 \times 3$ images, for two mixture components $k$. The next 5 rows show the first 5 (of 10) basis functions (columns of $\mathbf {W}_k$) as images. On row $i$, left column, we show $\boldsymbol {\mu }_k - \mathbf {W}_k[:,i]$; in the middle, we show $0.5 + \mathbf {W}_k[:,i]$, and on the right we show $\boldsymbol {\mu }_k + \mathbf {W}_k[:,i]$. (b) Images generated by computing $\boldsymbol {\mu }_k + z_1 \mathbf {W}_k[:,i] + z_2 \mathbf {W}_k[:,j]$, for some component $k$ and dimensions $i,j$, where $(z_1,z_2)$ are drawn from the grid $[-1:1, -1:1]$, so the central image is just $\boldsymbol {\mu }_k$. From Figure 6 of \citep {Richardson2018}. Used with kind permission of Yair Weiss. \relax }}{931}{figure.caption.544} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.13}{\ignorespaces Samples from the 100 CelebA images with lowest likelihood under the MFA model. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mix\_ppca\_celebA.ipynb}{mix\_ppca\_celebA.ipynb}. Adapted from Figure 7a of \citep {Richardson2018}. Used with kind permission of Yair Weiss. \relax }}{931}{figure.caption.545} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.14}{\ignorespaces Illustration of image imputation using an MFA. Left column shows 4 original images. Subsequent pairs of columns show an occluded input, and a predicted output. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#mix\_ppca\_celebA.ipynb}{mix\_ppca\_celebA.ipynb}. Adapted from Figure 7b of \citep {Richardson2018}. Used with kind permission of Yair Weiss. \relax }}{932}{figure.caption.546} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.15}{\ignorespaces Gaussian latent factor models for paired data. (a) Supervised PCA. (b) Partial least squares. \relax }}{932}{figure.caption.547} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.16}{\ignorespaces Canonical correlation analysis as a PGM. \relax }}{934}{figure.caption.548} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.17}{\ignorespaces Exponential family PCA model as a DPGM\xspace . \relax }}{935}{figure.caption.549} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.18}{\ignorespaces (a) 150 synthetic 16 dimensional bit vectors. (b) The 2d embedding learned by binary PCA, fit using variational EM. We have color coded points by the identity of the true ``prototype'' that generated them. (c) Predicted probability of being on. (d) Thresholded predictions. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#binary\_fa\_demo.ipynb}{binary\_fa\_demo.ipynb}. \relax }}{937}{figure.caption.550} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.19}{\ignorespaces Illustration of a 2d embedding of human motion-capture data using a GP-LVM. We show two poses and their corresponding embeddings. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#gplvm\_mocap.ipynb}{gplvm\_mocap.ipynb}. Used with kind permission of Aditya Ravuri. \relax }}{938}{figure.caption.551} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.20}{\ignorespaces (a) Gaussian-Poisson (GAP) model as a DPGM\xspace . Here $z_{n,k} \in \mathbb {R}^+$ and $x_{n,d} \in \mathbb {Z}_{\geq 0}$. (b) Simplex FA model as a DPGM\xspace . Here ${\bm {z}}_{n} \in \mathbb {S}_K$ and $x_{n,d} \in \ensuremath {\{1,\ldots ,V\}}$. \relax }}{939}{figure.caption.552} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.21}{\ignorespaces Illustrating the difference between Non-negative Matrix Factorization (NMF), Vector Quantization (VQ), and Principal Components Analysis (PCA). Left column: Filters (columns of $\mathbf {W}$) learned from a set of 2429 faces images, each of size $19 \times 19$. There are 49 basis functions in total, shown in a $7 \times 7$ montage; each filter is reshaped to a $19 \times 19$ image for display purposes. (For PCA, negative weights are red, positive weights are black.) Middle column: The 49 latent factors ${\bm {z}}$ when the model is applied to the original face image shown at the top. Right column: reconstructed face image. From Figure 1 of \citep {Lee1999nmf}. \relax }}{940}{figure.caption.553} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.22}{\ignorespaces The simplex factor analysis model applied to some roll call data from the US Senate collected in 2003. The senators have been sorted from left to right using the binary PCA method of \citep {DeLeeuw2006}. See text for details. From Figures 8--9 of \citep {Buntine06}. Used with kind permission of Wray Buntine. \relax }}{941}{figure.caption.554} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.23}{\ignorespaces Latent Dirichlet Allocation (LDA) as a DPGM\xspace . (a) Unrolled form. (b) Plate form. \relax }}{943}{figure.caption.555} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.24}{\ignorespaces Illustration of latent Dirichlet allocation (LDA). We have color coded certain words by the topic they have been assigned to: yellow represents the genetics cluster, pink represents the evolution cluster, blue represent the data analysis cluster, and green represents the neuroscience cluster. Each topic is in turn defined as a sparse distribution over words. This article is not related to neuroscience, so no words are assigned to the green topic. The overall distribution over topic assignments for this document is shown in the right as a sparse histogram. Adapted from Figure 1 of \citep {Blei2012}. Used with kind permission of David Blei. \relax }}{944}{figure.caption.556} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.25}{\ignorespaces Three topics related to the word {\em play}. From Figure 9 of \citep {Steyvers07}. Used with kind permission of Tom Griffiths. \relax }}{944}{figure.caption.557} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.26}{\ignorespaces Three documents from the TASA corpus containing different senses of the word {\em play}. Grayed out words were ignored by the model, because they correspond to uninteresting stop words (such as ``and'', ``the'', etc.) or very low frequency words. From Figure 10 of \citep {Steyvers07}. Used with kind permission of Tom Griffiths. \relax }}{945}{figure.caption.558} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.27}{\ignorespaces Output of the correlated topic model (with $K=50$ topics) when applied to articles from {\em Science}. Nodes represent topics, with the 5 most probable phrases from each topic shown inside. Font size reflects overall prevalence of the topic. See \url {http://www.cs.cmu.edu/\nobreakspace {}lemur/science/} for an interactive version of this model with 100 topics. Used with kind permission of Figure 2 of \citep {Blei07}. Used with kind permission of David Blei. \relax }}{947}{figure.caption.559} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.28}{\ignorespaces The dynamic topic model as a DPGM\xspace . \relax }}{948}{figure.caption.560} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.29}{\ignorespaces LDA-HMM model as a DPGM\xspace . \relax }}{948}{figure.caption.561} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.30}{\ignorespaces Function and content words in the NIPS corpus, as distinguished by the LDA-HMM model. Graylevel indicates posterior probability of assignment to LDA component, with black being highest. The boxed word appears as a function word in one sentence, and as a content word in another sentence. Asterisked words had low frequency, and were treated as a single word type by the model. From Figure 4 of \citep {Griffiths04ldahmm}. Used with kind permission of Tom Griffiths. \relax }}{950}{figure.caption.563} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.31}{\ignorespaces Illustration of ICA applied to 500 iid samples of a 4d source signal. (a) Latent signals. (b) Observations. (c) PCA estimate. (d) ICA estimate. This matches the true sources, up to permutation of the dimension indices. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ica\_demo.ipynb}{ica\_demo.ipynb}. \relax }}{951}{figure.caption.564} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.32}{\ignorespaces Illustration of ICA and PCA applied to 100 iid samples of a 2d source signal with a uniform distribution. (a) Latent signals. (b) Observations. (c) PCA estimate. (d) ICA estimate. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ica\_demo\_uniform.ipynb}{ica\_demo\_uniform.ipynb}. \relax }}{952}{figure.caption.565} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {28.33}{\ignorespaces Illustration of the filters learned by various methods when applied to natural image patches. (a) Sparse coding. (b) PCA. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#sparse\_dict\_demo.ipynb}{sparse\_dict\_demo.ipynb}. \relax }}{957}{figure.caption.566} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.1}{\ignorespaces State-space model represented as a graphical model. (a) Generic form, with inputs ${\bm {u}}_t$, hidden state ${\bm {z}}_t$, and observations ${\bm {y}}_t$. We assume the observation likelihood is first-order auto-regressive. (b) Simplified form, with no inputs, and Markovian observations. \relax }}{960}{figure.caption.567} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.2}{\ignorespaces Some samples from an HMM with 5 Bernoulli observables. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hmm\_bernoulli.ipynb}{hmm\_bernoulli.ipynb}. \relax }}{961}{figure.caption.568} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.3}{\ignorespaces (a) Some 2d data sampled from a 5 state HMM. Each state emits from a 2d Gaussian. (b) The hidden state sequence is shown by the colors. We superimpose the observed 2d timeseries (not that we have shifted the vertical scale so the values don't overlap). Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hmm\_gaussian\_2d.ipynb}{hmm\_gaussian\_2d.ipynb}. \relax }}{962}{figure.caption.569} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.4}{\ignorespaces Illustration of the observation dynamics for each of the 5 hidden states. The attractor point corresponds to the steady state solution for the corresponding autoregressive process. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hmm\_ar.ipynb}{hmm\_ar.ipynb}. \relax }}{963}{figure.caption.570} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.5}{\ignorespaces Samples from the 2d AR-HMM. (a) Time series plot of $y_{t,1}$ and $y_{t,2}$. (The latter are shifted up vertically by 4.7) The background color is the generating state. The dotted lines represent the stationary value for that component of the observation. (b) Scatter plot of observations. Colors denote the generating state. We show the first 12 samples from each state. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hmm\_ar.ipynb}{hmm\_ar.ipynb}. \relax }}{963}{figure.caption.571} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.6}{\ignorespaces (a) A sample time series dataset of counts. (b) A segmentation of this data using a 4 state HMM. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hmm\_poisson\_changepoint.ipynb}{hmm\_poisson\_changepoint.ipynb}. \relax }}{964}{figure.caption.572} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.7}{\ignorespaces Segmentation of the time series using HMMs with 1--6 states. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hmm\_poisson\_changepoint.ipynb}{hmm\_poisson\_changepoint.ipynb}. \relax }}{965}{figure.caption.573} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.8}{\ignorespaces Marginal likelihood vs number of states $K$ in the Poisson HMM. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hmm\_poisson\_changepoint.ipynb}{hmm\_poisson\_changepoint.ipynb}. \relax }}{966}{figure.caption.574} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.9}{\ignorespaces State transition diagram for a profile HMM. From Figure 5.7 of \citep {Durbin98}. Used with kind permission of Richard Durbin. \relax }}{966}{figure.caption.575} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.10}{\ignorespaces Example of multiple sequence alignment. We show the first 90 positions of the acidic ribosomal protein P0 from several organisms. Colors represent functional properties of the corresponding amino acid. Dashes represent insertions or deletions. From \url {https://en.wikipedia.org/wiki/Multiple_sequence_alignment}. Used with kind permission of Wikipedia author Miguel Andrade. \relax }}{967}{figure.caption.576} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.11}{\ignorespaces Illustration of an HMM applied to spelling correction. The top row, labeled ``query'', represents the search query ${\bm {y}}_{1:T}$ typed by the user, namely ''goverment home page of illnoisstate''. The bottom row, labeled ``state path'', represents the most probable assignment to the hidden states, ${\bm {z}}_{1:T}$, namely ''government homepage of illinois state''. (The NULL state is a silent state, that is needed to handle the generation of two tokens from a single hidden state.) The middle row, labeled ``emission'', represents the words emitted by each state, which match the observed data. From Figure 1 of \citep {Cloudspeller}. \relax }}{969}{figure.caption.577} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.12}{\ignorespaces HMM with plate notation. $A$ are the pararameters for the state transition matrix $p(z_t|z_{t-1})$ and $B$ are the parameters for the discrete observation model $p(x_t|z_t)$. $T_n$ is the length of the $n$'th sequence. \relax }}{971}{figure.caption.578} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.13}{\ignorespaces Illustration of the casino HMM. (a) True parameters used to generate the data. (b) Estimated parameters using EM. (c) Estimated parameters using SGD. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hmm\_casino\_training.ipynb}{hmm\_casino\_training.ipynb}. \relax }}{973}{figure.caption.580} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.14}{\ignorespaces Average negative log likeliood per learning step the casino HMM. (a) EM. (b) SGD with minibatch size $1$. (b) Full batch gradient descent. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hmm\_casino\_training.ipynb}{hmm\_casino\_training.ipynb}. \relax }}{974}{figure.caption.581} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.15}{\ignorespaces (a) A Markov chain with $n=4$ repeated states and self loops. (b) The resulting distribution over sequence lengths, for $p=0.99$ and various $n$. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#hmm\_self\_loop\_dist.ipynb}{hmm\_self\_loop\_dist.ipynb}. \relax }}{976}{figure.caption.582} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.16}{\ignorespaces Encoding a hidden semi-Markov model as a DPGM\xspace . (a) $d_t$ is a deterministic down counter (duration variable). Each observation is generated independently. (b) Similar to (a), except now we generate the observations within each segment as a block. In this figure, we represent the non-Markovian dependencies between the observations within each segment by using undirected edges. We represent the conditional independence between the observations across different segments by disconnecting ${\bm {y}}_{1:3}$ from ${\bm {y}}_{4:5}$; this can be enforced by ``breaking the link'' whenever $d_t=1$ (representing the end of a segment). \relax }}{977}{figure.caption.583} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.17}{\ignorespaces An example of an HHMM for an ASR system which can recognize 3 words. The top level represents bigram word probabilities. The middle level represents the phonetic spelling of each word. The bottom level represents the subphones of each phone. (It is traditional to represent a phone as a 3 state HMM, representing the beginning, middle and end; these are known as subphones.) Adapted from Figure 7.5 of \citep {Jurafsky00}. \relax }}{978}{figure.caption.584} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.18}{\ignorespaces An HHMM represented as a DPGM\xspace . $Z_t^{\ell }$ is the state at time $t$, level $\ell $; $F_t^{\ell } = 1$ if the HMM at level $\ell $ has finished (entered its exit state), otherwise $F_t^{\ell }=0$. Shaded nodes are observed; the remaining nodes are hidden. We may optionally clamp $F_T^{\ell } = 1$, where $T$ is the length of the observation sequence, to ensure all models have finished by the end of the sequence. From Figure 2 of \citep {Murphy01hhmmnips}. \relax }}{979}{figure.caption.585} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.19}{\ignorespaces (a) A factorial HMM wth 3 chains. (b) A coupled HMM wth 3 chains. \relax }}{981}{figure.caption.586} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.20}{\ignorespaces The BATnet DBN. The transient nodes are only shown for the second slice, to minimize clutter. The dotted lines are used to group related variables. Used with kind permission of Daphne Koller. \relax }}{982}{figure.caption.587} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.21}{\ignorespaces Illustration of Bayesian online changepoint detection (BOCPD). (a) Hypothetical segmentation of a univariate time series divided by changepoints on the mean into three segments of lengths $g_1=4$, $g_2=6$, and an undetermined length for $g_3$ (since it the third segment has not yet ended). From Figure 1 of \citep {Adams2007}. Used with kind permission of Ryan Adams. \relax }}{983}{figure.caption.588} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.22}{\ignorespaces Illustration of BOCPD. (a) Synthetic data from a GMM with 4 states. Top row is the data, bottom row is the generating state. (b) Output of algorithm. Top row: Estimated changepoint locations. Bottom row: posterior predicted probability of a changepoint at each step. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#changepoint\_detection.ipynb}{changepoint\_detection.ipynb}. \relax }}{984}{figure.caption.589} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.23}{\ignorespaces Illustration of Kalman filtering and smoothing for a linear dynamical system. (Repeated from \cref {fig:kf}.) (a) Observations (green cirles) are generated by an object moving to the right (true location denoted by blue squares). (b) Results of online Kalman filtering. Red cross is the posterior mean, circles are 95\% confidence ellipses derived from the posterior covariance. (c) Same as (b), but using offline Kalman smoothing. The MSE in the trajectory for filtering is 3.13, and for smoothing is 1.71. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#kf\_tracking.ipynb}{kf\_tracking.ipynb}. \relax }}{987}{figure.caption.590} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.24}{\ignorespaces (a) A dynamic generalization of linear regression. (b) Illustration of the recursive least squares algorithm applied to the model $p(y|x,{\bm {w}}) = \mathcal {N}(y|w_0 + w_1 x, \sigma ^2)$. We plot the marginal posterior of $w_0$ and $w_1$ vs number of data points. (Error bars represent $\mathbb {E}\left [{w_j|{\bm {y}}_{1:t},{\bm {x}}_{1:t}}\right ] \pm \sqrt {\mathbb {V}\left [ {w_j|{\bm {y}}_{1:t},{\bm {x}}_{1:t}}\right ]}$.) After seeing all the data, we converge to the offline (batch) Bayes solution, represented by the horizontal lines. (Shading represents the marginal posterior variance.) Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#kf\_linreg.ipynb}{kf\_linreg.ipynb}. \relax }}{988}{figure.caption.591} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.25}{\ignorespaces (a) A switching SSM. Squares represent discrete random variables, circles represent continuous random variables. (b) Illustration of how the number of modes in the belief state of a switching SSM grows exponentially over time. We assume there are two binary states. \relax }}{993}{figure.caption.592} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.26}{\ignorespaces Illustration of Kalman filtering and smoothing for tracking multiple moving objects. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#kf\_parallel.ipynb}{kf\_parallel.ipynb}. \relax }}{994}{figure.caption.593} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.27}{\ignorespaces A model for tracking two objects in the presence of data-association ambiguity. We observe 3, 1 and 2 detections at time steps $t-1$, $t$ and $t+1$. The $m_t$ hidden variable encodes the association between the observations and the hidden causes. \relax }}{994}{figure.caption.594} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.28}{\ignorespaces Illustration of multi-target tacking in 2d over 5 time steps. (a) We observe 2 measurements per time step. (b-c) Possible hypotheses about the underlying object tracks. (d) A more complex hypothesis in which the red track stops at step 3, the dashed red track starts at step 4 the dotted blue track has a detection failure at step 3, and one of the measurements at step 3 is a false alarm. Adapted from Figure 15.8 of \citep {aima}. \relax }}{996}{figure.caption.595} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.29}{\ignorespaces Illustration of a bearings-only tracking problem. Adapted from Figure 5.7 of \citep {Sarkka13}. \relax }}{997}{figure.caption.596} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.30}{\ignorespaces Samples from a 2d LDS with 5 Poisson likelihood terms. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#poisson\_lds\_example.ipynb}{poisson\_lds\_example.ipynb}. \relax }}{998}{figure.caption.597} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.31}{\ignorespaces Latent state trajectory (blue lines) and dynamics matrix $\mathbf {A}$ (arrows) for (left) true model and (right) estimated model. The star marks the start of the trajectory. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#poisson\_lds\_example.ipynb}{poisson\_lds\_example.ipynb}. \relax }}{999}{figure.caption.598} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.32}{\ignorespaces (a) A BSTS model with local linear trend and linear regression on inputs. The observed output is $y_t$. The latent state vector is defined by ${\bm {z}}_t=(\mu _t,\delta _t)$. The (static) parameters are ${\bm {\theta }}=(\sigma _y, \sigma _\mu ,\sigma _{\delta },\boldsymbol {\beta })$. The covariates are ${\bm {u}}_t$. (b) Adding a latent seasonal process (with $S=4$ seasons). Parameter nodes are omitted for clarity. \relax }}{1003}{figure.caption.599} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.33}{\ignorespaces (a) $\text {CO}_2$ levels from Mauna Loa. In orange plot we show predictions for the most recent 10 years. (b) Underlying components for the STS mode which was fit to \cref {fig:maunaLoa}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#sts.ipynb}{sts.ipynb}. \relax }}{1004}{figure.caption.600} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.34}{\ignorespaces (a) Hourly temperature and electricity demand in Victoria, Australia in 2014. (b) Electricity forecasts using an STS. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#sts.ipynb}{sts.ipynb}. \relax }}{1004}{figure.caption.601} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.35}{\ignorespaces Components of the electricity forecasts. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#sts.ipynb}{sts.ipynb}. \relax }}{1005}{figure.caption.602} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.36}{\ignorespaces Anomaly detection in a time series. We plot the observed electricity data in blue and the predictions in orange. In gray, we plot the z-score at time $t$ , given by $(y_t - \mu _t)/\sigma _t$, where $p(y_t|{\bm {y}}_{1:t-1},{\bm {u}}_{1:t})=\mathcal {N}(\mu _t,\sigma _t^2)$. Anomalous observations are defined as points where $z_t > 3$ and are marked with red crosses. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#sts.ipynb}{sts.ipynb}. \relax }}{1006}{figure.caption.603} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.37}{\ignorespaces Visualization of a probabilistic demand forecast for a hypothetical product. Note ths sudden spike near the Christmas holiday in December 2013. The black line denotes the actual demand. Green lines denote the model samples in the training range, while the red lines show the actual probabilistic forecast on data unseen by the model. The red bars at the bottom indicate out-of-stock events which can explain the observed zeros. From Figure 1 of \citep {Bose2017vldb}. Used with kind permission of Tim Januschowski. \relax }}{1006}{figure.caption.604} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.38}{\ignorespaces Illustration of hierarchical state-space model. \relax }}{1007}{figure.caption.605} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.39}{\ignorespaces Twin network state space model for estimating causal impact of an intervention that occurs just after time step $n=2$. We have $m=4$ actual observations, denoted ${\bm {y}}_{1:4}$. We cut the incoming arcs to ${\bm {z}}_3$ since we assume ${\bm {z}}_{3:T}$ comes from a different distribution, namely the post-intervention distributiion. However, in the counterfactual world, shown at the bottom of the figure (with tilde symbols), we assume the distributions are the same as in the past, so information flows along the chain uninterrupted. \relax }}{1008}{figure.caption.606} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.40}{\ignorespaces A graphical model representation of the local level causal impact model. The dotted line represents the time $n$ at which an intervention occurs. Adapted from Figure 2 of \citep {Brodersen2015}. Used with kind permission of Kay Brodersen. \relax }}{1010}{figure.caption.607} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.41}{\ignorespaces (a) Some simulated time series data which we use to estimate the causal impact of some intervention, which occurs at time $n=70$, illustrated by the dotted line. The blue curve are the observed outcomes $y$, the other curves are covariates (inputs). (b) Output of causal inference. Top row: observed vs predicted outcomes. Middle row: Estimate of causal effect $\tau _t$ at each time step. Bottom row: Cumulative causal effect, $\sigma _t$, up to each time step. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#causal\_impact.ipynb}{causal\_impact.ipynb}. \relax }}{1011}{figure.caption.608} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.42}{\ignorespaces Inference networks for deep Markov model. (a) Fully factorized causal posterior $q({\bm {z}}_{1:T}|{\bm {x}}_{1:T}) = \DOTSB \prod@ \slimits@ _t q({\bm {z}}_t|{\bm {x}}_{1:t})$. The past observations ${\bm {x}}_{1:t}$ are stored in the RNN hidden state ${\bm {h}}_t$. (b) Markovian posterior $q({\bm {z}}_{1:T}|{\bm {x}}_{1:T}) = \DOTSB \prod@ \slimits@ _t q({\bm {z}}_t|{\bm {z}}_{t-1},{\bm {x}}_{t:T})$. The future observations ${\bm {x}}_{t:T}$ are stored in the RNN hidden state ${\bm {h}}_t$. \relax }}{1014}{figure.caption.609} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.43}{\ignorespaces Recurrent state space models. (a) Prior is first-order Markov, $p({\bm {z}}_t|{\bm {z}}_{t-1})$, but observation model is not Markovian, $p({\bm {x}}_t|{\bm {h}}_t)=p({\bm {x}}_t|{\bm {z}}_{1:t})$, where ${\bm {h}}_t$ summarizes ${\bm {z}}_{1:t}$. (b) Prior model is no longer first-order Markov either, $p({\bm {z}}_t|{\bm {h}}_{t-1})=p({\bm {z}}_t|{\bm {z}}_{1:t-1})$. Diamonds are deterministic nodes, circles are stochastic. \relax }}{1015}{figure.caption.610} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.44}{\ignorespaces Unrolling schemes for SSMs. The labels $z_{i|j}$ is shorthand for $p({\bm {z}}_i|{\bm {x}}_{1:j})$. Solid lines denote the generative process, dashed lines the inference process. Arrows pointing at shaded circles represent log-likelihood loss terms. Wavy arrows indicate KL divergence loss terms. (a) Standard 1 step reconstruction of the observations. (b) Observation overshooting tries to predict future observations by unrolling in latent space. (c) Latent overshooting predicts future latent states and penalizes their KL divergence, but does need to care about future observations. Adapted from Figure 3 of \citep {planet}. \relax }}{1016}{figure.caption.611} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {29.45}{\ignorespaces Variational RNN. (a) Generative model. (b) Inference model. The diamond-shaped nodes are deterministic. \relax }}{1017}{figure.caption.612} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {30.1}{\ignorespaces (a) A directed graph. (b) The same graph, with the nodes partitioned into 3 groups, making the block structure more apparent. \relax }}{1020}{figure.caption.613} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {30.2}{\ignorespaces (a) Adjacency matrix for the graph in Figure\nobreakspace {}\ref {fig:SBMgraph}(a). (b) Rows and columns are shown permuted to show the block structure. We also show how the stochastic block model can generate this graph. From Figure 1 of \citep {Kemp06}. Used with kind permission of Charles Kemp. \relax }}{1020}{figure.caption.614} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {30.3}{\ignorespaces Some examples of graphs generated using the stochastic block model with different kinds of connectivity patterns between the blocks. The abstract graph (between blocks) represent a ring, a dominance hierarchy, a common-cause structure, and a common-effect structure. From Figure 4 of \citep {Kemp10}. Used with kind permission of Charles Kemp. \relax }}{1021}{figure.caption.615} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {30.4}{\ignorespaces (a) Stochastic block model. (b) Mixed membership stochastic block model. \relax }}{1022}{figure.caption.616} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {30.5}{\ignorespaces (a) Who-likes-whom graph for Sampson's monks. (b) Mixed membership of each monk in one of three groups. From Figures 2-3 of \citep {Airoldi08}. Used with kind permission of Edo Airoldi. \relax }}{1022}{figure.caption.617} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {30.6}{\ignorespaces Illustration of an ontology learned by IRM applied to the Unified Medical Language System. The boxes represent 7 of the 14 concept clusters. Predicates that belong to the same cluster are grouped together, and associated with edges to which they pertain. All links with weight above 0.8 have been included. From Figure 9 of \citep {Kemp10}. Used with kind permission of Charles Kemp. \relax }}{1024}{figure.caption.618} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {30.7}{\ignorespaces Illustration of IRM applied to some political data containing features and pairwise interactions. Top row (a): the partition of the countries into 5 clusters and the features into 5 clusters. Every second column is labelled with the name of the corresponding feature. Small squares at bottom (b-i): these are 8 of the 18 clusters of interaction types. From Figure 6 of \citep {Kemp06}. Used with kind permission of Charles Kemp. \relax }}{1025}{figure.caption.619} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {30.8}{\ignorespaces A sparse undirected Gaussian graphical model learned using graphical lasso applied to some flow cytometry data (from \citep {Sachs05}), which measures the phosphorylation status of 11 proteins. The sparsity level is controlled by $\lambda $. (a) $\lambda =36$. (b) $\lambda =27$. (c) $\lambda =7$. (d) $\lambda =0$. Adapted from Figure 17.5 of \citep {HastieBook}. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ggm\_lasso\_demo.ipynb}{ggm\_lasso\_demo.ipynb}. \relax }}{1026}{figure.caption.620} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.1}{\ignorespaces Partitions of the unit square. (left) One possible partition into $K=3$ regions, and (center) A refined partition into $K=4$ regions. In both figures, the shading of cell $T_k$ is proportional ${G(T_k)}$, resulting from the same realization of a Dirichlet process. (right) An `infinite partition' of the unit square. The Dirichlet process can informally be viewed as an infinite-dimensional Dirichlet distribution defined on this. \relax }}{1030}{figure.caption.621} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.2}{\ignorespaces Realizations from a Dirichlet process when $\Theta $ is (a) the real line, and (b) the unit square. Also shown are the base measures $\alpha H$. In reality, the number of atoms is infinite for both cases. \relax }}{1032}{figure.caption.622} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.3}{\ignorespaces Illustration of the stick breaking construction. (a) We have a unit length stick, which we break at a random point $\beta _1$; the length of the piece we keep is called $\pi _1$; we then recursively break off pieces of the remaining stick, to generate $\pi _2, \pi _3, \ldots $. From Figure 2.22 of \citep {SudderthThesis}. Used with kind permission of Erik Sudderth. (b) Samples of $\pi _k$ from this process for different values of $\alpha $. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#stick\_breaking\_demo.ipynb}{stick\_breaking\_demo.ipynb}. \relax }}{1032}{figure.caption.623} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.4}{\ignorespaces Some samples from a Dirichlet process mixture model of 2D Gaussians, with concentration parameter $\alpha =1$ (left column) and $\alpha =2$ (right column). From top to bottom, we show $N=50$, $N=500$ and $N=1000$ samples. base distribution. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#dp\_mixgauss\_sample.ipynb}{dp\_mixgauss\_sample.ipynb}. \relax }}{1035}{figure.caption.624} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.5}{\ignorespaces Two views of $N$ observations sampled from a DP mixture model. Left: representation where cluster indicators are sampled from the GEM-distributed distribution $\pi $. Right: representation where parameters are samples from the DP-distributed random measure $G$. The rightmost picture illustrates the case where $N=2$, ${\bm {\theta }}$ is real-valued with a Gaussian prior $H(\cdot )$, and $F(x|\theta )$ is a Gaussian with mean $\theta $ and variance $\sigma ^2$. We generate two parameters, ${\overline {\theta }}_1$ and ${\overline {\theta }}_2$, from $G$, one per data point. Finally, we generate two data points, $x_1$ and $x_2$, from $\mathcal {N}(\overline {\theta }_1, \sigma ^2)$ and $\mathcal {N}(\overline {\theta }_2, \sigma ^2)$. From Figure 2.24 of \citep {SudderthThesis}. Used with kind permission of Erik Sudderth. \relax }}{1036}{figure.caption.625} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.6}{\ignorespaces 300 data points in 2 dimensions are clustered using a DP mixture fit with collapsed Gibbs sampling. The two columns show different random initializations. The rows represent samples from the posterior over model parameters after $T=2$ (top), $T=10$ (middle) and $T=50$ (bottom) iterations. From Figure 2.26 of \citep {SudderthThesis}. Used with kind permission of Erik Sudderth. \relax }}{1039}{figure.caption.627} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.7}{\ignorespaces Comparison of collapsed Gibbs samplers for a DP mixture (dark blue) and a finite mixture (light red) with $K=4$ applied to the $N=300$ data points shown in \cref {fig:DPMclustersErik}. (a) Log probability $\qopname \relax o{log}p(X|\pi ,\theta )$ vs MCMC iteration for 20 different starting values. (b) Median (thick line) and quantiles (dashed lines) over 100 different starting values. (c) Number of components with at least 2\% of the probability mass, vs iteration. (Intensity is proportional to posterior probability.) (d) Posterior over $K$ based on last 900 samples. From Figure 2.27 of \citep {SudderthThesis}. Used with kind permission of Erik Sudderth. \relax }}{1040}{figure.caption.628} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.8}{\ignorespaces (a) A realization of the cluster matrix $Z$ from the Chinese restaurant process (CRP) (b) A realization of the feature matrix $Z$ from the Indian buffet process (IBP). Rows are customers and columns are dishes (for the CRP each table has its own dish). Both produce binary matrices, with the CRP assigning a customer to a single table (cluster), and the IBP assigning a set of dishes (features) to each customer. \relax }}{1047}{figure.caption.630} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.9}{\ignorespaces (a) A realization of a Beta process on the real line. The the atoms of the random measure are show in red, while the Beta subordinator (whose value at time $t$ sums all atoms up to $t$) is shown in black. (b) Samples from the Beta process \relax }}{1048}{figure.caption.631} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.10}{\ignorespaces Poisson process construction of a completely random measure $G = \DOTSB \sum@ \slimits@ _i w_i \delta _{\theta _i}$. The set of pairs $\{(w_1,\theta _1), (w_2,\theta _2), \dotsc \}$ is a realization from a Poisson process with intensity $\lambda (\theta ,w) = H(\theta ) \gamma (w)$. \relax }}{1052}{figure.caption.633} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.11}{\ignorespaces (a) A realization of a Brownian motion (b) A realization of Cauchy process (an $\alpha $-stable process with $\alpha =1$). \relax }}{1053}{figure.caption.634} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.12}{\ignorespaces A realization of (a) a homogeneous Poisson process. (b) an underdispersed point process (Swedish pine sapling locations\nobreakspace {}\citep {ripley2005spatial}). (c) A realization of an ovderdispersed point process (California redwood tree locations\nobreakspace {}\citep {ripley1977modelling}). \relax }}{1056}{figure.caption.635} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.13}{\ignorespaces (a) A self-exciting Hawkes process. Each event causes a rise in the event rate, which can lead to burstiness. (b) A pair of mutually exciting Hawkes processes, events from each cause a rise in the event rate of the other. \relax }}{1057}{figure.caption.636} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {31.14}{\ignorespaces The determinant of a matrix $V$ is the volume of the parallelogram spanned by the columns of $V$ and the origin. \relax }}{1061}{figure.caption.637} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {33.1}{\ignorespaces The Interpretable Machine Learning ecosystem. While standard machine learning can often abstract away elements of the context and consider only the the process of learning models given a data distribution and a loss, interpretable machine is inextricably tied to a socio-technical context.\relax }}{1069}{figure.caption.638} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {33.2}{\ignorespaces Models described in the simple example. Both of these models have the same qualitative characteristics, but different explanation methods will describe these models quite differently---and in ways that could potentially confuse someone unfamiliar with the explanation is computed.\relax }}{1070}{figure.caption.639} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {33.3}{\ignorespaces An example of computational evaluation using (semi-)synthetic datasets from \cite {yang2019benchmarking}: foreground images (e.g. dogs, backpacks) are placed on different backgrounds (e.g. indoors, outdoors) to test what an explanation is looking for.\relax }}{1085}{figure.caption.640} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {33.4}{\ignorespaces Interpretability methods (each row) and their explanations as we randomize layers starting from the logits, and cumulatively to the bottom layer (each column), in the context of image classification task. The rightmost column is showing a completely randomized network. Most methods output similar explanations for the first two columns; one predicts the bird, and the other predicts randomly. This sanity check tests the hypothesis that the explanation should significantly change (quantitatively and qualitatively) when comparing a trained model and an untrained model\nobreakspace {}\cite {adebayo2020sanity}. \relax }}{1086}{figure.caption.641} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {33.5}{\ignorespaces Examples of computational evaluation with real datasets from \cite {dabkowski2017real,ghorbani2019automatic}. For example, one would expect that adding or deleting patches rated as most relevant for an image classification would have a large effect on the classification compared to patches not rated as important.\relax }}{1088}{figure.caption.642} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {33.6}{\ignorespaces (Potential) perception issues: an explanation from a trained network (left) is visually indistinguishable to humans from one from an untrained network (right)---even if they are not exactly identical.\relax }}{1097}{figure.caption.643} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.1}{\ignorespaces Spectograms for three different spoken sentences. The x-axis shows progression of time and the y-axis shows different frequency bands. The energy of the signal in different bands is shown as intensity in grayscale values with progression of time. (A) and (B) show spectrograms of the same sentence ``How to recognize speech with this new display'' spoken by two different speakers, male and female. Although the frequency characterization is similar, the formant frequencies are much more clearly defined in the speech of the female speaker. (C) shows the spectrogram of the utterance ``How to wreck a nice beach with this nudist play'' spoken by the same female speaker as in (B). (A) and (B) are not identical even though they are composed of the same words. (B) and (C) are similar to each other even though they are not the same sentences. From Figure 1.2 of \citep {Ganapathiraju2007}. Used with kind permission of Madhavi Ganapathiraju. \relax }}{1103}{figure.caption.644} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.2}{\ignorespaces Influence diagrams for the oil wild catter problem. Ovals are random variables (chance nodes), squares are decision (action) nodes, diamonds are utility (value) nodes. (a) Basic model. (b) An extension in which we have an information arc from the Sound chance node to the Drill decision node. (c) An extension in which we get to decide whether to perform a test or not, as well as whether to drill or not. \relax }}{1105}{figure.caption.645} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.3}{\ignorespaces Decision tree for the oil wildcatter problem. Black circles are chance variables, black squares are decision nodes, diamonds are the resulting utilities. Green leaf nodes have higher utility than red leaf nodes. \relax }}{1107}{figure.caption.646} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.4}{\ignorespaces Total expected profit (a) and error rate (b) as a function of the sample size used fo website testing. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#ab\_test\_demo.ipynb}{ab\_test\_demo.ipynb}. \relax }}{1112}{figure.caption.647} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.5}{\ignorespaces Illustration of the feedback problem in online advertising and recommendation systems. The click through rate (CTR) model is used to decide what ads to show, which affects what data is collected, which affects how the model learns. From Figure 1--2 of \citep {Du2021ads}. Used with kind permission of Chao Du. \relax }}{1115}{figure.caption.648} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.6}{\ignorespaces Illustration of sequential belief updating for a two-armed beta-Bernoulli bandit. The prior for the reward for action 1 is the (blue) uniform distribution $\mathrm {Beta}(1,1)$; the prior for the reward for action 2 is the (orange) unimodal distribution $\mathrm {Beta}(2,2)$. We update the parameters of the belief state based on the chosen action, and based on whether the observed reward is success (1) or failure (0). \relax }}{1115}{figure.caption.649} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.7}{\ignorespaces Illustration of the reward distribution $Q(a)$ for 3 different actions, and the corresponding lower and upper confidence bounds. From \citep {Silver2018L9}. Used with kind permission of David Silver. \relax }}{1118}{figure.caption.650} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.8}{\ignorespaces Illustration of Thompson sampling applied to a linear-Gaussian contextual bandit. The context has the form ${\bm {s}}_t =(1,t,t^2)$. (a) True reward for each arm vs time. (b) Cumulative reward per arm vs time. (c) Cumulative regret vs time. Generated by \href {https://github.com/probml/pyprobml/blob/master/notebooks.md\#thompson\_sampling\_linear\_gaussian.ipynb}{thompson\_sampling\_linear\_gaussian.ipynb}. \relax }}{1119}{figure.caption.651} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.9}{\ignorespaces Illustration of an MDP as a finite state machine (FSM). The MDP has three discrete states (green cirlces), two discrete actions (orange circles), and two non-zero rewards (orange arrows). The numbers on the black edges represent state transition probabilities, e.g., $p(s'=s_0|a=a_0,s'=s_0)=0.7$; most state transitions are impossible (probability 0), so the graph is sparse. The numbers on the yellow wiggly edges represent expected rewards, e.g., $R(s=s_1, a=a_0, s'=s_0) = +5$; state transitions with zero reward are not annotated. From \url {https://en.wikipedia.org/wiki/Markov_decision_process}. Used with kind permission of Wikipedia author waldoalvarez. \relax }}{1121}{figure.caption.652} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.10}{\ignorespaces Illustration of a partially observable Markov decision process (POMDP) with hidden environment state $s_t$ which generates the observation $x_t$, controlled by an agent with internal belief state $b_t$ which generates the action $a_t$. The reward $r_t$ depends on $s_t$ and $a_t$. Nodes in this graph represent random variables (circles) and decision variables (squares). \relax }}{1123}{figure.caption.653} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.11}{\ignorespaces Left: illustration of a simple MDP corresponding to a 1d grid world of 3 non-absorbing states and 2 actions. Right: optimal $Q$-functions for different values of $\gamma $. Adapted from Figures 3.1, 3.2, 3.4 of \citep {Graesser2019}. \relax }}{1126}{figure.caption.654} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.12}{\ignorespaces Policy iteration vs value iteration represented as backup diagrams. Empty circles represent states, solid (filled) circles represent states and actions. Adapted from Figure 8.6 of \citep {Suttonv2}. \relax }}{1128}{figure.caption.655} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {34.13}{\ignorespaces (a) Illustration of uncertainty sampling for a simple random forest classifier that has been fit to 50 random MNIST examples. The entropy sampling heuristic would pick example 2. The margin sampling heuristic would pick example 1. The least confident heuristic would pick example 1. (b) F1 score on the test data vs amount of training data, as selected by 4 data selection methods. From \url {https://patel-zeel.github.io/active-learning-visualization/}. Used with kind permission of Zeel Patel. \relax }}{1130}{figure.caption.656} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {35.1}{\ignorespaces Overview of RL methods. Abbreviations: DQN = Deep Q network (\cref {sec:DQN}); MPC = Model Predictive Control (\cref {sec:MBRL}); HJB = Hamilton Jacobi Bellman equation; TD = temporal difference learning (\cref {sec:TD}). Adapted from a slide by Steve Brunton. \relax }}{1132}{figure.caption.657} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {35.2}{\ignorespaces Backup diagrams of $V(s_t)$ for Monte Carlo, temporal difference, and dynamic programming updates of the state-value function. Used with kind permission of Andy Barto. \relax }}{1137}{figure.caption.660} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {35.3}{\ignorespaces The backup diagram for TD($\lambda $). Standard TD learning corresponds to $\lambda =0$, and standard MC learning corresponds to $\lambda =1$. From Figure 12.1 of \citep {Suttonv2}. Used with kind permission of Richard Sutton. \relax }}{1138}{figure.caption.661} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {35.4}{\ignorespaces Illustration of Q learning for the 1d grid world in \cref {fig:Q1d} using $\epsilon $-greedy exploration. At the end of episode 1, we make a transition from $S_3$ to $S_{T2}$ and get a reward of $r=1$, so we estimate $Q(S_3,\delimiter "3223379 )=1$. In episode 2, we make a transition from $S_2$ to $S_3$, so $S_2$ gets incremented by $\gamma Q(S_3,\delimiter "3223379 )=0.9$. Adapted from Figure 3.3 of \citep {Graesser2019}. \relax }}{1140}{figure.caption.663} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {35.5}{\ignorespaces Comparison of Q-learning and double Q-learning on a simple episodic MDP using $\epsilon $-greedy action selection with $\epsilon =0.1$. The initial state is A, and squares denote absorbing states. The data are averaged over 10,000 runs. From Figure 6.5 of \citep {Suttonv2}. Used with kind permission of Rich Sutton. \relax }}{1141}{figure.caption.664} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {35.6}{\ignorespaces Illustration of heuristic search. In this figure, the subtrees are ordered according to a depth-first search procedure. From Figure 8.9 of \citep {Suttonv2}. Used with kind permission of Richard Sutton. \relax }}{1150}{figure.caption.668} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {35.7}{\ignorespaces (a) A cart-pole system being controlled by a policy learned by PILCO using just 17.5 seconds of real-world interaction. The goal state is marked by the red cross. The initial state is where the cart is stationary on the right edge of the workspace, and the pendulum is horizontal. For a video of the system learning, see \url {https://bit.ly/35fpLmR}. (b) A low-quality robot arm being controlled by a block-stacking policy learned by PILCO using just 230 seconds of real-world interaction. From Figures 11, 12 from \citep {pilcoJ}. Used with kind permission of Marc Deisenroth. \relax }}{1152}{figure.caption.669} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {35.8}{\ignorespaces (a) Illustration of an agent interacting with the VizDoom environment. (The yellow blobs represent fireballs being thrown towards the agent by various enemies.) The agent has a world model, composed of a vision system V and a memory RNN M, and has a controller C. (b) Detailed representation of the memory model. Here $h_t$ is the deterministic hidden state of the RNN at time $t$, which is used to predict the next latent of the VAE, $z_{t+1}$, using a mixture density network (MDN). Here $\tau $ is a temperature parameter used to increase the variance of the predictions, to prevent the controller from exploiting model inaccuracies. From Figures 4, 6 of \citep {worldModels}. Used with kind permission of David Ha. \relax }}{1154}{figure.caption.670} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {35.9}{\ignorespaces Illustration of some image-based control problems used in the PlaNet paper. Inputs are $64 \times 64 \times 3$. (a) The cartpole swingup task has a fixed camera so the cart can move out of sight, making this a partially observable problem. (b) The reacher task has a sparse reward. (c) The cheetah running task includes both contacts and a larger number of joints. (d) The finger spinning task includes contacts between the finger and the object. (e) The cup task has a sparse reward that is only given once the ball is caught. (f) The walker task requires balance and predicting difficult interactions with the ground when the robot is lying down. From Figure 1 of \citep {planet}. Used with kind permission of Danijar Hafner. \relax }}{1155}{figure.caption.671} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {35.10}{\ignorespaces (a) ... (b) ... From Figures 11.1 \& 11.2 of \citep {Suttonv2}. Used with kind permission of Richard Sutton. \relax }}{1161}{figure.caption.672} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {35.11}{\ignorespaces A graphical model for optimal control. States and actions are observed, while optimality variables are not. Adapted from Figure 1b of \citep {Levine2018inf}. \relax }}{1163}{figure.caption.673} \defcounter {refsection}{0}\relax \addvspace {10\p@ } \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {36.1}{\ignorespaces Ice cream production is strongly associated with deaths by drowning. Ice cream production data from the US Department of Agriculture National Agricultural Statistics Service. Drowning data from the National Center for Health Statistics at the United States Centers for Disease Control. \relax }}{1170}{figure.caption.674} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {36.2}{\ignorespaces Smoking is strongly associated with lung cancer. Figure by Max Roser, \url {ourworldindata.org/smoking-big-problem-in-brief}\relax }}{1170}{figure.caption.675} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {36.3}{\ignorespaces (Left) Causal graph illustrating relationships between smoking $A$, cancer $Y$, health conciousness $H$, and genetic cancer pre-disposition $G$. (Right) ``Mutilated'' causal graph illustrating relationships under an intervention on smoking $A$.\relax }}{1173}{figure.caption.676} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {36.4}{\ignorespaces A causal DAG illustrating a situation where treatment $A$ and outcome $Y$ are both influenced by observed confounders $X$.\relax }}{1180}{figure.caption.677} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {36.5}{\ignorespaces The M-bias causal graph. Here, $A$ and $Y$ are not confounded. However, conditioning on the covariate $X$ opens a backdoor path, passing through $U_1$ and $U_2$ (because $X$ is a colider). Thus, adjusting for $X$ creates bias. This is true even though $X$ need not be a pre-treatment variable.\relax }}{1189}{figure.caption.680} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {36.6}{\ignorespaces Causal graph illustrating the Instrumental Variable setup. The treatment $A$ and outcome $Y$ are both influenced by unobserved confounder $U$. Nevertheless, identification is sometimes possible due to the presence of the instrument $Z$. We also allow for observed covariates $X$ that we may need to adjust for. The dashed arrow between $U$ and $X$ indicates a statistical dependency where we remain agnostic to the particular causal relationship.\relax }}{1193}{figure.caption.681} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {36.7}{\ignorespaces Causal graph assumed for the difference-in-differences setting. Here, the outcome of interest is the difference between the pre- and post-treatment period, $Y_1 - Y_0$. This difference is influenced by the treatment, unobserved factors $U$, and observed covariates $X$. The dashed arrow between $U$ and $A$ indicates a statistical dependency between the variables, but where we remain agnostic to the precise causal mechanism. For example, in the minimum wage example, $U$ might be the average income in restaurant's neighbourhood, which is dependent on the state, and hence also the treatment.\relax }}{1203}{figure.caption.683} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {36.8}{\ignorespaces Austen plot showing how strong an unobserved confounder would need to be to induce a bias of 2 in an observational study of the effect of combination blood pressure medications on diastolic blood pressure \cite {Dorie:Harada:Carnegie:Hill:2016}. We chose this bias to equal the nominal average treatment effect estimated from the data. We model the outcome with Bayesian Additive Regression Trees and the treatment assignment with logistic regression. The curve shows all values treatment and outcome influence that would induce a bias of 2. The colored dots show the influence strength of (groups of) observed covariates, given all other covariates. For example, an unobserved confounder with as much influence as the patient's age might induce a bias of about 2. \relax }}{1207}{figure.caption.685} \defcounter {refsection}{0}\relax \contentsline {figure}{\numberline {36.9}{\ignorespaces Causal graph illustrating the frontdoor criterion setup. The effect of the treatment $A$ on outcome $Y$ is entirely mediated by mediator $M$. This allows us infer the causal effect even if the treatment and outcome are confounded by $U$.\relax }}{1215}{figure.caption.696}