Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
probml
GitHub Repository: probml/pyprobml
Path: blob/master/notebooks/book2/34/bandits.ipynb
1192 views
Kernel: Python 3

Open In Colab

Multi-armed bandits

import numpy as np np.set_printoptions(precision=3) import pandas as pd

Multi-armed bandits

The classic multi-armed bandit is illustrated below. The agent (the octopus) must choose which "arm" to pull. Each slot machine (aka one-armed bandit) has a different, but unknown, payout rate (reward), denoted R(a)R(a).

task-0000000902-cfda1c60_HmUi8CF.jpg

  • In a MAB problem, the goal is to choose a sequence of actions (arm pulls) that maximizes the sum of expected rewards: a1:T=argmaxa1:Tt=TEatπ[R(atθ)] a_{1:T} = \arg \max_{a_{1:T}} \sum_{t=}^T E_{a_t \sim \pi}[R(a_t | \theta)] where TT is the finite horizon, π()\pi() is the (stochastic) policy, R(aθ)R(a|\theta) is the reward function for action aa using model parameters θ\theta.

  • In the infinite horizon case, we use a discount factor γ<1\gamma < 1 to ensure the sum converges: J=t=1γt1Eatπ[R(atθ)] J = \sum_{t=1}^\infty \gamma^{t-1} E_{ a_t \sim \pi}[R(a_t | \theta)]

  • For binary rewards, we can use a Bernoulli bandit p(Ra)=Ber(Rμa) p(R|a) = \text{Ber}(R|\mu_a) where μa=E[Ra]=p(R=1a)\mu_a=E[R|a]=p(R=1|a).

  • For real-valued rewards, we can use a Gaussian bandit p(Ra)=Gauss(Rμa,σa2) p(R|a) = \text{Gauss}(R|\mu_a, \sigma_a^2)

Contextual bandits

  • In contextual bandits, the agent is in a random state sts_t at each step. States are drawn iid from some unknown distribution p(st)p(s_t). The policy should now depend on the inputs, atπ(st)a_t \sim \pi(s_t).

  • The goal is to maximimize J=t=1γt1Estp(s),atπ(ast)[R(st,atθ)] J = \sum_{t=1}^\infty \gamma^{t-1} E_{ s_t \sim p(s), a_t \sim \pi(a|s_t)}[R(s_t, a_t | \theta)]

  • There are numerous applications, e.g. ** Online advertising, where the action specifies which ad to show, and the state specifies user features (e.g., search history) and/or web page features (e.g., content). ** Recommender systems, where the action specifies which (set of) item(s) to show, and the state specifies user features. ** Clinical trials, where the action specifies which drug to give, and the state specifies patient features.

  • For binary rewards, we can use logistic regression model p(Rs,a;θ)=Ber(Rσ(θTϕ(s,a))) p(R|s,a;\theta) = \text{Ber}(R|\sigma(\theta^T \phi(s,a))) where ϕ(s,a)\phi(s,a) is a feature vector, possibly computed using a deep neural network, and σ(x)=1/(1+ex)\sigma(x)=1/(1+e^{-x}) is the sigmoid or logistic function.

  • For real-valued rewards, we can replace the sigmoid output with a linear output, and use a Gaussian observation model.

  • A key problem is that the rewards for each (state,action) pair (represented by model parameters θ\theta) are unknown.

  • The agent can infer the parameters in an online (recursive) fashion given the data stream D1:t=(s1:t,a1:t,r1:t)D_{1:t} = (s_{1:t}, a_{1:t}, r_{1:t}) using sequential Bayesian updating: p(θD1:t)=p(θDt,D1:t1)=p(Dtθ,D1:t1)p(θD1:t1)p(DtD1:t1)p(Dtθ)p(θD1:t1) \begin{align} p(\theta|D_{1:t}) &= p(\theta|D_t, D_{1:t-1}) = \frac{p(D_t|\theta,D_{1:t-1}) p(\theta|D_{1:t-1})} {p(D_t|D_{1:t-1})} \propto p(D_t|\theta) p(\theta|D_{1:t-1}) \end{align}

  • The likelihood is p(Dtθ)=p(rtst,at,θ)p(D_t|\theta)=p(r_t|s_t, a_t,\theta) and the prior is the belief state from the previous time step, p(θD1:t1)p(\theta|D_{1:t-1}).

Exploration-exploitation tradeoff

  • The agent must try each (state, action) pair (possibly multiple times) in order to collect data, but these experiments incur an opportunity cost. This is called the exploration-exploitation tradeoff.

  • A common heuristic solution is the upper confidence bound method, which selects the action according to at=argmaxaE[R(st,a;θ)]+c  std[R(st,a;θ)] a_t = \arg \max_a E[R(s_t,a;\theta)] + c \;\text{std}[R(s_t,a;\theta)]

  • The constant c0c \geq 0 reflects the amount of optimism about unknown rewards, and hence the amount of exploration.

  • Below we illustrate the distribution over possible rewards for a 3-armed bandit (without contextual features).

silver-UCB.png

Reinforcement learning

  • In reinforcement learning, the setup is similar to a contextual bandit, except now the current state depends on the previously visited states, as well as the actions that the agent has taken.

  • If we assume a Markov model for the enviroment, we get a Markov decision process or MDP. J=t=1γt1Estp(sst1,at1;θ),atπ(as1:t)[R(st,atθ)] J = \sum_{t=1}^\infty \gamma^{t-1} E_{ s_t \sim p(s|s_{t-1}, a_{t-1}; \theta), a_t \sim \pi(a|s_{1:t})}[R(s_t, a_t | \theta)]

  • Solving for the optimal policy when the parameters θ\theta of the environment are unknown is challenging. This is beyond the scope of this tutorial.

48b557e7-55c8-46ee-9c34-f589318ef6b5.png