Multi-armed bandits
Multi-armed bandits
The classic multi-armed bandit is illustrated below. The agent (the octopus) must choose which "arm" to pull. Each slot machine (aka one-armed bandit) has a different, but unknown, payout rate (reward), denoted .
In a MAB problem, the goal is to choose a sequence of actions (arm pulls) that maximizes the sum of expected rewards: where is the finite horizon, is the (stochastic) policy, is the reward function for action using model parameters .
In the infinite horizon case, we use a discount factor to ensure the sum converges:
For binary rewards, we can use a Bernoulli bandit where .
For real-valued rewards, we can use a Gaussian bandit
Contextual bandits
In contextual bandits, the agent is in a random state at each step. States are drawn iid from some unknown distribution . The policy should now depend on the inputs, .
The goal is to maximimize
There are numerous applications, e.g. ** Online advertising, where the action specifies which ad to show, and the state specifies user features (e.g., search history) and/or web page features (e.g., content). ** Recommender systems, where the action specifies which (set of) item(s) to show, and the state specifies user features. ** Clinical trials, where the action specifies which drug to give, and the state specifies patient features.
For binary rewards, we can use logistic regression model where is a feature vector, possibly computed using a deep neural network, and is the sigmoid or logistic function.
For real-valued rewards, we can replace the sigmoid output with a linear output, and use a Gaussian observation model.
A key problem is that the rewards for each (state,action) pair (represented by model parameters ) are unknown.
The agent can infer the parameters in an online (recursive) fashion given the data stream using sequential Bayesian updating:
The likelihood is and the prior is the belief state from the previous time step, .
Exploration-exploitation tradeoff
The agent must try each (state, action) pair (possibly multiple times) in order to collect data, but these experiments incur an opportunity cost. This is called the exploration-exploitation tradeoff.
A common heuristic solution is the upper confidence bound method, which selects the action according to
The constant reflects the amount of optimism about unknown rewards, and hence the amount of exploration.
Below we illustrate the distribution over possible rewards for a 3-armed bandit (without contextual features).
Reinforcement learning
In reinforcement learning, the setup is similar to a contextual bandit, except now the current state depends on the previously visited states, as well as the actions that the agent has taken.
If we assume a Markov model for the enviroment, we get a Markov decision process or MDP.
Solving for the optimal policy when the parameters of the environment are unknown is challenging. This is beyond the scope of this tutorial.