Path: blob/master/site/en-snapshot/agents/tutorials/bandits_tutorial.ipynb
25118 views
Copyright 2023 The TF-Agents Authors.
Tutorial on Multi Armed Bandits in TF-Agents
Setup
If you haven't installed the following dependencies, run:
Imports
Introduction
The Multi-Armed Bandit problem (MAB) is a special case of Reinforcement Learning: an agent collects rewards in an environment by taking some actions after observing some state of the environment. The main difference between general RL and MAB is that in MAB, we assume that the action taken by the agent does not influence the next state of the environment. Therefore, agents do not model state transitions, credit rewards to past actions, or "plan ahead" to get to reward-rich states.
As in other RL domains, the goal of a MAB agent is to find a policy that collects as much reward as possible. It would be a mistake, however, to always try to exploit the action that promises the highest reward, because then there is a chance that we miss out on better actions if we do not explore enough. This is the main problem to be solved in (MAB), often called the exploration-exploitation dilemma.
Bandit environments, policies, and agents for MAB can be found in subdirectories of tf_agents/bandits.
Environments
In TF-Agents, the environment class serves the role of giving information on the current state (this is called observation or context), receiving an action as input, performing a state transition, and outputting a reward. This class also takes care of resetting when an episode ends, so that a new episode can start. This is realized by calling a reset
function when a state is labelled as "last" of the episode.
For more details, see the TF-Agents environments tutorial.
As mentioned above, MAB differs from general RL in that actions do not influence the next observation. Another difference is that in Bandits, there are no "episodes": every time step starts with a new observation, independently of previous time steps.
To make sure observations are independent and to abstract away the concept of RL episodes, we introduce subclasses of PyEnvironment
and TFEnvironment
: BanditPyEnvironment and BanditTFEnvironment. These classes expose two private member functions that remain to be implemented by the user:
and
The _observe
function returns an observation. Then, the policy chooses an action based on this observation. The _apply_action
receives that action as an input, and returns the corresponding reward. These private member functions are called by the functions reset
and step
, respectively.
The above interim abstract class implements PyEnvironment
's _reset
and _step
functions and exposes the abstract functions _observe
and _apply_action
to be implemented by subclasses.
A Simple Example Environment Class
The following class gives a very simple environment for which the observation is a random integer between -2 and 2, there are 3 possible actions (0, 1, 2), and the reward is the product of the action and the observation.
Now we can use this environment to get observations, and receive rewards for our actions.
TF Environments
One can define a bandit environment by subclassing BanditTFEnvironment
, or, similarly to RL environments, one can define a BanditPyEnvironment
and wrap it with TFPyEnvironment
. For the sake of simplicity, we go with the latter option in this tutorial.
Policies
A policy in a bandit problem works the same way as in an RL problem: it provides an action (or a distribution of actions), given an observation as input.
For more details, see the TF-Agents Policy tutorial.
As with environments, there are two ways to construct a policy: One can create a PyPolicy
and wrap it with TFPyPolicy
, or directly create a TFPolicy
. Here we elect to go with the direct method.
Since this example is quite simple, we can define the optimal policy manually. The action only depends on the sign of the observation, 0 when is negative and 2 when is positive.
Now we can request an observation from the environment, call the policy to choose an action, then the environment will output the reward:
The way bandit environments are implemented ensures that every time we take a step, we not only receive the reward for the action we took, but also the next observation.
Agents
Now that we have bandit environments and bandit policies, it is time to also define bandit agents, that take care of changing the policy based on training samples.
The API for bandit agents does not differ from that of RL agents: the agent just needs to implement the _initialize
and _train
methods, and define a policy
and a collect_policy
.
A More Complicated Environment
Before we write our bandit agent, we need to have an environment that is a bit harder to figure out. To spice up things just a little bit, the next environment will either always give reward = observation * action
or reward = -observation * action
. This will be decided when the environment is initialized.
A More Complicated Policy
A more complicated environment calls for a more complicated policy. We need a policy that detects the behavior of the underlying environment. There are three situations that the policy needs to handle:
The agent has not detected know yet which version of the environment is running.
The agent detected that the original version of the environment is running.
The agent detected that the flipped version of the environment is running.
We define a tf_variable
named _situation
to store this information encoded as values in [0, 2]
, then make the policy behave accordingly.
The Agent
Now it's time to define the agent that detects the sign of the environment and sets the policy appropriately.
In the above code, the agent defines the policy, and the variable situation
is shared by the agent and the policy.
Also, the parameter experience
of the _train
function is a trajectory:
Trajectories
In TF-Agents, trajectories
are named tuples that contain samples from previous steps taken. These samples are then used by the agent to train and update the policy. In RL, trajectories must contain information about the current state, the next state, and whether the current episode has ended. Since in the Bandit world we do not need these things, we set up a helper function to create a trajectory:
Training an Agent
Now all the pieces are ready for training our bandit agent.
From the output one can see that after the second step (unless the observation was 0 in the first step), the policy chooses the action in the right way and thus the reward collected is always non-negative.
A Real Contextual Bandit Example
In the rest of this tutorial, we use the pre-implemented environments and agents of the TF-Agents Bandits library.
Stationary Stochastic Environment with Linear Payoff Functions
The environment used in this example is the StationaryStochasticPyEnvironment. This environment takes as parameter a (usually noisy) function for giving observations (context), and for every arm takes an (also noisy) function that computes the reward based on the given observation. In our example, we sample the context uniformly from a d-dimensional cube, and the reward functions are linear functions of the context, plus some Gaussian noise.
The LinUCB Agent
The agent below implements the LinUCB algorithm.
Regret Metric
Bandits' most important metric is regret, calculated as the difference between the reward collected by the agent and the expected reward of an oracle policy that has access to the reward functions of the environment. The RegretMetric thus needs a baseline_reward_fn function that calculates the best achievable expected reward given an observation. For our example, we need to take the maximum of the no-noise equivalents of the reward functions that we already defined for the environment.
Training
Now we put together all the components that we introduced above: the environment, the policy, and the agent. We run the policy on the environment and output training data with the help of a driver, and train the agent on the data.
Note that there are two parameters that together specify the number of steps taken. num_iterations
specifies how many times we run the trainer loop, while the driver will take steps_per_loop
steps per iteration. The main reason behind keeping both of these parameters is that some operations are done per iteration, while some are done by the driver in every step. For example, the agent's train
function is only called once per iteration. The trade-off here is that if we train more often then our policy is "fresher", on the other hand, training in bigger batches might be more time efficient.
After running the last code snippet, the resulting plot (hopefully) shows that the average regret is going down as the agent is trained and the policy gets better in figuring out what the right action is, given the observation.
What's Next?
To see more working examples, please see the bandits/agents/examples that has ready-to-run examples for different agents and environments.
The TF-Agents library is also capable of handling Multi-Armed Bandits with per-arm features. To that end, we refer the reader to the per-arm bandit tutorial.