APPLICATIONS OF MARKOV DECISION PROCESSES
In this notebook we will take a look at some indicative applications of markov decision processes. We will cover content from mdp.py
, for Chapter 17 Making Complex Decisions of Stuart Russel's and Peter Norvig's book Artificial Intelligence: A Modern Approach.
CONTENTS
Simple MDP
State dependent reward function
State and action dependent reward function
State, action and next state dependent reward function
Grid MDP
Pathfinding problem
POMDP
Two state POMDP
SIMPLE MDP
State dependent reward function
Markov Decision Processes are formally described as processes that follow the Markov property which states that "The future is independent of the past given the present". MDPs formally describe environments for reinforcement learning and we assume that the environment is fully observable. Let us take a toy example MDP and solve it using the functions in mdp.py
. This is a simple example adapted from a similar problem by Dr. David Silver, tweaked to fit the limitations of the current functions.
Let's say you're a student attending lectures in a university. There are three lectures you need to attend on a given day.
Attending the first lecture gives you 4 points of reward. After the first lecture, you have a 0.6 probability to continue into the second one, yielding 6 more points of reward.
But, with a probability of 0.4, you get distracted and start using Facebook instead and get a reward of -1. From then onwards, you really can't let go of Facebook and there's just a 0.1 probability that you will concentrate back on the lecture.
After the second lecture, you have an equal chance of attending the next lecture or just falling asleep. Falling asleep is the terminal state and yields you no reward, but continuing on to the final lecture gives you a big reward of 10 points.
From there on, you have a 40% chance of going to study and reach the terminal state, but a 60% chance of going to the pub with your friends instead. You end up drunk and don't know which lecture to attend, so you go to one of the lectures according to the probabilities given above.
We now have an outline of our stochastic environment and we need to maximize our reward by solving this MDP.
We first have to define our Transition Matrix as a nested dictionary to fit the requirements of the MDP class.
We now need to define the reward for each state.
This MDP has only one terminal state.
Let's now set the initial state to Class 1.
We will write a CustomMDP class to extend the MDP class for the problem at hand. This class will implement the T
method to implement the transition model. This is the exact same class as given in mdp.ipynb
.
We now need an instance of this class.
The utility of each state can be found by value_iteration
.
Now that we can compute the utility values, we can find the best policy.
pi
stores the best action for each state.
We can confirm that this is the best policy by verifying this result against policy_iteration
.
Everything looks perfect, but let us look at another possibility for an MDP.
Till now we have only dealt with rewards that the agent gets while it is on a particular state. What if we want to have different rewards for a state depending on the action that the agent takes next. The agent gets the reward during its transition to the next state.
For the sake of clarity, we will call this the transition reward and we will call this kind of MDP a dynamic MDP. This is not a conventional term, we just use it to minimize confusion between the two.
This next section deals with how to create and solve a dynamic MDP.
State and action dependent reward function
Let us consider a very similar problem, but this time, we do not have rewards on states, instead, we have rewards on the transitions between states. This state diagram will make it clearer.
A very similar scenario as the previous problem, but we have different rewards for the same state depending on the action taken.
To deal with this, we just need to change the R
method of the MDP
class, but to prevent confusion, we will write a new similar class DMDP
.
The transition model will be the same
The reward model will be a dictionary very similar to the transition dictionary with a reward for every action for every state.
The MDP has only one terminal state
Let's now set the initial state to Class 1.
We will write a CustomDMDP class to extend the DMDP class for the problem at hand. This class will implement everything that the previous CustomMDP class implements along with a new reward model.
One thing we haven't thought about yet is that the value_iteration
algorithm won't work now that the reward model is changed. It will be quite similar to the one we currently have nonetheless.
The Bellman update equation now is defined as follows
It is not difficult to see that the update equation we have been using till now is just a special case of this more generalized equation. We also need to max over the reward function now as the reward function is action dependent as well.
We will use this to write a function to carry out value iteration, very similar to the one we are familiar with.
We're all set. Let's instantiate our class.
Calculate utility values by calling value_iteration_dmdp
.
These are the expected utility values for our new MDP.
As you might have guessed, we cannot use the old best_policy
function to find the best policy. So we will write our own. But, before that we need a helper function to calculate the expected utility value given a state and an action.
Now we write our modified best_policy
function.
Find the best policy.
From this, we can infer that value_iteration_dmdp
tries to minimize the negative reward. Since we don't have rewards for states now, the algorithm takes the action that would try to avoid getting negative rewards and take the lesser of two evils if all rewards are negative. You might also want to have state rewards alongside transition rewards. Perhaps you can do that yourself now that the difficult part has been done.
State, action and next-state dependent reward function
For truly stochastic environments, we have noticed that taking an action from a particular state doesn't always do what we want it to. Instead, for every action taken from a particular state, it might be possible to reach a different state each time depending on the transition probabilities. What if we want different rewards for each state, action and next-state triplet? Mathematically, we now want a reward function of the form R(s, a, s') for our MDP. This section shows how we can tweak the MDP class to achieve this.
Let's now take a different problem statement. The one we are working with is a bit too simple. Consider a taxi that serves three adjacent towns A, B, and C. Each time the taxi discharges a passenger, the driver must choose from three possible actions:
Cruise the streets looking for a passenger.
Go to the nearest taxi stand.
Wait for a radio call from the dispatcher with instructions.
Subject to the constraint that the taxi driver cannot do the third action in town B because of distance and poor reception.
Let's model our MDP.
The MDP has three states, namely A, B and C.
It has three actions, namely 1, 2 and 3.
Action sets:
= {1, 2, 3}
= {1, 2}
= {1, 2, 3}
We have the following transition probability matrices:
Action 1: Cruising streets
Action 2: Waiting at the taxi stand
Action 3: Waiting for dispatch
For the sake of readability, we will call the states A, B and C and the actions 'cruise', 'stand' and 'dispatch'. We will now build the transition model as a dictionary using these matrices.
The reward matrices for the problem are as follows:
Action 1: Cruising streets
Action 2: Waiting at the taxi stand
Action 3: Waiting for dispatch
We now build the reward model as a dictionary using these matrices.
The Bellman update equation now is defined as follows
It is not difficult to see that all the update equations we have used till now is just a special case of this more generalized equation. If we did not have next-state-dependent rewards, the first term inside the summation exactly sums up to R(s, a) or the state-reward for a particular action and we would get the update equation used in the previous problem. If we did not have action dependent rewards, the first term inside the summation sums up to R(s) or the state-reward and we would get the first update equation used in mdp.ipynb
.
For example, as we have the same reward regardless of the action, let's consider a reward of r units for a particular state and let's assume the transition probabilities to be 0.1, 0.2, 0.3 and 0.4 for 4 possible actions for that state. We will further assume that a particular action in a state leads to the same state every time we take that action. The first term inside the summation for this case will evaluate to (0.1 + 0.2 + 0.3 + 0.4)r = r which is equal to R(s) in the first update equation.
There are many ways to write value iteration for this situation, but we will go with the most intuitive method. One that can be implemented with minor alterations to the existing value_iteration
algorithm.
Our DMDP
class will be slightly different. More specifically, the R
method will have one more index to go through now that we have three levels of nesting in the reward model. We will call the new class DMDP2
as I have run out of creative names.
Only the R
method is different from the previous DMDP
class.
Our traditional custom class will be required to implement the transition model and the reward model.
We call this class CustomDMDP2
.
We can finally write value iteration for this problem. The latest update equation will be used.
These algorithms can be made more pythonic by using cleverer list comprehensions. We can also write the variants of value iteration in such a way that all problems are solved using the same base class, regardless of the reward function and the number of arguments it takes. Quite a few things can be done to refactor the code and reduce repetition, but we have done it this way for the sake of clarity. Perhaps you can try this as an exercise.
We now need to define terminals and initial state.
Let's instantiate our class.
These are the expected utility values for the states of our MDP. Let's proceed to write a helper function to find the expected utility and another to find the best policy.
Find the best policy.
We have successfully adapted the existing code to a different scenario yet again. The takeaway from this section is that you can convert the vast majority of reinforcement learning problems into MDPs and solve for the best policy using simple yet efficient tools.
GRID MDP
Pathfinding Problem
Markov Decision Processes can be used to find the best path through a maze. Let us consider this simple maze.
This environment can be formulated as a GridMDP.
To make the grid matrix, we will consider the state-reward to be -0.1 for every state.
State (1, 1) will have a reward of -5 to signify that this state is to be prohibited.
State (9, 9) will have a reward of +5. This will be the terminal state.
The matrix can be generated using the GridMDP editor or we can write it ourselves.
We have only one terminal state, (9, 9)
We define our maze environment below
To solve the maze, we can use the best_policy
function along with value_iteration
.
This is the heatmap generated by the GridMDP editor using value_iteration
on this environment
Let's print out the best policy
As you can infer, we can find the path to the terminal state starting from any given state using this policy. All maze problems can be solved by formulating it as a MDP.
POMDP
Two state POMDP
Let's consider a problem where we have two doors, one to our left and one to our right. One of these doors opens to a room with a tiger in it, and the other one opens to an empty hall.
We will call our two states 0
and 1
for left
and right
respectively.
The possible actions we can take are as follows:
Open-left: Open the left door. Represented by
0
.Open-right: Open the right door. Represented by
1
.Listen: Listen carefully to one side and possibly hear the tiger breathing. Represented by
2
.
The possible observations we can get are as follows:
1. __TL__: Tiger seems to be at the left door. 2. __TR__: Tiger seems to be at the right door.
The reward function is as follows:
We get +10 reward for opening the door to the empty hall and we get -100 reward for opening the other door and setting the tiger free.
Listening costs us -1 reward.
We want to minimize our chances of setting the tiger free.
Our transition probabilities can be defined as:
Action 0
(Open left door)
Action 1
(Open right door)
Action 2
(Listen)
Our observation probabilities can be defined as:
The rewards of this POMDP are defined as:
Based on these matrices, we will initialize our variables.
Let's first define our transition state.
Followed by the observation model.
And the reward model.
Let's now define our states, observations and actions.
We will use gamma
= 0.95 for this example.
We have all the required variables to instantiate an object of the POMDP
class.
We can now find the utility function by running pomdp_value_iteration
on our pomdp
object.
Hence, we get a piecewise-continuous utility function consistent with the given POMDP.