Path: blob/master/notebooks/book2/35/supplementary/Tabular_SARSA.ipynb
1193 views
Tabular Sarsa Algorithm and the Taxi Environment
Authors: Fred Amouzgar [email protected] and Kevin Murphy [email protected]
Tabular methods are suitable for small and discrete state space and discrete action space environments. So, the state-action function (Q) can be represented by a table of values. For large state space environments, we prefer to use approximation methods such as neural networks. However, the simplicity of tabular methods' implementation is helpful to demonstrate RL method's functionality. In this notebook, we train a SARSA agent for OpenAI's Taxi Gym environment (an example originally proposed by Tom Dietterich).
1- Installations
2- Setting up the Environment
Environment Name: Taxi-v3
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
Here is a description of the taxi environment from the docstring.
Description:
There are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger's location, picks up the passenger, drives to the passenger's destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends.
States/ Observations:
State space is (taxi_row, taxi_col, passenger_location, destination).
There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations.
Passenger locations:
0: R(ed)
1: G(reen)
2: Y(ellow)
3: B(lue)
4: in taxi
Destinations:
0: R(ed)
1: G(reen)
2: Y(ellow)
3: B(lue)
Actions:
There are 6 discrete deterministic actions:
0: move south
1: move north
2: move east
3: move west
4: pickup passenger
5: drop off passenger
Notice in this environment the taxi cannot perform certain actions in certain states due to walls. In environment's code, we will simply provide a -1 penalty for every wall hit and the taxi won't move anywhere. This will just rack up penalties causing the taxi to consider going around the wall.
Rewards:
There is a default per-step reward of -1, except for delivering the passenger, which is +20, or executing "pickup" and "drop-off" actions illegally, which is -10.
Rendering:
blue: passenger
magenta: destination
yellow: empty taxi
green: full taxi
other letters (R, G, Y and B): locations for passengers and destinations
3- Developing the SARSA agent
Here's the full update formula covered in line 7 and 8:
4- Defining the Training Loop
5- Let's train our agent for 1500 episodes (takes ~5 minutes)
6- Methods for Playing and Rendering the Taxi environment in the notebook
7- Watch a Trained SARSA Cab Driver
Note: You can change the number of passengers if you want to move more than 3. Change the wait_btw_frames if you want to see the game running faster or slower.
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
(Dropoff)
Passenger #: 3
-----------
Timestep: 10
State: 479
Action: 5
Reward: 20