Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pytorch
GitHub Repository: pytorch/tutorials
Path: blob/main/intermediate_source/reinforcement_q_learning.py
1686 views
1
# -*- coding: utf-8 -*-
2
"""
3
Reinforcement Learning (DQN) Tutorial
4
=====================================
5
**Author**: `Adam Paszke <https://github.com/apaszke>`_
6
`Mark Towers <https://github.com/pseudo-rnd-thoughts>`_
7
8
9
This tutorial shows how to use PyTorch to train a Deep Q Learning (DQN) agent
10
on the CartPole-v1 task from `Gymnasium <https://gymnasium.farama.org>`__.
11
12
You might find it helpful to read the original `Deep Q Learning (DQN) <https://arxiv.org/abs/1312.5602>`__ paper
13
14
**Task**
15
16
The agent has to decide between two actions - moving the cart left or
17
right - so that the pole attached to it stays upright. You can find more
18
information about the environment and other more challenging environments at
19
`Gymnasium's website <https://gymnasium.farama.org/environments/classic_control/cart_pole/>`__.
20
21
.. figure:: /_static/img/cartpole.gif
22
:alt: CartPole
23
24
CartPole
25
26
As the agent observes the current state of the environment and chooses
27
an action, the environment *transitions* to a new state, and also
28
returns a reward that indicates the consequences of the action. In this
29
task, rewards are +1 for every incremental timestep and the environment
30
terminates if the pole falls over too far or the cart moves more than 2.4
31
units away from center. This means better performing scenarios will run
32
for longer duration, accumulating larger return.
33
34
The CartPole task is designed so that the inputs to the agent are 4 real
35
values representing the environment state (position, velocity, etc.).
36
We take these 4 inputs without any scaling and pass them through a
37
small fully-connected network with 2 outputs, one for each action.
38
The network is trained to predict the expected value for each action,
39
given the input state. The action with the highest expected value is
40
then chosen.
41
42
43
**Packages**
44
45
46
First, let's import needed packages. Firstly, we need
47
`gymnasium <https://gymnasium.farama.org/>`__ for the environment,
48
installed by using `pip`. This is a fork of the original OpenAI
49
Gym project and maintained by the same team since Gym v0.19.
50
If you are running this in Google Colab, run:
51
52
.. code-block:: bash
53
54
%%bash
55
pip3 install gymnasium[classic_control]
56
57
We'll also use the following from PyTorch:
58
59
- neural networks (``torch.nn``)
60
- optimization (``torch.optim``)
61
- automatic differentiation (``torch.autograd``)
62
63
"""
64
65
import gymnasium as gym
66
import math
67
import random
68
import matplotlib
69
import matplotlib.pyplot as plt
70
from collections import namedtuple, deque
71
from itertools import count
72
73
import torch
74
import torch.nn as nn
75
import torch.optim as optim
76
import torch.nn.functional as F
77
78
env = gym.make("CartPole-v1")
79
80
# set up matplotlib
81
is_ipython = 'inline' in matplotlib.get_backend()
82
if is_ipython:
83
from IPython import display
84
85
plt.ion()
86
87
# if GPU is to be used
88
device = torch.device(
89
"cuda" if torch.cuda.is_available() else
90
"mps" if torch.backends.mps.is_available() else
91
"cpu"
92
)
93
94
95
# To ensure reproducibility during training, you can fix the random seeds
96
# by uncommenting the lines below. This makes the results consistent across
97
# runs, which is helpful for debugging or comparing different approaches.
98
#
99
# That said, allowing randomness can be beneficial in practice, as it lets
100
# the model explore different training trajectories.
101
102
103
# seed = 42
104
# random.seed(seed)
105
# torch.manual_seed(seed)
106
# env.reset(seed=seed)
107
# env.action_space.seed(seed)
108
# env.observation_space.seed(seed)
109
# if torch.cuda.is_available():
110
# torch.cuda.manual_seed(seed)
111
112
113
######################################################################
114
# Replay Memory
115
# -------------
116
#
117
# We'll be using experience replay memory for training our DQN. It stores
118
# the transitions that the agent observes, allowing us to reuse this data
119
# later. By sampling from it randomly, the transitions that build up a
120
# batch are decorrelated. It has been shown that this greatly stabilizes
121
# and improves the DQN training procedure.
122
#
123
# For this, we're going to need two classes:
124
#
125
# - ``Transition`` - a named tuple representing a single transition in
126
# our environment. It essentially maps (state, action) pairs
127
# to their (next_state, reward) result, with the state being the
128
# screen difference image as described later on.
129
# - ``ReplayMemory`` - a cyclic buffer of bounded size that holds the
130
# transitions observed recently. It also implements a ``.sample()``
131
# method for selecting a random batch of transitions for training.
132
#
133
134
Transition = namedtuple('Transition',
135
('state', 'action', 'next_state', 'reward'))
136
137
138
class ReplayMemory(object):
139
140
def __init__(self, capacity):
141
self.memory = deque([], maxlen=capacity)
142
143
def push(self, *args):
144
"""Save a transition"""
145
self.memory.append(Transition(*args))
146
147
def sample(self, batch_size):
148
return random.sample(self.memory, batch_size)
149
150
def __len__(self):
151
return len(self.memory)
152
153
154
######################################################################
155
# Now, let's define our model. But first, let's quickly recap what a DQN is.
156
#
157
# DQN algorithm
158
# -------------
159
#
160
# Our environment is deterministic, so all equations presented here are
161
# also formulated deterministically for the sake of simplicity. In the
162
# reinforcement learning literature, they would also contain expectations
163
# over stochastic transitions in the environment.
164
#
165
# Our aim will be to train a policy that tries to maximize the discounted,
166
# cumulative reward
167
# :math:`R_{t_0} = \sum_{t=t_0}^{\infty} \gamma^{t - t_0} r_t`, where
168
# :math:`R_{t_0}` is also known as the *return*. The discount,
169
# :math:`\gamma`, should be a constant between :math:`0` and :math:`1`
170
# that ensures the sum converges. A lower :math:`\gamma` makes
171
# rewards from the uncertain far future less important for our agent
172
# than the ones in the near future that it can be fairly confident
173
# about. It also encourages agents to collect reward closer in time
174
# than equivalent rewards that are temporally far away in the future.
175
#
176
# The main idea behind Q-learning is that if we had a function
177
# :math:`Q^*: State \times Action \rightarrow \mathbb{R}`, that could tell
178
# us what our return would be, if we were to take an action in a given
179
# state, then we could easily construct a policy that maximizes our
180
# rewards:
181
#
182
# .. math:: \pi^*(s) = \arg\!\max_a \ Q^*(s, a)
183
#
184
# However, we don't know everything about the world, so we don't have
185
# access to :math:`Q^*`. But, since neural networks are universal function
186
# approximators, we can simply create one and train it to resemble
187
# :math:`Q^*`.
188
#
189
# For our training update rule, we'll use a fact that every :math:`Q`
190
# function for some policy obeys the Bellman equation:
191
#
192
# .. math:: Q^{\pi}(s, a) = r + \gamma Q^{\pi}(s', \pi(s'))
193
#
194
# The difference between the two sides of the equality is known as the
195
# temporal difference error, :math:`\delta`:
196
#
197
# .. math:: \delta = Q(s, a) - (r + \gamma \max_a' Q(s', a))
198
#
199
# To minimize this error, we will use the `Huber
200
# loss <https://en.wikipedia.org/wiki/Huber_loss>`__. The Huber loss acts
201
# like the mean squared error when the error is small, but like the mean
202
# absolute error when the error is large - this makes it more robust to
203
# outliers when the estimates of :math:`Q` are very noisy. We calculate
204
# this over a batch of transitions, :math:`B`, sampled from the replay
205
# memory:
206
#
207
# .. math::
208
#
209
# \mathcal{L} = \frac{1}{|B|}\sum_{(s, a, s', r) \ \in \ B} \mathcal{L}(\delta)
210
#
211
# .. math::
212
#
213
# \text{where} \quad \mathcal{L}(\delta) = \begin{cases}
214
# \frac{1}{2}{\delta^2} & \text{for } |\delta| \le 1, \\
215
# |\delta| - \frac{1}{2} & \text{otherwise.}
216
# \end{cases}
217
#
218
# Q-network
219
# ^^^^^^^^^
220
#
221
# Our model will be a feed forward neural network that takes in the
222
# difference between the current and previous screen patches. It has two
223
# outputs, representing :math:`Q(s, \mathrm{left})` and
224
# :math:`Q(s, \mathrm{right})` (where :math:`s` is the input to the
225
# network). In effect, the network is trying to predict the *expected return* of
226
# taking each action given the current input.
227
#
228
229
class DQN(nn.Module):
230
231
def __init__(self, n_observations, n_actions):
232
super(DQN, self).__init__()
233
self.layer1 = nn.Linear(n_observations, 128)
234
self.layer2 = nn.Linear(128, 128)
235
self.layer3 = nn.Linear(128, n_actions)
236
237
# Called with either one element to determine next action, or a batch
238
# during optimization. Returns tensor([[left0exp,right0exp]...]).
239
def forward(self, x):
240
x = F.relu(self.layer1(x))
241
x = F.relu(self.layer2(x))
242
return self.layer3(x)
243
244
245
######################################################################
246
# Training
247
# --------
248
#
249
# Hyperparameters and utilities
250
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
251
# This cell instantiates our model and its optimizer, and defines some
252
# utilities:
253
#
254
# - ``select_action`` - will select an action according to an epsilon
255
# greedy policy. Simply put, we'll sometimes use our model for choosing
256
# the action, and sometimes we'll just sample one uniformly. The
257
# probability of choosing a random action will start at ``EPS_START``
258
# and will decay exponentially towards ``EPS_END``. ``EPS_DECAY``
259
# controls the rate of the decay.
260
# - ``plot_durations`` - a helper for plotting the duration of episodes,
261
# along with an average over the last 100 episodes (the measure used in
262
# the official evaluations). The plot will be underneath the cell
263
# containing the main training loop, and will update after every
264
# episode.
265
#
266
267
# BATCH_SIZE is the number of transitions sampled from the replay buffer
268
# GAMMA is the discount factor as mentioned in the previous section
269
# EPS_START is the starting value of epsilon
270
# EPS_END is the final value of epsilon
271
# EPS_DECAY controls the rate of exponential decay of epsilon, higher means a slower decay
272
# TAU is the update rate of the target network
273
# LR is the learning rate of the ``AdamW`` optimizer
274
275
BATCH_SIZE = 128
276
GAMMA = 0.99
277
EPS_START = 0.9
278
EPS_END = 0.01
279
EPS_DECAY = 2500
280
TAU = 0.005
281
LR = 3e-4
282
283
284
# Get number of actions from gym action space
285
n_actions = env.action_space.n
286
# Get the number of state observations
287
state, info = env.reset()
288
n_observations = len(state)
289
290
policy_net = DQN(n_observations, n_actions).to(device)
291
target_net = DQN(n_observations, n_actions).to(device)
292
target_net.load_state_dict(policy_net.state_dict())
293
294
optimizer = optim.AdamW(policy_net.parameters(), lr=LR, amsgrad=True)
295
memory = ReplayMemory(10000)
296
297
298
steps_done = 0
299
300
301
def select_action(state):
302
global steps_done
303
sample = random.random()
304
eps_threshold = EPS_END + (EPS_START - EPS_END) * \
305
math.exp(-1. * steps_done / EPS_DECAY)
306
steps_done += 1
307
if sample > eps_threshold:
308
with torch.no_grad():
309
# t.max(1) will return the largest column value of each row.
310
# second column on max result is index of where max element was
311
# found, so we pick action with the larger expected reward.
312
return policy_net(state).max(1).indices.view(1, 1)
313
else:
314
return torch.tensor([[env.action_space.sample()]], device=device, dtype=torch.long)
315
316
317
episode_durations = []
318
319
320
def plot_durations(show_result=False):
321
plt.figure(1)
322
durations_t = torch.tensor(episode_durations, dtype=torch.float)
323
if show_result:
324
plt.title('Result')
325
else:
326
plt.clf()
327
plt.title('Training...')
328
plt.xlabel('Episode')
329
plt.ylabel('Duration')
330
plt.plot(durations_t.numpy())
331
# Take 100 episode averages and plot them too
332
if len(durations_t) >= 100:
333
means = durations_t.unfold(0, 100, 1).mean(1).view(-1)
334
means = torch.cat((torch.zeros(99), means))
335
plt.plot(means.numpy())
336
337
plt.pause(0.001) # pause a bit so that plots are updated
338
if is_ipython:
339
if not show_result:
340
display.display(plt.gcf())
341
display.clear_output(wait=True)
342
else:
343
display.display(plt.gcf())
344
345
346
######################################################################
347
# Training loop
348
# ^^^^^^^^^^^^^
349
#
350
# Finally, the code for training our model.
351
#
352
# Here, you can find an ``optimize_model`` function that performs a
353
# single step of the optimization. It first samples a batch, concatenates
354
# all the tensors into a single one, computes :math:`Q(s_t, a_t)` and
355
# :math:`V(s_{t+1}) = \max_a Q(s_{t+1}, a)`, and combines them into our
356
# loss. By definition we set :math:`V(s) = 0` if :math:`s` is a terminal
357
# state. We also use a target network to compute :math:`V(s_{t+1})` for
358
# added stability. The target network is updated at every step with a
359
# `soft update <https://arxiv.org/pdf/1509.02971.pdf>`__ controlled by
360
# the hyperparameter ``TAU``, which was previously defined.
361
#
362
363
def optimize_model():
364
if len(memory) < BATCH_SIZE:
365
return
366
transitions = memory.sample(BATCH_SIZE)
367
# Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
368
# detailed explanation). This converts batch-array of Transitions
369
# to Transition of batch-arrays.
370
batch = Transition(*zip(*transitions))
371
372
# Compute a mask of non-final states and concatenate the batch elements
373
# (a final state would've been the one after which simulation ended)
374
non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
375
batch.next_state)), device=device, dtype=torch.bool)
376
non_final_next_states = torch.cat([s for s in batch.next_state
377
if s is not None])
378
state_batch = torch.cat(batch.state)
379
action_batch = torch.cat(batch.action)
380
reward_batch = torch.cat(batch.reward)
381
382
# Compute Q(s_t, a) - the model computes Q(s_t), then we select the
383
# columns of actions taken. These are the actions which would've been taken
384
# for each batch state according to policy_net
385
state_action_values = policy_net(state_batch).gather(1, action_batch)
386
387
# Compute V(s_{t+1}) for all next states.
388
# Expected values of actions for non_final_next_states are computed based
389
# on the "older" target_net; selecting their best reward with max(1).values
390
# This is merged based on the mask, such that we'll have either the expected
391
# state value or 0 in case the state was final.
392
next_state_values = torch.zeros(BATCH_SIZE, device=device)
393
with torch.no_grad():
394
next_state_values[non_final_mask] = target_net(non_final_next_states).max(1).values
395
# Compute the expected Q values
396
expected_state_action_values = (next_state_values * GAMMA) + reward_batch
397
398
# Compute Huber loss
399
criterion = nn.SmoothL1Loss()
400
loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))
401
402
# Optimize the model
403
optimizer.zero_grad()
404
loss.backward()
405
# In-place gradient clipping
406
torch.nn.utils.clip_grad_value_(policy_net.parameters(), 100)
407
optimizer.step()
408
409
410
######################################################################
411
#
412
# Below, you can find the main training loop. At the beginning we reset
413
# the environment and obtain the initial ``state`` Tensor. Then, we sample
414
# an action, execute it, observe the next state and the reward (always
415
# 1), and optimize our model once. When the episode ends (our model
416
# fails), we restart the loop.
417
#
418
# Below, `num_episodes` is set to 600 if a GPU is available, otherwise 50
419
# episodes are scheduled so training does not take too long. However, 50
420
# episodes is insufficient for to observe good performance on CartPole.
421
# You should see the model constantly achieve 500 steps within 600 training
422
# episodes. Training RL agents can be a noisy process, so restarting training
423
# can produce better results if convergence is not observed.
424
#
425
426
if torch.cuda.is_available() or torch.backends.mps.is_available():
427
num_episodes = 600
428
else:
429
num_episodes = 50
430
431
for i_episode in range(num_episodes):
432
# Initialize the environment and get its state
433
state, info = env.reset()
434
state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
435
for t in count():
436
action = select_action(state)
437
observation, reward, terminated, truncated, _ = env.step(action.item())
438
reward = torch.tensor([reward], device=device)
439
done = terminated or truncated
440
441
if terminated:
442
next_state = None
443
else:
444
next_state = torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0)
445
446
# Store the transition in memory
447
memory.push(state, action, next_state, reward)
448
449
# Move to the next state
450
state = next_state
451
452
# Perform one step of the optimization (on the policy network)
453
optimize_model()
454
455
# Soft update of the target network's weights
456
# θ′ ← τ θ + (1 −τ )θ′
457
target_net_state_dict = target_net.state_dict()
458
policy_net_state_dict = policy_net.state_dict()
459
for key in policy_net_state_dict:
460
target_net_state_dict[key] = policy_net_state_dict[key]*TAU + target_net_state_dict[key]*(1-TAU)
461
target_net.load_state_dict(target_net_state_dict)
462
463
if done:
464
episode_durations.append(t + 1)
465
plot_durations()
466
break
467
468
print('Complete')
469
plot_durations(show_result=True)
470
plt.ioff()
471
plt.show()
472
473
######################################################################
474
# Here is the diagram that illustrates the overall resulting data flow.
475
#
476
# .. figure:: /_static/img/reinforcement_learning_diagram.jpg
477
#
478
# Actions are chosen either randomly or based on a policy, getting the next
479
# step sample from the gym environment. We record the results in the
480
# replay memory and also run optimization step on every iteration.
481
# Optimization picks a random batch from the replay memory to do training of the
482
# new policy. The "older" target_net is also used in optimization to compute the
483
# expected Q values. A soft update of its weights are performed at every step.
484
#
485
486