CoCalc -- reinforcement_q

GitHub Repository: pytorch/tutorials
Path: blob/main/intermediate_source/reinforcement_q_learning.py
¹⁶⁸⁶ views
1
# -*- coding: utf-8 -*-
2
"""
3
Reinforcement Learning (DQN) Tutorial
4
=====================================
5
**Author**: `Adam Paszke <https://github.com/apaszke>`_
6
            `Mark Towers <https://github.com/pseudo-rnd-thoughts>`_
7

8

9
This tutorial shows how to use PyTorch to train a Deep Q Learning (DQN) agent
10
on the CartPole-v1 task from `Gymnasium <https://gymnasium.farama.org>`__.
11

12
You might find it helpful to read the original `Deep Q Learning (DQN) <https://arxiv.org/abs/1312.5602>`__ paper
13

14
**Task**
15

16
The agent has to decide between two actions - moving the cart left or
17
right - so that the pole attached to it stays upright. You can find more
18
information about the environment and other more challenging environments at
19
`Gymnasium's website <https://gymnasium.farama.org/environments/classic_control/cart_pole/>`__.
20

21
.. figure:: /_static/img/cartpole.gif
22
   :alt: CartPole
23

24
   CartPole
25

26
As the agent observes the current state of the environment and chooses
27
an action, the environment *transitions* to a new state, and also
28
returns a reward that indicates the consequences of the action. In this
29
task, rewards are +1 for every incremental timestep and the environment
30
terminates if the pole falls over too far or the cart moves more than 2.4
31
units away from center. This means better performing scenarios will run
32
for longer duration, accumulating larger return.
33

34
The CartPole task is designed so that the inputs to the agent are 4 real
35
values representing the environment state (position, velocity, etc.).
36
We take these 4 inputs without any scaling and pass them through a 
37
small fully-connected network with 2 outputs, one for each action. 
38
The network is trained to predict the expected value for each action, 
39
given the input state. The action with the highest expected value is 
40
then chosen.
41

42

43
**Packages**
44

45

46
First, let's import needed packages. Firstly, we need
47
`gymnasium <https://gymnasium.farama.org/>`__ for the environment,
48
installed by using `pip`. This is a fork of the original OpenAI
49
Gym project and maintained by the same team since Gym v0.19.
50
If you are running this in Google Colab, run:
51

52
.. code-block:: bash
53

54
   %%bash
55
   pip3 install gymnasium[classic_control]
56

57
We'll also use the following from PyTorch:
58

59
-  neural networks (``torch.nn``)
60
-  optimization (``torch.optim``)
61
-  automatic differentiation (``torch.autograd``)
62

63
"""
64

65
import gymnasium as gym
66
import math
67
import random
68
import matplotlib
69
import matplotlib.pyplot as plt
70
from collections import namedtuple, deque
71
from itertools import count
72

73
import torch
74
import torch.nn as nn
75
import torch.optim as optim
76
import torch.nn.functional as F
77

78
env = gym.make("CartPole-v1")
79

80
# set up matplotlib
81
is_ipython = 'inline' in matplotlib.get_backend()
82
if is_ipython:
83
    from IPython import display
84

85
plt.ion()
86

87
# if GPU is to be used
88
device = torch.device(
89
    "cuda" if torch.cuda.is_available() else
90
    "mps" if torch.backends.mps.is_available() else
91
    "cpu"
92
)
93

94

95
# To ensure reproducibility during training, you can fix the random seeds
96
# by uncommenting the lines below. This makes the results consistent across
97
# runs, which is helpful for debugging or comparing different approaches.
98
#
99
# That said, allowing randomness can be beneficial in practice, as it lets
100
# the model explore different training trajectories.
101

102

103
# seed = 42
104
# random.seed(seed)
105
# torch.manual_seed(seed)
106
# env.reset(seed=seed)
107
# env.action_space.seed(seed)
108
# env.observation_space.seed(seed)
109
# if torch.cuda.is_available(): 
110
#     torch.cuda.manual_seed(seed)
111

112

113
######################################################################
114
# Replay Memory
115
# -------------
116
#
117
# We'll be using experience replay memory for training our DQN. It stores
118
# the transitions that the agent observes, allowing us to reuse this data
119
# later. By sampling from it randomly, the transitions that build up a
120
# batch are decorrelated. It has been shown that this greatly stabilizes
121
# and improves the DQN training procedure.
122
#
123
# For this, we're going to need two classes:
124
#
125
# -  ``Transition`` - a named tuple representing a single transition in
126
#    our environment. It essentially maps (state, action) pairs
127
#    to their (next_state, reward) result, with the state being the
128
#    screen difference image as described later on.
129
# -  ``ReplayMemory`` - a cyclic buffer of bounded size that holds the
130
#    transitions observed recently. It also implements a ``.sample()``
131
#    method for selecting a random batch of transitions for training.
132
#
133

134
Transition = namedtuple('Transition',
135
                        ('state', 'action', 'next_state', 'reward'))
136

137

138
class ReplayMemory(object):
139

140
    def __init__(self, capacity):
141
        self.memory = deque([], maxlen=capacity)
142

143
    def push(self, *args):
144
        """Save a transition"""
145
        self.memory.append(Transition(*args))
146

147
    def sample(self, batch_size):
148
        return random.sample(self.memory, batch_size)
149

150
    def __len__(self):
151
        return len(self.memory)
152

153

154
######################################################################
155
# Now, let's define our model. But first, let's quickly recap what a DQN is.
156
#
157
# DQN algorithm
158
# -------------
159
#
160
# Our environment is deterministic, so all equations presented here are
161
# also formulated deterministically for the sake of simplicity. In the
162
# reinforcement learning literature, they would also contain expectations
163
# over stochastic transitions in the environment.
164
#
165
# Our aim will be to train a policy that tries to maximize the discounted,
166
# cumulative reward
167
# :math:`R_{t_0} = \sum_{t=t_0}^{\infty} \gamma^{t - t_0} r_t`, where
168
# :math:`R_{t_0}` is also known as the *return*. The discount,
169
# :math:`\gamma`, should be a constant between :math:`0` and :math:`1`
170
# that ensures the sum converges. A lower :math:`\gamma` makes 
171
# rewards from the uncertain far future less important for our agent 
172
# than the ones in the near future that it can be fairly confident 
173
# about. It also encourages agents to collect reward closer in time 
174
# than equivalent rewards that are temporally far away in the future.
175
#
176
# The main idea behind Q-learning is that if we had a function
177
# :math:`Q^*: State \times Action \rightarrow \mathbb{R}`, that could tell
178
# us what our return would be, if we were to take an action in a given
179
# state, then we could easily construct a policy that maximizes our
180
# rewards:
181
#
182
# .. math:: \pi^*(s) = \arg\!\max_a \ Q^*(s, a)
183
#
184
# However, we don't know everything about the world, so we don't have
185
# access to :math:`Q^*`. But, since neural networks are universal function
186
# approximators, we can simply create one and train it to resemble
187
# :math:`Q^*`.
188
#
189
# For our training update rule, we'll use a fact that every :math:`Q`
190
# function for some policy obeys the Bellman equation:
191
#
192
# .. math:: Q^{\pi}(s, a) = r + \gamma Q^{\pi}(s', \pi(s'))
193
#
194
# The difference between the two sides of the equality is known as the
195
# temporal difference error, :math:`\delta`:
196
#
197
# .. math:: \delta = Q(s, a) - (r + \gamma \max_a' Q(s', a))
198
#
199
# To minimize this error, we will use the `Huber
200
# loss <https://en.wikipedia.org/wiki/Huber_loss>`__. The Huber loss acts
201
# like the mean squared error when the error is small, but like the mean
202
# absolute error when the error is large - this makes it more robust to
203
# outliers when the estimates of :math:`Q` are very noisy. We calculate
204
# this over a batch of transitions, :math:`B`, sampled from the replay
205
# memory:
206
#
207
# .. math::
208
#
209
#    \mathcal{L} = \frac{1}{|B|}\sum_{(s, a, s', r) \ \in \ B} \mathcal{L}(\delta)
210
#
211
# .. math::
212
#
213
#    \text{where} \quad \mathcal{L}(\delta) = \begin{cases}
214
#      \frac{1}{2}{\delta^2}  & \text{for } |\delta| \le 1, \\
215
#      |\delta| - \frac{1}{2} & \text{otherwise.}
216
#    \end{cases}
217
#
218
# Q-network
219
# ^^^^^^^^^
220
#
221
# Our model will be a feed forward  neural network that takes in the
222
# difference between the current and previous screen patches. It has two
223
# outputs, representing :math:`Q(s, \mathrm{left})` and
224
# :math:`Q(s, \mathrm{right})` (where :math:`s` is the input to the
225
# network). In effect, the network is trying to predict the *expected return* of
226
# taking each action given the current input.
227
#
228

229
class DQN(nn.Module):
230

231
    def __init__(self, n_observations, n_actions):
232
        super(DQN, self).__init__()
233
        self.layer1 = nn.Linear(n_observations, 128)
234
        self.layer2 = nn.Linear(128, 128)
235
        self.layer3 = nn.Linear(128, n_actions)
236

237
    # Called with either one element to determine next action, or a batch
238
    # during optimization. Returns tensor([[left0exp,right0exp]...]).
239
    def forward(self, x):
240
        x = F.relu(self.layer1(x))
241
        x = F.relu(self.layer2(x))
242
        return self.layer3(x)
243

244

245
######################################################################
246
# Training
247
# --------
248
#
249
# Hyperparameters and utilities
250
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
251
# This cell instantiates our model and its optimizer, and defines some
252
# utilities:
253
#
254
# -  ``select_action`` - will select an action according to an epsilon
255
#    greedy policy. Simply put, we'll sometimes use our model for choosing
256
#    the action, and sometimes we'll just sample one uniformly. The
257
#    probability of choosing a random action will start at ``EPS_START``
258
#    and will decay exponentially towards ``EPS_END``. ``EPS_DECAY``
259
#    controls the rate of the decay.
260
# -  ``plot_durations`` - a helper for plotting the duration of episodes,
261
#    along with an average over the last 100 episodes (the measure used in
262
#    the official evaluations). The plot will be underneath the cell
263
#    containing the main training loop, and will update after every
264
#    episode.
265
#
266

267
# BATCH_SIZE is the number of transitions sampled from the replay buffer
268
# GAMMA is the discount factor as mentioned in the previous section
269
# EPS_START is the starting value of epsilon
270
# EPS_END is the final value of epsilon
271
# EPS_DECAY controls the rate of exponential decay of epsilon, higher means a slower decay
272
# TAU is the update rate of the target network
273
# LR is the learning rate of the ``AdamW`` optimizer
274

275
BATCH_SIZE = 128
276
GAMMA = 0.99
277
EPS_START = 0.9
278
EPS_END = 0.01
279
EPS_DECAY = 2500
280
TAU = 0.005
281
LR = 3e-4
282

283

284
# Get number of actions from gym action space
285
n_actions = env.action_space.n
286
# Get the number of state observations
287
state, info = env.reset()
288
n_observations = len(state)
289

290
policy_net = DQN(n_observations, n_actions).to(device)
291
target_net = DQN(n_observations, n_actions).to(device)
292
target_net.load_state_dict(policy_net.state_dict())
293

294
optimizer = optim.AdamW(policy_net.parameters(), lr=LR, amsgrad=True)
295
memory = ReplayMemory(10000)
296

297

298
steps_done = 0
299

300

301
def select_action(state):
302
    global steps_done
303
    sample = random.random()
304
    eps_threshold = EPS_END + (EPS_START - EPS_END) * \
305
        math.exp(-1. * steps_done / EPS_DECAY)
306
    steps_done += 1
307
    if sample > eps_threshold:
308
        with torch.no_grad():
309
            # t.max(1) will return the largest column value of each row.
310
            # second column on max result is index of where max element was
311
            # found, so we pick action with the larger expected reward.
312
            return policy_net(state).max(1).indices.view(1, 1)
313
    else:
314
        return torch.tensor([[env.action_space.sample()]], device=device, dtype=torch.long)
315

316

317
episode_durations = []
318

319

320
def plot_durations(show_result=False):
321
    plt.figure(1)
322
    durations_t = torch.tensor(episode_durations, dtype=torch.float)
323
    if show_result:
324
        plt.title('Result')
325
    else:
326
        plt.clf()
327
        plt.title('Training...')
328
    plt.xlabel('Episode')
329
    plt.ylabel('Duration')
330
    plt.plot(durations_t.numpy())
331
    # Take 100 episode averages and plot them too
332
    if len(durations_t) >= 100:
333
        means = durations_t.unfold(0, 100, 1).mean(1).view(-1)
334
        means = torch.cat((torch.zeros(99), means))
335
        plt.plot(means.numpy())
336

337
    plt.pause(0.001)  # pause a bit so that plots are updated
338
    if is_ipython:
339
        if not show_result:
340
            display.display(plt.gcf())
341
            display.clear_output(wait=True)
342
        else:
343
            display.display(plt.gcf())
344

345

346
######################################################################
347
# Training loop
348
# ^^^^^^^^^^^^^
349
#
350
# Finally, the code for training our model.
351
#
352
# Here, you can find an ``optimize_model`` function that performs a
353
# single step of the optimization. It first samples a batch, concatenates
354
# all the tensors into a single one, computes :math:`Q(s_t, a_t)` and
355
# :math:`V(s_{t+1}) = \max_a Q(s_{t+1}, a)`, and combines them into our
356
# loss. By definition we set :math:`V(s) = 0` if :math:`s` is a terminal
357
# state. We also use a target network to compute :math:`V(s_{t+1})` for
358
# added stability. The target network is updated at every step with a 
359
# `soft update <https://arxiv.org/pdf/1509.02971.pdf>`__ controlled by 
360
# the hyperparameter ``TAU``, which was previously defined.
361
#
362

363
def optimize_model():
364
    if len(memory) < BATCH_SIZE:
365
        return
366
    transitions = memory.sample(BATCH_SIZE)
367
    # Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
368
    # detailed explanation). This converts batch-array of Transitions
369
    # to Transition of batch-arrays.
370
    batch = Transition(*zip(*transitions))
371

372
    # Compute a mask of non-final states and concatenate the batch elements
373
    # (a final state would've been the one after which simulation ended)
374
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
375
                                          batch.next_state)), device=device, dtype=torch.bool)
376
    non_final_next_states = torch.cat([s for s in batch.next_state
377
                                                if s is not None])
378
    state_batch = torch.cat(batch.state)
379
    action_batch = torch.cat(batch.action)
380
    reward_batch = torch.cat(batch.reward)
381

382
    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
383
    # columns of actions taken. These are the actions which would've been taken
384
    # for each batch state according to policy_net
385
    state_action_values = policy_net(state_batch).gather(1, action_batch)
386

387
    # Compute V(s_{t+1}) for all next states.
388
    # Expected values of actions for non_final_next_states are computed based
389
    # on the "older" target_net; selecting their best reward with max(1).values
390
    # This is merged based on the mask, such that we'll have either the expected
391
    # state value or 0 in case the state was final.
392
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
393
    with torch.no_grad():
394
        next_state_values[non_final_mask] = target_net(non_final_next_states).max(1).values
395
    # Compute the expected Q values
396
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch
397

398
    # Compute Huber loss
399
    criterion = nn.SmoothL1Loss()
400
    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))
401

402
    # Optimize the model
403
    optimizer.zero_grad()
404
    loss.backward()
405
    # In-place gradient clipping
406
    torch.nn.utils.clip_grad_value_(policy_net.parameters(), 100)
407
    optimizer.step()
408

409

410
######################################################################
411
#
412
# Below, you can find the main training loop. At the beginning we reset
413
# the environment and obtain the initial ``state`` Tensor. Then, we sample
414
# an action, execute it, observe the next state and the reward (always
415
# 1), and optimize our model once. When the episode ends (our model
416
# fails), we restart the loop.
417
#
418
# Below, `num_episodes` is set to 600 if a GPU is available, otherwise 50 
419
# episodes are scheduled so training does not take too long. However, 50 
420
# episodes is insufficient for to observe good performance on CartPole.
421
# You should see the model constantly achieve 500 steps within 600 training 
422
# episodes. Training RL agents can be a noisy process, so restarting training
423
# can produce better results if convergence is not observed.
424
#
425

426
if torch.cuda.is_available() or torch.backends.mps.is_available():
427
    num_episodes = 600
428
else:
429
    num_episodes = 50
430

431
for i_episode in range(num_episodes):
432
    # Initialize the environment and get its state
433
    state, info = env.reset()
434
    state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
435
    for t in count():
436
        action = select_action(state)
437
        observation, reward, terminated, truncated, _ = env.step(action.item())
438
        reward = torch.tensor([reward], device=device)
439
        done = terminated or truncated
440

441
        if terminated:
442
            next_state = None
443
        else:
444
            next_state = torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0)
445

446
        # Store the transition in memory
447
        memory.push(state, action, next_state, reward)
448

449
        # Move to the next state
450
        state = next_state
451

452
        # Perform one step of the optimization (on the policy network)
453
        optimize_model()
454

455
        # Soft update of the target network's weights
456
        # θ′ ← τ θ + (1 −τ )θ′
457
        target_net_state_dict = target_net.state_dict()
458
        policy_net_state_dict = policy_net.state_dict()
459
        for key in policy_net_state_dict:
460
            target_net_state_dict[key] = policy_net_state_dict[key]*TAU + target_net_state_dict[key]*(1-TAU)
461
        target_net.load_state_dict(target_net_state_dict)
462

463
        if done:
464
            episode_durations.append(t + 1)
465
            plot_durations()
466
            break
467

468
print('Complete')
469
plot_durations(show_result=True)
470
plt.ioff()
471
plt.show()
472

473
######################################################################
474
# Here is the diagram that illustrates the overall resulting data flow.
475
#
476
# .. figure:: /_static/img/reinforcement_learning_diagram.jpg
477
#
478
# Actions are chosen either randomly or based on a policy, getting the next
479
# step sample from the gym environment. We record the results in the
480
# replay memory and also run optimization step on every iteration.
481
# Optimization picks a random batch from the replay memory to do training of the
482
# new policy. The "older" target_net is also used in optimization to compute the
483
# expected Q values. A soft update of its weights are performed at every step.
484
#
485

486
Product

Resources

Company