CoCalc -- profiler

GitHub Repository: pytorch/tutorials
Path: blob/main/recipes_source/recipes/profiler_recipe.py
¹⁶⁹⁴ views
1
"""
2
PyTorch Profiler
3
====================================
4
**Author:** `Shivam Raikundalia <https://github.com/sraikund16>`_
5
"""
6

7
######################################################################
8
# This recipe explains how to use PyTorch profiler and measure the time and
9
# memory consumption of the model's operators.
10
#
11
# Introduction
12
# ------------
13
# PyTorch includes a simple profiler API that is useful when the user needs
14
# to determine the most expensive operators in the model.
15
#
16
# In this recipe, we will use a simple Resnet model to demonstrate how to
17
# use the profiler to analyze model performance.
18
#
19
# Prerequisites
20
# ---------------
21
# - ``torch >= 2.3.0``
22
#
23
# Setup
24
# -----
25
# To install ``torch`` and ``torchvision`` use the following command:
26
#
27
# .. code-block:: sh
28
#
29
#    pip install torch torchvision
30
#
31

32
######################################################################
33
# Steps
34
# -----
35
#
36
# 1. Import all necessary libraries
37
# 2. Instantiate a simple Resnet model
38
# 3. Using profiler to analyze execution time
39
# 4. Using profiler to analyze memory consumption
40
# 5. Using tracing functionality
41
# 6. Examining stack traces
42
# 7. Using profiler to analyze long-running jobs
43
#
44
# 1. Import all necessary libraries
45
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
46
#
47
# In this recipe we will use ``torch``, ``torchvision.models``
48
# and ``profiler`` modules:
49
#
50

51
import torch
52
import torchvision.models as models
53
from torch.profiler import profile, ProfilerActivity, record_function
54

55

56
######################################################################
57
# 2. Instantiate a simple Resnet model
58
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
59
#
60
# Let's create an instance of a Resnet model and prepare an input
61
# for it:
62
#
63

64
model = models.resnet18()
65
inputs = torch.randn(5, 3, 224, 224)
66

67
######################################################################
68
# 3. Using profiler to analyze execution time
69
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
70
#
71
# PyTorch profiler is enabled through the context manager and accepts
72
# a number of parameters, some of the most useful are:
73
#
74
# - ``activities`` - a list of activities to profile:
75
#    - ``ProfilerActivity.CPU`` - PyTorch operators, TorchScript functions and
76
#      user-defined code labels (see ``record_function`` below);
77
#    - ``ProfilerActivity.CUDA`` - on-device CUDA kernels;
78
#    - ``ProfilerActivity.XPU`` - on-device XPU kernels;
79
# - ``record_shapes`` - whether to record shapes of the operator inputs;
80
# - ``profile_memory`` - whether to report amount of memory consumed by
81
#   model's Tensors;
82
#
83
# Note: when using CUDA, profiler also shows the runtime CUDA events
84
# occurring on the host.
85

86
######################################################################
87
# Let's see how we can use profiler to analyze the execution time:
88

89
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
90
    with record_function("model_inference"):
91
        model(inputs)
92

93
######################################################################
94
# Note that we can use ``record_function`` context manager to label
95
# arbitrary code ranges with user provided names
96
# (``model_inference`` is used as a label in the example above).
97
#
98
# Profiler allows one to check which operators were called during the
99
# execution of a code range wrapped with a profiler context manager.
100
# If multiple profiler ranges are active at the same time (e.g. in
101
# parallel PyTorch threads), each profiling context manager tracks only
102
# the operators of its corresponding range.
103
# Profiler also automatically profiles the asynchronous tasks launched
104
# with ``torch.jit._fork`` and (in case of a backward pass)
105
# the backward pass operators launched with ``backward()`` call.
106
#
107
# Let's print out the stats for the execution above:
108

109
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
110

111
######################################################################
112
# The output will look like (omitting some columns):
113
#
114
# .. code-block:: sh
115
#
116
#    ---------------------------------  ------------  ------------  ------------  ------------
117
#                                 Name      Self CPU     CPU total  CPU time avg    # of Calls
118
#    ---------------------------------  ------------  ------------  ------------  ------------
119
#                      model_inference       5.509ms      57.503ms      57.503ms             1
120
#                         aten::conv2d     231.000us      31.931ms       1.597ms            20
121
#                    aten::convolution     250.000us      31.700ms       1.585ms            20
122
#                   aten::_convolution     336.000us      31.450ms       1.573ms            20
123
#             aten::mkldnn_convolution      30.838ms      31.114ms       1.556ms            20
124
#                     aten::batch_norm     211.000us      14.693ms     734.650us            20
125
#         aten::_batch_norm_impl_index     319.000us      14.482ms     724.100us            20
126
#              aten::native_batch_norm       9.229ms      14.109ms     705.450us            20
127
#                           aten::mean     332.000us       2.631ms     125.286us            21
128
#                         aten::select       1.668ms       2.292ms       8.988us           255
129
#    ---------------------------------  ------------  ------------  ------------  ------------
130
#    Self CPU time total: 57.549m
131
#
132

133
######################################################################
134
# Here we see that, as expected, most of the time is spent in convolution (and specifically in ``mkldnn_convolution``
135
# for PyTorch compiled with ``MKL-DNN`` support).
136
# Note the difference between self cpu time and cpu time - operators can call other operators, self cpu time excludes time
137
# spent in children operator calls, while total cpu time includes it. You can choose to sort by the self cpu time by passing
138
# ``sort_by="self_cpu_time_total"`` into the ``table`` call.
139
#
140
# To get a finer granularity of results and include operator input shapes, pass ``group_by_input_shape=True``
141
# (note: this requires running the profiler with ``record_shapes=True``):
142

143
print(
144
    prof.key_averages(group_by_input_shape=True).table(
145
        sort_by="cpu_time_total", row_limit=10
146
    )
147
)
148

149
########################################################################################
150
# The output might look like this (omitting some columns):
151
#
152
# .. code-block:: sh
153
#
154
#    ---------------------------------  ------------  -------------------------------------------
155
#                                 Name     CPU total                                 Input Shapes
156
#    ---------------------------------  ------------  -------------------------------------------
157
#                      model_inference      57.503ms                                           []
158
#                         aten::conv2d       8.008ms      [5,64,56,56], [64,64,3,3], [], ..., []]
159
#                    aten::convolution       7.956ms     [[5,64,56,56], [64,64,3,3], [], ..., []]
160
#                   aten::_convolution       7.909ms     [[5,64,56,56], [64,64,3,3], [], ..., []]
161
#             aten::mkldnn_convolution       7.834ms     [[5,64,56,56], [64,64,3,3], [], ..., []]
162
#                         aten::conv2d       6.332ms    [[5,512,7,7], [512,512,3,3], [], ..., []]
163
#                    aten::convolution       6.303ms    [[5,512,7,7], [512,512,3,3], [], ..., []]
164
#                   aten::_convolution       6.273ms    [[5,512,7,7], [512,512,3,3], [], ..., []]
165
#             aten::mkldnn_convolution       6.233ms    [[5,512,7,7], [512,512,3,3], [], ..., []]
166
#                         aten::conv2d       4.751ms  [[5,256,14,14], [256,256,3,3], [], ..., []]
167
#    ---------------------------------  ------------  -------------------------------------------
168
#    Self CPU time total: 57.549ms
169
#
170

171
######################################################################
172
# Note the occurrence of ``aten::convolution`` twice with different input shapes.
173

174
######################################################################
175
# Profiler can also be used to analyze performance of models executed on GPUs:
176
# Users could switch between cpu, cuda and xpu
177
activities = [ProfilerActivity.CPU]
178
if torch.cuda.is_available():
179
    device = "cuda"
180
    activities += [ProfilerActivity.CUDA]
181
elif torch.xpu.is_available():
182
    device = "xpu"
183
    activities += [ProfilerActivity.XPU]
184
else:
185
    print(
186
        "Neither CUDA nor XPU devices are available to demonstrate profiling on acceleration devices"
187
    )
188
    import sys
189

190
    sys.exit(0)
191

192
sort_by_keyword = device + "_time_total"
193

194
model = models.resnet18().to(device)
195
inputs = torch.randn(5, 3, 224, 224).to(device)
196

197
with profile(activities=activities, record_shapes=True) as prof:
198
    with record_function("model_inference"):
199
        model(inputs)
200

201
print(prof.key_averages().table(sort_by=sort_by_keyword, row_limit=10))
202

203
######################################################################
204
# (Note: the first use of CUDA profiling may bring an extra overhead.)
205

206
######################################################################
207
# The resulting table output (omitting some columns):
208
#
209
# .. code-block:: sh
210
#
211
#    -------------------------------------------------------  ------------  ------------
212
#                                                       Name     Self CUDA    CUDA total
213
#    -------------------------------------------------------  ------------  ------------
214
#                                            model_inference       0.000us      11.666ms
215
#                                               aten::conv2d       0.000us      10.484ms
216
#                                          aten::convolution       0.000us      10.484ms
217
#                                         aten::_convolution       0.000us      10.484ms
218
#                                 aten::_convolution_nogroup       0.000us      10.484ms
219
#                                          aten::thnn_conv2d       0.000us      10.484ms
220
#                                  aten::thnn_conv2d_forward      10.484ms      10.484ms
221
#    void at::native::im2col_kernel<float>(long, float co...       3.844ms       3.844ms
222
#                                          sgemm_32x32x32_NN       3.206ms       3.206ms
223
#                                      sgemm_32x32x32_NN_vec       3.093ms       3.093ms
224
#    -------------------------------------------------------  ------------  ------------
225
#    Self CPU time total: 23.015ms
226
#    Self CUDA time total: 11.666ms
227
#
228

229
######################################################################
230
# (Note: the first use of XPU profiling may bring an extra overhead.)
231

232
######################################################################
233
# The resulting table output (omitting some columns):
234
#
235
# .. code-block:: sh
236
#
237
#    ------------------------------  ------------  ------------  ------------  ------------  ------------
238
#                              Name      Self XPU    Self XPU %     XPU total  XPU time avg    # of Calls
239
#    ------------------------------  ------------  ------------  ------------  ------------  ------------
240
#                   model_inference       0.000us         0.00%       2.567ms       2.567ms             1
241
#                      aten::conv2d       0.000us         0.00%       1.871ms      93.560us            20
242
#                 aten::convolution       0.000us         0.00%       1.871ms      93.560us            20
243
#                aten::_convolution       0.000us         0.00%       1.871ms      93.560us            20
244
#    aten::convolution_overrideable       1.871ms        72.89%       1.871ms      93.560us            20
245
#                          gen_conv       1.484ms        57.82%       1.484ms      74.216us            20
246
#                  aten::batch_norm       0.000us         0.00%     432.640us      21.632us            20
247
#      aten::_batch_norm_impl_index       0.000us         0.00%     432.640us      21.632us            20
248
#           aten::native_batch_norm     432.640us        16.85%     432.640us      21.632us            20
249
#                      conv_reorder     386.880us        15.07%     386.880us       6.448us            60
250
#    ------------------------------  ------------  ------------  ------------  ------------  ------------
251
#    Self CPU time total: 712.486ms
252
#    Self XPU time total: 2.567ms
253
#
254

255
######################################################################
256
# Note the occurrence of on-device kernels in the output (e.g. ``sgemm_32x32x32_NN`` for CUDA or ``gen_conv`` for XPU).
257

258
######################################################################
259
# 4. Using profiler to analyze memory consumption
260
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
261
#
262
# PyTorch profiler can also show the amount of memory (used by the model's tensors)
263
# that was allocated (or released) during the execution of the model's operators.
264
# In the output below, 'self' memory corresponds to the memory allocated (released)
265
# by the operator, excluding the children calls to the other operators.
266
# To enable memory profiling functionality pass ``profile_memory=True``.
267

268
model = models.resnet18()
269
inputs = torch.randn(5, 3, 224, 224)
270

271
with profile(
272
    activities=[ProfilerActivity.CPU], profile_memory=True, record_shapes=True
273
) as prof:
274
    model(inputs)
275

276
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))
277

278
# (omitting some columns)
279
# ---------------------------------  ------------  ------------  ------------
280
#                              Name       CPU Mem  Self CPU Mem    # of Calls
281
# ---------------------------------  ------------  ------------  ------------
282
#                       aten::empty      94.79 Mb      94.79 Mb           121
283
#     aten::max_pool2d_with_indices      11.48 Mb      11.48 Mb             1
284
#                       aten::addmm      19.53 Kb      19.53 Kb             1
285
#               aten::empty_strided         572 b         572 b            25
286
#                     aten::resize_         240 b         240 b             6
287
#                         aten::abs         480 b         240 b             4
288
#                         aten::add         160 b         160 b            20
289
#               aten::masked_select         120 b         112 b             1
290
#                          aten::ne         122 b          53 b             6
291
#                          aten::eq          60 b          30 b             2
292
# ---------------------------------  ------------  ------------  ------------
293
# Self CPU time total: 53.064ms
294

295
print(prof.key_averages().table(sort_by="cpu_memory_usage", row_limit=10))
296

297
#############################################################################
298
# The output might look like this (omitting some columns):
299
#
300
# .. code-block:: sh
301
#
302
#    ---------------------------------  ------------  ------------  ------------
303
#                                 Name       CPU Mem  Self CPU Mem    # of Calls
304
#    ---------------------------------  ------------  ------------  ------------
305
#                          aten::empty      94.79 Mb      94.79 Mb           121
306
#                     aten::batch_norm      47.41 Mb           0 b            20
307
#         aten::_batch_norm_impl_index      47.41 Mb           0 b            20
308
#              aten::native_batch_norm      47.41 Mb           0 b            20
309
#                         aten::conv2d      47.37 Mb           0 b            20
310
#                    aten::convolution      47.37 Mb           0 b            20
311
#                   aten::_convolution      47.37 Mb           0 b            20
312
#             aten::mkldnn_convolution      47.37 Mb           0 b            20
313
#                     aten::max_pool2d      11.48 Mb           0 b             1
314
#        aten::max_pool2d_with_indices      11.48 Mb      11.48 Mb             1
315
#    ---------------------------------  ------------  ------------  ------------
316
#    Self CPU time total: 53.064ms
317
#
318

319
######################################################################
320
# 5. Using tracing functionality
321
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
322
#
323
# Profiling results can be outputted as a ``.json`` trace file:
324
# Tracing CUDA or XPU kernels
325
# Users could switch between cpu, cuda and xpu
326
activities = [ProfilerActivity.CPU]
327
if torch.cuda.is_available():
328
    device = "cuda"
329
    activities += [ProfilerActivity.CUDA]
330
elif torch.xpu.is_available():
331
    device = "xpu"
332
    activities += [ProfilerActivity.XPU]
333
else:
334
    print(
335
        "Neither CUDA nor XPU devices are available to demonstrate profiling on acceleration devices"
336
    )
337
    import sys
338

339
    sys.exit(0)
340

341
model = models.resnet18().to(device)
342
inputs = torch.randn(5, 3, 224, 224).to(device)
343

344
with profile(activities=activities) as prof:
345
    model(inputs)
346

347
prof.export_chrome_trace("trace.json")
348

349
######################################################################
350
# You can examine the sequence of profiled operators and CUDA/XPU kernels
351
# in Chrome trace viewer (``chrome://tracing``):
352
#
353
# .. image:: ../../_static/img/trace_img.png
354
#    :scale: 25 %
355

356
######################################################################
357
# 6. Examining stack traces
358
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
359
#
360
# Profiler can be used to analyze Python and TorchScript stack traces:
361
sort_by_keyword = "self_" + device + "_time_total"
362

363
with profile(
364
    activities=activities,
365
    with_stack=True,
366
    experimental_config=torch._C._profiler._ExperimentalConfig(verbose=True),
367
) as prof:
368
    model(inputs)
369

370
# Print aggregated stats
371
print(prof.key_averages(group_by_stack_n=5).table(sort_by=sort_by_keyword, row_limit=2))
372

373
#################################################################################
374
# The output might look like this (omitting some columns):
375
#
376
# .. code-block:: sh
377
#
378
#    -------------------------  -----------------------------------------------------------
379
#                         Name  Source Location
380
#    -------------------------  -----------------------------------------------------------
381
#    aten::thnn_conv2d_forward  .../torch/nn/modules/conv.py(439): _conv_forward
382
#                               .../torch/nn/modules/conv.py(443): forward
383
#                               .../torch/nn/modules/module.py(1051): _call_impl
384
#                               .../site-packages/torchvision/models/resnet.py(63): forward
385
#                               .../torch/nn/modules/module.py(1051): _call_impl
386
#    aten::thnn_conv2d_forward  .../torch/nn/modules/conv.py(439): _conv_forward
387
#                               .../torch/nn/modules/conv.py(443): forward
388
#                               .../torch/nn/modules/module.py(1051): _call_impl
389
#                               .../site-packages/torchvision/models/resnet.py(59): forward
390
#                               .../torch/nn/modules/module.py(1051): _call_impl
391
#    -------------------------  -----------------------------------------------------------
392
#    Self CPU time total: 34.016ms
393
#    Self CUDA time total: 11.659ms
394
#
395

396
######################################################################
397
# Note the two convolutions and the two call sites in ``torchvision/models/resnet.py`` script.
398
#
399
# (Warning: stack tracing adds an extra profiling overhead.)
400

401
######################################################################
402
# 7. Using profiler to analyze long-running jobs
403
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
404
#
405
# PyTorch profiler offers an additional API to handle long-running jobs
406
# (such as training loops). Tracing all of the execution can be
407
# slow and result in very large trace files. To avoid this, use optional
408
# arguments:
409
#
410
# - ``schedule`` - specifies a function that takes an integer argument (step number)
411
#   as an input and returns an action for the profiler, the best way to use this parameter
412
#   is to use ``torch.profiler.schedule`` helper function that can generate a schedule for you;
413
# - ``on_trace_ready`` - specifies a function that takes a reference to the profiler as
414
#   an input and is called by the profiler each time the new trace is ready.
415
#
416
# To illustrate how the API works, let's first consider the following example with
417
# ``torch.profiler.schedule`` helper function:
418

419
from torch.profiler import schedule
420

421
my_schedule = schedule(skip_first=10, wait=5, warmup=1, active=3, repeat=2)
422

423
######################################################################
424
# Profiler assumes that the long-running job is composed of steps, numbered
425
# starting from zero. The example above defines the following sequence of actions
426
# for the profiler:
427
#
428
# 1. Parameter ``skip_first`` tells profiler that it should ignore the first 10 steps
429
#    (default value of ``skip_first`` is zero);
430
# 2. After the first ``skip_first`` steps, profiler starts executing profiler cycles;
431
# 3. Each cycle consists of three phases:
432
#
433
#    - idling (``wait=5`` steps), during this phase profiler is not active;
434
#    - warming up (``warmup=1`` steps), during this phase profiler starts tracing, but
435
#      the results are discarded; this phase is used to discard the samples obtained by
436
#      the profiler at the beginning of the trace since they are usually skewed by an extra
437
#      overhead;
438
#    - active tracing (``active=3`` steps), during this phase profiler traces and records data;
439
# 4. An optional ``repeat`` parameter specifies an upper bound on the number of cycles.
440
#    By default (zero value), profiler will execute cycles as long as the job runs.
441

442
######################################################################
443
# Thus, in the example above, profiler will skip the first 15 steps, spend the next step on the warm up,
444
# actively record the next 3 steps, skip another 5 steps, spend the next step on the warm up, actively
445
# record another 3 steps. Since the ``repeat=2`` parameter value is specified, the profiler will stop
446
# the recording after the first two cycles.
447
#
448
# At the end of each cycle profiler calls the specified ``on_trace_ready`` function and passes itself as
449
# an argument. This function is used to process the new trace - either by obtaining the table output or
450
# by saving the output on disk as a trace file.
451
#
452
# To send the signal to the profiler that the next step has started, call ``prof.step()`` function.
453
# The current profiler step is stored in ``prof.step_num``.
454
#
455
# The following example shows how to use all of the concepts above for CUDA and XPU Kernels:
456

457
sort_by_keyword = "self_" + device + "_time_total"
458

459

460
def trace_handler(p):
461
    output = p.key_averages().table(sort_by=sort_by_keyword, row_limit=10)
462
    print(output)
463
    p.export_chrome_trace("/tmp/trace_" + str(p.step_num) + ".json")
464

465

466
with profile(
467
    activities=activities,
468
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=2),
469
    on_trace_ready=trace_handler,
470
) as p:
471
    for idx in range(8):
472
        model(inputs)
473
        p.step()
474

475
######################################################################
476
# Learn More
477
# ----------
478
#
479
# Take a look at the following recipes/tutorials to continue your learning:
480
#
481
# - `PyTorch Benchmark <https://pytorch.org/tutorials/recipes/recipes/benchmark.html>`_
482
# - `Visualizing models, data, and training with TensorBoard <https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html>`_ tutorial
483
#
484

485
Product

Resources

Company