Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pytorch
GitHub Repository: pytorch/tutorials
Path: blob/main/recipes_source/recipes/profiler_recipe.py
1694 views
1
"""
2
PyTorch Profiler
3
====================================
4
**Author:** `Shivam Raikundalia <https://github.com/sraikund16>`_
5
"""
6
7
######################################################################
8
# This recipe explains how to use PyTorch profiler and measure the time and
9
# memory consumption of the model's operators.
10
#
11
# Introduction
12
# ------------
13
# PyTorch includes a simple profiler API that is useful when the user needs
14
# to determine the most expensive operators in the model.
15
#
16
# In this recipe, we will use a simple Resnet model to demonstrate how to
17
# use the profiler to analyze model performance.
18
#
19
# Prerequisites
20
# ---------------
21
# - ``torch >= 2.3.0``
22
#
23
# Setup
24
# -----
25
# To install ``torch`` and ``torchvision`` use the following command:
26
#
27
# .. code-block:: sh
28
#
29
# pip install torch torchvision
30
#
31
32
######################################################################
33
# Steps
34
# -----
35
#
36
# 1. Import all necessary libraries
37
# 2. Instantiate a simple Resnet model
38
# 3. Using profiler to analyze execution time
39
# 4. Using profiler to analyze memory consumption
40
# 5. Using tracing functionality
41
# 6. Examining stack traces
42
# 7. Using profiler to analyze long-running jobs
43
#
44
# 1. Import all necessary libraries
45
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
46
#
47
# In this recipe we will use ``torch``, ``torchvision.models``
48
# and ``profiler`` modules:
49
#
50
51
import torch
52
import torchvision.models as models
53
from torch.profiler import profile, ProfilerActivity, record_function
54
55
56
######################################################################
57
# 2. Instantiate a simple Resnet model
58
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
59
#
60
# Let's create an instance of a Resnet model and prepare an input
61
# for it:
62
#
63
64
model = models.resnet18()
65
inputs = torch.randn(5, 3, 224, 224)
66
67
######################################################################
68
# 3. Using profiler to analyze execution time
69
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
70
#
71
# PyTorch profiler is enabled through the context manager and accepts
72
# a number of parameters, some of the most useful are:
73
#
74
# - ``activities`` - a list of activities to profile:
75
# - ``ProfilerActivity.CPU`` - PyTorch operators, TorchScript functions and
76
# user-defined code labels (see ``record_function`` below);
77
# - ``ProfilerActivity.CUDA`` - on-device CUDA kernels;
78
# - ``ProfilerActivity.XPU`` - on-device XPU kernels;
79
# - ``record_shapes`` - whether to record shapes of the operator inputs;
80
# - ``profile_memory`` - whether to report amount of memory consumed by
81
# model's Tensors;
82
#
83
# Note: when using CUDA, profiler also shows the runtime CUDA events
84
# occurring on the host.
85
86
######################################################################
87
# Let's see how we can use profiler to analyze the execution time:
88
89
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
90
with record_function("model_inference"):
91
model(inputs)
92
93
######################################################################
94
# Note that we can use ``record_function`` context manager to label
95
# arbitrary code ranges with user provided names
96
# (``model_inference`` is used as a label in the example above).
97
#
98
# Profiler allows one to check which operators were called during the
99
# execution of a code range wrapped with a profiler context manager.
100
# If multiple profiler ranges are active at the same time (e.g. in
101
# parallel PyTorch threads), each profiling context manager tracks only
102
# the operators of its corresponding range.
103
# Profiler also automatically profiles the asynchronous tasks launched
104
# with ``torch.jit._fork`` and (in case of a backward pass)
105
# the backward pass operators launched with ``backward()`` call.
106
#
107
# Let's print out the stats for the execution above:
108
109
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
110
111
######################################################################
112
# The output will look like (omitting some columns):
113
#
114
# .. code-block:: sh
115
#
116
# --------------------------------- ------------ ------------ ------------ ------------
117
# Name Self CPU CPU total CPU time avg # of Calls
118
# --------------------------------- ------------ ------------ ------------ ------------
119
# model_inference 5.509ms 57.503ms 57.503ms 1
120
# aten::conv2d 231.000us 31.931ms 1.597ms 20
121
# aten::convolution 250.000us 31.700ms 1.585ms 20
122
# aten::_convolution 336.000us 31.450ms 1.573ms 20
123
# aten::mkldnn_convolution 30.838ms 31.114ms 1.556ms 20
124
# aten::batch_norm 211.000us 14.693ms 734.650us 20
125
# aten::_batch_norm_impl_index 319.000us 14.482ms 724.100us 20
126
# aten::native_batch_norm 9.229ms 14.109ms 705.450us 20
127
# aten::mean 332.000us 2.631ms 125.286us 21
128
# aten::select 1.668ms 2.292ms 8.988us 255
129
# --------------------------------- ------------ ------------ ------------ ------------
130
# Self CPU time total: 57.549m
131
#
132
133
######################################################################
134
# Here we see that, as expected, most of the time is spent in convolution (and specifically in ``mkldnn_convolution``
135
# for PyTorch compiled with ``MKL-DNN`` support).
136
# Note the difference between self cpu time and cpu time - operators can call other operators, self cpu time excludes time
137
# spent in children operator calls, while total cpu time includes it. You can choose to sort by the self cpu time by passing
138
# ``sort_by="self_cpu_time_total"`` into the ``table`` call.
139
#
140
# To get a finer granularity of results and include operator input shapes, pass ``group_by_input_shape=True``
141
# (note: this requires running the profiler with ``record_shapes=True``):
142
143
print(
144
prof.key_averages(group_by_input_shape=True).table(
145
sort_by="cpu_time_total", row_limit=10
146
)
147
)
148
149
########################################################################################
150
# The output might look like this (omitting some columns):
151
#
152
# .. code-block:: sh
153
#
154
# --------------------------------- ------------ -------------------------------------------
155
# Name CPU total Input Shapes
156
# --------------------------------- ------------ -------------------------------------------
157
# model_inference 57.503ms []
158
# aten::conv2d 8.008ms [5,64,56,56], [64,64,3,3], [], ..., []]
159
# aten::convolution 7.956ms [[5,64,56,56], [64,64,3,3], [], ..., []]
160
# aten::_convolution 7.909ms [[5,64,56,56], [64,64,3,3], [], ..., []]
161
# aten::mkldnn_convolution 7.834ms [[5,64,56,56], [64,64,3,3], [], ..., []]
162
# aten::conv2d 6.332ms [[5,512,7,7], [512,512,3,3], [], ..., []]
163
# aten::convolution 6.303ms [[5,512,7,7], [512,512,3,3], [], ..., []]
164
# aten::_convolution 6.273ms [[5,512,7,7], [512,512,3,3], [], ..., []]
165
# aten::mkldnn_convolution 6.233ms [[5,512,7,7], [512,512,3,3], [], ..., []]
166
# aten::conv2d 4.751ms [[5,256,14,14], [256,256,3,3], [], ..., []]
167
# --------------------------------- ------------ -------------------------------------------
168
# Self CPU time total: 57.549ms
169
#
170
171
######################################################################
172
# Note the occurrence of ``aten::convolution`` twice with different input shapes.
173
174
######################################################################
175
# Profiler can also be used to analyze performance of models executed on GPUs:
176
# Users could switch between cpu, cuda and xpu
177
activities = [ProfilerActivity.CPU]
178
if torch.cuda.is_available():
179
device = "cuda"
180
activities += [ProfilerActivity.CUDA]
181
elif torch.xpu.is_available():
182
device = "xpu"
183
activities += [ProfilerActivity.XPU]
184
else:
185
print(
186
"Neither CUDA nor XPU devices are available to demonstrate profiling on acceleration devices"
187
)
188
import sys
189
190
sys.exit(0)
191
192
sort_by_keyword = device + "_time_total"
193
194
model = models.resnet18().to(device)
195
inputs = torch.randn(5, 3, 224, 224).to(device)
196
197
with profile(activities=activities, record_shapes=True) as prof:
198
with record_function("model_inference"):
199
model(inputs)
200
201
print(prof.key_averages().table(sort_by=sort_by_keyword, row_limit=10))
202
203
######################################################################
204
# (Note: the first use of CUDA profiling may bring an extra overhead.)
205
206
######################################################################
207
# The resulting table output (omitting some columns):
208
#
209
# .. code-block:: sh
210
#
211
# ------------------------------------------------------- ------------ ------------
212
# Name Self CUDA CUDA total
213
# ------------------------------------------------------- ------------ ------------
214
# model_inference 0.000us 11.666ms
215
# aten::conv2d 0.000us 10.484ms
216
# aten::convolution 0.000us 10.484ms
217
# aten::_convolution 0.000us 10.484ms
218
# aten::_convolution_nogroup 0.000us 10.484ms
219
# aten::thnn_conv2d 0.000us 10.484ms
220
# aten::thnn_conv2d_forward 10.484ms 10.484ms
221
# void at::native::im2col_kernel<float>(long, float co... 3.844ms 3.844ms
222
# sgemm_32x32x32_NN 3.206ms 3.206ms
223
# sgemm_32x32x32_NN_vec 3.093ms 3.093ms
224
# ------------------------------------------------------- ------------ ------------
225
# Self CPU time total: 23.015ms
226
# Self CUDA time total: 11.666ms
227
#
228
229
######################################################################
230
# (Note: the first use of XPU profiling may bring an extra overhead.)
231
232
######################################################################
233
# The resulting table output (omitting some columns):
234
#
235
# .. code-block:: sh
236
#
237
# ------------------------------ ------------ ------------ ------------ ------------ ------------
238
# Name Self XPU Self XPU % XPU total XPU time avg # of Calls
239
# ------------------------------ ------------ ------------ ------------ ------------ ------------
240
# model_inference 0.000us 0.00% 2.567ms 2.567ms 1
241
# aten::conv2d 0.000us 0.00% 1.871ms 93.560us 20
242
# aten::convolution 0.000us 0.00% 1.871ms 93.560us 20
243
# aten::_convolution 0.000us 0.00% 1.871ms 93.560us 20
244
# aten::convolution_overrideable 1.871ms 72.89% 1.871ms 93.560us 20
245
# gen_conv 1.484ms 57.82% 1.484ms 74.216us 20
246
# aten::batch_norm 0.000us 0.00% 432.640us 21.632us 20
247
# aten::_batch_norm_impl_index 0.000us 0.00% 432.640us 21.632us 20
248
# aten::native_batch_norm 432.640us 16.85% 432.640us 21.632us 20
249
# conv_reorder 386.880us 15.07% 386.880us 6.448us 60
250
# ------------------------------ ------------ ------------ ------------ ------------ ------------
251
# Self CPU time total: 712.486ms
252
# Self XPU time total: 2.567ms
253
#
254
255
######################################################################
256
# Note the occurrence of on-device kernels in the output (e.g. ``sgemm_32x32x32_NN`` for CUDA or ``gen_conv`` for XPU).
257
258
######################################################################
259
# 4. Using profiler to analyze memory consumption
260
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
261
#
262
# PyTorch profiler can also show the amount of memory (used by the model's tensors)
263
# that was allocated (or released) during the execution of the model's operators.
264
# In the output below, 'self' memory corresponds to the memory allocated (released)
265
# by the operator, excluding the children calls to the other operators.
266
# To enable memory profiling functionality pass ``profile_memory=True``.
267
268
model = models.resnet18()
269
inputs = torch.randn(5, 3, 224, 224)
270
271
with profile(
272
activities=[ProfilerActivity.CPU], profile_memory=True, record_shapes=True
273
) as prof:
274
model(inputs)
275
276
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))
277
278
# (omitting some columns)
279
# --------------------------------- ------------ ------------ ------------
280
# Name CPU Mem Self CPU Mem # of Calls
281
# --------------------------------- ------------ ------------ ------------
282
# aten::empty 94.79 Mb 94.79 Mb 121
283
# aten::max_pool2d_with_indices 11.48 Mb 11.48 Mb 1
284
# aten::addmm 19.53 Kb 19.53 Kb 1
285
# aten::empty_strided 572 b 572 b 25
286
# aten::resize_ 240 b 240 b 6
287
# aten::abs 480 b 240 b 4
288
# aten::add 160 b 160 b 20
289
# aten::masked_select 120 b 112 b 1
290
# aten::ne 122 b 53 b 6
291
# aten::eq 60 b 30 b 2
292
# --------------------------------- ------------ ------------ ------------
293
# Self CPU time total: 53.064ms
294
295
print(prof.key_averages().table(sort_by="cpu_memory_usage", row_limit=10))
296
297
#############################################################################
298
# The output might look like this (omitting some columns):
299
#
300
# .. code-block:: sh
301
#
302
# --------------------------------- ------------ ------------ ------------
303
# Name CPU Mem Self CPU Mem # of Calls
304
# --------------------------------- ------------ ------------ ------------
305
# aten::empty 94.79 Mb 94.79 Mb 121
306
# aten::batch_norm 47.41 Mb 0 b 20
307
# aten::_batch_norm_impl_index 47.41 Mb 0 b 20
308
# aten::native_batch_norm 47.41 Mb 0 b 20
309
# aten::conv2d 47.37 Mb 0 b 20
310
# aten::convolution 47.37 Mb 0 b 20
311
# aten::_convolution 47.37 Mb 0 b 20
312
# aten::mkldnn_convolution 47.37 Mb 0 b 20
313
# aten::max_pool2d 11.48 Mb 0 b 1
314
# aten::max_pool2d_with_indices 11.48 Mb 11.48 Mb 1
315
# --------------------------------- ------------ ------------ ------------
316
# Self CPU time total: 53.064ms
317
#
318
319
######################################################################
320
# 5. Using tracing functionality
321
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
322
#
323
# Profiling results can be outputted as a ``.json`` trace file:
324
# Tracing CUDA or XPU kernels
325
# Users could switch between cpu, cuda and xpu
326
activities = [ProfilerActivity.CPU]
327
if torch.cuda.is_available():
328
device = "cuda"
329
activities += [ProfilerActivity.CUDA]
330
elif torch.xpu.is_available():
331
device = "xpu"
332
activities += [ProfilerActivity.XPU]
333
else:
334
print(
335
"Neither CUDA nor XPU devices are available to demonstrate profiling on acceleration devices"
336
)
337
import sys
338
339
sys.exit(0)
340
341
model = models.resnet18().to(device)
342
inputs = torch.randn(5, 3, 224, 224).to(device)
343
344
with profile(activities=activities) as prof:
345
model(inputs)
346
347
prof.export_chrome_trace("trace.json")
348
349
######################################################################
350
# You can examine the sequence of profiled operators and CUDA/XPU kernels
351
# in Chrome trace viewer (``chrome://tracing``):
352
#
353
# .. image:: ../../_static/img/trace_img.png
354
# :scale: 25 %
355
356
######################################################################
357
# 6. Examining stack traces
358
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
359
#
360
# Profiler can be used to analyze Python and TorchScript stack traces:
361
sort_by_keyword = "self_" + device + "_time_total"
362
363
with profile(
364
activities=activities,
365
with_stack=True,
366
experimental_config=torch._C._profiler._ExperimentalConfig(verbose=True),
367
) as prof:
368
model(inputs)
369
370
# Print aggregated stats
371
print(prof.key_averages(group_by_stack_n=5).table(sort_by=sort_by_keyword, row_limit=2))
372
373
#################################################################################
374
# The output might look like this (omitting some columns):
375
#
376
# .. code-block:: sh
377
#
378
# ------------------------- -----------------------------------------------------------
379
# Name Source Location
380
# ------------------------- -----------------------------------------------------------
381
# aten::thnn_conv2d_forward .../torch/nn/modules/conv.py(439): _conv_forward
382
# .../torch/nn/modules/conv.py(443): forward
383
# .../torch/nn/modules/module.py(1051): _call_impl
384
# .../site-packages/torchvision/models/resnet.py(63): forward
385
# .../torch/nn/modules/module.py(1051): _call_impl
386
# aten::thnn_conv2d_forward .../torch/nn/modules/conv.py(439): _conv_forward
387
# .../torch/nn/modules/conv.py(443): forward
388
# .../torch/nn/modules/module.py(1051): _call_impl
389
# .../site-packages/torchvision/models/resnet.py(59): forward
390
# .../torch/nn/modules/module.py(1051): _call_impl
391
# ------------------------- -----------------------------------------------------------
392
# Self CPU time total: 34.016ms
393
# Self CUDA time total: 11.659ms
394
#
395
396
######################################################################
397
# Note the two convolutions and the two call sites in ``torchvision/models/resnet.py`` script.
398
#
399
# (Warning: stack tracing adds an extra profiling overhead.)
400
401
######################################################################
402
# 7. Using profiler to analyze long-running jobs
403
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
404
#
405
# PyTorch profiler offers an additional API to handle long-running jobs
406
# (such as training loops). Tracing all of the execution can be
407
# slow and result in very large trace files. To avoid this, use optional
408
# arguments:
409
#
410
# - ``schedule`` - specifies a function that takes an integer argument (step number)
411
# as an input and returns an action for the profiler, the best way to use this parameter
412
# is to use ``torch.profiler.schedule`` helper function that can generate a schedule for you;
413
# - ``on_trace_ready`` - specifies a function that takes a reference to the profiler as
414
# an input and is called by the profiler each time the new trace is ready.
415
#
416
# To illustrate how the API works, let's first consider the following example with
417
# ``torch.profiler.schedule`` helper function:
418
419
from torch.profiler import schedule
420
421
my_schedule = schedule(skip_first=10, wait=5, warmup=1, active=3, repeat=2)
422
423
######################################################################
424
# Profiler assumes that the long-running job is composed of steps, numbered
425
# starting from zero. The example above defines the following sequence of actions
426
# for the profiler:
427
#
428
# 1. Parameter ``skip_first`` tells profiler that it should ignore the first 10 steps
429
# (default value of ``skip_first`` is zero);
430
# 2. After the first ``skip_first`` steps, profiler starts executing profiler cycles;
431
# 3. Each cycle consists of three phases:
432
#
433
# - idling (``wait=5`` steps), during this phase profiler is not active;
434
# - warming up (``warmup=1`` steps), during this phase profiler starts tracing, but
435
# the results are discarded; this phase is used to discard the samples obtained by
436
# the profiler at the beginning of the trace since they are usually skewed by an extra
437
# overhead;
438
# - active tracing (``active=3`` steps), during this phase profiler traces and records data;
439
# 4. An optional ``repeat`` parameter specifies an upper bound on the number of cycles.
440
# By default (zero value), profiler will execute cycles as long as the job runs.
441
442
######################################################################
443
# Thus, in the example above, profiler will skip the first 15 steps, spend the next step on the warm up,
444
# actively record the next 3 steps, skip another 5 steps, spend the next step on the warm up, actively
445
# record another 3 steps. Since the ``repeat=2`` parameter value is specified, the profiler will stop
446
# the recording after the first two cycles.
447
#
448
# At the end of each cycle profiler calls the specified ``on_trace_ready`` function and passes itself as
449
# an argument. This function is used to process the new trace - either by obtaining the table output or
450
# by saving the output on disk as a trace file.
451
#
452
# To send the signal to the profiler that the next step has started, call ``prof.step()`` function.
453
# The current profiler step is stored in ``prof.step_num``.
454
#
455
# The following example shows how to use all of the concepts above for CUDA and XPU Kernels:
456
457
sort_by_keyword = "self_" + device + "_time_total"
458
459
460
def trace_handler(p):
461
output = p.key_averages().table(sort_by=sort_by_keyword, row_limit=10)
462
print(output)
463
p.export_chrome_trace("/tmp/trace_" + str(p.step_num) + ".json")
464
465
466
with profile(
467
activities=activities,
468
schedule=torch.profiler.schedule(wait=1, warmup=1, active=2),
469
on_trace_ready=trace_handler,
470
) as p:
471
for idx in range(8):
472
model(inputs)
473
p.step()
474
475
######################################################################
476
# Learn More
477
# ----------
478
#
479
# Take a look at the following recipes/tutorials to continue your learning:
480
#
481
# - `PyTorch Benchmark <https://pytorch.org/tutorials/recipes/recipes/benchmark.html>`_
482
# - `Visualizing models, data, and training with TensorBoard <https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html>`_ tutorial
483
#
484
485