Path: blob/main/recipes_source/recipes/profiler_recipe.py
1694 views
"""1PyTorch Profiler2====================================3**Author:** `Shivam Raikundalia <https://github.com/sraikund16>`_4"""56######################################################################7# This recipe explains how to use PyTorch profiler and measure the time and8# memory consumption of the model's operators.9#10# Introduction11# ------------12# PyTorch includes a simple profiler API that is useful when the user needs13# to determine the most expensive operators in the model.14#15# In this recipe, we will use a simple Resnet model to demonstrate how to16# use the profiler to analyze model performance.17#18# Prerequisites19# ---------------20# - ``torch >= 2.3.0``21#22# Setup23# -----24# To install ``torch`` and ``torchvision`` use the following command:25#26# .. code-block:: sh27#28# pip install torch torchvision29#3031######################################################################32# Steps33# -----34#35# 1. Import all necessary libraries36# 2. Instantiate a simple Resnet model37# 3. Using profiler to analyze execution time38# 4. Using profiler to analyze memory consumption39# 5. Using tracing functionality40# 6. Examining stack traces41# 7. Using profiler to analyze long-running jobs42#43# 1. Import all necessary libraries44# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~45#46# In this recipe we will use ``torch``, ``torchvision.models``47# and ``profiler`` modules:48#4950import torch51import torchvision.models as models52from torch.profiler import profile, ProfilerActivity, record_function535455######################################################################56# 2. Instantiate a simple Resnet model57# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~58#59# Let's create an instance of a Resnet model and prepare an input60# for it:61#6263model = models.resnet18()64inputs = torch.randn(5, 3, 224, 224)6566######################################################################67# 3. Using profiler to analyze execution time68# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~69#70# PyTorch profiler is enabled through the context manager and accepts71# a number of parameters, some of the most useful are:72#73# - ``activities`` - a list of activities to profile:74# - ``ProfilerActivity.CPU`` - PyTorch operators, TorchScript functions and75# user-defined code labels (see ``record_function`` below);76# - ``ProfilerActivity.CUDA`` - on-device CUDA kernels;77# - ``ProfilerActivity.XPU`` - on-device XPU kernels;78# - ``record_shapes`` - whether to record shapes of the operator inputs;79# - ``profile_memory`` - whether to report amount of memory consumed by80# model's Tensors;81#82# Note: when using CUDA, profiler also shows the runtime CUDA events83# occurring on the host.8485######################################################################86# Let's see how we can use profiler to analyze the execution time:8788with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:89with record_function("model_inference"):90model(inputs)9192######################################################################93# Note that we can use ``record_function`` context manager to label94# arbitrary code ranges with user provided names95# (``model_inference`` is used as a label in the example above).96#97# Profiler allows one to check which operators were called during the98# execution of a code range wrapped with a profiler context manager.99# If multiple profiler ranges are active at the same time (e.g. in100# parallel PyTorch threads), each profiling context manager tracks only101# the operators of its corresponding range.102# Profiler also automatically profiles the asynchronous tasks launched103# with ``torch.jit._fork`` and (in case of a backward pass)104# the backward pass operators launched with ``backward()`` call.105#106# Let's print out the stats for the execution above:107108print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))109110######################################################################111# The output will look like (omitting some columns):112#113# .. code-block:: sh114#115# --------------------------------- ------------ ------------ ------------ ------------116# Name Self CPU CPU total CPU time avg # of Calls117# --------------------------------- ------------ ------------ ------------ ------------118# model_inference 5.509ms 57.503ms 57.503ms 1119# aten::conv2d 231.000us 31.931ms 1.597ms 20120# aten::convolution 250.000us 31.700ms 1.585ms 20121# aten::_convolution 336.000us 31.450ms 1.573ms 20122# aten::mkldnn_convolution 30.838ms 31.114ms 1.556ms 20123# aten::batch_norm 211.000us 14.693ms 734.650us 20124# aten::_batch_norm_impl_index 319.000us 14.482ms 724.100us 20125# aten::native_batch_norm 9.229ms 14.109ms 705.450us 20126# aten::mean 332.000us 2.631ms 125.286us 21127# aten::select 1.668ms 2.292ms 8.988us 255128# --------------------------------- ------------ ------------ ------------ ------------129# Self CPU time total: 57.549m130#131132######################################################################133# Here we see that, as expected, most of the time is spent in convolution (and specifically in ``mkldnn_convolution``134# for PyTorch compiled with ``MKL-DNN`` support).135# Note the difference between self cpu time and cpu time - operators can call other operators, self cpu time excludes time136# spent in children operator calls, while total cpu time includes it. You can choose to sort by the self cpu time by passing137# ``sort_by="self_cpu_time_total"`` into the ``table`` call.138#139# To get a finer granularity of results and include operator input shapes, pass ``group_by_input_shape=True``140# (note: this requires running the profiler with ``record_shapes=True``):141142print(143prof.key_averages(group_by_input_shape=True).table(144sort_by="cpu_time_total", row_limit=10145)146)147148########################################################################################149# The output might look like this (omitting some columns):150#151# .. code-block:: sh152#153# --------------------------------- ------------ -------------------------------------------154# Name CPU total Input Shapes155# --------------------------------- ------------ -------------------------------------------156# model_inference 57.503ms []157# aten::conv2d 8.008ms [5,64,56,56], [64,64,3,3], [], ..., []]158# aten::convolution 7.956ms [[5,64,56,56], [64,64,3,3], [], ..., []]159# aten::_convolution 7.909ms [[5,64,56,56], [64,64,3,3], [], ..., []]160# aten::mkldnn_convolution 7.834ms [[5,64,56,56], [64,64,3,3], [], ..., []]161# aten::conv2d 6.332ms [[5,512,7,7], [512,512,3,3], [], ..., []]162# aten::convolution 6.303ms [[5,512,7,7], [512,512,3,3], [], ..., []]163# aten::_convolution 6.273ms [[5,512,7,7], [512,512,3,3], [], ..., []]164# aten::mkldnn_convolution 6.233ms [[5,512,7,7], [512,512,3,3], [], ..., []]165# aten::conv2d 4.751ms [[5,256,14,14], [256,256,3,3], [], ..., []]166# --------------------------------- ------------ -------------------------------------------167# Self CPU time total: 57.549ms168#169170######################################################################171# Note the occurrence of ``aten::convolution`` twice with different input shapes.172173######################################################################174# Profiler can also be used to analyze performance of models executed on GPUs:175# Users could switch between cpu, cuda and xpu176activities = [ProfilerActivity.CPU]177if torch.cuda.is_available():178device = "cuda"179activities += [ProfilerActivity.CUDA]180elif torch.xpu.is_available():181device = "xpu"182activities += [ProfilerActivity.XPU]183else:184print(185"Neither CUDA nor XPU devices are available to demonstrate profiling on acceleration devices"186)187import sys188189sys.exit(0)190191sort_by_keyword = device + "_time_total"192193model = models.resnet18().to(device)194inputs = torch.randn(5, 3, 224, 224).to(device)195196with profile(activities=activities, record_shapes=True) as prof:197with record_function("model_inference"):198model(inputs)199200print(prof.key_averages().table(sort_by=sort_by_keyword, row_limit=10))201202######################################################################203# (Note: the first use of CUDA profiling may bring an extra overhead.)204205######################################################################206# The resulting table output (omitting some columns):207#208# .. code-block:: sh209#210# ------------------------------------------------------- ------------ ------------211# Name Self CUDA CUDA total212# ------------------------------------------------------- ------------ ------------213# model_inference 0.000us 11.666ms214# aten::conv2d 0.000us 10.484ms215# aten::convolution 0.000us 10.484ms216# aten::_convolution 0.000us 10.484ms217# aten::_convolution_nogroup 0.000us 10.484ms218# aten::thnn_conv2d 0.000us 10.484ms219# aten::thnn_conv2d_forward 10.484ms 10.484ms220# void at::native::im2col_kernel<float>(long, float co... 3.844ms 3.844ms221# sgemm_32x32x32_NN 3.206ms 3.206ms222# sgemm_32x32x32_NN_vec 3.093ms 3.093ms223# ------------------------------------------------------- ------------ ------------224# Self CPU time total: 23.015ms225# Self CUDA time total: 11.666ms226#227228######################################################################229# (Note: the first use of XPU profiling may bring an extra overhead.)230231######################################################################232# The resulting table output (omitting some columns):233#234# .. code-block:: sh235#236# ------------------------------ ------------ ------------ ------------ ------------ ------------237# Name Self XPU Self XPU % XPU total XPU time avg # of Calls238# ------------------------------ ------------ ------------ ------------ ------------ ------------239# model_inference 0.000us 0.00% 2.567ms 2.567ms 1240# aten::conv2d 0.000us 0.00% 1.871ms 93.560us 20241# aten::convolution 0.000us 0.00% 1.871ms 93.560us 20242# aten::_convolution 0.000us 0.00% 1.871ms 93.560us 20243# aten::convolution_overrideable 1.871ms 72.89% 1.871ms 93.560us 20244# gen_conv 1.484ms 57.82% 1.484ms 74.216us 20245# aten::batch_norm 0.000us 0.00% 432.640us 21.632us 20246# aten::_batch_norm_impl_index 0.000us 0.00% 432.640us 21.632us 20247# aten::native_batch_norm 432.640us 16.85% 432.640us 21.632us 20248# conv_reorder 386.880us 15.07% 386.880us 6.448us 60249# ------------------------------ ------------ ------------ ------------ ------------ ------------250# Self CPU time total: 712.486ms251# Self XPU time total: 2.567ms252#253254######################################################################255# Note the occurrence of on-device kernels in the output (e.g. ``sgemm_32x32x32_NN`` for CUDA or ``gen_conv`` for XPU).256257######################################################################258# 4. Using profiler to analyze memory consumption259# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~260#261# PyTorch profiler can also show the amount of memory (used by the model's tensors)262# that was allocated (or released) during the execution of the model's operators.263# In the output below, 'self' memory corresponds to the memory allocated (released)264# by the operator, excluding the children calls to the other operators.265# To enable memory profiling functionality pass ``profile_memory=True``.266267model = models.resnet18()268inputs = torch.randn(5, 3, 224, 224)269270with profile(271activities=[ProfilerActivity.CPU], profile_memory=True, record_shapes=True272) as prof:273model(inputs)274275print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))276277# (omitting some columns)278# --------------------------------- ------------ ------------ ------------279# Name CPU Mem Self CPU Mem # of Calls280# --------------------------------- ------------ ------------ ------------281# aten::empty 94.79 Mb 94.79 Mb 121282# aten::max_pool2d_with_indices 11.48 Mb 11.48 Mb 1283# aten::addmm 19.53 Kb 19.53 Kb 1284# aten::empty_strided 572 b 572 b 25285# aten::resize_ 240 b 240 b 6286# aten::abs 480 b 240 b 4287# aten::add 160 b 160 b 20288# aten::masked_select 120 b 112 b 1289# aten::ne 122 b 53 b 6290# aten::eq 60 b 30 b 2291# --------------------------------- ------------ ------------ ------------292# Self CPU time total: 53.064ms293294print(prof.key_averages().table(sort_by="cpu_memory_usage", row_limit=10))295296#############################################################################297# The output might look like this (omitting some columns):298#299# .. code-block:: sh300#301# --------------------------------- ------------ ------------ ------------302# Name CPU Mem Self CPU Mem # of Calls303# --------------------------------- ------------ ------------ ------------304# aten::empty 94.79 Mb 94.79 Mb 121305# aten::batch_norm 47.41 Mb 0 b 20306# aten::_batch_norm_impl_index 47.41 Mb 0 b 20307# aten::native_batch_norm 47.41 Mb 0 b 20308# aten::conv2d 47.37 Mb 0 b 20309# aten::convolution 47.37 Mb 0 b 20310# aten::_convolution 47.37 Mb 0 b 20311# aten::mkldnn_convolution 47.37 Mb 0 b 20312# aten::max_pool2d 11.48 Mb 0 b 1313# aten::max_pool2d_with_indices 11.48 Mb 11.48 Mb 1314# --------------------------------- ------------ ------------ ------------315# Self CPU time total: 53.064ms316#317318######################################################################319# 5. Using tracing functionality320# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~321#322# Profiling results can be outputted as a ``.json`` trace file:323# Tracing CUDA or XPU kernels324# Users could switch between cpu, cuda and xpu325activities = [ProfilerActivity.CPU]326if torch.cuda.is_available():327device = "cuda"328activities += [ProfilerActivity.CUDA]329elif torch.xpu.is_available():330device = "xpu"331activities += [ProfilerActivity.XPU]332else:333print(334"Neither CUDA nor XPU devices are available to demonstrate profiling on acceleration devices"335)336import sys337338sys.exit(0)339340model = models.resnet18().to(device)341inputs = torch.randn(5, 3, 224, 224).to(device)342343with profile(activities=activities) as prof:344model(inputs)345346prof.export_chrome_trace("trace.json")347348######################################################################349# You can examine the sequence of profiled operators and CUDA/XPU kernels350# in Chrome trace viewer (``chrome://tracing``):351#352# .. image:: ../../_static/img/trace_img.png353# :scale: 25 %354355######################################################################356# 6. Examining stack traces357# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~358#359# Profiler can be used to analyze Python and TorchScript stack traces:360sort_by_keyword = "self_" + device + "_time_total"361362with profile(363activities=activities,364with_stack=True,365experimental_config=torch._C._profiler._ExperimentalConfig(verbose=True),366) as prof:367model(inputs)368369# Print aggregated stats370print(prof.key_averages(group_by_stack_n=5).table(sort_by=sort_by_keyword, row_limit=2))371372#################################################################################373# The output might look like this (omitting some columns):374#375# .. code-block:: sh376#377# ------------------------- -----------------------------------------------------------378# Name Source Location379# ------------------------- -----------------------------------------------------------380# aten::thnn_conv2d_forward .../torch/nn/modules/conv.py(439): _conv_forward381# .../torch/nn/modules/conv.py(443): forward382# .../torch/nn/modules/module.py(1051): _call_impl383# .../site-packages/torchvision/models/resnet.py(63): forward384# .../torch/nn/modules/module.py(1051): _call_impl385# aten::thnn_conv2d_forward .../torch/nn/modules/conv.py(439): _conv_forward386# .../torch/nn/modules/conv.py(443): forward387# .../torch/nn/modules/module.py(1051): _call_impl388# .../site-packages/torchvision/models/resnet.py(59): forward389# .../torch/nn/modules/module.py(1051): _call_impl390# ------------------------- -----------------------------------------------------------391# Self CPU time total: 34.016ms392# Self CUDA time total: 11.659ms393#394395######################################################################396# Note the two convolutions and the two call sites in ``torchvision/models/resnet.py`` script.397#398# (Warning: stack tracing adds an extra profiling overhead.)399400######################################################################401# 7. Using profiler to analyze long-running jobs402# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~403#404# PyTorch profiler offers an additional API to handle long-running jobs405# (such as training loops). Tracing all of the execution can be406# slow and result in very large trace files. To avoid this, use optional407# arguments:408#409# - ``schedule`` - specifies a function that takes an integer argument (step number)410# as an input and returns an action for the profiler, the best way to use this parameter411# is to use ``torch.profiler.schedule`` helper function that can generate a schedule for you;412# - ``on_trace_ready`` - specifies a function that takes a reference to the profiler as413# an input and is called by the profiler each time the new trace is ready.414#415# To illustrate how the API works, let's first consider the following example with416# ``torch.profiler.schedule`` helper function:417418from torch.profiler import schedule419420my_schedule = schedule(skip_first=10, wait=5, warmup=1, active=3, repeat=2)421422######################################################################423# Profiler assumes that the long-running job is composed of steps, numbered424# starting from zero. The example above defines the following sequence of actions425# for the profiler:426#427# 1. Parameter ``skip_first`` tells profiler that it should ignore the first 10 steps428# (default value of ``skip_first`` is zero);429# 2. After the first ``skip_first`` steps, profiler starts executing profiler cycles;430# 3. Each cycle consists of three phases:431#432# - idling (``wait=5`` steps), during this phase profiler is not active;433# - warming up (``warmup=1`` steps), during this phase profiler starts tracing, but434# the results are discarded; this phase is used to discard the samples obtained by435# the profiler at the beginning of the trace since they are usually skewed by an extra436# overhead;437# - active tracing (``active=3`` steps), during this phase profiler traces and records data;438# 4. An optional ``repeat`` parameter specifies an upper bound on the number of cycles.439# By default (zero value), profiler will execute cycles as long as the job runs.440441######################################################################442# Thus, in the example above, profiler will skip the first 15 steps, spend the next step on the warm up,443# actively record the next 3 steps, skip another 5 steps, spend the next step on the warm up, actively444# record another 3 steps. Since the ``repeat=2`` parameter value is specified, the profiler will stop445# the recording after the first two cycles.446#447# At the end of each cycle profiler calls the specified ``on_trace_ready`` function and passes itself as448# an argument. This function is used to process the new trace - either by obtaining the table output or449# by saving the output on disk as a trace file.450#451# To send the signal to the profiler that the next step has started, call ``prof.step()`` function.452# The current profiler step is stored in ``prof.step_num``.453#454# The following example shows how to use all of the concepts above for CUDA and XPU Kernels:455456sort_by_keyword = "self_" + device + "_time_total"457458459def trace_handler(p):460output = p.key_averages().table(sort_by=sort_by_keyword, row_limit=10)461print(output)462p.export_chrome_trace("/tmp/trace_" + str(p.step_num) + ".json")463464465with profile(466activities=activities,467schedule=torch.profiler.schedule(wait=1, warmup=1, active=2),468on_trace_ready=trace_handler,469) as p:470for idx in range(8):471model(inputs)472p.step()473474######################################################################475# Learn More476# ----------477#478# Take a look at the following recipes/tutorials to continue your learning:479#480# - `PyTorch Benchmark <https://pytorch.org/tutorials/recipes/recipes/benchmark.html>`_481# - `Visualizing models, data, and training with TensorBoard <https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html>`_ tutorial482#483484485