Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
torvalds
GitHub Repository: torvalds/linux
Path: blob/master/tools/perf/Documentation/perf-arm-spe.txt
26282 views
1
perf-arm-spe(1)
2
================
3
4
NAME
5
----
6
perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools
7
8
SYNOPSIS
9
--------
10
[verse]
11
'perf record' -e arm_spe//
12
13
DESCRIPTION
14
-----------
15
16
The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and
17
events down to individual instructions. Rather than being interrupt-driven, it picks an
18
instruction to sample and then captures data for it during execution. Data includes execution time
19
in cycles. For loads and stores it also includes data address, cache miss events, and data origin.
20
21
The sampling has 5 stages:
22
23
1. Choose an operation
24
2. Collect data about the operation
25
3. Optionally discard the record based on a filter
26
4. Write the record to memory
27
5. Interrupt when the buffer is full
28
29
Choose an operation
30
~~~~~~~~~~~~~~~~~~~
31
32
This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all
33
architectural instructions or all micro-ops. Sampling happens at a programmable interval. The
34
architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should
35
sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random
36
perturbation is also added to the sampling interval by default.
37
38
Collect data about the operation
39
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
40
41
Program counter, PMU events, timings and data addresses related to the operation are recorded.
42
Sampling ensures there is only one sampled operation is in flight.
43
44
Optionally discard the record based on a filter
45
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
46
47
Based on programmable criteria, choose whether to keep the record or discard it. If the record is
48
discarded then the flow stops here for this sample.
49
50
Write the record to memory
51
~~~~~~~~~~~~~~~~~~~~~~~~~~
52
53
The record is appended to a memory buffer
54
55
Interrupt when the buffer is full
56
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
57
58
When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.
59
Perf saves the raw data in the perf.data file.
60
61
Opening the file
62
----------------
63
64
Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the
65
recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding
66
the data, Perf generates "synthetic samples" as if these were generated at the time of the
67
recording. These samples are the same as if normal sampling was done by Perf without using SPE,
68
although they may have more attributes associated with them. For example a normal sample may have
69
just the instruction pointer, but an SPE sample can have data addresses and latency attributes.
70
71
Why Sampling?
72
-------------
73
74
- Sampling, rather than tracing, cuts down the profiling problem to something more manageable for
75
hardware. Only one sampled operation is in flight at a time.
76
77
- Allows precise attribution data, including: Full PC of instruction, data virtual and physical
78
addresses.
79
80
- Allows correlation between an instruction and events, such as TLB and cache miss. (Data source
81
indicates which particular cache was hit, but the meaning is implementation defined because
82
different implementations can have different cache configurations.)
83
84
However, SPE does not provide any call-graph information, and relies on statistical methods.
85
86
Collisions
87
----------
88
89
When an operation is sampled while a previous sampled operation has not finished, a collision
90
occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate
91
should be set to avoid collisions.
92
93
The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this
94
count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact
95
number for samples dropped that would have made it through the filter, but can be a rough
96
guide.
97
98
The effect of microarchitectural sampling
99
-----------------------------------------
100
101
If an implementation samples micro-operations instead of instructions, the results of sampling must
102
be weighted accordingly.
103
104
For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it
105
becomes twice as likely to appear in the sample population.
106
107
The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be
108
estimated from the 'sample_pop' and 'inst_retired' PMU events.
109
110
Kernel Requirements
111
-------------------
112
113
The ARM_SPE_PMU config must be set to build as either a module or statically.
114
115
Depending on CPU model, the kernel may need to be booted with page table isolation disabled
116
(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer
117
inaccessible. Try passing 'kpti=off' on the kernel command line".
118
119
For the full criteria that determine whether KPTI needs to be forced off or not, see function
120
unmap_kernel_at_el0() in the kernel sources. Common cases where it's not required
121
are on the CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is mandatory.
122
123
The SPE interrupt must also be described by the firmware. If the module is loaded and KPTI is
124
disabled (or isn't required to be disabled) but the SPE PMU still doesn't show in
125
/sys/bus/event_source/devices/, then it's possible that the SPE interrupt isn't described by
126
ACPI or DT. In this case no warning will be printed by the driver.
127
128
Capturing SPE with perf command-line tools
129
------------------------------------------
130
131
You can record a session with SPE samples:
132
133
perf record -e arm_spe// -- ./mybench
134
135
The sample period is set from the -c option, and because the minimum interval is used by default
136
it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.
137
138
Config parameters
139
~~~~~~~~~~~~~~~~~
140
141
These are placed between the // in the event and comma separated. For example '-e
142
arm_spe/load_filter=1,min_latency=10/'
143
144
branch_filter=1 - collect branches only (PMSFCR.B)
145
event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
146
jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND)
147
load_filter=1 - collect loads only (PMSFCR.LD)
148
min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR)
149
pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
150
pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
151
store_filter=1 - collect stores only (PMSFCR.ST)
152
ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS)
153
discard=1 - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD)
154
155
+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
156
than only the execution latency.
157
158
Only some events can be filtered on; these include:
159
160
bit 1 - instruction retired (i.e. omit speculative instructions)
161
bit 3 - L1D refill
162
bit 5 - TLB refill
163
bit 7 - mispredict
164
bit 11 - misaligned access
165
166
So to sample just retired instructions:
167
168
perf record -e arm_spe/event_filter=2/ -- ./mybench
169
170
or just mispredicted branches:
171
172
perf record -e arm_spe/event_filter=0x80/ -- ./mybench
173
174
Viewing the data
175
~~~~~~~~~~~~~~~~~
176
177
By default perf report and perf script will assign samples to separate groups depending on the
178
attributes/events of the SPE record. Because instructions can have multiple events associated with
179
them, the samples in these groups are not necessarily unique. For example perf report shows these
180
groups:
181
182
Available samples
183
0 arm_spe//
184
0 dummy:u
185
21 l1d-miss
186
897 l1d-access
187
5 llc-miss
188
7 llc-access
189
2 tlb-miss
190
1K tlb-access
191
36 branch
192
0 remote-access
193
900 memory
194
195
The arm_spe// and dummy:u events are implementation details and are expected to be empty.
196
197
To get a full list of unique samples that are not sorted into groups, set the itrace option to
198
generate 'instruction' samples. The period option is also taken into account, so set it to 1
199
instruction unless you want to further downsample the already sampled SPE data:
200
201
perf report --itrace=i1i
202
203
Memory access details are also stored on the samples and this can be viewed with:
204
205
perf report --mem-mode
206
207
Common errors
208
~~~~~~~~~~~~~
209
210
- "Cannot find PMU `arm_spe'. Missing kernel support?"
211
212
Module not built or loaded, KPTI not disabled, interrupt not described by firmware,
213
or running on a VM. See 'Kernel Requirements' above.
214
215
- "Arm SPE CONTEXT packets not found in the traces."
216
217
Root privilege is required to collect context packets. But these only increase the accuracy of
218
assigning PIDs to kernel samples. For userspace sampling this can be ignored.
219
220
- Excessively large perf.data file size
221
222
Increase sampling interval (see above)
223
224
PMU events
225
~~~~~~~~~~
226
227
SPE has events that can be counted on core PMUs. These are prefixed with
228
SAMPLE_, for example SAMPLE_POP, SAMPLE_FEED, SAMPLE_COLLISION and
229
SAMPLE_FEED_BR.
230
231
These events will only count when an SPE event is running on the same core that
232
the PMU event is opened on, otherwise they read as 0. There are various ways to
233
ensure that the PMU event and SPE event are scheduled together depending on the
234
way the event is opened. For example opening both events as per-process events
235
on the same process, although it's not guaranteed that the PMU event is enabled
236
first when context switching. For that reason it may be better to open the PMU
237
event as a systemwide event and then open SPE on the process of interest.
238
239
Discard mode
240
~~~~~~~~~~~~
241
242
SPE related (SAMPLE_* etc) core PMU events can be used without the overhead of
243
collecting sample data if discard mode is supported (optional from Armv8.6).
244
First run a system wide SPE session (or on the core of interest) using options
245
to minimize output. Then run perf stat:
246
247
perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &
248
perf stat -e SAMPLE_FEED_LD
249
250
SEE ALSO
251
--------
252
253
linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
254
linkperf:perf-inject[1]
255
256