CoCalc -- perf-arm-spe.txt

GitHub Repository: torvalds/linux
Path: blob/master/tools/perf/Documentation/perf-arm-spe.txt
⁵¹⁹⁷⁷ views
1
perf-arm-spe(1)
2
================
3

4
NAME
5
----
6
perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools
7

8
SYNOPSIS
9
--------
10
[verse]
11
'perf record' -e arm_spe//
12

13
DESCRIPTION
14
-----------
15

16
The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and
17
 events down to individual instructions. Rather than being interrupt-driven, it picks an
18
instruction to sample and then captures data for it during execution. Data includes execution time
19
in cycles. For loads and stores it also includes data address, cache miss events, and data origin.
20

21
The sampling has 5 stages:
22

23
  1. Choose an operation
24
  2. Collect data about the operation
25
  3. Optionally discard the record based on a filter
26
  4. Write the record to memory
27
  5. Interrupt when the buffer is full
28

29
Choose an operation
30
~~~~~~~~~~~~~~~~~~~
31

32
This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all
33
architectural instructions or all micro-ops. Sampling happens at a programmable interval. The
34
architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should
35
sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random
36
perturbation is also added to the sampling interval by default.
37

38
Collect data about the operation
39
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
40

41
Program counter, PMU events, timings and data addresses related to the operation are recorded.
42
Sampling ensures there is only one sampled operation is in flight.
43

44
Optionally discard the record based on a filter
45
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
46

47
Based on programmable criteria, choose whether to keep the record or discard it. If the record is
48
discarded then the flow stops here for this sample.
49

50
Write the record to memory
51
~~~~~~~~~~~~~~~~~~~~~~~~~~
52

53
The record is appended to a memory buffer
54

55
Interrupt when the buffer is full
56
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
57

58
When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.
59
Perf saves the raw data in the perf.data file.
60

61
Opening the file
62
----------------
63

64
Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the
65
recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding
66
the data, Perf generates "synthetic samples" as if these were generated at the time of the
67
recording. These samples are the same as if normal sampling was done by Perf without using SPE,
68
although they may have more attributes associated with them. For example a normal sample may have
69
just the instruction pointer, but an SPE sample can have data addresses and latency attributes.
70

71
Why Sampling?
72
-------------
73

74
 - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for
75
 hardware. Only one sampled operation is in flight at a time.
76

77
 - Allows precise attribution data, including: Full PC of instruction, data virtual and physical
78
 addresses.
79

80
 - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source
81
 indicates which particular cache was hit, but the meaning is implementation defined because
82
 different implementations can have different cache configurations.)
83

84
However, SPE does not provide any call-graph information, and relies on statistical methods.
85

86
Collisions
87
----------
88

89
When an operation is sampled while a previous sampled operation has not finished, a collision
90
occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate
91
should be set to avoid collisions.
92

93
The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this
94
count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact
95
number for samples dropped that would have made it through the filter, but can be a rough
96
guide.
97

98
The effect of microarchitectural sampling
99
-----------------------------------------
100

101
If an implementation samples micro-operations instead of instructions, the results of sampling must
102
be weighted accordingly.
103

104
For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it
105
becomes twice as likely to appear in the sample population.
106

107
The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be
108
estimated from the 'sample_pop' and 'inst_retired' PMU events.
109

110
Kernel Requirements
111
-------------------
112

113
The ARM_SPE_PMU config must be set to build as either a module or statically.
114

115
Depending on CPU model, the kernel may need to be booted with page table isolation disabled
116
(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer
117
inaccessible. Try passing 'kpti=off' on the kernel command line".
118

119
For the full criteria that determine whether KPTI needs to be forced off or not, see function
120
unmap_kernel_at_el0() in the kernel sources. Common cases where it's not required
121
are on the CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is mandatory.
122

123
The SPE interrupt must also be described by the firmware. If the module is loaded and KPTI is
124
disabled (or isn't required to be disabled) but the SPE PMU still doesn't show in
125
/sys/bus/event_source/devices/, then it's possible that the SPE interrupt isn't described by
126
ACPI or DT. In this case no warning will be printed by the driver.
127

128
Capturing SPE with perf command-line tools
129
------------------------------------------
130

131
You can record a session with SPE samples:
132

133
  perf record -e arm_spe// -- ./mybench
134

135
The sample period is set from the -c option, and because the minimum interval is used by default
136
it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.
137

138
Config parameters
139
~~~~~~~~~~~~~~~~~
140

141
These are placed between the // in the event and comma separated. For example '-e
142
arm_spe/load_filter=1,min_latency=10/'
143

144
  event_filter=<mask> - logical AND filter on specific events (PMSEVFR) - see bitfield description below
145
  inv_event_filter=<mask> - logical OR to filter out specific events (PMSNEVFR, FEAT_SPEv1p2) - see bitfield description below
146
  jitter=1            - use jitter to avoid resonance when sampling (PMSIRR.RND)
147
  min_latency=<n>     - collect only samples with this latency or higher* (PMSLATFR)
148
  pa_enable=1         - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
149
  pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
150
  ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
151
  discard=1           - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD)
152
  inv_data_src_filter=<mask> - mask to filter from 0-63 possible data sources (PMSDSFR, FEAT_SPE_FDS) - See 'Data source filtering'
153

154
+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
155
than only the execution latency.
156

157
Only some events can be filtered on using 'event_filter' bits. The overall
158
filter is the logical AND of these bits, for example if bits 3 and 5 are set
159
only samples that have both 'L1D cache refill' AND 'TLB walk' are recorded. When
160
FEAT_SPEv1p2 is implemented 'inv_event_filter' can also be used to exclude
161
events that have any (OR) of the filter's bits set. For example setting bits 3
162
and 5 in 'inv_event_filter' will exclude any events that are either L1D cache
163
refill OR TLB walk. If the same bit is set in both filters it's UNPREDICTABLE
164
whether the sample is included or excluded. Filter bits for both event_filter
165
and inv_event_filter are:
166

167
  bit 1     - Instruction retired (i.e. omit speculative instructions)
168
  bit 2     - L1D access (FEAT_SPEv1p4)
169
  bit 3     - L1D refill
170
  bit 4     - TLB access (FEAT_SPEv1p4)
171
  bit 5     - TLB refill
172
  bit 6     - Not taken event (FEAT_SPEv1p2)
173
  bit 7     - Mispredict
174
  bit 8     - Last level cache access (FEAT_SPEv1p4)
175
  bit 9     - Last level cache miss (FEAT_SPEv1p4)
176
  bit 10    - Remote access (FEAT_SPEv1p4)
177
  bit 11    - Misaligned access (FEAT_SPEv1p1)
178
  bit 12-15 - IMPLEMENTATION DEFINED events (when implemented)
179
  bit 17    - Partial or empty SME or SVE predicate (FEAT_SPEv1p1)
180
  bit 18    - Empty SME or SVE predicate (FEAT_SPEv1p1)
181
  bit 19    - L2D access (FEAT_SPEv1p4)
182
  bit 20    - L2D miss (FEAT_SPEv1p4)
183
  bit 21    - Cache data modified (FEAT_SPEv1p4)
184
  bit 22    - Recently fetched (FEAT_SPEv1p4)
185
  bit 23    - Data snooped (FEAT_SPEv1p4)
186
  bit 24    - Streaming SVE mode event (when FEAT_SPE_SME is implemented), or
187
              IMPLEMENTATION DEFINED event 24 (when implemented, only versions
188
              less than FEAT_SPEv1p4)
189
  bit 25    - SMCU or external coprocessor operation event when FEAT_SPE_SME is
190
              implemented, or IMPLEMENTATION DEFINED event 25 (when implemented,
191
              only versions less than FEAT_SPEv1p4)
192
  bit 26-31 - IMPLEMENTATION DEFINED events (only versions less than FEAT_SPEv1p4)
193
  bit 48-63 - IMPLEMENTATION DEFINED events (when implemented)
194

195
For IMPLEMENTATION DEFINED bits, refer to the CPU TRM if these bits are
196
implemented.
197

198
The driver will reject events if requested filter bits require unimplemented SPE
199
versions, but will not reject filter bits for unimplemented IMPDEF bits or when
200
their related feature is not present (e.g. SME). For example, if FEAT_SPEv1p2 is
201
not implemented, filtering on "Not taken event" (bit 6) will be rejected.
202

203
So to sample just retired instructions:
204

205
  perf record -e arm_spe/event_filter=2/ -- ./mybench
206

207
or just mispredicted branches:
208

209
  perf record -e arm_spe/event_filter=0x80/ -- ./mybench
210

211
When set, the following filters can be used to select samples that match any of
212
the operation types (OR filtering). If only one is set then only samples of that
213
type are collected:
214

215
  branch_filter=1     - Collect branches (PMSFCR.B)
216
  load_filter=1       - Collect loads (PMSFCR.LD)
217
  store_filter=1      - Collect stores (PMSFCR.ST)
218

219
When extended filtering is supported (FEAT_SPE_EFT), SIMD and float
220
pointer operations can also be selected:
221

222
  simd_filter=1         - Collect SIMD loads, stores and operations (PMSFCR.SIMD)
223
  float_filter=1        - Collect floating point loads, stores and operations (PMSFCR.FP)
224

225
When extended filtering is supported (FEAT_SPE_EFT), operation type filters can
226
be changed to AND using _mask fields. For example samples could be selected if
227
they are store AND SIMD by setting 'store_filter=1,simd_filter=1,
228
store_filter_mask=1,simd_filter_mask=1'. The new masks are as follows:
229

230
  branch_filter_mask=1  - Change branch filter behavior from OR to AND (PMSFCR.Bm)
231
  load_filter_mask=1    - Change load filter behavior from OR to AND (PMSFCR.LDm)
232
  store_filter_mask=1   - Change store filter behavior from OR to AND (PMSFCR.STm)
233
  simd_filter_mask=1    - Change SIMD filter behavior from OR to AND (PMSFCR.SIMDm)
234
  float_filter_mask=1   - Change floating point filter behavior from OR to AND (PMSFCR.FPm)
235

236
Viewing the data
237
~~~~~~~~~~~~~~~~~
238

239
By default perf report and perf script will assign samples to separate groups depending on the
240
attributes/events of the SPE record. Because instructions can have multiple events associated with
241
them, the samples in these groups are not necessarily unique. For example perf report shows these
242
groups:
243

244
  Available samples
245
  0 arm_spe//
246
  0 dummy:u
247
  21 l1d-miss
248
  897 l1d-access
249
  5 llc-miss
250
  7 llc-access
251
  2 tlb-miss
252
  1K tlb-access
253
  36 branch
254
  0 remote-access
255
  900 memory
256
  1800 instructions
257

258
The arm_spe// and dummy:u events are implementation details and are expected to be empty.
259

260
The instructions group contains the full list of unique samples that are not
261
sorted into other groups. To generate only this group use --itrace=i1i.
262

263
1i (1 instruction interval) signifies no further downsampling. Rather than an
264
instruction interval, this generates a sample every n SPE samples. For example
265
to generate the default set of events for every 100 SPE samples:
266

267
  perf report --itrace==bxofmtMai100i
268

269
Other period types, for example nanoseconds (ns) are not currently supported.
270

271
Memory access details are also stored on the samples and this can be viewed with:
272

273
  perf report --mem-mode
274

275
The latency value from the SPE sample is stored in the 'weight' field of the
276
Perf samples and can be displayed in Perf script and report outputs by enabling
277
its display from the command line.
278

279
Common errors
280
~~~~~~~~~~~~~
281

282
 - "Cannot find PMU `arm_spe'. Missing kernel support?"
283

284
   Module not built or loaded, KPTI not disabled, interrupt not described by firmware,
285
   or running on a VM. See 'Kernel Requirements' above.
286

287
 - "Arm SPE CONTEXT packets not found in the traces."
288

289
   Root privilege is required to collect context packets. But these only increase the accuracy of
290
   assigning PIDs to kernel samples. For userspace sampling this can be ignored.
291

292
 - Excessively large perf.data file size
293

294
   Increase sampling interval (see above)
295

296
PMU events
297
~~~~~~~~~~
298

299
SPE has events that can be counted on core PMUs. These are prefixed with
300
SAMPLE_, for example SAMPLE_POP, SAMPLE_FEED, SAMPLE_COLLISION and
301
SAMPLE_FEED_BR.
302

303
These events will only count when an SPE event is running on the same core that
304
the PMU event is opened on, otherwise they read as 0. There are various ways to
305
ensure that the PMU event and SPE event are scheduled together depending on the
306
way the event is opened. For example opening both events as per-process events
307
on the same process, although it's not guaranteed that the PMU event is enabled
308
first when context switching. For that reason it may be better to open the PMU
309
event as a systemwide event and then open SPE on the process of interest.
310

311
Discard mode
312
~~~~~~~~~~~~
313

314
SPE related (SAMPLE_* etc) core PMU events can be used without the overhead of
315
collecting sample data if discard mode is supported (optional from Armv8.6).
316
First run a system wide SPE session (or on the core of interest) using options
317
to minimize output. Then run perf stat:
318

319
  perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &
320
  perf stat -e SAMPLE_FEED_LD
321

322
Data source filtering
323
~~~~~~~~~~~~~~~~~~~~~
324

325
When FEAT_SPE_FDS is present, 'inv_data_src_filter' can be used as a mask to
326
filter on a subset (0 - 63) of possible data source IDs. The full range of data
327
sources is 0 - 65535 although these are unlikely to be used in practice. Data
328
sources are IMPDEF so refer to the TRM for the mappings. Each bit N of the
329
filter maps to data source N. The filter is an OR of all the bits, and the value
330
provided inv_data_src_filter is inverted before writing to PMSDSFR_EL1 so that
331
set bits exclude that data source and cleared bits include that data source.
332
Therefore the default value of 0 is equivalent to no filtering (all data sources
333
included).
334

335
For example, to include only data sources 0 and 3, clear bits 0 and 3
336
(0xFFFFFFFFFFFFFFF6)
337

338
When 'inv_data_src_filter' is set to 0xFFFFFFFFFFFFFFFF, any samples with any
339
data source set are excluded.
340

341
SEE ALSO
342
--------
343

344
linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
345
linkperf:perf-inject[1]
346

347
Product

Resources

Company