Path: blob/master/tools/perf/Documentation/perf-arm-spe.txt
26282 views
perf-arm-spe(1)1================23NAME4----5perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools67SYNOPSIS8--------9[verse]10'perf record' -e arm_spe//1112DESCRIPTION13-----------1415The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and16events down to individual instructions. Rather than being interrupt-driven, it picks an17instruction to sample and then captures data for it during execution. Data includes execution time18in cycles. For loads and stores it also includes data address, cache miss events, and data origin.1920The sampling has 5 stages:21221. Choose an operation232. Collect data about the operation243. Optionally discard the record based on a filter254. Write the record to memory265. Interrupt when the buffer is full2728Choose an operation29~~~~~~~~~~~~~~~~~~~3031This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all32architectural instructions or all micro-ops. Sampling happens at a programmable interval. The33architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should34sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random35perturbation is also added to the sampling interval by default.3637Collect data about the operation38~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~3940Program counter, PMU events, timings and data addresses related to the operation are recorded.41Sampling ensures there is only one sampled operation is in flight.4243Optionally discard the record based on a filter44~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~4546Based on programmable criteria, choose whether to keep the record or discard it. If the record is47discarded then the flow stops here for this sample.4849Write the record to memory50~~~~~~~~~~~~~~~~~~~~~~~~~~5152The record is appended to a memory buffer5354Interrupt when the buffer is full55~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~5657When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.58Perf saves the raw data in the perf.data file.5960Opening the file61----------------6263Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the64recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding65the data, Perf generates "synthetic samples" as if these were generated at the time of the66recording. These samples are the same as if normal sampling was done by Perf without using SPE,67although they may have more attributes associated with them. For example a normal sample may have68just the instruction pointer, but an SPE sample can have data addresses and latency attributes.6970Why Sampling?71-------------7273- Sampling, rather than tracing, cuts down the profiling problem to something more manageable for74hardware. Only one sampled operation is in flight at a time.7576- Allows precise attribution data, including: Full PC of instruction, data virtual and physical77addresses.7879- Allows correlation between an instruction and events, such as TLB and cache miss. (Data source80indicates which particular cache was hit, but the meaning is implementation defined because81different implementations can have different cache configurations.)8283However, SPE does not provide any call-graph information, and relies on statistical methods.8485Collisions86----------8788When an operation is sampled while a previous sampled operation has not finished, a collision89occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate90should be set to avoid collisions.9192The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this93count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact94number for samples dropped that would have made it through the filter, but can be a rough95guide.9697The effect of microarchitectural sampling98-----------------------------------------99100If an implementation samples micro-operations instead of instructions, the results of sampling must101be weighted accordingly.102103For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it104becomes twice as likely to appear in the sample population.105106The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be107estimated from the 'sample_pop' and 'inst_retired' PMU events.108109Kernel Requirements110-------------------111112The ARM_SPE_PMU config must be set to build as either a module or statically.113114Depending on CPU model, the kernel may need to be booted with page table isolation disabled115(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer116inaccessible. Try passing 'kpti=off' on the kernel command line".117118For the full criteria that determine whether KPTI needs to be forced off or not, see function119unmap_kernel_at_el0() in the kernel sources. Common cases where it's not required120are on the CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is mandatory.121122The SPE interrupt must also be described by the firmware. If the module is loaded and KPTI is123disabled (or isn't required to be disabled) but the SPE PMU still doesn't show in124/sys/bus/event_source/devices/, then it's possible that the SPE interrupt isn't described by125ACPI or DT. In this case no warning will be printed by the driver.126127Capturing SPE with perf command-line tools128------------------------------------------129130You can record a session with SPE samples:131132perf record -e arm_spe// -- ./mybench133134The sample period is set from the -c option, and because the minimum interval is used by default135it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.136137Config parameters138~~~~~~~~~~~~~~~~~139140These are placed between the // in the event and comma separated. For example '-e141arm_spe/load_filter=1,min_latency=10/'142143branch_filter=1 - collect branches only (PMSFCR.B)144event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below145jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND)146load_filter=1 - collect loads only (PMSFCR.LD)147min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR)148pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege149pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege150store_filter=1 - collect stores only (PMSFCR.ST)151ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS)152discard=1 - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD)153154+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather155than only the execution latency.156157Only some events can be filtered on; these include:158159bit 1 - instruction retired (i.e. omit speculative instructions)160bit 3 - L1D refill161bit 5 - TLB refill162bit 7 - mispredict163bit 11 - misaligned access164165So to sample just retired instructions:166167perf record -e arm_spe/event_filter=2/ -- ./mybench168169or just mispredicted branches:170171perf record -e arm_spe/event_filter=0x80/ -- ./mybench172173Viewing the data174~~~~~~~~~~~~~~~~~175176By default perf report and perf script will assign samples to separate groups depending on the177attributes/events of the SPE record. Because instructions can have multiple events associated with178them, the samples in these groups are not necessarily unique. For example perf report shows these179groups:180181Available samples1820 arm_spe//1830 dummy:u18421 l1d-miss185897 l1d-access1865 llc-miss1877 llc-access1882 tlb-miss1891K tlb-access19036 branch1910 remote-access192900 memory193194The arm_spe// and dummy:u events are implementation details and are expected to be empty.195196To get a full list of unique samples that are not sorted into groups, set the itrace option to197generate 'instruction' samples. The period option is also taken into account, so set it to 1198instruction unless you want to further downsample the already sampled SPE data:199200perf report --itrace=i1i201202Memory access details are also stored on the samples and this can be viewed with:203204perf report --mem-mode205206Common errors207~~~~~~~~~~~~~208209- "Cannot find PMU `arm_spe'. Missing kernel support?"210211Module not built or loaded, KPTI not disabled, interrupt not described by firmware,212or running on a VM. See 'Kernel Requirements' above.213214- "Arm SPE CONTEXT packets not found in the traces."215216Root privilege is required to collect context packets. But these only increase the accuracy of217assigning PIDs to kernel samples. For userspace sampling this can be ignored.218219- Excessively large perf.data file size220221Increase sampling interval (see above)222223PMU events224~~~~~~~~~~225226SPE has events that can be counted on core PMUs. These are prefixed with227SAMPLE_, for example SAMPLE_POP, SAMPLE_FEED, SAMPLE_COLLISION and228SAMPLE_FEED_BR.229230These events will only count when an SPE event is running on the same core that231the PMU event is opened on, otherwise they read as 0. There are various ways to232ensure that the PMU event and SPE event are scheduled together depending on the233way the event is opened. For example opening both events as per-process events234on the same process, although it's not guaranteed that the PMU event is enabled235first when context switching. For that reason it may be better to open the PMU236event as a systemwide event and then open SPE on the process of interest.237238Discard mode239~~~~~~~~~~~~240241SPE related (SAMPLE_* etc) core PMU events can be used without the overhead of242collecting sample data if discard mode is supported (optional from Armv8.6).243First run a system wide SPE session (or on the core of interest) using options244to minimize output. Then run perf stat:245246perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &247perf stat -e SAMPLE_FEED_LD248249SEE ALSO250--------251252linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],253linkperf:perf-inject[1]254255256