Path: blob/master/tools/perf/Documentation/perf-arm-spe.txt
51977 views
perf-arm-spe(1)1================23NAME4----5perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools67SYNOPSIS8--------9[verse]10'perf record' -e arm_spe//1112DESCRIPTION13-----------1415The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and16events down to individual instructions. Rather than being interrupt-driven, it picks an17instruction to sample and then captures data for it during execution. Data includes execution time18in cycles. For loads and stores it also includes data address, cache miss events, and data origin.1920The sampling has 5 stages:21221. Choose an operation232. Collect data about the operation243. Optionally discard the record based on a filter254. Write the record to memory265. Interrupt when the buffer is full2728Choose an operation29~~~~~~~~~~~~~~~~~~~3031This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all32architectural instructions or all micro-ops. Sampling happens at a programmable interval. The33architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should34sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random35perturbation is also added to the sampling interval by default.3637Collect data about the operation38~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~3940Program counter, PMU events, timings and data addresses related to the operation are recorded.41Sampling ensures there is only one sampled operation is in flight.4243Optionally discard the record based on a filter44~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~4546Based on programmable criteria, choose whether to keep the record or discard it. If the record is47discarded then the flow stops here for this sample.4849Write the record to memory50~~~~~~~~~~~~~~~~~~~~~~~~~~5152The record is appended to a memory buffer5354Interrupt when the buffer is full55~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~5657When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.58Perf saves the raw data in the perf.data file.5960Opening the file61----------------6263Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the64recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding65the data, Perf generates "synthetic samples" as if these were generated at the time of the66recording. These samples are the same as if normal sampling was done by Perf without using SPE,67although they may have more attributes associated with them. For example a normal sample may have68just the instruction pointer, but an SPE sample can have data addresses and latency attributes.6970Why Sampling?71-------------7273- Sampling, rather than tracing, cuts down the profiling problem to something more manageable for74hardware. Only one sampled operation is in flight at a time.7576- Allows precise attribution data, including: Full PC of instruction, data virtual and physical77addresses.7879- Allows correlation between an instruction and events, such as TLB and cache miss. (Data source80indicates which particular cache was hit, but the meaning is implementation defined because81different implementations can have different cache configurations.)8283However, SPE does not provide any call-graph information, and relies on statistical methods.8485Collisions86----------8788When an operation is sampled while a previous sampled operation has not finished, a collision89occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate90should be set to avoid collisions.9192The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this93count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact94number for samples dropped that would have made it through the filter, but can be a rough95guide.9697The effect of microarchitectural sampling98-----------------------------------------99100If an implementation samples micro-operations instead of instructions, the results of sampling must101be weighted accordingly.102103For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it104becomes twice as likely to appear in the sample population.105106The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be107estimated from the 'sample_pop' and 'inst_retired' PMU events.108109Kernel Requirements110-------------------111112The ARM_SPE_PMU config must be set to build as either a module or statically.113114Depending on CPU model, the kernel may need to be booted with page table isolation disabled115(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer116inaccessible. Try passing 'kpti=off' on the kernel command line".117118For the full criteria that determine whether KPTI needs to be forced off or not, see function119unmap_kernel_at_el0() in the kernel sources. Common cases where it's not required120are on the CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is mandatory.121122The SPE interrupt must also be described by the firmware. If the module is loaded and KPTI is123disabled (or isn't required to be disabled) but the SPE PMU still doesn't show in124/sys/bus/event_source/devices/, then it's possible that the SPE interrupt isn't described by125ACPI or DT. In this case no warning will be printed by the driver.126127Capturing SPE with perf command-line tools128------------------------------------------129130You can record a session with SPE samples:131132perf record -e arm_spe// -- ./mybench133134The sample period is set from the -c option, and because the minimum interval is used by default135it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.136137Config parameters138~~~~~~~~~~~~~~~~~139140These are placed between the // in the event and comma separated. For example '-e141arm_spe/load_filter=1,min_latency=10/'142143event_filter=<mask> - logical AND filter on specific events (PMSEVFR) - see bitfield description below144inv_event_filter=<mask> - logical OR to filter out specific events (PMSNEVFR, FEAT_SPEv1p2) - see bitfield description below145jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND)146min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR)147pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege148pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege149ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS)150discard=1 - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD)151inv_data_src_filter=<mask> - mask to filter from 0-63 possible data sources (PMSDSFR, FEAT_SPE_FDS) - See 'Data source filtering'152153+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather154than only the execution latency.155156Only some events can be filtered on using 'event_filter' bits. The overall157filter is the logical AND of these bits, for example if bits 3 and 5 are set158only samples that have both 'L1D cache refill' AND 'TLB walk' are recorded. When159FEAT_SPEv1p2 is implemented 'inv_event_filter' can also be used to exclude160events that have any (OR) of the filter's bits set. For example setting bits 3161and 5 in 'inv_event_filter' will exclude any events that are either L1D cache162refill OR TLB walk. If the same bit is set in both filters it's UNPREDICTABLE163whether the sample is included or excluded. Filter bits for both event_filter164and inv_event_filter are:165166bit 1 - Instruction retired (i.e. omit speculative instructions)167bit 2 - L1D access (FEAT_SPEv1p4)168bit 3 - L1D refill169bit 4 - TLB access (FEAT_SPEv1p4)170bit 5 - TLB refill171bit 6 - Not taken event (FEAT_SPEv1p2)172bit 7 - Mispredict173bit 8 - Last level cache access (FEAT_SPEv1p4)174bit 9 - Last level cache miss (FEAT_SPEv1p4)175bit 10 - Remote access (FEAT_SPEv1p4)176bit 11 - Misaligned access (FEAT_SPEv1p1)177bit 12-15 - IMPLEMENTATION DEFINED events (when implemented)178bit 17 - Partial or empty SME or SVE predicate (FEAT_SPEv1p1)179bit 18 - Empty SME or SVE predicate (FEAT_SPEv1p1)180bit 19 - L2D access (FEAT_SPEv1p4)181bit 20 - L2D miss (FEAT_SPEv1p4)182bit 21 - Cache data modified (FEAT_SPEv1p4)183bit 22 - Recently fetched (FEAT_SPEv1p4)184bit 23 - Data snooped (FEAT_SPEv1p4)185bit 24 - Streaming SVE mode event (when FEAT_SPE_SME is implemented), or186IMPLEMENTATION DEFINED event 24 (when implemented, only versions187less than FEAT_SPEv1p4)188bit 25 - SMCU or external coprocessor operation event when FEAT_SPE_SME is189implemented, or IMPLEMENTATION DEFINED event 25 (when implemented,190only versions less than FEAT_SPEv1p4)191bit 26-31 - IMPLEMENTATION DEFINED events (only versions less than FEAT_SPEv1p4)192bit 48-63 - IMPLEMENTATION DEFINED events (when implemented)193194For IMPLEMENTATION DEFINED bits, refer to the CPU TRM if these bits are195implemented.196197The driver will reject events if requested filter bits require unimplemented SPE198versions, but will not reject filter bits for unimplemented IMPDEF bits or when199their related feature is not present (e.g. SME). For example, if FEAT_SPEv1p2 is200not implemented, filtering on "Not taken event" (bit 6) will be rejected.201202So to sample just retired instructions:203204perf record -e arm_spe/event_filter=2/ -- ./mybench205206or just mispredicted branches:207208perf record -e arm_spe/event_filter=0x80/ -- ./mybench209210When set, the following filters can be used to select samples that match any of211the operation types (OR filtering). If only one is set then only samples of that212type are collected:213214branch_filter=1 - Collect branches (PMSFCR.B)215load_filter=1 - Collect loads (PMSFCR.LD)216store_filter=1 - Collect stores (PMSFCR.ST)217218When extended filtering is supported (FEAT_SPE_EFT), SIMD and float219pointer operations can also be selected:220221simd_filter=1 - Collect SIMD loads, stores and operations (PMSFCR.SIMD)222float_filter=1 - Collect floating point loads, stores and operations (PMSFCR.FP)223224When extended filtering is supported (FEAT_SPE_EFT), operation type filters can225be changed to AND using _mask fields. For example samples could be selected if226they are store AND SIMD by setting 'store_filter=1,simd_filter=1,227store_filter_mask=1,simd_filter_mask=1'. The new masks are as follows:228229branch_filter_mask=1 - Change branch filter behavior from OR to AND (PMSFCR.Bm)230load_filter_mask=1 - Change load filter behavior from OR to AND (PMSFCR.LDm)231store_filter_mask=1 - Change store filter behavior from OR to AND (PMSFCR.STm)232simd_filter_mask=1 - Change SIMD filter behavior from OR to AND (PMSFCR.SIMDm)233float_filter_mask=1 - Change floating point filter behavior from OR to AND (PMSFCR.FPm)234235Viewing the data236~~~~~~~~~~~~~~~~~237238By default perf report and perf script will assign samples to separate groups depending on the239attributes/events of the SPE record. Because instructions can have multiple events associated with240them, the samples in these groups are not necessarily unique. For example perf report shows these241groups:242243Available samples2440 arm_spe//2450 dummy:u24621 l1d-miss247897 l1d-access2485 llc-miss2497 llc-access2502 tlb-miss2511K tlb-access25236 branch2530 remote-access254900 memory2551800 instructions256257The arm_spe// and dummy:u events are implementation details and are expected to be empty.258259The instructions group contains the full list of unique samples that are not260sorted into other groups. To generate only this group use --itrace=i1i.2612621i (1 instruction interval) signifies no further downsampling. Rather than an263instruction interval, this generates a sample every n SPE samples. For example264to generate the default set of events for every 100 SPE samples:265266perf report --itrace==bxofmtMai100i267268Other period types, for example nanoseconds (ns) are not currently supported.269270Memory access details are also stored on the samples and this can be viewed with:271272perf report --mem-mode273274The latency value from the SPE sample is stored in the 'weight' field of the275Perf samples and can be displayed in Perf script and report outputs by enabling276its display from the command line.277278Common errors279~~~~~~~~~~~~~280281- "Cannot find PMU `arm_spe'. Missing kernel support?"282283Module not built or loaded, KPTI not disabled, interrupt not described by firmware,284or running on a VM. See 'Kernel Requirements' above.285286- "Arm SPE CONTEXT packets not found in the traces."287288Root privilege is required to collect context packets. But these only increase the accuracy of289assigning PIDs to kernel samples. For userspace sampling this can be ignored.290291- Excessively large perf.data file size292293Increase sampling interval (see above)294295PMU events296~~~~~~~~~~297298SPE has events that can be counted on core PMUs. These are prefixed with299SAMPLE_, for example SAMPLE_POP, SAMPLE_FEED, SAMPLE_COLLISION and300SAMPLE_FEED_BR.301302These events will only count when an SPE event is running on the same core that303the PMU event is opened on, otherwise they read as 0. There are various ways to304ensure that the PMU event and SPE event are scheduled together depending on the305way the event is opened. For example opening both events as per-process events306on the same process, although it's not guaranteed that the PMU event is enabled307first when context switching. For that reason it may be better to open the PMU308event as a systemwide event and then open SPE on the process of interest.309310Discard mode311~~~~~~~~~~~~312313SPE related (SAMPLE_* etc) core PMU events can be used without the overhead of314collecting sample data if discard mode is supported (optional from Armv8.6).315First run a system wide SPE session (or on the core of interest) using options316to minimize output. Then run perf stat:317318perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &319perf stat -e SAMPLE_FEED_LD320321Data source filtering322~~~~~~~~~~~~~~~~~~~~~323324When FEAT_SPE_FDS is present, 'inv_data_src_filter' can be used as a mask to325filter on a subset (0 - 63) of possible data source IDs. The full range of data326sources is 0 - 65535 although these are unlikely to be used in practice. Data327sources are IMPDEF so refer to the TRM for the mappings. Each bit N of the328filter maps to data source N. The filter is an OR of all the bits, and the value329provided inv_data_src_filter is inverted before writing to PMSDSFR_EL1 so that330set bits exclude that data source and cleared bits include that data source.331Therefore the default value of 0 is equivalent to no filtering (all data sources332included).333334For example, to include only data sources 0 and 3, clear bits 0 and 3335(0xFFFFFFFFFFFFFFF6)336337When 'inv_data_src_filter' is set to 0xFFFFFFFFFFFFFFFF, any samples with any338data source set are excluded.339340SEE ALSO341--------342343linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],344linkperf:perf-inject[1]345346347