Path: blob/master/tools/perf/Documentation/perf-c2c.txt
26282 views
perf-c2c(1)1===========23NAME4----5perf-c2c - Shared Data C2C/HITM Analyzer.67SYNOPSIS8--------9[verse]10'perf c2c record' [<options>] <command>11'perf c2c record' [<options>] \-- [<record command options>] <command>12'perf c2c report' [<options>]1314DESCRIPTION15-----------16C2C stands for Cache To Cache.1718The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows19you to track down the cacheline contentions.2021On Intel, the tool is based on load latency and precise store facility events22provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling23with thresholding feature. On AMD, the tool uses IBS op pmu (due to hardware24limitations, perf c2c is not supported on Zen3 cpus). On Arm64 it uses SPE to25sample load and store operations, therefore hardware and kernel support is26required. See linkperf:perf-arm-spe[1] for a setup guide. Due to the27statistical nature of Arm SPE sampling, not every memory operation will be28sampled.2930These events provide:31- memory address of the access32- type of the access (load and store details)33- latency (in cycles) of the load access3435The c2c tool provide means to record this data and report back access details36for cachelines with highest contention - highest number of HITM accesses.3738The basic workflow with this tool follows the standard record/report phase.39User uses the record command to record events data and report command to40display it.414243RECORD OPTIONS44--------------45-e::46--event=::47Select the PMU event. Use 'perf c2c record -e list'48to list available events.4950-v::51--verbose::52Be more verbose (show counter open errors, etc).5354-l::55--ldlat::56Configure mem-loads latency. Supported on Intel, Arm64 and some AMD57processors. Ignored on other archs.5859On supported AMD processors:60- /sys/bus/event_source/devices/ibs_op/caps/ldlat file contains '1'.61- Supported latency values are 128 to 2048 (both inclusive).62- Latency value which is a multiple of 128 incurs a little less profiling63overhead compared to other values.64- Load latency filtering is disabled by default.6566-k::67--all-kernel::68Configure all used events to run in kernel space.6970-u::71--all-user::72Configure all used events to run in user space.7374REPORT OPTIONS75--------------76-k::77--vmlinux=<file>::78vmlinux pathname7980-v::81--verbose::82Be more verbose (show counter open errors, etc).8384-i::85--input::86Specify the input file to process.8788-N::89--node-info::90Show extra node info in report (see NODE INFO section)9192-c::93--coalesce::94Specify sorting fields for single cacheline display.95Following fields are available: tid,pid,iaddr,dso96(see COALESCE)9798-g::99--call-graph::100Setup callchains parameters.101Please refer to perf-report man page for details.102103--stdio::104Force the stdio output (see STDIO OUTPUT)105106--stats::107Display only statistic tables and force stdio mode.108109--full-symbols::110Display full length of symbols.111112--no-source::113Do not display Source:Line column.114115--show-all::116Show all captured HITM lines, with no regard to HITM % 0.0005 limit.117118-f::119--force::120Don't do ownership validation.121122-d::123--display::124Switch to HITM type (rmt, lcl) or peer snooping type (peer) to display125and sort on. Total HITMs (tot) as default, except Arm64 uses peer mode126as default.127128--stitch-lbr::129Show callgraph with stitched LBRs, which may have more complete130callgraph. The perf.data file must have been obtained using131perf c2c record --call-graph lbr.132Disabled by default. In common cases with call stack overflows,133it can recreate better call stacks than the default lbr call stack134output. But this approach is not foolproof. There can be cases135where it creates incorrect call stacks from incorrect matches.136The known limitations include exception handing such as137setjmp/longjmp will have calls/returns not match.138139--double-cl::140Group the detection of shared cacheline events into double cacheline141granularity. Some architectures have an Adjacent Cacheline Prefetch142feature, which causes cacheline sharing to behave like the cacheline143size is doubled.144145C2C RECORD146----------147The perf c2c record command setup options related to HITM cacheline analysis148and calls standard perf record command.149150Following perf record options are configured by default:151(check perf record man page for details)152153-W,-d,--phys-data,--sample-cpu154155Unless specified otherwise with '-e' option, following events are monitored by156default on Intel:157158cpu/mem-loads,ldlat=30/P159cpu/mem-stores/P160161following on AMD:162163ibs_op//164165and following on PowerPC:166167cpu/mem-loads/168cpu/mem-stores/169170User can pass any 'perf record' option behind '--' mark, like (to enable171callchains and system wide monitoring):172173$ perf c2c record -- -g -a174175Please check RECORD OPTIONS section for specific c2c record options.176177C2C REPORT178----------179The perf c2c report command displays shared data analysis. It comes in two180display modes: stdio and tui (default).181182The report command workflow is following:183- sort all the data based on the cacheline address184- store access details for each cacheline185- sort all cachelines based on user settings186- display data187188In general perf report output consist of 2 basic views:1891) most expensive cachelines list1902) offsets details for each cacheline191192For each cacheline in the 1) list we display following data:193(Both stdio and TUI modes follow the same fields output)194195Index196- zero based index to identify the cacheline197198Cacheline199- cacheline address (hex number)200201Rmt/Lcl Hitm (Display with HITM types)202- cacheline percentage of all Remote/Local HITM accesses203204Peer Snoop (Display with peer type)205- cacheline percentage of all peer accesses206207LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types)208- count of Total/Local/Remote load HITMs209210Load Peer - Total, Local, Remote (For display with peer type)211- count of Total/Local/Remote load from peer cache or DRAM212213Total records214- sum of all cachelines accesses215216Total loads217- sum of all load accesses218219Total stores220- sum of all store accesses221222Store Reference - L1Hit, L1Miss, N/A223L1Hit - store accesses that hit L1224L1Miss - store accesses that missed L1225N/A - store accesses with memory level is not available226227Core Load Hit - FB, L1, L2228- count of load hits in FB (Fill Buffer), L1 and L2 cache229230LLC Load Hit - LlcHit, LclHitm231- count of LLC load accesses, includes LLC hits and LLC HITMs232233RMT Load Hit - RmtHit, RmtHitm234- count of remote load accesses, includes remote hits and remote HITMs;235on Arm neoverse cores, RmtHit is used to account remote accesses,236includes remote DRAM or any upward cache level in remote node237238Load Dram - Lcl, Rmt239- count of local and remote DRAM accesses240241For each offset in the 2) list we display following data:242243HITM - Rmt, Lcl (Display with HITM types)244- % of Remote/Local HITM accesses for given offset within cacheline245246Peer Snoop - Rmt, Lcl (Display with peer type)247- % of Remote/Local peer accesses for given offset within cacheline248249Store Refs - L1 Hit, L1 Miss, N/A250- % of store accesses that hit L1, missed L1 and N/A (no available) memory251level for given offset within cacheline252253Data address - Offset254- offset address255256Pid257- pid of the process responsible for the accesses258259Tid260- tid of the process responsible for the accesses261262Code address263- code address responsible for the accesses264265cycles - rmt hitm, lcl hitm, load (Display with HITM types)266- sum of cycles for given accesses - Remote/Local HITM and generic load267268cycles - rmt peer, lcl peer, load (Display with peer type)269- sum of cycles for given accesses - Remote/Local peer load and generic load270271cpu cnt272- number of cpus that participated on the access273274Symbol275- code symbol related to the 'Code address' value276277Shared Object278- shared object name related to the 'Code address' value279280Source:Line281- source information related to the 'Code address' value282283Node284- nodes participating on the access (see NODE INFO section)285286NODE INFO287---------288The 'Node' field displays nodes that accesses given cacheline289offset. Its output comes in 3 flavors:290- node IDs separated by ','291- node IDs with stats for each ID, in following format:292Node{cpus %hitms %stores} (Display with HITM types)293Node{cpus %peers %stores} (Display with peer type)294- node IDs with list of affected CPUs in following format:295Node{cpu list}296297User can switch between above flavors with -N option or298use 'n' key to interactively switch in TUI mode.299300COALESCE301--------302User can specify how to sort offsets for cacheline.303304Following fields are available and governs the final305output fields set for cacheline offsets output:306307tid - coalesced by process TIDs308pid - coalesced by process PIDs309iaddr - coalesced by code address, following fields are displayed:310Code address, Code symbol, Shared Object, Source line311dso - coalesced by shared object312313By default the coalescing is setup with 'pid,iaddr'.314315STDIO OUTPUT316------------317The stdio output displays data on standard output.318319Following tables are displayed:320Trace Event Information321- overall statistics of memory accesses322323Global Shared Cache Line Event Information324- overall statistics on shared cachelines325326Shared Data Cache Line Table327- list of most expensive cachelines328329Shared Cache Line Distribution Pareto330- list of all accessed offsets for each cacheline331332TUI OUTPUT333----------334The TUI output provides interactive interface to navigate335through cachelines list and to display offset details.336337For details please refer to the help window by pressing '?' key.338339CREDITS340-------341Although Don Zickus, Dick Fowles and Joe Mario worked together342to get this implemented, we got lots of early help from Arnaldo343Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.344345C2C BLOG346--------347Check Joe's blog on c2c tool for detailed use case explanation:348https://joemario.github.io/blog/2016/09/01/c2c-blog/349350SEE ALSO351--------352linkperf:perf-record[1], linkperf:perf-mem[1], linkperf:perf-arm-spe[1]353354355