Path: blob/master/tools/perf/Documentation/perf-c2c.txt
50693 views
perf-c2c(1)1===========23NAME4----5perf-c2c - Shared Data C2C/HITM Analyzer.67SYNOPSIS8--------9[verse]10'perf c2c record' [<options>] <command>11'perf c2c record' [<options>] \-- [<record command options>] <command>12'perf c2c report' [<options>]1314DESCRIPTION15-----------16C2C stands for Cache To Cache.1718The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows19you to track down the cacheline contentions.2021On Intel, the tool is based on load latency and precise store facility events22provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling23with thresholding feature. On AMD, the tool uses IBS op pmu (due to hardware24limitations, perf c2c is not supported on Zen3 cpus). On Arm64 it uses SPE to25sample load and store operations, therefore hardware and kernel support is26required. See linkperf:perf-arm-spe[1] for a setup guide. Due to the27statistical nature of Arm SPE sampling, not every memory operation will be28sampled.2930These events provide:31- memory address of the access32- type of the access (load and store details)33- latency (in cycles) of the load access3435The c2c tool provide means to record this data and report back access details36for cachelines with highest contention - highest number of HITM accesses.3738The basic workflow with this tool follows the standard record/report phase.39User uses the record command to record events data and report command to40display it.414243RECORD OPTIONS44--------------45-e::46--event=::47Select the PMU event. Use 'perf c2c record -e list'48to list available events.4950-v::51--verbose::52Be more verbose (show counter open errors, etc).5354-l::55--ldlat::56Configure mem-loads latency. Supported on Intel, Arm64 and some AMD57processors. Ignored on other archs.5859On supported AMD processors:60- /sys/bus/event_source/devices/ibs_op/caps/ldlat file contains '1'.61- Supported latency values are 128 to 2048 (both inclusive).62- Latency value which is a multiple of 128 incurs a little less profiling63overhead compared to other values.64- Load latency filtering is disabled by default.6566-k::67--all-kernel::68Configure all used events to run in kernel space.6970-u::71--all-user::72Configure all used events to run in user space.7374REPORT OPTIONS75--------------76-k::77--vmlinux=<file>::78vmlinux pathname7980-v::81--verbose::82Be more verbose (show counter open errors, etc).8384-i::85--input::86Specify the input file to process.8788-N::89--node-info::90Show extra node info in report (see NODE INFO section)9192-c::93--coalesce::94Specify sorting fields for single cacheline display.95Following fields are available: tid,pid,iaddr,dso96(see COALESCE)9798-g::99--call-graph::100Setup callchains parameters.101Please refer to perf-report man page for details.102103--stdio::104Force the stdio output (see STDIO OUTPUT)105106--stats::107Display only statistic tables and force stdio mode.108109--full-symbols::110Display full length of symbols.111112--no-source::113Do not display Source:Line column.114115--show-all::116Show all captured HITM lines, with no regard to HITM % 0.0005 limit.117118-f::119--force::120Don't do ownership validation.121122-d::123--display::124Switch to HITM type (rmt, lcl) or peer snooping type (peer) to display125and sort on. Total HITMs (tot) as default, except Arm64 uses peer mode126as default.127128--stitch-lbr::129Show callgraph with stitched LBRs, which may have more complete130callgraph. The perf.data file must have been obtained using131perf c2c record --call-graph lbr.132Disabled by default. In common cases with call stack overflows,133it can recreate better call stacks than the default lbr call stack134output. But this approach is not foolproof. There can be cases135where it creates incorrect call stacks from incorrect matches.136The known limitations include exception handing such as137setjmp/longjmp will have calls/returns not match.138139--double-cl::140Group the detection of shared cacheline events into double cacheline141granularity. Some architectures have an Adjacent Cacheline Prefetch142feature, which causes cacheline sharing to behave like the cacheline143size is doubled.144145-M::146--disassembler-style=::147Set disassembler style for objdump.148149--objdump=<path>::150Path to objdump binary.151152C2C RECORD153----------154The perf c2c record command setup options related to HITM cacheline analysis155and calls standard perf record command.156157Following perf record options are configured by default:158(check perf record man page for details)159160-W,-d,--phys-data,--sample-cpu161162Unless specified otherwise with '-e' option, following events are monitored by163default on Intel:164165cpu/mem-loads,ldlat=30/P166cpu/mem-stores/P167168following on AMD:169170ibs_op//171172and following on PowerPC:173174cpu/mem-loads/175cpu/mem-stores/176177User can pass any 'perf record' option behind '--' mark, like (to enable178callchains and system wide monitoring):179180$ perf c2c record -- -g -a181182Please check RECORD OPTIONS section for specific c2c record options.183184C2C REPORT185----------186The perf c2c report command displays shared data analysis. It comes in two187display modes: stdio and tui (default).188189The report command workflow is following:190- sort all the data based on the cacheline address191- store access details for each cacheline192- sort all cachelines based on user settings193- display data194195In general perf report output consist of 2 basic views:1961) most expensive cachelines list1972) offsets details for each cacheline198199For each cacheline in the 1) list we display following data:200(Both stdio and TUI modes follow the same fields output)201202Index203- zero based index to identify the cacheline204205Cacheline206- cacheline address (hex number)207208Rmt/Lcl Hitm (Display with HITM types)209- cacheline percentage of all Remote/Local HITM accesses210211Peer Snoop (Display with peer type)212- cacheline percentage of all peer accesses213214LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types)215- count of Total/Local/Remote load HITMs216217Load Peer - Total, Local, Remote (For display with peer type)218- count of Total/Local/Remote load from peer cache or DRAM219220Total records221- sum of all cachelines accesses222223Total loads224- sum of all load accesses225226Total stores227- sum of all store accesses228229Store Reference - L1Hit, L1Miss, N/A230L1Hit - store accesses that hit L1231L1Miss - store accesses that missed L1232N/A - store accesses with memory level is not available233234Core Load Hit - FB, L1, L2235- count of load hits in FB (Fill Buffer), L1 and L2 cache236237LLC Load Hit - LlcHit, LclHitm238- count of LLC load accesses, includes LLC hits and LLC HITMs239240RMT Load Hit - RmtHit, RmtHitm241- count of remote load accesses, includes remote hits and remote HITMs;242on Arm neoverse cores, RmtHit is used to account remote accesses,243includes remote DRAM or any upward cache level in remote node244245Load Dram - Lcl, Rmt246- count of local and remote DRAM accesses247248For each offset in the 2) list we display following data:249250HITM - Rmt, Lcl (Display with HITM types)251- % of Remote/Local HITM accesses for given offset within cacheline252253Peer Snoop - Rmt, Lcl (Display with peer type)254- % of Remote/Local peer accesses for given offset within cacheline255256Store Refs - L1 Hit, L1 Miss, N/A257- % of store accesses that hit L1, missed L1 and N/A (no available) memory258level for given offset within cacheline259260Data address - Offset261- offset address262263Pid264- pid of the process responsible for the accesses265266Tid267- tid of the process responsible for the accesses268269Code address270- code address responsible for the accesses271272cycles - rmt hitm, lcl hitm, load (Display with HITM types)273- sum of cycles for given accesses - Remote/Local HITM and generic load274275cycles - rmt peer, lcl peer, load (Display with peer type)276- sum of cycles for given accesses - Remote/Local peer load and generic load277278cpu cnt279- number of cpus that participated on the access280281Symbol282- code symbol related to the 'Code address' value283284Shared Object285- shared object name related to the 'Code address' value286287Source:Line288- source information related to the 'Code address' value289290Node291- nodes participating on the access (see NODE INFO section)292293NODE INFO294---------295The 'Node' field displays nodes that accesses given cacheline296offset. Its output comes in 3 flavors:297- node IDs separated by ','298- node IDs with stats for each ID, in following format:299Node{cpus %hitms %stores} (Display with HITM types)300Node{cpus %peers %stores} (Display with peer type)301- node IDs with list of affected CPUs in following format:302Node{cpu list}303304User can switch between above flavors with -N option or305use 'n' key to interactively switch in TUI mode.306307COALESCE308--------309User can specify how to sort offsets for cacheline.310311Following fields are available and governs the final312output fields set for cacheline offsets output:313314tid - coalesced by process TIDs315pid - coalesced by process PIDs316iaddr - coalesced by code address, following fields are displayed:317Code address, Code symbol, Shared Object, Source line318dso - coalesced by shared object319320By default the coalescing is setup with 'pid,iaddr'.321322STDIO OUTPUT323------------324The stdio output displays data on standard output.325326Following tables are displayed:327Trace Event Information328- overall statistics of memory accesses329330Global Shared Cache Line Event Information331- overall statistics on shared cachelines332333Shared Data Cache Line Table334- list of most expensive cachelines335336Shared Cache Line Distribution Pareto337- list of all accessed offsets for each cacheline338339TUI OUTPUT340----------341The TUI output provides interactive interface to navigate342through cachelines list and to display offset details.343344For details please refer to the help window by pressing '?' key.345346CREDITS347-------348Although Don Zickus, Dick Fowles and Joe Mario worked together349to get this implemented, we got lots of early help from Arnaldo350Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.351352C2C BLOG353--------354Check Joe's blog on c2c tool for detailed use case explanation:355https://joemario.github.io/blog/2016/09/01/c2c-blog/356357SEE ALSO358--------359linkperf:perf-record[1], linkperf:perf-mem[1], linkperf:perf-arm-spe[1]360361362