Path: blob/master/tools/perf/Documentation/perf-amd-ibs.txt
26282 views
perf-amd-ibs(1)1===============23NAME4----5perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool67SYNOPSIS8--------9[verse]10'perf record' -e ibs_op//11'perf record' -e ibs_fetch//1213DESCRIPTION14-----------1516Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP)17profiling support on AMD platforms. IBS has two independent components: IBS18Op and IBS Fetch. IBS Op sampling provides information about instruction19execution (micro-op execution to be precise) with details like d-cache20hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch21behavior etc. IBS Fetch sampling provides information about instruction fetch22with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is23per-smt-thread i.e. each SMT hardware thread contains standalone IBS units.2425Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited26using the Linux perf utility. The following files will be created at boot time27if IBS is supported by the hardware and kernel.2829/sys/bus/event_source/devices/ibs_op/30/sys/bus/event_source/devices/ibs_fetch/3132IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports33one event: fetch ops.3435IBS PMUs do not have user/kernel filtering capability and thus it requires36CAP_SYS_ADMIN or CAP_PERFMON privilege.3738IBS VS. REGULAR CORE PMU39------------------------4041IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has42no skid. Whereas the IP recorded by regular core PMU will have some skid43(sample was generated at IP X but perf would record it at IP X+n). Hence,44regular core PMU might not help for profiling with instruction level45precision. Further, IBS provides additional information about the sample in46question. On the other hand, regular core PMU has it's own advantages like47plethora of events, counting mode (less interference), up to 6 parallel48counters, event grouping support, filtering capabilities etc.4950Three regular core PMU events are internally forwarded to IBS Op PMU when51precise_ip attribute is set:5253-e cpu-cycles:p becomes -e ibs_op//54-e r076:p becomes -e ibs_op//55-e r0C1:p becomes -e ibs_op/cnt_ctl=1/5657EXAMPLES58--------5960IBS Op PMU61~~~~~~~~~~6263System-wide profile, cycles event, sampling period: 1000006465# perf record -e ibs_op// -c 100000 -a6667Per-cpu profile (cpu10), cycles event, sampling period: 1000006869# perf record -e ibs_op// -c 100000 -C 107071Per-cpu profile (cpu10), cycles event, sampling freq: 10007273# perf record -e ibs_op// -F 1000 -C 107475System-wide profile, uOps event, sampling period: 1000007677# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a7879Same command, but also capture IBS register raw dump along with perf sample:8081# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples8283System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)8485# perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a8687System-wide profile, cycles event, sampling period: 100000, LdLat filtering (Zen588onward)8990# perf record -e ibs_op/ldlat=128/ -c 100000 -a9192Supported load latency threshold values are 128 to 2048 (both inclusive).93Latency value which is a multiple of 128 incurs a little less profiling94overhead compared to other values.9596Per process(upstream v6.2 onward), uOps event, sampling period: 1000009798# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 123499100Per process(upstream v6.2 onward), uOps event, sampling period: 100000101102# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls103104To analyse recorded profile in aggregate mode105106# perf report107/* Select a line and press 'a' to drill down at instruction level. */108109To go over each sample110111# perf script112113Raw dump of IBS registers when profiled with --raw-samples114115# perf report -D116/* Look for PERF_RECORD_SAMPLE */117118Example register raw dump:119120ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1121Val 1 CntCtl 0=cycles CurCnt 707122IbsOpRip: ffffffff8204aea7123ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597124BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1125ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM126ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0127DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0128DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0129DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1130DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes131OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0132IbsDCLinAd: ff110008a5398920133IbsDCPhysAd: 00000008a5398920134135IBS applied in a real world usecase136137~90% regression was observed in tbench with specific scheduler hint138which was counter intuitive. IBS profile of good and bad run captured139using perf helped in identifying exact cause of the problem:140141https://lore.kernel.org/r/[email protected]142143IBS Fetch PMU144~~~~~~~~~~~~~145146Similar commands can be used with Fetch PMU as well.147148System-wide profile, fetch ops event, sampling period: 100000149150# perf record -e ibs_fetch// -c 100000 -a151152System-wide profile, fetch ops event, sampling period: 100000, Random enable153154# perf record -e ibs_fetch/rand_en=1/ -c 100000 -a155156Random enable adds small degree of variability to sample period. This157helps in cases like long running loops where PMU is tagging the same158instruction over and over because of fixed sample period.159160etc.161162PERF MEM AND PERF C2C163---------------------164165perf mem is a memory access profiler tool and perf c2c is a shared data166cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD.167Below is a simple example of the perf mem tool.168169# perf mem record -c 100000 -- make170# perf mem report171172A normal perf mem report output will provide detailed memory access profile.173New output fields will show related access info together. For example:174175# perf mem report -F overhead,cache,snoop,comm176...177# Samples: 92K of event 'ibs_op//'178# Total weight : 531104179#180# ---------- Cache ----------- --- Snoop ----181# Overhead L1 L2 L1-buf Other HitM Other Command182# ........ ............................ .............. ..........183#18476.07% 5.8% 35.7% 0.0% 34.6% 23.3% 52.8% cc11855.79% 0.2% 0.0% 0.0% 5.6% 0.1% 5.7% make1865.78% 0.1% 4.4% 0.0% 1.2% 0.5% 5.3% gcc1875.33% 0.3% 3.9% 0.0% 1.1% 0.2% 5.2% as1885.00% 0.1% 3.8% 0.0% 1.0% 0.3% 4.7% sh1891.56% 0.1% 0.1% 0.0% 1.4% 0.6% 0.9% ld1900.28% 0.1% 0.0% 0.0% 0.2% 0.1% 0.2% pkg-config1910.09% 0.0% 0.0% 0.0% 0.1% 0.0% 0.1% git1920.03% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% rm193...194195Also, it can be aggregated based on various memory access info using the196sort keys. For example:197198# perf mem report -s mem,snoop199...200# Samples: 92K of event 'ibs_op//'201# Total weight : 531104202# Sort order : mem,snoop203#204# Overhead Samples Memory access Snoop205# ........ ............ ....................................... ............206#20747.99% 1509 L2 hit N/A20825.08% 338 core, same node Any cache hit HitM20910.24% 54374 N/A N/A2106.77% 35938 L1 hit N/A2116.39% 101 core, same node Any cache hit N/A2123.50% 69 RAM hit N/A2130.03% 158 LFB/MAB hit N/A2140.00% 2 Uncached hit N/A215216Please refer to their man page for more detail.217218SEE ALSO219--------220221linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],222linkperf:perf-mem[1], linkperf:perf-c2c[1]223224225