CoCalc -- perf-amd-ibs.txt

GitHub Repository: torvalds/linux
Path: blob/master/tools/perf/Documentation/perf-amd-ibs.txt
²⁶²⁸² views
1
perf-amd-ibs(1)
2
===============
3

4
NAME
5
----
6
perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool
7

8
SYNOPSIS
9
--------
10
[verse]
11
'perf record' -e ibs_op//
12
'perf record' -e ibs_fetch//
13

14
DESCRIPTION
15
-----------
16

17
Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP)
18
profiling support on AMD platforms. IBS has two independent components: IBS
19
Op and IBS Fetch. IBS Op sampling provides information about instruction
20
execution (micro-op execution to be precise) with details like d-cache
21
hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch
22
behavior etc. IBS Fetch sampling provides information about instruction fetch
23
with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is
24
per-smt-thread i.e. each SMT hardware thread contains standalone IBS units.
25

26
Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited
27
using the Linux perf utility. The following files will be created at boot time
28
if IBS is supported by the hardware and kernel.
29

30
  /sys/bus/event_source/devices/ibs_op/
31
  /sys/bus/event_source/devices/ibs_fetch/
32

33
IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports
34
one event: fetch ops.
35

36
IBS PMUs do not have user/kernel filtering capability and thus it requires
37
CAP_SYS_ADMIN or CAP_PERFMON privilege.
38

39
IBS VS. REGULAR CORE PMU
40
------------------------
41

42
IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has
43
no skid. Whereas the IP recorded by regular core PMU will have some skid
44
(sample was generated at IP X but perf would record it at IP X+n). Hence,
45
regular core PMU might not help for profiling with instruction level
46
precision. Further, IBS provides additional information about the sample in
47
question. On the other hand, regular core PMU has it's own advantages like
48
plethora of events, counting mode (less interference), up to 6 parallel
49
counters, event grouping support, filtering capabilities etc.
50

51
Three regular core PMU events are internally forwarded to IBS Op PMU when
52
precise_ip attribute is set:
53

54
	-e cpu-cycles:p becomes -e ibs_op//
55
	-e r076:p becomes -e ibs_op//
56
	-e r0C1:p becomes -e ibs_op/cnt_ctl=1/
57

58
EXAMPLES
59
--------
60

61
IBS Op PMU
62
~~~~~~~~~~
63

64
System-wide profile, cycles event, sampling period: 100000
65

66
	# perf record -e ibs_op// -c 100000 -a
67

68
Per-cpu profile (cpu10), cycles event, sampling period: 100000
69

70
	# perf record -e ibs_op// -c 100000 -C 10
71

72
Per-cpu profile (cpu10), cycles event, sampling freq: 1000
73

74
	# perf record -e ibs_op// -F 1000 -C 10
75

76
System-wide profile, uOps event, sampling period: 100000
77

78
	# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a
79

80
Same command, but also capture IBS register raw dump along with perf sample:
81

82
	# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples
83

84
System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)
85

86
	# perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a
87

88
System-wide profile, cycles event, sampling period: 100000, LdLat filtering (Zen5
89
onward)
90

91
	# perf record -e ibs_op/ldlat=128/ -c 100000 -a
92

93
	Supported load latency threshold values are 128 to 2048 (both inclusive).
94
	Latency value which is a multiple of 128 incurs a little less profiling
95
	overhead compared to other values.
96

97
Per process(upstream v6.2 onward), uOps event, sampling period: 100000
98

99
	# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234
100

101
Per process(upstream v6.2 onward), uOps event, sampling period: 100000
102

103
	# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls
104

105
To analyse recorded profile in aggregate mode
106

107
	# perf report
108
	/* Select a line and press 'a' to drill down at instruction level. */
109

110
To go over each sample
111

112
	# perf script
113

114
Raw dump of IBS registers when profiled with --raw-samples
115

116
	# perf report -D
117
	/* Look for PERF_RECORD_SAMPLE */
118

119
	Example register raw dump:
120

121
	ibs_op_ctl:     000002c30006186a MaxCnt    100000 L3MissOnly 0 En 1
122
		Val 1 CntCtl 0=cycles CurCnt       707
123
	IbsOpRip:       ffffffff8204aea7
124
	ibs_op_data:    0000010002550001 CompToRetCtr     1 TagToRetCtr   597
125
		BrnRet 0  RipInvalid 0 BrnFuse 0 Microcode 1
126
	ibs_op_data2:   0000000000000013 RmtNode 1 DataSrc 3=DRAM
127
	ibs_op_data3:   0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0
128
		DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0
129
		DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0
130
		DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1
131
		DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes
132
		OpDcMissOpenMemReqs 12 DcMissLat     0 TlbRefillLat     0
133
	IbsDCLinAd:     ff110008a5398920
134
	IbsDCPhysAd:    00000008a5398920
135

136
IBS applied in a real world usecase
137

138
	~90% regression was observed in tbench with specific scheduler hint
139
	which was counter intuitive. IBS profile of good and bad run captured
140
	using perf helped in identifying exact cause of the problem:
141

142
	https://lore.kernel.org/r/[email protected]
143

144
IBS Fetch PMU
145
~~~~~~~~~~~~~
146

147
Similar commands can be used with Fetch PMU as well.
148

149
System-wide profile, fetch ops event, sampling period: 100000
150

151
	# perf record -e ibs_fetch// -c 100000 -a
152

153
System-wide profile, fetch ops event, sampling period: 100000, Random enable
154

155
	# perf record -e ibs_fetch/rand_en=1/ -c 100000 -a
156

157
	Random enable adds small degree of variability to sample period. This
158
	helps in cases like long running loops where PMU is tagging the same
159
	instruction over and over because of fixed sample period.
160

161
etc.
162

163
PERF MEM AND PERF C2C
164
---------------------
165

166
perf mem is a memory access profiler tool and perf c2c is a shared data
167
cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD.
168
Below is a simple example of the perf mem tool.
169

170
	# perf mem record -c 100000 -- make
171
	# perf mem report
172

173
A normal perf mem report output will provide detailed memory access profile.
174
New output fields will show related access info together.  For example:
175

176
	# perf mem report -F overhead,cache,snoop,comm
177
	...
178
	# Samples: 92K of event 'ibs_op//'
179
	# Total weight : 531104
180
	#
181
	#           ---------- Cache -----------  --- Snoop ----
182
	# Overhead       L1     L2 L1-buf  Other     HitM  Other  Command
183
	# ........  ............................  ..............  ..........
184
	#
185
	    76.07%     5.8%  35.7%   0.0%  34.6%    23.3%  52.8%  cc1
186
	     5.79%     0.2%   0.0%   0.0%   5.6%     0.1%   5.7%  make
187
	     5.78%     0.1%   4.4%   0.0%   1.2%     0.5%   5.3%  gcc
188
	     5.33%     0.3%   3.9%   0.0%   1.1%     0.2%   5.2%  as
189
	     5.00%     0.1%   3.8%   0.0%   1.0%     0.3%   4.7%  sh
190
	     1.56%     0.1%   0.1%   0.0%   1.4%     0.6%   0.9%  ld
191
	     0.28%     0.1%   0.0%   0.0%   0.2%     0.1%   0.2%  pkg-config
192
	     0.09%     0.0%   0.0%   0.0%   0.1%     0.0%   0.1%  git
193
	     0.03%     0.0%   0.0%   0.0%   0.0%     0.0%   0.0%  rm
194
	     ...
195

196
Also, it can be aggregated based on various memory access info using the
197
sort keys.  For example:
198

199
	# perf mem report -s mem,snoop
200
	...
201
	# Samples: 92K of event 'ibs_op//'
202
	# Total weight : 531104
203
	# Sort order   : mem,snoop
204
	#
205
	# Overhead       Samples  Memory access                            Snoop
206
	# ........  ............  .......................................  ............
207
	#
208
	    47.99%          1509  L2 hit                                   N/A
209
	    25.08%           338  core, same node Any cache hit            HitM
210
	    10.24%         54374  N/A                                      N/A
211
	     6.77%         35938  L1 hit                                   N/A
212
	     6.39%           101  core, same node Any cache hit            N/A
213
	     3.50%            69  RAM hit                                  N/A
214
	     0.03%           158  LFB/MAB hit                              N/A
215
	     0.00%             2  Uncached hit                             N/A
216

217
Please refer to their man page for more detail.
218

219
SEE ALSO
220
--------
221

222
linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
223
linkperf:perf-mem[1], linkperf:perf-c2c[1]
224

225
Product

Resources

Company