CoCalc -- recipes.txt

GitHub Repository: torvalds/linux
Path: blob/master/tools/memory-model/Documentation/recipes.txt
²⁶²⁸² views
1
This document provides "recipes", that is, litmus tests for commonly
2
occurring situations, as well as a few that illustrate subtly broken but
3
attractive nuisances.  Many of these recipes include example code from
4
v5.7 of the Linux kernel.
5

6
The first section covers simple special cases, the second section
7
takes off the training wheels to cover more involved examples,
8
and the third section provides a few rules of thumb.
9

10

11
Simple special cases
12
====================
13

14
This section presents two simple special cases, the first being where
15
there is only one CPU or only one memory location is accessed, and the
16
second being use of that old concurrency workhorse, locking.
17

18

19
Single CPU or single memory location
20
------------------------------------
21

22
If there is only one CPU on the one hand or only one variable
23
on the other, the code will execute in order.  There are (as
24
usual) some things to be careful of:
25

26
1.	Some aspects of the C language are unordered.  For example,
27
	in the expression "f(x) + g(y)", the order in which f and g are
28
	called is not defined; the object code is allowed to use either
29
	order or even to interleave the computations.
30

31
2.	Compilers are permitted to use the "as-if" rule.  That is, a
32
	compiler can emit whatever code it likes for normal accesses,
33
	as long as the results of a single-threaded execution appear
34
	just as if the compiler had followed all the relevant rules.
35
	To see this, compile with a high level of optimization and run
36
	the debugger on the resulting binary.
37

38
3.	If there is only one variable but multiple CPUs, that variable
39
	must be properly aligned and all accesses to that variable must
40
	be full sized.	Variables that straddle cachelines or pages void
41
	your full-ordering warranty, as do undersized accesses that load
42
	from or store to only part of the variable.
43

44
4.	If there are multiple CPUs, accesses to shared variables should
45
	use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
46
	tearing, load/store fusing, and invented loads and stores.
47
	There are exceptions to this rule, including:
48

49
	i.	When there is no possibility of a given shared variable
50
		being updated by some other CPU, for example, while
51
		holding the update-side lock, reads from that variable
52
		need not use READ_ONCE().
53

54
	ii.	When there is no possibility of a given shared variable
55
		being either read or updated by other CPUs, for example,
56
		when running during early boot, reads from that variable
57
		need not use READ_ONCE() and writes to that variable
58
		need not use WRITE_ONCE().
59

60

61
Locking
62
-------
63

64
[!] Note:
65
	locking.txt expands on this section, providing more detail on
66
	locklessly accessing lock-protected shared variables.
67

68
Locking is well-known and straightforward, at least if you don't think
69
about it too hard.  And the basic rule is indeed quite simple: Any CPU that
70
has acquired a given lock sees any changes previously seen or made by any
71
CPU before it released that same lock.  Note that this statement is a bit
72
stronger than "Any CPU holding a given lock sees all changes made by any
73
CPU during the time that CPU was holding this same lock".  For example,
74
consider the following pair of code fragments:
75

76
	/* See MP+polocks.litmus. */
77
	void CPU0(void)
78
	{
79
		WRITE_ONCE(x, 1);
80
		spin_lock(&mylock);
81
		WRITE_ONCE(y, 1);
82
		spin_unlock(&mylock);
83
	}
84

85
	void CPU1(void)
86
	{
87
		spin_lock(&mylock);
88
		r0 = READ_ONCE(y);
89
		spin_unlock(&mylock);
90
		r1 = READ_ONCE(x);
91
	}
92

93
The basic rule guarantees that if CPU0() acquires mylock before CPU1(),
94
then both r0 and r1 must be set to the value 1.  This also has the
95
consequence that if the final value of r0 is equal to 1, then the final
96
value of r1 must also be equal to 1.  In contrast, the weaker rule would
97
say nothing about the final value of r1.
98

99
The converse to the basic rule also holds, as illustrated by the
100
following litmus test:
101

102
	/* See MP+porevlocks.litmus. */
103
	void CPU0(void)
104
	{
105
		r0 = READ_ONCE(y);
106
		spin_lock(&mylock);
107
		r1 = READ_ONCE(x);
108
		spin_unlock(&mylock);
109
	}
110

111
	void CPU1(void)
112
	{
113
		spin_lock(&mylock);
114
		WRITE_ONCE(x, 1);
115
		spin_unlock(&mylock);
116
		WRITE_ONCE(y, 1);
117
	}
118

119
This converse to the basic rule guarantees that if CPU0() acquires
120
mylock before CPU1(), then both r0 and r1 must be set to the value 0.
121
This also has the consequence that if the final value of r1 is equal
122
to 0, then the final value of r0 must also be equal to 0.  In contrast,
123
the weaker rule would say nothing about the final value of r0.
124

125
These examples show only a single pair of CPUs, but the effects of the
126
locking basic rule extend across multiple acquisitions of a given lock
127
across multiple CPUs.
128

129
However, it is not necessarily the case that accesses ordered by
130
locking will be seen as ordered by CPUs not holding that lock.
131
Consider this example:
132

133
	/* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
134
	void CPU0(void)
135
	{
136
		spin_lock(&mylock);
137
		WRITE_ONCE(x, 1);
138
		WRITE_ONCE(y, 1);
139
		spin_unlock(&mylock);
140
	}
141

142
	void CPU1(void)
143
	{
144
		spin_lock(&mylock);
145
		r0 = READ_ONCE(y);
146
		WRITE_ONCE(z, 1);
147
		spin_unlock(&mylock);
148
	}
149

150
	void CPU2(void)
151
	{
152
		WRITE_ONCE(z, 2);
153
		smp_mb();
154
		r1 = READ_ONCE(x);
155
	}
156

157
Counter-intuitive though it might be, it is quite possible to have
158
the final value of r0 be 1, the final value of z be 2, and the final
159
value of r1 be 0.  The reason for this surprising outcome is that
160
CPU2() never acquired the lock, and thus did not benefit from the
161
lock's ordering properties.
162

163
Ordering can be extended to CPUs not holding the lock by careful use
164
of smp_mb__after_spinlock():
165

166
	/* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
167
	void CPU0(void)
168
	{
169
		spin_lock(&mylock);
170
		WRITE_ONCE(x, 1);
171
		WRITE_ONCE(y, 1);
172
		spin_unlock(&mylock);
173
	}
174

175
	void CPU1(void)
176
	{
177
		spin_lock(&mylock);
178
		smp_mb__after_spinlock();
179
		r0 = READ_ONCE(y);
180
		WRITE_ONCE(z, 1);
181
		spin_unlock(&mylock);
182
	}
183

184
	void CPU2(void)
185
	{
186
		WRITE_ONCE(z, 2);
187
		smp_mb();
188
		r1 = READ_ONCE(x);
189
	}
190

191
This addition of smp_mb__after_spinlock() strengthens the lock acquisition
192
sufficiently to rule out the counter-intuitive outcome.
193

194

195
Taking off the training wheels
196
==============================
197

198
This section looks at more complex examples, including message passing,
199
load buffering, release-acquire chains, store buffering.
200
Many classes of litmus tests have abbreviated names, which may be found
201
here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
202

203

204
Message passing (MP)
205
--------------------
206

207
The MP pattern has one CPU execute a pair of stores to a pair of variables
208
and another CPU execute a pair of loads from this same pair of variables,
209
but in the opposite order.  The goal is to avoid the counter-intuitive
210
outcome in which the first load sees the value written by the second store
211
but the second load does not see the value written by the first store.
212
In the absence of any ordering, this goal may not be met, as can be seen
213
in the MP+poonceonces.litmus litmus test.  This section therefore looks at
214
a number of ways of meeting this goal.
215

216

217
Release and acquire
218
~~~~~~~~~~~~~~~~~~~
219

220
Use of smp_store_release() and smp_load_acquire() is one way to force
221
the desired MP ordering.  The general approach is shown below:
222

223
	/* See MP+pooncerelease+poacquireonce.litmus. */
224
	void CPU0(void)
225
	{
226
		WRITE_ONCE(x, 1);
227
		smp_store_release(&y, 1);
228
	}
229

230
	void CPU1(void)
231
	{
232
		r0 = smp_load_acquire(&y);
233
		r1 = READ_ONCE(x);
234
	}
235

236
The smp_store_release() macro orders any prior accesses against the
237
store, while the smp_load_acquire macro orders the load against any
238
subsequent accesses.  Therefore, if the final value of r0 is the value 1,
239
the final value of r1 must also be the value 1.
240

241
The init_stack_slab() function in lib/stackdepot.c uses release-acquire
242
in this way to safely initialize of a slab of the stack.  Working out
243
the mutual-exclusion design is left as an exercise for the reader.
244

245

246
Assign and dereference
247
~~~~~~~~~~~~~~~~~~~~~~
248

249
Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the
250
use of smp_store_release() and smp_load_acquire(), except that both
251
rcu_assign_pointer() and rcu_dereference() operate on RCU-protected
252
pointers.  The general approach is shown below:
253

254
	/* See MP+onceassign+derefonce.litmus. */
255
	int z;
256
	int *y = &z;
257
	int x;
258

259
	void CPU0(void)
260
	{
261
		WRITE_ONCE(x, 1);
262
		rcu_assign_pointer(y, &x);
263
	}
264

265
	void CPU1(void)
266
	{
267
		rcu_read_lock();
268
		r0 = rcu_dereference(y);
269
		r1 = READ_ONCE(*r0);
270
		rcu_read_unlock();
271
	}
272

273
In this example, if the final value of r0 is &x then the final value of
274
r1 must be 1.
275

276
The rcu_assign_pointer() macro has the same ordering properties as does
277
smp_store_release(), but the rcu_dereference() macro orders the load only
278
against later accesses that depend on the value loaded.  A dependency
279
is present if the value loaded determines the address of a later access
280
(address dependency, as shown above), the value written by a later store
281
(data dependency), or whether or not a later store is executed in the
282
first place (control dependency).  Note that the term "data dependency"
283
is sometimes casually used to cover both address and data dependencies.
284

285
In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
286
rcu_assign_pointer(), and the next_prime_number() function invokes
287
rcu_dereference().  This combination mediates access to a bit vector
288
that is expanded as additional primes are needed.
289

290

291
Write and read memory barriers
292
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
293

294
It is usually better to use smp_store_release() instead of smp_wmb()
295
and to use smp_load_acquire() instead of smp_rmb().  However, the older
296
smp_wmb() and smp_rmb() APIs are still heavily used, so it is important
297
to understand their use cases.  The general approach is shown below:
298

299
	/* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
300
	void CPU0(void)
301
	{
302
		WRITE_ONCE(x, 1);
303
		smp_wmb();
304
		WRITE_ONCE(y, 1);
305
	}
306

307
	void CPU1(void)
308
	{
309
		r0 = READ_ONCE(y);
310
		smp_rmb();
311
		r1 = READ_ONCE(x);
312
	}
313

314
The smp_wmb() macro orders prior stores against later stores, and the
315
smp_rmb() macro orders prior loads against later loads.  Therefore, if
316
the final value of r0 is 1, the final value of r1 must also be 1.
317

318
The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
319
the following write-side code fragment:
320

321
	log->l_curr_block -= log->l_logBBsize;
322
	ASSERT(log->l_curr_block >= 0);
323
	smp_wmb();
324
	log->l_curr_cycle++;
325

326
And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
327
the corresponding read-side code fragment:
328

329
	cur_cycle = READ_ONCE(log->l_curr_cycle);
330
	smp_rmb();
331
	cur_block = READ_ONCE(log->l_curr_block);
332

333
Alternatively, consider the following comment in function
334
perf_output_put_handle() in kernel/events/ring_buffer.c:
335

336
	 *   kernel				user
337
	 *
338
	 *   if (LOAD ->data_tail) {		LOAD ->data_head
339
	 *			(A)		smp_rmb()	(C)
340
	 *	STORE $data			LOAD $data
341
	 *	smp_wmb()	(B)		smp_mb()	(D)
342
	 *	STORE ->data_head		STORE ->data_tail
343
	 *   }
344

345
The B/C pairing is an example of the MP pattern using smp_wmb() on the
346
write side and smp_rmb() on the read side.
347

348
Of course, given that smp_mb() is strictly stronger than either smp_wmb()
349
or smp_rmb(), any code fragment that would work with smp_rmb() and
350
smp_wmb() would also work with smp_mb() replacing either or both of the
351
weaker barriers.
352

353

354
Load buffering (LB)
355
-------------------
356

357
The LB pattern has one CPU load from one variable and then store to a
358
second, while another CPU loads from the second variable and then stores
359
to the first.  The goal is to avoid the counter-intuitive situation where
360
each load reads the value written by the other CPU's store.  In the
361
absence of any ordering it is quite possible that this may happen, as
362
can be seen in the LB+poonceonces.litmus litmus test.
363

364
One way of avoiding the counter-intuitive outcome is through the use of a
365
control dependency paired with a full memory barrier:
366

367
	/* See LB+fencembonceonce+ctrlonceonce.litmus. */
368
	void CPU0(void)
369
	{
370
		r0 = READ_ONCE(x);
371
		if (r0)
372
			WRITE_ONCE(y, 1);
373
	}
374

375
	void CPU1(void)
376
	{
377
		r1 = READ_ONCE(y);
378
		smp_mb();
379
		WRITE_ONCE(x, 1);
380
	}
381

382
This pairing of a control dependency in CPU0() with a full memory
383
barrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
384

385
The A/D pairing from the ring-buffer use case shown earlier also
386
illustrates LB.  Here is a repeat of the comment in
387
perf_output_put_handle() in kernel/events/ring_buffer.c, showing a
388
control dependency on the kernel side and a full memory barrier on
389
the user side:
390

391
	 *   kernel				user
392
	 *
393
	 *   if (LOAD ->data_tail) {		LOAD ->data_head
394
	 *			(A)		smp_rmb()	(C)
395
	 *	STORE $data			LOAD $data
396
	 *	smp_wmb()	(B)		smp_mb()	(D)
397
	 *	STORE ->data_head		STORE ->data_tail
398
	 *   }
399
	 *
400
	 * Where A pairs with D, and B pairs with C.
401

402
The kernel's control dependency between the load from ->data_tail
403
and the store to data combined with the user's full memory barrier
404
between the load from data and the store to ->data_tail prevents
405
the counter-intuitive outcome where the kernel overwrites the data
406
before the user gets done loading it.
407

408

409
Release-acquire chains
410
----------------------
411

412
Release-acquire chains are a low-overhead, flexible, and easy-to-use
413
method of maintaining order.  However, they do have some limitations that
414
need to be fully understood.  Here is an example that maintains order:
415

416
	/* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
417
	void CPU0(void)
418
	{
419
		WRITE_ONCE(x, 1);
420
		smp_store_release(&y, 1);
421
	}
422

423
	void CPU1(void)
424
	{
425
		r0 = smp_load_acquire(y);
426
		smp_store_release(&z, 1);
427
	}
428

429
	void CPU2(void)
430
	{
431
		r1 = smp_load_acquire(z);
432
		r2 = READ_ONCE(x);
433
	}
434

435
In this case, if r0 and r1 both have final values of 1, then r2 must
436
also have a final value of 1.
437

438
The ordering in this example is stronger than it needs to be.  For
439
example, ordering would still be preserved if CPU1()'s smp_load_acquire()
440
invocation was replaced with READ_ONCE().
441

442
It is tempting to assume that CPU0()'s store to x is globally ordered
443
before CPU1()'s store to z, but this is not the case:
444

445
	/* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
446
	void CPU0(void)
447
	{
448
		WRITE_ONCE(x, 1);
449
		smp_store_release(&y, 1);
450
	}
451

452
	void CPU1(void)
453
	{
454
		r0 = smp_load_acquire(y);
455
		smp_store_release(&z, 1);
456
	}
457

458
	void CPU2(void)
459
	{
460
		WRITE_ONCE(z, 2);
461
		smp_mb();
462
		r1 = READ_ONCE(x);
463
	}
464

465
One might hope that if the final value of r0 is 1 and the final value
466
of z is 2, then the final value of r1 must also be 1, but it really is
467
possible for r1 to have the final value of 0.  The reason, of course,
468
is that in this version, CPU2() is not part of the release-acquire chain.
469
This situation is accounted for in the rules of thumb below.
470

471
Despite this limitation, release-acquire chains are low-overhead as
472
well as simple and powerful, at least as memory-ordering mechanisms go.
473

474

475
Store buffering
476
---------------
477

478
Store buffering can be thought of as upside-down load buffering, so
479
that one CPU first stores to one variable and then loads from a second,
480
while another CPU stores to the second variable and then loads from the
481
first.  Preserving order requires nothing less than full barriers:
482

483
	/* See SB+fencembonceonces.litmus. */
484
	void CPU0(void)
485
	{
486
		WRITE_ONCE(x, 1);
487
		smp_mb();
488
		r0 = READ_ONCE(y);
489
	}
490

491
	void CPU1(void)
492
	{
493
		WRITE_ONCE(y, 1);
494
		smp_mb();
495
		r1 = READ_ONCE(x);
496
	}
497

498
Omitting either smp_mb() will allow both r0 and r1 to have final
499
values of 0, but providing both full barriers as shown above prevents
500
this counter-intuitive outcome.
501

502
This pattern most famously appears as part of Dekker's locking
503
algorithm, but it has a much more practical use within the Linux kernel
504
of ordering wakeups.  The following comment taken from waitqueue_active()
505
in include/linux/wait.h shows the canonical pattern:
506

507
 *      CPU0 - waker                    CPU1 - waiter
508
 *
509
 *                                      for (;;) {
510
 *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
511
 *      smp_mb();                         // smp_mb() from set_current_state()
512
 *      if (waitqueue_active(wq_head))         if (@cond)
513
 *        wake_up(wq_head);                      break;
514
 *                                        schedule();
515
 *                                      }
516
 *                                      finish_wait(&wq_head, &wait);
517

518
On CPU0, the store is to @cond and the load is in waitqueue_active().
519
On CPU1, prepare_to_wait() contains both a store to wq_head and a call
520
to set_current_state(), which contains an smp_mb() barrier; the load is
521
"if (@cond)".  The full barriers prevent the undesirable outcome where
522
CPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
523

524
Note that use of locking can greatly simplify this pattern.
525

526

527
Rules of thumb
528
==============
529

530
There might seem to be no pattern governing what ordering primitives are
531
needed in which situations, but this is not the case.  There is a pattern
532
based on the relation between the accesses linking successive CPUs in a
533
given litmus test.  There are three types of linkage:
534

535
1.	Write-to-read, where the next CPU reads the value that the
536
	previous CPU wrote.  The LB litmus-test patterns contain only
537
	this type of relation.	In formal memory-modeling texts, this
538
	relation is called "reads-from" and is usually abbreviated "rf".
539

540
2.	Read-to-write, where the next CPU overwrites the value that the
541
	previous CPU read.  The SB litmus test contains only this type
542
	of relation.  In formal memory-modeling texts, this relation is
543
	often called "from-reads" and is sometimes abbreviated "fr".
544

545
3.	Write-to-write, where the next CPU overwrites the value written
546
	by the previous CPU.  The Z6.0 litmus test pattern contains a
547
	write-to-write relation between the last access of CPU1() and
548
	the first access of CPU2().  In formal memory-modeling texts,
549
	this relation is often called "coherence order" and is sometimes
550
	abbreviated "co".  In the C++ standard, it is instead called
551
	"modification order" and often abbreviated "mo".
552

553
The strength of memory ordering required for a given litmus test to
554
avoid a counter-intuitive outcome depends on the types of relations
555
linking the memory accesses for the outcome in question:
556

557
o	If all links are write-to-read links, then the weakest
558
	possible ordering within each CPU suffices.  For example, in
559
	the LB litmus test, a control dependency was enough to do the
560
	job.
561

562
o	If all but one of the links are write-to-read links, then a
563
	release-acquire chain suffices.  Both the MP and the ISA2
564
	litmus tests illustrate this case.
565

566
o	If more than one of the links are something other than
567
	write-to-read links, then a full memory barrier is required
568
	between each successive pair of non-write-to-read links.  This
569
	case is illustrated by the Z6.0 litmus tests, both in the
570
	locking and in the release-acquire sections.
571

572
However, if you find yourself having to stretch these rules of thumb
573
to fit your situation, you should consider creating a litmus test and
574
running it on the model.
575

576
Product

Resources

Company