Path: blob/master/tools/memory-model/Documentation/recipes.txt
26282 views
This document provides "recipes", that is, litmus tests for commonly1occurring situations, as well as a few that illustrate subtly broken but2attractive nuisances. Many of these recipes include example code from3v5.7 of the Linux kernel.45The first section covers simple special cases, the second section6takes off the training wheels to cover more involved examples,7and the third section provides a few rules of thumb.8910Simple special cases11====================1213This section presents two simple special cases, the first being where14there is only one CPU or only one memory location is accessed, and the15second being use of that old concurrency workhorse, locking.161718Single CPU or single memory location19------------------------------------2021If there is only one CPU on the one hand or only one variable22on the other, the code will execute in order. There are (as23usual) some things to be careful of:24251. Some aspects of the C language are unordered. For example,26in the expression "f(x) + g(y)", the order in which f and g are27called is not defined; the object code is allowed to use either28order or even to interleave the computations.29302. Compilers are permitted to use the "as-if" rule. That is, a31compiler can emit whatever code it likes for normal accesses,32as long as the results of a single-threaded execution appear33just as if the compiler had followed all the relevant rules.34To see this, compile with a high level of optimization and run35the debugger on the resulting binary.36373. If there is only one variable but multiple CPUs, that variable38must be properly aligned and all accesses to that variable must39be full sized. Variables that straddle cachelines or pages void40your full-ordering warranty, as do undersized accesses that load41from or store to only part of the variable.42434. If there are multiple CPUs, accesses to shared variables should44use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store45tearing, load/store fusing, and invented loads and stores.46There are exceptions to this rule, including:4748i. When there is no possibility of a given shared variable49being updated by some other CPU, for example, while50holding the update-side lock, reads from that variable51need not use READ_ONCE().5253ii. When there is no possibility of a given shared variable54being either read or updated by other CPUs, for example,55when running during early boot, reads from that variable56need not use READ_ONCE() and writes to that variable57need not use WRITE_ONCE().585960Locking61-------6263[!] Note:64locking.txt expands on this section, providing more detail on65locklessly accessing lock-protected shared variables.6667Locking is well-known and straightforward, at least if you don't think68about it too hard. And the basic rule is indeed quite simple: Any CPU that69has acquired a given lock sees any changes previously seen or made by any70CPU before it released that same lock. Note that this statement is a bit71stronger than "Any CPU holding a given lock sees all changes made by any72CPU during the time that CPU was holding this same lock". For example,73consider the following pair of code fragments:7475/* See MP+polocks.litmus. */76void CPU0(void)77{78WRITE_ONCE(x, 1);79spin_lock(&mylock);80WRITE_ONCE(y, 1);81spin_unlock(&mylock);82}8384void CPU1(void)85{86spin_lock(&mylock);87r0 = READ_ONCE(y);88spin_unlock(&mylock);89r1 = READ_ONCE(x);90}9192The basic rule guarantees that if CPU0() acquires mylock before CPU1(),93then both r0 and r1 must be set to the value 1. This also has the94consequence that if the final value of r0 is equal to 1, then the final95value of r1 must also be equal to 1. In contrast, the weaker rule would96say nothing about the final value of r1.9798The converse to the basic rule also holds, as illustrated by the99following litmus test:100101/* See MP+porevlocks.litmus. */102void CPU0(void)103{104r0 = READ_ONCE(y);105spin_lock(&mylock);106r1 = READ_ONCE(x);107spin_unlock(&mylock);108}109110void CPU1(void)111{112spin_lock(&mylock);113WRITE_ONCE(x, 1);114spin_unlock(&mylock);115WRITE_ONCE(y, 1);116}117118This converse to the basic rule guarantees that if CPU0() acquires119mylock before CPU1(), then both r0 and r1 must be set to the value 0.120This also has the consequence that if the final value of r1 is equal121to 0, then the final value of r0 must also be equal to 0. In contrast,122the weaker rule would say nothing about the final value of r0.123124These examples show only a single pair of CPUs, but the effects of the125locking basic rule extend across multiple acquisitions of a given lock126across multiple CPUs.127128However, it is not necessarily the case that accesses ordered by129locking will be seen as ordered by CPUs not holding that lock.130Consider this example:131132/* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */133void CPU0(void)134{135spin_lock(&mylock);136WRITE_ONCE(x, 1);137WRITE_ONCE(y, 1);138spin_unlock(&mylock);139}140141void CPU1(void)142{143spin_lock(&mylock);144r0 = READ_ONCE(y);145WRITE_ONCE(z, 1);146spin_unlock(&mylock);147}148149void CPU2(void)150{151WRITE_ONCE(z, 2);152smp_mb();153r1 = READ_ONCE(x);154}155156Counter-intuitive though it might be, it is quite possible to have157the final value of r0 be 1, the final value of z be 2, and the final158value of r1 be 0. The reason for this surprising outcome is that159CPU2() never acquired the lock, and thus did not benefit from the160lock's ordering properties.161162Ordering can be extended to CPUs not holding the lock by careful use163of smp_mb__after_spinlock():164165/* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */166void CPU0(void)167{168spin_lock(&mylock);169WRITE_ONCE(x, 1);170WRITE_ONCE(y, 1);171spin_unlock(&mylock);172}173174void CPU1(void)175{176spin_lock(&mylock);177smp_mb__after_spinlock();178r0 = READ_ONCE(y);179WRITE_ONCE(z, 1);180spin_unlock(&mylock);181}182183void CPU2(void)184{185WRITE_ONCE(z, 2);186smp_mb();187r1 = READ_ONCE(x);188}189190This addition of smp_mb__after_spinlock() strengthens the lock acquisition191sufficiently to rule out the counter-intuitive outcome.192193194Taking off the training wheels195==============================196197This section looks at more complex examples, including message passing,198load buffering, release-acquire chains, store buffering.199Many classes of litmus tests have abbreviated names, which may be found200here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf201202203Message passing (MP)204--------------------205206The MP pattern has one CPU execute a pair of stores to a pair of variables207and another CPU execute a pair of loads from this same pair of variables,208but in the opposite order. The goal is to avoid the counter-intuitive209outcome in which the first load sees the value written by the second store210but the second load does not see the value written by the first store.211In the absence of any ordering, this goal may not be met, as can be seen212in the MP+poonceonces.litmus litmus test. This section therefore looks at213a number of ways of meeting this goal.214215216Release and acquire217~~~~~~~~~~~~~~~~~~~218219Use of smp_store_release() and smp_load_acquire() is one way to force220the desired MP ordering. The general approach is shown below:221222/* See MP+pooncerelease+poacquireonce.litmus. */223void CPU0(void)224{225WRITE_ONCE(x, 1);226smp_store_release(&y, 1);227}228229void CPU1(void)230{231r0 = smp_load_acquire(&y);232r1 = READ_ONCE(x);233}234235The smp_store_release() macro orders any prior accesses against the236store, while the smp_load_acquire macro orders the load against any237subsequent accesses. Therefore, if the final value of r0 is the value 1,238the final value of r1 must also be the value 1.239240The init_stack_slab() function in lib/stackdepot.c uses release-acquire241in this way to safely initialize of a slab of the stack. Working out242the mutual-exclusion design is left as an exercise for the reader.243244245Assign and dereference246~~~~~~~~~~~~~~~~~~~~~~247248Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the249use of smp_store_release() and smp_load_acquire(), except that both250rcu_assign_pointer() and rcu_dereference() operate on RCU-protected251pointers. The general approach is shown below:252253/* See MP+onceassign+derefonce.litmus. */254int z;255int *y = &z;256int x;257258void CPU0(void)259{260WRITE_ONCE(x, 1);261rcu_assign_pointer(y, &x);262}263264void CPU1(void)265{266rcu_read_lock();267r0 = rcu_dereference(y);268r1 = READ_ONCE(*r0);269rcu_read_unlock();270}271272In this example, if the final value of r0 is &x then the final value of273r1 must be 1.274275The rcu_assign_pointer() macro has the same ordering properties as does276smp_store_release(), but the rcu_dereference() macro orders the load only277against later accesses that depend on the value loaded. A dependency278is present if the value loaded determines the address of a later access279(address dependency, as shown above), the value written by a later store280(data dependency), or whether or not a later store is executed in the281first place (control dependency). Note that the term "data dependency"282is sometimes casually used to cover both address and data dependencies.283284In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes285rcu_assign_pointer(), and the next_prime_number() function invokes286rcu_dereference(). This combination mediates access to a bit vector287that is expanded as additional primes are needed.288289290Write and read memory barriers291~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~292293It is usually better to use smp_store_release() instead of smp_wmb()294and to use smp_load_acquire() instead of smp_rmb(). However, the older295smp_wmb() and smp_rmb() APIs are still heavily used, so it is important296to understand their use cases. The general approach is shown below:297298/* See MP+fencewmbonceonce+fencermbonceonce.litmus. */299void CPU0(void)300{301WRITE_ONCE(x, 1);302smp_wmb();303WRITE_ONCE(y, 1);304}305306void CPU1(void)307{308r0 = READ_ONCE(y);309smp_rmb();310r1 = READ_ONCE(x);311}312313The smp_wmb() macro orders prior stores against later stores, and the314smp_rmb() macro orders prior loads against later loads. Therefore, if315the final value of r0 is 1, the final value of r1 must also be 1.316317The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains318the following write-side code fragment:319320log->l_curr_block -= log->l_logBBsize;321ASSERT(log->l_curr_block >= 0);322smp_wmb();323log->l_curr_cycle++;324325And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains326the corresponding read-side code fragment:327328cur_cycle = READ_ONCE(log->l_curr_cycle);329smp_rmb();330cur_block = READ_ONCE(log->l_curr_block);331332Alternatively, consider the following comment in function333perf_output_put_handle() in kernel/events/ring_buffer.c:334335* kernel user336*337* if (LOAD ->data_tail) { LOAD ->data_head338* (A) smp_rmb() (C)339* STORE $data LOAD $data340* smp_wmb() (B) smp_mb() (D)341* STORE ->data_head STORE ->data_tail342* }343344The B/C pairing is an example of the MP pattern using smp_wmb() on the345write side and smp_rmb() on the read side.346347Of course, given that smp_mb() is strictly stronger than either smp_wmb()348or smp_rmb(), any code fragment that would work with smp_rmb() and349smp_wmb() would also work with smp_mb() replacing either or both of the350weaker barriers.351352353Load buffering (LB)354-------------------355356The LB pattern has one CPU load from one variable and then store to a357second, while another CPU loads from the second variable and then stores358to the first. The goal is to avoid the counter-intuitive situation where359each load reads the value written by the other CPU's store. In the360absence of any ordering it is quite possible that this may happen, as361can be seen in the LB+poonceonces.litmus litmus test.362363One way of avoiding the counter-intuitive outcome is through the use of a364control dependency paired with a full memory barrier:365366/* See LB+fencembonceonce+ctrlonceonce.litmus. */367void CPU0(void)368{369r0 = READ_ONCE(x);370if (r0)371WRITE_ONCE(y, 1);372}373374void CPU1(void)375{376r1 = READ_ONCE(y);377smp_mb();378WRITE_ONCE(x, 1);379}380381This pairing of a control dependency in CPU0() with a full memory382barrier in CPU1() prevents r0 and r1 from both ending up equal to 1.383384The A/D pairing from the ring-buffer use case shown earlier also385illustrates LB. Here is a repeat of the comment in386perf_output_put_handle() in kernel/events/ring_buffer.c, showing a387control dependency on the kernel side and a full memory barrier on388the user side:389390* kernel user391*392* if (LOAD ->data_tail) { LOAD ->data_head393* (A) smp_rmb() (C)394* STORE $data LOAD $data395* smp_wmb() (B) smp_mb() (D)396* STORE ->data_head STORE ->data_tail397* }398*399* Where A pairs with D, and B pairs with C.400401The kernel's control dependency between the load from ->data_tail402and the store to data combined with the user's full memory barrier403between the load from data and the store to ->data_tail prevents404the counter-intuitive outcome where the kernel overwrites the data405before the user gets done loading it.406407408Release-acquire chains409----------------------410411Release-acquire chains are a low-overhead, flexible, and easy-to-use412method of maintaining order. However, they do have some limitations that413need to be fully understood. Here is an example that maintains order:414415/* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */416void CPU0(void)417{418WRITE_ONCE(x, 1);419smp_store_release(&y, 1);420}421422void CPU1(void)423{424r0 = smp_load_acquire(y);425smp_store_release(&z, 1);426}427428void CPU2(void)429{430r1 = smp_load_acquire(z);431r2 = READ_ONCE(x);432}433434In this case, if r0 and r1 both have final values of 1, then r2 must435also have a final value of 1.436437The ordering in this example is stronger than it needs to be. For438example, ordering would still be preserved if CPU1()'s smp_load_acquire()439invocation was replaced with READ_ONCE().440441It is tempting to assume that CPU0()'s store to x is globally ordered442before CPU1()'s store to z, but this is not the case:443444/* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */445void CPU0(void)446{447WRITE_ONCE(x, 1);448smp_store_release(&y, 1);449}450451void CPU1(void)452{453r0 = smp_load_acquire(y);454smp_store_release(&z, 1);455}456457void CPU2(void)458{459WRITE_ONCE(z, 2);460smp_mb();461r1 = READ_ONCE(x);462}463464One might hope that if the final value of r0 is 1 and the final value465of z is 2, then the final value of r1 must also be 1, but it really is466possible for r1 to have the final value of 0. The reason, of course,467is that in this version, CPU2() is not part of the release-acquire chain.468This situation is accounted for in the rules of thumb below.469470Despite this limitation, release-acquire chains are low-overhead as471well as simple and powerful, at least as memory-ordering mechanisms go.472473474Store buffering475---------------476477Store buffering can be thought of as upside-down load buffering, so478that one CPU first stores to one variable and then loads from a second,479while another CPU stores to the second variable and then loads from the480first. Preserving order requires nothing less than full barriers:481482/* See SB+fencembonceonces.litmus. */483void CPU0(void)484{485WRITE_ONCE(x, 1);486smp_mb();487r0 = READ_ONCE(y);488}489490void CPU1(void)491{492WRITE_ONCE(y, 1);493smp_mb();494r1 = READ_ONCE(x);495}496497Omitting either smp_mb() will allow both r0 and r1 to have final498values of 0, but providing both full barriers as shown above prevents499this counter-intuitive outcome.500501This pattern most famously appears as part of Dekker's locking502algorithm, but it has a much more practical use within the Linux kernel503of ordering wakeups. The following comment taken from waitqueue_active()504in include/linux/wait.h shows the canonical pattern:505506* CPU0 - waker CPU1 - waiter507*508* for (;;) {509* @cond = true; prepare_to_wait(&wq_head, &wait, state);510* smp_mb(); // smp_mb() from set_current_state()511* if (waitqueue_active(wq_head)) if (@cond)512* wake_up(wq_head); break;513* schedule();514* }515* finish_wait(&wq_head, &wait);516517On CPU0, the store is to @cond and the load is in waitqueue_active().518On CPU1, prepare_to_wait() contains both a store to wq_head and a call519to set_current_state(), which contains an smp_mb() barrier; the load is520"if (@cond)". The full barriers prevent the undesirable outcome where521CPU1 puts the waiting task to sleep and CPU0 fails to wake it up.522523Note that use of locking can greatly simplify this pattern.524525526Rules of thumb527==============528529There might seem to be no pattern governing what ordering primitives are530needed in which situations, but this is not the case. There is a pattern531based on the relation between the accesses linking successive CPUs in a532given litmus test. There are three types of linkage:5335341. Write-to-read, where the next CPU reads the value that the535previous CPU wrote. The LB litmus-test patterns contain only536this type of relation. In formal memory-modeling texts, this537relation is called "reads-from" and is usually abbreviated "rf".5385392. Read-to-write, where the next CPU overwrites the value that the540previous CPU read. The SB litmus test contains only this type541of relation. In formal memory-modeling texts, this relation is542often called "from-reads" and is sometimes abbreviated "fr".5435443. Write-to-write, where the next CPU overwrites the value written545by the previous CPU. The Z6.0 litmus test pattern contains a546write-to-write relation between the last access of CPU1() and547the first access of CPU2(). In formal memory-modeling texts,548this relation is often called "coherence order" and is sometimes549abbreviated "co". In the C++ standard, it is instead called550"modification order" and often abbreviated "mo".551552The strength of memory ordering required for a given litmus test to553avoid a counter-intuitive outcome depends on the types of relations554linking the memory accesses for the outcome in question:555556o If all links are write-to-read links, then the weakest557possible ordering within each CPU suffices. For example, in558the LB litmus test, a control dependency was enough to do the559job.560561o If all but one of the links are write-to-read links, then a562release-acquire chain suffices. Both the MP and the ISA2563litmus tests illustrate this case.564565o If more than one of the links are something other than566write-to-read links, then a full memory barrier is required567between each successive pair of non-write-to-read links. This568case is illustrated by the Z6.0 litmus tests, both in the569locking and in the release-acquire sections.570571However, if you find yourself having to stretch these rules of thumb572to fit your situation, you should consider creating a litmus test and573running it on the model.574575576