Path: blob/master/tools/memory-model/Documentation/explanation.txt
26282 views
Explanation of the Linux-Kernel Memory Consistency Model1~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~23:Author: Alan Stern <[email protected]>4:Created: October 201756.. Contents781. INTRODUCTION92. BACKGROUND103. A SIMPLE EXAMPLE114. A SELECTION OF MEMORY MODELS125. ORDERING AND CYCLES136. EVENTS147. THE PROGRAM ORDER RELATION: po AND po-loc158. A WARNING169. DEPENDENCY RELATIONS: data, addr, and ctrl1710. THE READS-FROM RELATION: rf, rfi, and rfe1811. CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe1912. THE FROM-READS RELATION: fr, fri, and fre2013. AN OPERATIONAL MODEL2114. PROPAGATION ORDER RELATION: cumul-fence2215. DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL2316. SEQUENTIAL CONSISTENCY PER VARIABLE2417. ATOMIC UPDATES: rmw2518. THE PRESERVED PROGRAM ORDER RELATION: ppo2619. AND THEN THERE WAS ALPHA2720. THE HAPPENS-BEFORE RELATION: hb2821. THE PROPAGATES-BEFORE RELATION: pb2922. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb3023. SRCU READ-SIDE CRITICAL SECTIONS3124. LOCKING3225. PLAIN ACCESSES AND DATA RACES3326. ODDS AND ENDS34353637INTRODUCTION38------------3940The Linux-kernel memory consistency model (LKMM) is rather complex and41obscure. This is particularly evident if you read through the42linux-kernel.bell and linux-kernel.cat files that make up the formal43version of the model; they are extremely terse and their meanings are44far from clear.4546This document describes the ideas underlying the LKMM. It is meant47for people who want to understand how the model was designed. It does48not go into the details of the code in the .bell and .cat files;49rather, it explains in English what the code expresses symbolically.5051Sections 2 (BACKGROUND) through 5 (ORDERING AND CYCLES) are aimed52toward beginners; they explain what memory consistency models are and53the basic notions shared by all such models. People already familiar54with these concepts can skim or skip over them. Sections 6 (EVENTS)55through 12 (THE FROM_READS RELATION) describe the fundamental56relations used in many models. Starting in Section 13 (AN OPERATIONAL57MODEL), the workings of the LKMM itself are covered.5859Warning: The code examples in this document are not written in the60proper format for litmus tests. They don't include a header line, the61initializations are not enclosed in braces, the global variables are62not passed by pointers, and they don't have an "exists" clause at the63end. Converting them to the right format is left as an exercise for64the reader.656667BACKGROUND68----------6970A memory consistency model (or just memory model, for short) is71something which predicts, given a piece of computer code running on a72particular kind of system, what values may be obtained by the code's73load instructions. The LKMM makes these predictions for code running74as part of the Linux kernel.7576In practice, people tend to use memory models the other way around.77That is, given a piece of code and a collection of values specified78for the loads, the model will predict whether it is possible for the79code to run in such a way that the loads will indeed obtain the80specified values. Of course, this is just another way of expressing81the same idea.8283For code running on a uniprocessor system, the predictions are easy:84Each load instruction must obtain the value written by the most recent85store instruction accessing the same location (we ignore complicating86factors such as DMA and mixed-size accesses.) But on multiprocessor87systems, with multiple CPUs making concurrent accesses to shared88memory locations, things aren't so simple.8990Different architectures have differing memory models, and the Linux91kernel supports a variety of architectures. The LKMM has to be fairly92permissive, in the sense that any behavior allowed by one of these93architectures also has to be allowed by the LKMM.949596A SIMPLE EXAMPLE97----------------9899Here is a simple example to illustrate the basic concepts. Consider100some code running as part of a device driver for an input device. The101driver might contain an interrupt handler which collects data from the102device, stores it in a buffer, and sets a flag to indicate the buffer103is full. Running concurrently on a different CPU might be a part of104the driver code being executed by a process in the midst of a read(2)105system call. This code tests the flag to see whether the buffer is106ready, and if it is, copies the data back to userspace. The buffer107and the flag are memory locations shared between the two CPUs.108109We can abstract out the important pieces of the driver code as follows110(the reason for using WRITE_ONCE() and READ_ONCE() instead of simple111assignment statements is discussed later):112113int buf = 0, flag = 0;114115P0()116{117WRITE_ONCE(buf, 1);118WRITE_ONCE(flag, 1);119}120121P1()122{123int r1;124int r2 = 0;125126r1 = READ_ONCE(flag);127if (r1)128r2 = READ_ONCE(buf);129}130131Here the P0() function represents the interrupt handler running on one132CPU and P1() represents the read() routine running on another. The133value 1 stored in buf represents input data collected from the device.134Thus, P0 stores the data in buf and then sets flag. Meanwhile, P1135reads flag into the private variable r1, and if it is set, reads the136data from buf into a second private variable r2 for copying to137userspace. (Presumably if flag is not set then the driver will wait a138while and try again.)139140This pattern of memory accesses, where one CPU stores values to two141shared memory locations and another CPU loads from those locations in142the opposite order, is widely known as the "Message Passing" or MP143pattern. It is typical of memory access patterns in the kernel.144145Please note that this example code is a simplified abstraction. Real146buffers are usually larger than a single integer, real device drivers147usually use sleep and wakeup mechanisms rather than polling for I/O148completion, and real code generally doesn't bother to copy values into149private variables before using them. All that is beside the point;150the idea here is simply to illustrate the overall pattern of memory151accesses by the CPUs.152153A memory model will predict what values P1 might obtain for its loads154from flag and buf, or equivalently, what values r1 and r2 might end up155with after the code has finished running.156157Some predictions are trivial. For instance, no sane memory model would158predict that r1 = 42 or r2 = -7, because neither of those values ever159gets stored in flag or buf.160161Some nontrivial predictions are nonetheless quite simple. For162instance, P1 might run entirely before P0 begins, in which case r1 and163r2 will both be 0 at the end. Or P0 might run entirely before P1164begins, in which case r1 and r2 will both be 1.165166The interesting predictions concern what might happen when the two167routines run concurrently. One possibility is that P1 runs after P0's168store to buf but before the store to flag. In this case, r1 and r2169will again both be 0. (If P1 had been designed to read buf170unconditionally then we would instead have r1 = 0 and r2 = 1.)171172However, the most interesting possibility is where r1 = 1 and r2 = 0.173If this were to occur it would mean the driver contains a bug, because174incorrect data would get sent to the user: 0 instead of 1. As it175happens, the LKMM does predict this outcome can occur, and the example176driver code shown above is indeed buggy.177178179A SELECTION OF MEMORY MODELS180----------------------------181182The first widely cited memory model, and the simplest to understand,183is Sequential Consistency. According to this model, systems behave as184if each CPU executed its instructions in order but with unspecified185timing. In other words, the instructions from the various CPUs get186interleaved in a nondeterministic way, always according to some single187global order that agrees with the order of the instructions in the188program source for each CPU. The model says that the value obtained189by each load is simply the value written by the most recently executed190store to the same memory location, from any CPU.191192For the MP example code shown above, Sequential Consistency predicts193that the undesired result r1 = 1, r2 = 0 cannot occur. The reasoning194goes like this:195196Since r1 = 1, P0 must store 1 to flag before P1 loads 1 from197it, as loads can obtain values only from earlier stores.198199P1 loads from flag before loading from buf, since CPUs execute200their instructions in order.201202P1 must load 0 from buf before P0 stores 1 to it; otherwise r2203would be 1 since a load obtains its value from the most recent204store to the same address.205206P0 stores 1 to buf before storing 1 to flag, since it executes207its instructions in order.208209Since an instruction (in this case, P0's store to flag) cannot210execute before itself, the specified outcome is impossible.211212However, real computer hardware almost never follows the Sequential213Consistency memory model; doing so would rule out too many valuable214performance optimizations. On ARM and PowerPC architectures, for215instance, the MP example code really does sometimes yield r1 = 1 and216r2 = 0.217218x86 and SPARC follow yet a different memory model: TSO (Total Store219Ordering). This model predicts that the undesired outcome for the MP220pattern cannot occur, but in other respects it differs from Sequential221Consistency. One example is the Store Buffer (SB) pattern, in which222each CPU stores to its own shared location and then loads from the223other CPU's location:224225int x = 0, y = 0;226227P0()228{229int r0;230231WRITE_ONCE(x, 1);232r0 = READ_ONCE(y);233}234235P1()236{237int r1;238239WRITE_ONCE(y, 1);240r1 = READ_ONCE(x);241}242243Sequential Consistency predicts that the outcome r0 = 0, r1 = 0 is244impossible. (Exercise: Figure out the reasoning.) But TSO allows245this outcome to occur, and in fact it does sometimes occur on x86 and246SPARC systems.247248The LKMM was inspired by the memory models followed by PowerPC, ARM,249x86, Alpha, and other architectures. However, it is different in250detail from each of them.251252253ORDERING AND CYCLES254-------------------255256Memory models are all about ordering. Often this is temporal ordering257(i.e., the order in which certain events occur) but it doesn't have to258be; consider for example the order of instructions in a program's259source code. We saw above that Sequential Consistency makes an260important assumption that CPUs execute instructions in the same order261as those instructions occur in the code, and there are many other262instances of ordering playing central roles in memory models.263264The counterpart to ordering is a cycle. Ordering rules out cycles:265It's not possible to have X ordered before Y, Y ordered before Z, and266Z ordered before X, because this would mean that X is ordered before267itself. The analysis of the MP example under Sequential Consistency268involved just such an impossible cycle:269270W: P0 stores 1 to flag executes before271X: P1 loads 1 from flag executes before272Y: P1 loads 0 from buf executes before273Z: P0 stores 1 to buf executes before274W: P0 stores 1 to flag.275276In short, if a memory model requires certain accesses to be ordered,277and a certain outcome for the loads in a piece of code can happen only278if those accesses would form a cycle, then the memory model predicts279that outcome cannot occur.280281The LKMM is defined largely in terms of cycles, as we will see.282283284EVENTS285------286287The LKMM does not work directly with the C statements that make up288kernel source code. Instead it considers the effects of those289statements in a more abstract form, namely, events. The model290includes three types of events:291292Read events correspond to loads from shared memory, such as293calls to READ_ONCE(), smp_load_acquire(), or294rcu_dereference().295296Write events correspond to stores to shared memory, such as297calls to WRITE_ONCE(), smp_store_release(), or atomic_set().298299Fence events correspond to memory barriers (also known as300fences), such as calls to smp_rmb() or rcu_read_lock().301302These categories are not exclusive; a read or write event can also be303a fence. This happens with functions like smp_load_acquire() or304spin_lock(). However, no single event can be both a read and a write.305Atomic read-modify-write accesses, such as atomic_inc() or xchg(),306correspond to a pair of events: a read followed by a write. (The307write event is omitted for executions where it doesn't occur, such as308a cmpxchg() where the comparison fails.)309310Other parts of the code, those which do not involve interaction with311shared memory, do not give rise to events. Thus, arithmetic and312logical computations, control-flow instructions, or accesses to313private memory or CPU registers are not of central interest to the314memory model. They only affect the model's predictions indirectly.315For example, an arithmetic computation might determine the value that316gets stored to a shared memory location (or in the case of an array317index, the address where the value gets stored), but the memory model318is concerned only with the store itself -- its value and its address319-- not the computation leading up to it.320321Events in the LKMM can be linked by various relations, which we will322describe in the following sections. The memory model requires certain323of these relations to be orderings, that is, it requires them not to324have any cycles.325326327THE PROGRAM ORDER RELATION: po AND po-loc328-----------------------------------------329330The most important relation between events is program order (po). You331can think of it as the order in which statements occur in the source332code after branches are taken into account and loops have been333unrolled. A better description might be the order in which334instructions are presented to a CPU's execution unit. Thus, we say335that X is po-before Y (written as "X ->po Y" in formulas) if X occurs336before Y in the instruction stream.337338This is inherently a single-CPU relation; two instructions executing339on different CPUs are never linked by po. Also, it is by definition340an ordering so it cannot have any cycles.341342po-loc is a sub-relation of po. It links two memory accesses when the343first comes before the second in program order and they access the344same memory location (the "-loc" suffix).345346Although this may seem straightforward, there is one subtle aspect to347program order we need to explain. The LKMM was inspired by low-level348architectural memory models which describe the behavior of machine349code, and it retains their outlook to a considerable extent. The350read, write, and fence events used by the model are close in spirit to351individual machine instructions. Nevertheless, the LKMM describes352kernel code written in C, and the mapping from C to machine code can353be extremely complex.354355Optimizing compilers have great freedom in the way they translate356source code to object code. They are allowed to apply transformations357that add memory accesses, eliminate accesses, combine them, split them358into pieces, or move them around. The use of READ_ONCE(), WRITE_ONCE(),359or one of the other atomic or synchronization primitives prevents a360large number of compiler optimizations. In particular, it is guaranteed361that the compiler will not remove such accesses from the generated code362(unless it can prove the accesses will never be executed), it will not363change the order in which they occur in the code (within limits imposed364by the C standard), and it will not introduce extraneous accesses.365366The MP and SB examples above used READ_ONCE() and WRITE_ONCE() rather367than ordinary memory accesses. Thanks to this usage, we can be certain368that in the MP example, the compiler won't reorder P0's write event to369buf and P0's write event to flag, and similarly for the other shared370memory accesses in the examples.371372Since private variables are not shared between CPUs, they can be373accessed normally without READ_ONCE() or WRITE_ONCE(). In fact, they374need not even be stored in normal memory at all -- in principle a375private variable could be stored in a CPU register (hence the convention376that these variables have names starting with the letter 'r').377378379A WARNING380---------381382The protections provided by READ_ONCE(), WRITE_ONCE(), and others are383not perfect; and under some circumstances it is possible for the384compiler to undermine the memory model. Here is an example. Suppose385both branches of an "if" statement store the same value to the same386location:387388r1 = READ_ONCE(x);389if (r1) {390WRITE_ONCE(y, 2);391... /* do something */392} else {393WRITE_ONCE(y, 2);394... /* do something else */395}396397For this code, the LKMM predicts that the load from x will always be398executed before either of the stores to y. However, a compiler could399lift the stores out of the conditional, transforming the code into400something resembling:401402r1 = READ_ONCE(x);403WRITE_ONCE(y, 2);404if (r1) {405... /* do something */406} else {407... /* do something else */408}409410Given this version of the code, the LKMM would predict that the load411from x could be executed after the store to y. Thus, the memory412model's original prediction could be invalidated by the compiler.413414Another issue arises from the fact that in C, arguments to many415operators and function calls can be evaluated in any order. For416example:417418r1 = f(5) + g(6);419420The object code might call f(5) either before or after g(6); the421memory model cannot assume there is a fixed program order relation422between them. (In fact, if the function calls are inlined then the423compiler might even interleave their object code.)424425426DEPENDENCY RELATIONS: data, addr, and ctrl427------------------------------------------428429We say that two events are linked by a dependency relation when the430execution of the second event depends in some way on a value obtained431from memory by the first. The first event must be a read, and the432value it obtains must somehow affect what the second event does.433There are three kinds of dependencies: data, address (addr), and434control (ctrl).435436A read and a write event are linked by a data dependency if the value437obtained by the read affects the value stored by the write. As a very438simple example:439440int x, y;441442r1 = READ_ONCE(x);443WRITE_ONCE(y, r1 + 5);444445The value stored by the WRITE_ONCE obviously depends on the value446loaded by the READ_ONCE. Such dependencies can wind through447arbitrarily complicated computations, and a write can depend on the448values of multiple reads.449450A read event and another memory access event are linked by an address451dependency if the value obtained by the read affects the location452accessed by the other event. The second event can be either a read or453a write. Here's another simple example:454455int a[20];456int i;457458r1 = READ_ONCE(i);459r2 = READ_ONCE(a[r1]);460461Here the location accessed by the second READ_ONCE() depends on the462index value loaded by the first. Pointer indirection also gives rise463to address dependencies, since the address of a location accessed464through a pointer will depend on the value read earlier from that465pointer.466467Finally, a read event X and a write event Y are linked by a control468dependency if Y syntactically lies within an arm of an if statement and469X affects the evaluation of the if condition via a data or address470dependency (or similarly for a switch statement). Simple example:471472int x, y;473474r1 = READ_ONCE(x);475if (r1)476WRITE_ONCE(y, 1984);477478Execution of the WRITE_ONCE() is controlled by a conditional expression479which depends on the value obtained by the READ_ONCE(); hence there is480a control dependency from the load to the store.481482It should be pretty obvious that events can only depend on reads that483come earlier in program order. Symbolically, if we have R ->data X,484R ->addr X, or R ->ctrl X (where R is a read event), then we must also485have R ->po X. It wouldn't make sense for a computation to depend486somehow on a value that doesn't get loaded from shared memory until487later in the code!488489Here's a trick question: When is a dependency not a dependency? Answer:490When it is purely syntactic rather than semantic. We say a dependency491between two accesses is purely syntactic if the second access doesn't492actually depend on the result of the first. Here is a trivial example:493494r1 = READ_ONCE(x);495WRITE_ONCE(y, r1 * 0);496497There appears to be a data dependency from the load of x to the store498of y, since the value to be stored is computed from the value that was499loaded. But in fact, the value stored does not really depend on500anything since it will always be 0. Thus the data dependency is only501syntactic (it appears to exist in the code) but not semantic (the502second access will always be the same, regardless of the value of the503first access). Given code like this, a compiler could simply discard504the value returned by the load from x, which would certainly destroy505any dependency. (The compiler is not permitted to eliminate entirely506the load generated for a READ_ONCE() -- that's one of the nice507properties of READ_ONCE() -- but it is allowed to ignore the load's508value.)509510It's natural to object that no one in their right mind would write511code like the above. However, macro expansions can easily give rise512to this sort of thing, in ways that often are not apparent to the513programmer.514515Another mechanism that can lead to purely syntactic dependencies is516related to the notion of "undefined behavior". Certain program517behaviors are called "undefined" in the C language specification,518which means that when they occur there are no guarantees at all about519the outcome. Consider the following example:520521int a[1];522int i;523524r1 = READ_ONCE(i);525r2 = READ_ONCE(a[r1]);526527Access beyond the end or before the beginning of an array is one kind528of undefined behavior. Therefore the compiler doesn't have to worry529about what will happen if r1 is nonzero, and it can assume that r1530will always be zero regardless of the value actually loaded from i.531(If the assumption turns out to be wrong the resulting behavior will532be undefined anyway, so the compiler doesn't care!) Thus the value533from the load can be discarded, breaking the address dependency.534535The LKMM is unaware that purely syntactic dependencies are different536from semantic dependencies and therefore mistakenly predicts that the537accesses in the two examples above will be ordered. This is another538example of how the compiler can undermine the memory model. Be warned.539540541THE READS-FROM RELATION: rf, rfi, and rfe542-----------------------------------------543544The reads-from relation (rf) links a write event to a read event when545the value loaded by the read is the value that was stored by the546write. In colloquial terms, the load "reads from" the store. We547write W ->rf R to indicate that the load R reads from the store W. We548further distinguish the cases where the load and the store occur on549the same CPU (internal reads-from, or rfi) and where they occur on550different CPUs (external reads-from, or rfe).551552For our purposes, a memory location's initial value is treated as553though it had been written there by an imaginary initial store that554executes on a separate CPU before the main program runs.555556Usage of the rf relation implicitly assumes that loads will always557read from a single store. It doesn't apply properly in the presence558of load-tearing, where a load obtains some of its bits from one store559and some of them from another store. Fortunately, use of READ_ONCE()560and WRITE_ONCE() will prevent load-tearing; it's not possible to have:561562int x = 0;563564P0()565{566WRITE_ONCE(x, 0x1234);567}568569P1()570{571int r1;572573r1 = READ_ONCE(x);574}575576and end up with r1 = 0x1200 (partly from x's initial value and partly577from the value stored by P0).578579On the other hand, load-tearing is unavoidable when mixed-size580accesses are used. Consider this example:581582union {583u32 w;584u16 h[2];585} x;586587P0()588{589WRITE_ONCE(x.h[0], 0x1234);590WRITE_ONCE(x.h[1], 0x5678);591}592593P1()594{595int r1;596597r1 = READ_ONCE(x.w);598}599600If r1 = 0x56781234 (little-endian!) at the end, then P1 must have read601from both of P0's stores. It is possible to handle mixed-size and602unaligned accesses in a memory model, but the LKMM currently does not603attempt to do so. It requires all accesses to be properly aligned and604of the location's actual size.605606607CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe608------------------------------------------------------------------609610Cache coherence is a general principle requiring that in a611multi-processor system, the CPUs must share a consistent view of the612memory contents. Specifically, it requires that for each location in613shared memory, the stores to that location must form a single global614ordering which all the CPUs agree on (the coherence order), and this615ordering must be consistent with the program order for accesses to616that location.617618To put it another way, for any variable x, the coherence order (co) of619the stores to x is simply the order in which the stores overwrite one620another. The imaginary store which establishes x's initial value621comes first in the coherence order; the store which directly622overwrites the initial value comes second; the store which overwrites623that value comes third, and so on.624625You can think of the coherence order as being the order in which the626stores reach x's location in memory (or if you prefer a more627hardware-centric view, the order in which the stores get written to628x's cache line). We write W ->co W' if W comes before W' in the629coherence order, that is, if the value stored by W gets overwritten,630directly or indirectly, by the value stored by W'.631632Coherence order is required to be consistent with program order. This633requirement takes the form of four coherency rules:634635Write-write coherence: If W ->po-loc W' (i.e., W comes before636W' in program order and they access the same location), where W637and W' are two stores, then W ->co W'.638639Write-read coherence: If W ->po-loc R, where W is a store and R640is a load, then R must read from W or from some other store641which comes after W in the coherence order.642643Read-write coherence: If R ->po-loc W, where R is a load and W644is a store, then the store which R reads from must come before645W in the coherence order.646647Read-read coherence: If R ->po-loc R', where R and R' are two648loads, then either they read from the same store or else the649store read by R comes before the store read by R' in the650coherence order.651652This is sometimes referred to as sequential consistency per variable,653because it means that the accesses to any single memory location obey654the rules of the Sequential Consistency memory model. (According to655Wikipedia, sequential consistency per variable and cache coherence656mean the same thing except that cache coherence includes an extra657requirement that every store eventually becomes visible to every CPU.)658659Any reasonable memory model will include cache coherence. Indeed, our660expectation of cache coherence is so deeply ingrained that violations661of its requirements look more like hardware bugs than programming662errors:663664int x;665666P0()667{668WRITE_ONCE(x, 17);669WRITE_ONCE(x, 23);670}671672If the final value stored in x after this code ran was 17, you would673think your computer was broken. It would be a violation of the674write-write coherence rule: Since the store of 23 comes later in675program order, it must also come later in x's coherence order and676thus must overwrite the store of 17.677678int x = 0;679680P0()681{682int r1;683684r1 = READ_ONCE(x);685WRITE_ONCE(x, 666);686}687688If r1 = 666 at the end, this would violate the read-write coherence689rule: The READ_ONCE() load comes before the WRITE_ONCE() store in690program order, so it must not read from that store but rather from one691coming earlier in the coherence order (in this case, x's initial692value).693694int x = 0;695696P0()697{698WRITE_ONCE(x, 5);699}700701P1()702{703int r1, r2;704705r1 = READ_ONCE(x);706r2 = READ_ONCE(x);707}708709If r1 = 5 (reading from P0's store) and r2 = 0 (reading from the710imaginary store which establishes x's initial value) at the end, this711would violate the read-read coherence rule: The r1 load comes before712the r2 load in program order, so it must not read from a store that713comes later in the coherence order.714715(As a minor curiosity, if this code had used normal loads instead of716READ_ONCE() in P1, on Itanium it sometimes could end up with r1 = 5717and r2 = 0! This results from parallel execution of the operations718encoded in Itanium's Very-Long-Instruction-Word format, and it is yet719another motivation for using READ_ONCE() when accessing shared memory720locations.)721722Just like the po relation, co is inherently an ordering -- it is not723possible for a store to directly or indirectly overwrite itself! And724just like with the rf relation, we distinguish between stores that725occur on the same CPU (internal coherence order, or coi) and stores726that occur on different CPUs (external coherence order, or coe).727728On the other hand, stores to different memory locations are never729related by co, just as instructions on different CPUs are never730related by po. Coherence order is strictly per-location, or if you731prefer, each location has its own independent coherence order.732733734THE FROM-READS RELATION: fr, fri, and fre735-----------------------------------------736737The from-reads relation (fr) can be a little difficult for people to738grok. It describes the situation where a load reads a value that gets739overwritten by a store. In other words, we have R ->fr W when the740value that R reads is overwritten (directly or indirectly) by W, or741equivalently, when R reads from a store which comes earlier than W in742the coherence order.743744For example:745746int x = 0;747748P0()749{750int r1;751752r1 = READ_ONCE(x);753WRITE_ONCE(x, 2);754}755756The value loaded from x will be 0 (assuming cache coherence!), and it757gets overwritten by the value 2. Thus there is an fr link from the758READ_ONCE() to the WRITE_ONCE(). If the code contained any later759stores to x, there would also be fr links from the READ_ONCE() to760them.761762As with rf, rfi, and rfe, we subdivide the fr relation into fri (when763the load and the store are on the same CPU) and fre (when they are on764different CPUs).765766Note that the fr relation is determined entirely by the rf and co767relations; it is not independent. Given a read event R and a write768event W for the same location, we will have R ->fr W if and only if769the write which R reads from is co-before W. In symbols,770771(R ->fr W) := (there exists W' with W' ->rf R and W' ->co W).772773774AN OPERATIONAL MODEL775--------------------776777The LKMM is based on various operational memory models, meaning that778the models arise from an abstract view of how a computer system779operates. Here are the main ideas, as incorporated into the LKMM.780781The system as a whole is divided into the CPUs and a memory subsystem.782The CPUs are responsible for executing instructions (not necessarily783in program order), and they communicate with the memory subsystem.784For the most part, executing an instruction requires a CPU to perform785only internal operations. However, loads, stores, and fences involve786more.787788When CPU C executes a store instruction, it tells the memory subsystem789to store a certain value at a certain location. The memory subsystem790propagates the store to all the other CPUs as well as to RAM. (As a791special case, we say that the store propagates to its own CPU at the792time it is executed.) The memory subsystem also determines where the793store falls in the location's coherence order. In particular, it must794arrange for the store to be co-later than (i.e., to overwrite) any795other store to the same location which has already propagated to CPU C.796797When a CPU executes a load instruction R, it first checks to see798whether there are any as-yet unexecuted store instructions, for the799same location, that come before R in program order. If there are, it800uses the value of the po-latest such store as the value obtained by R,801and we say that the store's value is forwarded to R. Otherwise, the802CPU asks the memory subsystem for the value to load and we say that R803is satisfied from memory. The memory subsystem hands back the value804of the co-latest store to the location in question which has already805propagated to that CPU.806807(In fact, the picture needs to be a little more complicated than this.808CPUs have local caches, and propagating a store to a CPU really means809propagating it to the CPU's local cache. A local cache can take some810time to process the stores that it receives, and a store can't be used811to satisfy one of the CPU's loads until it has been processed. On812most architectures, the local caches process stores in813First-In-First-Out order, and consequently the processing delay814doesn't matter for the memory model. But on Alpha, the local caches815have a partitioned design that results in non-FIFO behavior. We will816discuss this in more detail later.)817818Note that load instructions may be executed speculatively and may be819restarted under certain circumstances. The memory model ignores these820premature executions; we simply say that the load executes at the821final time it is forwarded or satisfied.822823Executing a fence (or memory barrier) instruction doesn't require a824CPU to do anything special other than informing the memory subsystem825about the fence. However, fences do constrain the way CPUs and the826memory subsystem handle other instructions, in two respects.827828First, a fence forces the CPU to execute various instructions in829program order. Exactly which instructions are ordered depends on the830type of fence:831832Strong fences, including smp_mb() and synchronize_rcu(), force833the CPU to execute all po-earlier instructions before any834po-later instructions;835836smp_rmb() forces the CPU to execute all po-earlier loads837before any po-later loads;838839smp_wmb() forces the CPU to execute all po-earlier stores840before any po-later stores;841842Acquire fences, such as smp_load_acquire(), force the CPU to843execute the load associated with the fence (e.g., the load844part of an smp_load_acquire()) before any po-later845instructions;846847Release fences, such as smp_store_release(), force the CPU to848execute all po-earlier instructions before the store849associated with the fence (e.g., the store part of an850smp_store_release()).851852Second, some types of fence affect the way the memory subsystem853propagates stores. When a fence instruction is executed on CPU C:854855For each other CPU C', smp_wmb() forces all po-earlier stores856on C to propagate to C' before any po-later stores do.857858For each other CPU C', any store which propagates to C before859a release fence is executed (including all po-earlier860stores executed on C) is forced to propagate to C' before the861store associated with the release fence does.862863Any store which propagates to C before a strong fence is864executed (including all po-earlier stores on C) is forced to865propagate to all other CPUs before any instructions po-after866the strong fence are executed on C.867868The propagation ordering enforced by release fences and strong fences869affects stores from other CPUs that propagate to CPU C before the870fence is executed, as well as stores that are executed on C before the871fence. We describe this property by saying that release fences and872strong fences are A-cumulative. By contrast, smp_wmb() fences are not873A-cumulative; they only affect the propagation of stores that are874executed on C before the fence (i.e., those which precede the fence in875program order).876877rcu_read_lock(), rcu_read_unlock(), and synchronize_rcu() fences have878other properties which we discuss later.879880881PROPAGATION ORDER RELATION: cumul-fence882---------------------------------------883884The fences which affect propagation order (i.e., strong, release, and885smp_wmb() fences) are collectively referred to as cumul-fences, even886though smp_wmb() isn't A-cumulative. The cumul-fence relation is887defined to link memory access events E and F whenever:888889E and F are both stores on the same CPU and an smp_wmb() fence890event occurs between them in program order; or891892F is a release fence and some X comes before F in program order,893where either X = E or else E ->rf X; or894895A strong fence event occurs between some X and F in program896order, where either X = E or else E ->rf X.897898The operational model requires that whenever W and W' are both stores899and W ->cumul-fence W', then W must propagate to any given CPU900before W' does. However, for different CPUs C and C', it does not901require W to propagate to C before W' propagates to C'.902903904DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL905-------------------------------------------------906907The LKMM is derived from the restrictions imposed by the design908outlined above. These restrictions involve the necessity of909maintaining cache coherence and the fact that a CPU can't operate on a910value before it knows what that value is, among other things.911912The formal version of the LKMM is defined by six requirements, or913axioms:914915Sequential consistency per variable: This requires that the916system obey the four coherency rules.917918Atomicity: This requires that atomic read-modify-write919operations really are atomic, that is, no other stores can920sneak into the middle of such an update.921922Happens-before: This requires that certain instructions are923executed in a specific order.924925Propagation: This requires that certain stores propagate to926CPUs and to RAM in a specific order.927928Rcu: This requires that RCU read-side critical sections and929grace periods obey the rules of RCU, in particular, the930Grace-Period Guarantee.931932Plain-coherence: This requires that plain memory accesses933(those not using READ_ONCE(), WRITE_ONCE(), etc.) must obey934the operational model's rules regarding cache coherence.935936The first and second are quite common; they can be found in many937memory models (such as those for C11/C++11). The "happens-before" and938"propagation" axioms have analogs in other memory models as well. The939"rcu" and "plain-coherence" axioms are specific to the LKMM.940941Each of these axioms is discussed below.942943944SEQUENTIAL CONSISTENCY PER VARIABLE945-----------------------------------946947According to the principle of cache coherence, the stores to any fixed948shared location in memory form a global ordering. We can imagine949inserting the loads from that location into this ordering, by placing950each load between the store that it reads from and the following951store. This leaves the relative positions of loads that read from the952same store unspecified; let's say they are inserted in program order,953first for CPU 0, then CPU 1, etc.954955You can check that the four coherency rules imply that the rf, co, fr,956and po-loc relations agree with this global ordering; in other words,957whenever we have X ->rf Y or X ->co Y or X ->fr Y or X ->po-loc Y, the958X event comes before the Y event in the global ordering. The LKMM's959"coherence" axiom expresses this by requiring the union of these960relations not to have any cycles. This means it must not be possible961to find events962963X0 -> X1 -> X2 -> ... -> Xn -> X0,964965where each of the links is either rf, co, fr, or po-loc. This has to966hold if the accesses to the fixed memory location can be ordered as967cache coherence demands.968969Although it is not obvious, it can be shown that the converse is also970true: This LKMM axiom implies that the four coherency rules are971obeyed.972973974ATOMIC UPDATES: rmw975-------------------976977What does it mean to say that a read-modify-write (rmw) update, such978as atomic_inc(&x), is atomic? It means that the memory location (x in979this case) does not get altered between the read and the write events980making up the atomic operation. In particular, if two CPUs perform981atomic_inc(&x) concurrently, it must be guaranteed that the final982value of x will be the initial value plus two. We should never have983the following sequence of events:984985CPU 0 loads x obtaining 13;986CPU 1 loads x obtaining 13;987CPU 0 stores 14 to x;988CPU 1 stores 14 to x;989990where the final value of x is wrong (14 rather than 15).991992In this example, CPU 0's increment effectively gets lost because it993occurs in between CPU 1's load and store. To put it another way, the994problem is that the position of CPU 0's store in x's coherence order995is between the store that CPU 1 reads from and the store that CPU 1996performs.997998The same analysis applies to all atomic update operations. Therefore,999to enforce atomicity the LKMM requires that atomic updates follow this1000rule: Whenever R and W are the read and write events composing an1001atomic read-modify-write and W' is the write event which R reads from,1002there must not be any stores coming between W' and W in the coherence1003order. Equivalently,10041005(R ->rmw W) implies (there is no X with R ->fr X and X ->co W),10061007where the rmw relation links the read and write events making up each1008atomic update. This is what the LKMM's "atomic" axiom says.10091010Atomic rmw updates play one more role in the LKMM: They can form "rmw1011sequences". An rmw sequence is simply a bunch of atomic updates where1012each update reads from the previous one. Written using events, it1013looks like this:10141015Z0 ->rf Y1 ->rmw Z1 ->rf ... ->rf Yn ->rmw Zn,10161017where Z0 is some store event and n can be any number (even 0, in the1018degenerate case). We write this relation as: Z0 ->rmw-sequence Zn.1019Note that this implies Z0 and Zn are stores to the same variable.10201021Rmw sequences have a special property in the LKMM: They can extend the1022cumul-fence relation. That is, if we have:10231024U ->cumul-fence X -> rmw-sequence Y10251026then also U ->cumul-fence Y. Thinking about this in terms of the1027operational model, U ->cumul-fence X says that the store U propagates1028to each CPU before the store X does. Then the fact that X and Y are1029linked by an rmw sequence means that U also propagates to each CPU1030before Y does. In an analogous way, rmw sequences can also extend1031the w-post-bounded relation defined below in the PLAIN ACCESSES AND1032DATA RACES section.10331034(The notion of rmw sequences in the LKMM is similar to, but not quite1035the same as, that of release sequences in the C11 memory model. They1036were added to the LKMM to fix an obscure bug; without them, atomic1037updates with full-barrier semantics did not always guarantee ordering1038at least as strong as atomic updates with release-barrier semantics.)103910401041THE PRESERVED PROGRAM ORDER RELATION: ppo1042-----------------------------------------10431044There are many situations where a CPU is obliged to execute two1045instructions in program order. We amalgamate them into the ppo (for1046"preserved program order") relation, which links the po-earlier1047instruction to the po-later instruction and is thus a sub-relation of1048po.10491050The operational model already includes a description of one such1051situation: Fences are a source of ppo links. Suppose X and Y are1052memory accesses with X ->po Y; then the CPU must execute X before Y if1053any of the following hold:10541055A strong (smp_mb() or synchronize_rcu()) fence occurs between1056X and Y;10571058X and Y are both stores and an smp_wmb() fence occurs between1059them;10601061X and Y are both loads and an smp_rmb() fence occurs between1062them;10631064X is also an acquire fence, such as smp_load_acquire();10651066Y is also a release fence, such as smp_store_release().10671068Another possibility, not mentioned earlier but discussed in the next1069section, is:10701071X and Y are both loads, X ->addr Y (i.e., there is an address1072dependency from X to Y), and X is a READ_ONCE() or an atomic1073access.10741075Dependencies can also cause instructions to be executed in program1076order. This is uncontroversial when the second instruction is a1077store; either a data, address, or control dependency from a load R to1078a store W will force the CPU to execute R before W. This is very1079simply because the CPU cannot tell the memory subsystem about W's1080store before it knows what value should be stored (in the case of a1081data dependency), what location it should be stored into (in the case1082of an address dependency), or whether the store should actually take1083place (in the case of a control dependency).10841085Dependencies to load instructions are more problematic. To begin with,1086there is no such thing as a data dependency to a load. Next, a CPU1087has no reason to respect a control dependency to a load, because it1088can always satisfy the second load speculatively before the first, and1089then ignore the result if it turns out that the second load shouldn't1090be executed after all. And lastly, the real difficulties begin when1091we consider address dependencies to loads.10921093To be fair about it, all Linux-supported architectures do execute1094loads in program order if there is an address dependency between them.1095After all, a CPU cannot ask the memory subsystem to load a value from1096a particular location before it knows what that location is. However,1097the split-cache design used by Alpha can cause it to behave in a way1098that looks as if the loads were executed out of order (see the next1099section for more details). The kernel includes a workaround for this1100problem when the loads come from READ_ONCE(), and therefore the LKMM1101includes address dependencies to loads in the ppo relation.11021103On the other hand, dependencies can indirectly affect the ordering of1104two loads. This happens when there is a dependency from a load to a1105store and a second, po-later load reads from that store:11061107R ->dep W ->rfi R',11081109where the dep link can be either an address or a data dependency. In1110this situation we know it is possible for the CPU to execute R' before1111W, because it can forward the value that W will store to R'. But it1112cannot execute R' before R, because it cannot forward the value before1113it knows what that value is, or that W and R' do access the same1114location. However, if there is merely a control dependency between R1115and W then the CPU can speculatively forward W to R' before executing1116R; if the speculation turns out to be wrong then the CPU merely has to1117restart or abandon R'.11181119(In theory, a CPU might forward a store to a load when it runs across1120an address dependency like this:11211122r1 = READ_ONCE(ptr);1123WRITE_ONCE(*r1, 17);1124r2 = READ_ONCE(*r1);11251126because it could tell that the store and the second load access the1127same location even before it knows what the location's address is.1128However, none of the architectures supported by the Linux kernel do1129this.)11301131Two memory accesses of the same location must always be executed in1132program order if the second access is a store. Thus, if we have11331134R ->po-loc W11351136(the po-loc link says that R comes before W in program order and they1137access the same location), the CPU is obliged to execute W after R.1138If it executed W first then the memory subsystem would respond to R's1139read request with the value stored by W (or an even later store), in1140violation of the read-write coherence rule. Similarly, if we had11411142W ->po-loc W'11431144and the CPU executed W' before W, then the memory subsystem would put1145W' before W in the coherence order. It would effectively cause W to1146overwrite W', in violation of the write-write coherence rule.1147(Interestingly, an early ARMv8 memory model, now obsolete, proposed1148allowing out-of-order writes like this to occur. The model avoided1149violating the write-write coherence rule by requiring the CPU not to1150send the W write to the memory subsystem at all!)115111521153AND THEN THERE WAS ALPHA1154------------------------11551156As mentioned above, the Alpha architecture is unique in that it does1157not appear to respect address dependencies to loads. This means that1158code such as the following:11591160int x = 0;1161int y = -1;1162int *ptr = &y;11631164P0()1165{1166WRITE_ONCE(x, 1);1167smp_wmb();1168WRITE_ONCE(ptr, &x);1169}11701171P1()1172{1173int *r1;1174int r2;11751176r1 = ptr;1177r2 = READ_ONCE(*r1);1178}11791180can malfunction on Alpha systems (notice that P1 uses an ordinary load1181to read ptr instead of READ_ONCE()). It is quite possible that r1 = &x1182and r2 = 0 at the end, in spite of the address dependency.11831184At first glance this doesn't seem to make sense. We know that the1185smp_wmb() forces P0's store to x to propagate to P1 before the store1186to ptr does. And since P1 can't execute its second load1187until it knows what location to load from, i.e., after executing its1188first load, the value x = 1 must have propagated to P1 before the1189second load executed. So why doesn't r2 end up equal to 1?11901191The answer lies in the Alpha's split local caches. Although the two1192stores do reach P1's local cache in the proper order, it can happen1193that the first store is processed by a busy part of the cache while1194the second store is processed by an idle part. As a result, the x = 11195value may not become available for P1's CPU to read until after the1196ptr = &x value does, leading to the undesirable result above. The1197final effect is that even though the two loads really are executed in1198program order, it appears that they aren't.11991200This could not have happened if the local cache had processed the1201incoming stores in FIFO order. By contrast, other architectures1202maintain at least the appearance of FIFO order.12031204In practice, this difficulty is solved by inserting a special fence1205between P1's two loads when the kernel is compiled for the Alpha1206architecture. In fact, as of version 4.15, the kernel automatically1207adds this fence after every READ_ONCE() and atomic load on Alpha. The1208effect of the fence is to cause the CPU not to execute any po-later1209instructions until after the local cache has finished processing all1210the stores it has already received. Thus, if the code was changed to:12111212P1()1213{1214int *r1;1215int r2;12161217r1 = READ_ONCE(ptr);1218r2 = READ_ONCE(*r1);1219}12201221then we would never get r1 = &x and r2 = 0. By the time P1 executed1222its second load, the x = 1 store would already be fully processed by1223the local cache and available for satisfying the read request. Thus1224we have yet another reason why shared data should always be read with1225READ_ONCE() or another synchronization primitive rather than accessed1226directly.12271228The LKMM requires that smp_rmb(), acquire fences, and strong fences1229share this property: They do not allow the CPU to execute any po-later1230instructions (or po-later loads in the case of smp_rmb()) until all1231outstanding stores have been processed by the local cache. In the1232case of a strong fence, the CPU first has to wait for all of its1233po-earlier stores to propagate to every other CPU in the system; then1234it has to wait for the local cache to process all the stores received1235as of that time -- not just the stores received when the strong fence1236began.12371238And of course, none of this matters for any architecture other than1239Alpha.124012411242THE HAPPENS-BEFORE RELATION: hb1243-------------------------------12441245The happens-before relation (hb) links memory accesses that have to1246execute in a certain order. hb includes the ppo relation and two1247others, one of which is rfe.12481249W ->rfe R implies that W and R are on different CPUs. It also means1250that W's store must have propagated to R's CPU before R executed;1251otherwise R could not have read the value stored by W. Therefore W1252must have executed before R, and so we have W ->hb R.12531254The equivalent fact need not hold if W ->rfi R (i.e., W and R are on1255the same CPU). As we have already seen, the operational model allows1256W's value to be forwarded to R in such cases, meaning that R may well1257execute before W does.12581259It's important to understand that neither coe nor fre is included in1260hb, despite their similarities to rfe. For example, suppose we have1261W ->coe W'. This means that W and W' are stores to the same location,1262they execute on different CPUs, and W comes before W' in the coherence1263order (i.e., W' overwrites W). Nevertheless, it is possible for W' to1264execute before W, because the decision as to which store overwrites1265the other is made later by the memory subsystem. When the stores are1266nearly simultaneous, either one can come out on top. Similarly,1267R ->fre W means that W overwrites the value which R reads, but it1268doesn't mean that W has to execute after R. All that's necessary is1269for the memory subsystem not to propagate W to R's CPU until after R1270has executed, which is possible if W executes shortly before R.12711272The third relation included in hb is like ppo, in that it only links1273events that are on the same CPU. However it is more difficult to1274explain, because it arises only indirectly from the requirement of1275cache coherence. The relation is called prop, and it links two events1276on CPU C in situations where a store from some other CPU comes after1277the first event in the coherence order and propagates to C before the1278second event executes.12791280This is best explained with some examples. The simplest case looks1281like this:12821283int x;12841285P0()1286{1287int r1;12881289WRITE_ONCE(x, 1);1290r1 = READ_ONCE(x);1291}12921293P1()1294{1295WRITE_ONCE(x, 8);1296}12971298If r1 = 8 at the end then P0's accesses must have executed in program1299order. We can deduce this from the operational model; if P0's load1300had executed before its store then the value of the store would have1301been forwarded to the load, so r1 would have ended up equal to 1, not13028. In this case there is a prop link from P0's write event to its read1303event, because P1's store came after P0's store in x's coherence1304order, and P1's store propagated to P0 before P0's load executed.13051306An equally simple case involves two loads of the same location that1307read from different stores:13081309int x = 0;13101311P0()1312{1313int r1, r2;13141315r1 = READ_ONCE(x);1316r2 = READ_ONCE(x);1317}13181319P1()1320{1321WRITE_ONCE(x, 9);1322}13231324If r1 = 0 and r2 = 9 at the end then P0's accesses must have executed1325in program order. If the second load had executed before the first1326then the x = 9 store must have been propagated to P0 before the first1327load executed, and so r1 would have been 9 rather than 0. In this1328case there is a prop link from P0's first read event to its second,1329because P1's store overwrote the value read by P0's first load, and1330P1's store propagated to P0 before P0's second load executed.13311332Less trivial examples of prop all involve fences. Unlike the simple1333examples above, they can require that some instructions are executed1334out of program order. This next one should look familiar:13351336int buf = 0, flag = 0;13371338P0()1339{1340WRITE_ONCE(buf, 1);1341smp_wmb();1342WRITE_ONCE(flag, 1);1343}13441345P1()1346{1347int r1;1348int r2;13491350r1 = READ_ONCE(flag);1351r2 = READ_ONCE(buf);1352}13531354This is the MP pattern again, with an smp_wmb() fence between the two1355stores. If r1 = 1 and r2 = 0 at the end then there is a prop link1356from P1's second load to its first (backwards!). The reason is1357similar to the previous examples: The value P1 loads from buf gets1358overwritten by P0's store to buf, the fence guarantees that the store1359to buf will propagate to P1 before the store to flag does, and the1360store to flag propagates to P1 before P1 reads flag.13611362The prop link says that in order to obtain the r1 = 1, r2 = 0 result,1363P1 must execute its second load before the first. Indeed, if the load1364from flag were executed first, then the buf = 1 store would already1365have propagated to P1 by the time P1's load from buf executed, so r21366would have been 1 at the end, not 0. (The reasoning holds even for1367Alpha, although the details are more complicated and we will not go1368into them.)13691370But what if we put an smp_rmb() fence between P1's loads? The fence1371would force the two loads to be executed in program order, and it1372would generate a cycle in the hb relation: The fence would create a ppo1373link (hence an hb link) from the first load to the second, and the1374prop relation would give an hb link from the second load to the first.1375Since an instruction can't execute before itself, we are forced to1376conclude that if an smp_rmb() fence is added, the r1 = 1, r2 = 01377outcome is impossible -- as it should be.13781379The formal definition of the prop relation involves a coe or fre link,1380followed by an arbitrary number of cumul-fence links, ending with an1381rfe link. You can concoct more exotic examples, containing more than1382one fence, although this quickly leads to diminishing returns in terms1383of complexity. For instance, here's an example containing a coe link1384followed by two cumul-fences and an rfe link, utilizing the fact that1385release fences are A-cumulative:13861387int x, y, z;13881389P0()1390{1391int r0;13921393WRITE_ONCE(x, 1);1394r0 = READ_ONCE(z);1395}13961397P1()1398{1399WRITE_ONCE(x, 2);1400smp_wmb();1401WRITE_ONCE(y, 1);1402}14031404P2()1405{1406int r2;14071408r2 = READ_ONCE(y);1409smp_store_release(&z, 1);1410}14111412If x = 2, r0 = 1, and r2 = 1 after this code runs then there is a prop1413link from P0's store to its load. This is because P0's store gets1414overwritten by P1's store since x = 2 at the end (a coe link), the1415smp_wmb() ensures that P1's store to x propagates to P2 before the1416store to y does (the first cumul-fence), the store to y propagates to P21417before P2's load and store execute, P2's smp_store_release()1418guarantees that the stores to x and y both propagate to P0 before the1419store to z does (the second cumul-fence), and P0's load executes after the1420store to z has propagated to P0 (an rfe link).14211422In summary, the fact that the hb relation links memory access events1423in the order they execute means that it must not have cycles. This1424requirement is the content of the LKMM's "happens-before" axiom.14251426The LKMM defines yet another relation connected to times of1427instruction execution, but it is not included in hb. It relies on the1428particular properties of strong fences, which we cover in the next1429section.143014311432THE PROPAGATES-BEFORE RELATION: pb1433----------------------------------14341435The propagates-before (pb) relation capitalizes on the special1436features of strong fences. It links two events E and F whenever some1437store is coherence-later than E and propagates to every CPU and to RAM1438before F executes. The formal definition requires that E be linked to1439F via a coe or fre link, an arbitrary number of cumul-fences, an1440optional rfe link, a strong fence, and an arbitrary number of hb1441links. Let's see how this definition works out.14421443Consider first the case where E is a store (implying that the sequence1444of links begins with coe). Then there are events W, X, Y, and Z such1445that:14461447E ->coe W ->cumul-fence* X ->rfe? Y ->strong-fence Z ->hb* F,14481449where the * suffix indicates an arbitrary number of links of the1450specified type, and the ? suffix indicates the link is optional (Y may1451be equal to X). Because of the cumul-fence links, we know that W will1452propagate to Y's CPU before X does, hence before Y executes and hence1453before the strong fence executes. Because this fence is strong, we1454know that W will propagate to every CPU and to RAM before Z executes.1455And because of the hb links, we know that Z will execute before F.1456Thus W, which comes later than E in the coherence order, will1457propagate to every CPU and to RAM before F executes.14581459The case where E is a load is exactly the same, except that the first1460link in the sequence is fre instead of coe.14611462The existence of a pb link from E to F implies that E must execute1463before F. To see why, suppose that F executed first. Then W would1464have propagated to E's CPU before E executed. If E was a store, the1465memory subsystem would then be forced to make E come after W in the1466coherence order, contradicting the fact that E ->coe W. If E was a1467load, the memory subsystem would then be forced to satisfy E's read1468request with the value stored by W or an even later store,1469contradicting the fact that E ->fre W.14701471A good example illustrating how pb works is the SB pattern with strong1472fences:14731474int x = 0, y = 0;14751476P0()1477{1478int r0;14791480WRITE_ONCE(x, 1);1481smp_mb();1482r0 = READ_ONCE(y);1483}14841485P1()1486{1487int r1;14881489WRITE_ONCE(y, 1);1490smp_mb();1491r1 = READ_ONCE(x);1492}14931494If r0 = 0 at the end then there is a pb link from P0's load to P1's1495load: an fre link from P0's load to P1's store (which overwrites the1496value read by P0), and a strong fence between P1's store and its load.1497In this example, the sequences of cumul-fence and hb links are empty.1498Note that this pb link is not included in hb as an instance of prop,1499because it does not start and end on the same CPU.15001501Similarly, if r1 = 0 at the end then there is a pb link from P1's load1502to P0's. This means that if both r1 and r2 were 0 there would be a1503cycle in pb, which is not possible since an instruction cannot execute1504before itself. Thus, adding smp_mb() fences to the SB pattern1505prevents the r0 = 0, r1 = 0 outcome.15061507In summary, the fact that the pb relation links events in the order1508they execute means that it cannot have cycles. This requirement is1509the content of the LKMM's "propagation" axiom.151015111512RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb1513------------------------------------------------------------------------15141515RCU (Read-Copy-Update) is a powerful synchronization mechanism. It1516rests on two concepts: grace periods and read-side critical sections.15171518A grace period is the span of time occupied by a call to1519synchronize_rcu(). A read-side critical section (or just critical1520section, for short) is a region of code delimited by rcu_read_lock()1521at the start and rcu_read_unlock() at the end. Critical sections can1522be nested, although we won't make use of this fact.15231524As far as memory models are concerned, RCU's main feature is its1525Grace-Period Guarantee, which states that a critical section can never1526span a full grace period. In more detail, the Guarantee says:15271528For any critical section C and any grace period G, at least1529one of the following statements must hold:15301531(1) C ends before G does, and in addition, every store that1532propagates to C's CPU before the end of C must propagate to1533every CPU before G ends.15341535(2) G starts before C does, and in addition, every store that1536propagates to G's CPU before the start of G must propagate1537to every CPU before C starts.15381539In particular, it is not possible for a critical section to both start1540before and end after a grace period.15411542Here is a simple example of RCU in action:15431544int x, y;15451546P0()1547{1548rcu_read_lock();1549WRITE_ONCE(x, 1);1550WRITE_ONCE(y, 1);1551rcu_read_unlock();1552}15531554P1()1555{1556int r1, r2;15571558r1 = READ_ONCE(x);1559synchronize_rcu();1560r2 = READ_ONCE(y);1561}15621563The Grace Period Guarantee tells us that when this code runs, it will1564never end with r1 = 1 and r2 = 0. The reasoning is as follows. r1 = 11565means that P0's store to x propagated to P1 before P1 called1566synchronize_rcu(), so P0's critical section must have started before1567P1's grace period, contrary to part (2) of the Guarantee. On the1568other hand, r2 = 0 means that P0's store to y, which occurs before the1569end of the critical section, did not propagate to P1 before the end of1570the grace period, contrary to part (1). Together the results violate1571the Guarantee.15721573In the kernel's implementations of RCU, the requirements for stores1574to propagate to every CPU are fulfilled by placing strong fences at1575suitable places in the RCU-related code. Thus, if a critical section1576starts before a grace period does then the critical section's CPU will1577execute an smp_mb() fence after the end of the critical section and1578some time before the grace period's synchronize_rcu() call returns.1579And if a critical section ends after a grace period does then the1580synchronize_rcu() routine will execute an smp_mb() fence at its start1581and some time before the critical section's opening rcu_read_lock()1582executes.15831584What exactly do we mean by saying that a critical section "starts1585before" or "ends after" a grace period? Some aspects of the meaning1586are pretty obvious, as in the example above, but the details aren't1587entirely clear. The LKMM formalizes this notion by means of the1588rcu-link relation. rcu-link encompasses a very general notion of1589"before": If E and F are RCU fence events (i.e., rcu_read_lock(),1590rcu_read_unlock(), or synchronize_rcu()) then among other things,1591E ->rcu-link F includes cases where E is po-before some memory-access1592event X, F is po-after some memory-access event Y, and we have any of1593X ->rfe Y, X ->co Y, or X ->fr Y.15941595The formal definition of the rcu-link relation is more than a little1596obscure, and we won't give it here. It is closely related to the pb1597relation, and the details don't matter unless you want to comb through1598a somewhat lengthy formal proof. Pretty much all you need to know1599about rcu-link is the information in the preceding paragraph.16001601The LKMM also defines the rcu-gp and rcu-rscsi relations. They bring1602grace periods and read-side critical sections into the picture, in the1603following way:16041605E ->rcu-gp F means that E and F are in fact the same event,1606and that event is a synchronize_rcu() fence (i.e., a grace1607period).16081609E ->rcu-rscsi F means that E and F are the rcu_read_unlock()1610and rcu_read_lock() fence events delimiting some read-side1611critical section. (The 'i' at the end of the name emphasizes1612that this relation is "inverted": It links the end of the1613critical section to the start.)16141615If we think of the rcu-link relation as standing for an extended1616"before", then X ->rcu-gp Y ->rcu-link Z roughly says that X is a1617grace period which ends before Z begins. (In fact it covers more than1618this, because it also includes cases where some store propagates to1619Z's CPU before Z begins but doesn't propagate to some other CPU until1620after X ends.) Similarly, X ->rcu-rscsi Y ->rcu-link Z says that X is1621the end of a critical section which starts before Z begins.16221623The LKMM goes on to define the rcu-order relation as a sequence of1624rcu-gp and rcu-rscsi links separated by rcu-link links, in which the1625number of rcu-gp links is >= the number of rcu-rscsi links. For1626example:16271628X ->rcu-gp Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V16291630would imply that X ->rcu-order V, because this sequence contains two1631rcu-gp links and one rcu-rscsi link. (It also implies that1632X ->rcu-order T and Z ->rcu-order V.) On the other hand:16331634X ->rcu-rscsi Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V16351636does not imply X ->rcu-order V, because the sequence contains only1637one rcu-gp link but two rcu-rscsi links.16381639The rcu-order relation is important because the Grace Period Guarantee1640means that rcu-order links act kind of like strong fences. In1641particular, E ->rcu-order F implies not only that E begins before F1642ends, but also that any write po-before E will propagate to every CPU1643before any instruction po-after F can execute. (However, it does not1644imply that E must execute before F; in fact, each synchronize_rcu()1645fence event is linked to itself by rcu-order as a degenerate case.)16461647To prove this in full generality requires some intellectual effort.1648We'll consider just a very simple case:16491650G ->rcu-gp W ->rcu-link Z ->rcu-rscsi F.16511652This formula means that G and W are the same event (a grace period),1653and there are events X, Y and a read-side critical section C such that:165416551. G = W is po-before or equal to X;165616572. X comes "before" Y in some sense (including rfe, co and fr);165816593. Y is po-before Z;166016614. Z is the rcu_read_unlock() event marking the end of C;166216635. F is the rcu_read_lock() event marking the start of C.16641665From 1 - 4 we deduce that the grace period G ends before the critical1666section C. Then part (2) of the Grace Period Guarantee says not only1667that G starts before C does, but also that any write which executes on1668G's CPU before G starts must propagate to every CPU before C starts.1669In particular, the write propagates to every CPU before F finishes1670executing and hence before any instruction po-after F can execute.1671This sort of reasoning can be extended to handle all the situations1672covered by rcu-order.16731674The rcu-fence relation is a simple extension of rcu-order. While1675rcu-order only links certain fence events (calls to synchronize_rcu(),1676rcu_read_lock(), or rcu_read_unlock()), rcu-fence links any events1677that are separated by an rcu-order link. This is analogous to the way1678the strong-fence relation links events that are separated by an1679smp_mb() fence event (as mentioned above, rcu-order links act kind of1680like strong fences). Written symbolically, X ->rcu-fence Y means1681there are fence events E and F such that:16821683X ->po E ->rcu-order F ->po Y.16841685From the discussion above, we see this implies not only that X1686executes before Y, but also (if X is a store) that X propagates to1687every CPU before Y executes. Thus rcu-fence is sort of a1688"super-strong" fence: Unlike the original strong fences (smp_mb() and1689synchronize_rcu()), rcu-fence is able to link events on different1690CPUs. (Perhaps this fact should lead us to say that rcu-fence isn't1691really a fence at all!)16921693Finally, the LKMM defines the RCU-before (rb) relation in terms of1694rcu-fence. This is done in essentially the same way as the pb1695relation was defined in terms of strong-fence. We will omit the1696details; the end result is that E ->rb F implies E must execute1697before F, just as E ->pb F does (and for much the same reasons).16981699Putting this all together, the LKMM expresses the Grace Period1700Guarantee by requiring that the rb relation does not contain a cycle.1701Equivalently, this "rcu" axiom requires that there are no events E1702and F with E ->rcu-link F ->rcu-order E. Or to put it a third way,1703the axiom requires that there are no cycles consisting of rcu-gp and1704rcu-rscsi alternating with rcu-link, where the number of rcu-gp links1705is >= the number of rcu-rscsi links.17061707Justifying the axiom isn't easy, but it is in fact a valid1708formalization of the Grace Period Guarantee. We won't attempt to go1709through the detailed argument, but the following analysis gives a1710taste of what is involved. Suppose both parts of the Guarantee are1711violated: A critical section starts before a grace period, and some1712store propagates to the critical section's CPU before the end of the1713critical section but doesn't propagate to some other CPU until after1714the end of the grace period.17151716Putting symbols to these ideas, let L and U be the rcu_read_lock() and1717rcu_read_unlock() fence events delimiting the critical section in1718question, and let S be the synchronize_rcu() fence event for the grace1719period. Saying that the critical section starts before S means there1720are events Q and R where Q is po-after L (which marks the start of the1721critical section), Q is "before" R in the sense used by the rcu-link1722relation, and R is po-before the grace period S. Thus we have:17231724L ->rcu-link S.17251726Let W be the store mentioned above, let Y come before the end of the1727critical section and witness that W propagates to the critical1728section's CPU by reading from W, and let Z on some arbitrary CPU be a1729witness that W has not propagated to that CPU, where Z happens after1730some event X which is po-after S. Symbolically, this amounts to:17311732S ->po X ->hb* Z ->fr W ->rf Y ->po U.17331734The fr link from Z to W indicates that W has not propagated to Z's CPU1735at the time that Z executes. From this, it can be shown (see the1736discussion of the rcu-link relation earlier) that S and U are related1737by rcu-link:17381739S ->rcu-link U.17401741Since S is a grace period we have S ->rcu-gp S, and since L and U are1742the start and end of the critical section C we have U ->rcu-rscsi L.1743From this we obtain:17441745S ->rcu-gp S ->rcu-link U ->rcu-rscsi L ->rcu-link S,17461747a forbidden cycle. Thus the "rcu" axiom rules out this violation of1748the Grace Period Guarantee.17491750For something a little more down-to-earth, let's see how the axiom1751works out in practice. Consider the RCU code example from above, this1752time with statement labels added:17531754int x, y;17551756P0()1757{1758L: rcu_read_lock();1759X: WRITE_ONCE(x, 1);1760Y: WRITE_ONCE(y, 1);1761U: rcu_read_unlock();1762}17631764P1()1765{1766int r1, r2;17671768Z: r1 = READ_ONCE(x);1769S: synchronize_rcu();1770W: r2 = READ_ONCE(y);1771}177217731774If r2 = 0 at the end then P0's store at Y overwrites the value that1775P1's load at W reads from, so we have W ->fre Y. Since S ->po W and1776also Y ->po U, we get S ->rcu-link U. In addition, S ->rcu-gp S1777because S is a grace period.17781779If r1 = 1 at the end then P1's load at Z reads from P0's store at X,1780so we have X ->rfe Z. Together with L ->po X and Z ->po S, this1781yields L ->rcu-link S. And since L and U are the start and end of a1782critical section, we have U ->rcu-rscsi L.17831784Then U ->rcu-rscsi L ->rcu-link S ->rcu-gp S ->rcu-link U is a1785forbidden cycle, violating the "rcu" axiom. Hence the outcome is not1786allowed by the LKMM, as we would expect.17871788For contrast, let's see what can happen in a more complicated example:17891790int x, y, z;17911792P0()1793{1794int r0;17951796L0: rcu_read_lock();1797r0 = READ_ONCE(x);1798WRITE_ONCE(y, 1);1799U0: rcu_read_unlock();1800}18011802P1()1803{1804int r1;18051806r1 = READ_ONCE(y);1807S1: synchronize_rcu();1808WRITE_ONCE(z, 1);1809}18101811P2()1812{1813int r2;18141815L2: rcu_read_lock();1816r2 = READ_ONCE(z);1817WRITE_ONCE(x, 1);1818U2: rcu_read_unlock();1819}18201821If r0 = r1 = r2 = 1 at the end, then similar reasoning to before shows1822that U0 ->rcu-rscsi L0 ->rcu-link S1 ->rcu-gp S1 ->rcu-link U2 ->rcu-rscsi1823L2 ->rcu-link U0. However this cycle is not forbidden, because the1824sequence of relations contains fewer instances of rcu-gp (one) than of1825rcu-rscsi (two). Consequently the outcome is allowed by the LKMM.1826The following instruction timing diagram shows how it might actually1827occur:18281829P0 P1 P21830-------------------- -------------------- --------------------1831rcu_read_lock()1832WRITE_ONCE(y, 1)1833r1 = READ_ONCE(y)1834synchronize_rcu() starts1835. rcu_read_lock()1836. WRITE_ONCE(x, 1)1837r0 = READ_ONCE(x) .1838rcu_read_unlock() .1839synchronize_rcu() ends1840WRITE_ONCE(z, 1)1841r2 = READ_ONCE(z)1842rcu_read_unlock()18431844This requires P0 and P2 to execute their loads and stores out of1845program order, but of course they are allowed to do so. And as you1846can see, the Grace Period Guarantee is not violated: The critical1847section in P0 both starts before P1's grace period does and ends1848before it does, and the critical section in P2 both starts after P1's1849grace period does and ends after it does.18501851The LKMM supports SRCU (Sleepable Read-Copy-Update) in addition to1852normal RCU. The ideas involved are much the same as above, with new1853relations srcu-gp and srcu-rscsi added to represent SRCU grace periods1854and read-side critical sections. However, there are some significant1855differences between RCU read-side critical sections and their SRCU1856counterparts, as described in the next section.185718581859SRCU READ-SIDE CRITICAL SECTIONS1860--------------------------------18611862The LKMM uses the srcu-rscsi relation to model SRCU read-side critical1863sections. They differ from RCU read-side critical sections in the1864following respects:186518661. Unlike the analogous RCU primitives, synchronize_srcu(),1867srcu_read_lock(), and srcu_read_unlock() take a pointer to a1868struct srcu_struct as an argument. This structure is called1869an SRCU domain, and calls linked by srcu-rscsi must have the1870same domain. Read-side critical sections and grace periods1871associated with different domains are independent of one1872another; the SRCU version of the RCU Guarantee applies only1873to pairs of critical sections and grace periods having the1874same domain.187518762. srcu_read_lock() returns a value, called the index, which must1877be passed to the matching srcu_read_unlock() call. Unlike1878rcu_read_lock() and rcu_read_unlock(), an srcu_read_lock()1879call does not always have to match the next unpaired1880srcu_read_unlock(). In fact, it is possible for two SRCU1881read-side critical sections to overlap partially, as in the1882following example (where s is an srcu_struct and idx1 and idx21883are integer variables):18841885idx1 = srcu_read_lock(&s); // Start of first RSCS1886idx2 = srcu_read_lock(&s); // Start of second RSCS1887srcu_read_unlock(&s, idx1); // End of first RSCS1888srcu_read_unlock(&s, idx2); // End of second RSCS18891890The matching is determined entirely by the domain pointer and1891index value. By contrast, if the calls had been1892rcu_read_lock() and rcu_read_unlock() then they would have1893created two nested (fully overlapping) read-side critical1894sections: an inner one and an outer one.189518963. The srcu_down_read() and srcu_up_read() primitives work1897exactly like srcu_read_lock() and srcu_read_unlock(), except1898that matching calls don't have to execute within the same context.1899(The names are meant to be suggestive of operations on1900semaphores.) Since the matching is determined by the domain1901pointer and index value, these primitives make it possible for1902an SRCU read-side critical section to start on one CPU and end1903on another, so to speak.19041905In order to account for these properties of SRCU, the LKMM models1906srcu_read_lock() as a special type of load event (which is1907appropriate, since it takes a memory location as argument and returns1908a value, just as a load does) and srcu_read_unlock() as a special type1909of store event (again appropriate, since it takes as arguments a1910memory location and a value). These loads and stores are annotated as1911belonging to the "srcu-lock" and "srcu-unlock" event classes1912respectively.19131914This approach allows the LKMM to tell whether two events are1915associated with the same SRCU domain, simply by checking whether they1916access the same memory location (i.e., they are linked by the loc1917relation). It also gives a way to tell which unlock matches a1918particular lock, by checking for the presence of a data dependency1919from the load (srcu-lock) to the store (srcu-unlock). For example,1920given the situation outlined earlier (with statement labels added):19211922A: idx1 = srcu_read_lock(&s);1923B: idx2 = srcu_read_lock(&s);1924C: srcu_read_unlock(&s, idx1);1925D: srcu_read_unlock(&s, idx2);19261927the LKMM will treat A and B as loads from s yielding values saved in1928idx1 and idx2 respectively. Similarly, it will treat C and D as1929though they stored the values from idx1 and idx2 in s. The end result1930is much as if we had written:19311932A: idx1 = READ_ONCE(s);1933B: idx2 = READ_ONCE(s);1934C: WRITE_ONCE(s, idx1);1935D: WRITE_ONCE(s, idx2);19361937except for the presence of the special srcu-lock and srcu-unlock1938annotations. You can see at once that we have A ->data C and1939B ->data D. These dependencies tell the LKMM that C is the1940srcu-unlock event matching srcu-lock event A, and D is the1941srcu-unlock event matching srcu-lock event B.19421943This approach is admittedly a hack, and it has the potential to lead1944to problems. For example, in:19451946idx1 = srcu_read_lock(&s);1947srcu_read_unlock(&s, idx1);1948idx2 = srcu_read_lock(&s);1949srcu_read_unlock(&s, idx2);19501951the LKMM will believe that idx2 must have the same value as idx1,1952since it reads from the immediately preceding store of idx1 in s.1953Fortunately this won't matter, assuming that litmus tests never do1954anything with SRCU index values other than pass them to1955srcu_read_unlock() or srcu_up_read() calls.19561957However, sometimes it is necessary to store an index value in a1958shared variable temporarily. In fact, this is the only way for1959srcu_down_read() to pass the index it gets to an srcu_up_read() call1960on a different CPU. In more detail, we might have soething like:19611962struct srcu_struct s;1963int x;19641965P0()1966{1967int r0;19681969A: r0 = srcu_down_read(&s);1970B: WRITE_ONCE(x, r0);1971}19721973P1()1974{1975int r1;19761977C: r1 = READ_ONCE(x);1978D: srcu_up_read(&s, r1);1979}19801981Assuming that P1 executes after P0 and does read the index value1982stored in x, we can write this (using brackets to represent event1983annotations) as:19841985A[srcu-lock] ->data B[once] ->rf C[once] ->data D[srcu-unlock].19861987The LKMM defines a carry-srcu-data relation to express this pattern;1988it permits an arbitrarily long sequence of19891990data ; rf19911992pairs (that is, a data link followed by an rf link) to occur between1993an srcu-lock event and the final data dependency leading to the1994matching srcu-unlock event. carry-srcu-data is complicated by the1995need to ensure that none of the intermediate store events in this1996sequence are instances of srcu-unlock. This is necessary because in a1997pattern like the one above:19981999A: idx1 = srcu_read_lock(&s);2000B: srcu_read_unlock(&s, idx1);2001C: idx2 = srcu_read_lock(&s);2002D: srcu_read_unlock(&s, idx2);20032004the LKMM treats B as a store to the variable s and C as a load from2005that variable, creating an undesirable rf link from B to C:20062007A ->data B ->rf C ->data D.20082009This would cause carry-srcu-data to mistakenly extend a data2010dependency from A to D, giving the impression that D was the2011srcu-unlock event matching A's srcu-lock. To avoid such problems,2012carry-srcu-data does not accept sequences in which the ends of any of2013the intermediate ->data links (B above) is an srcu-unlock event.201420152016LOCKING2017-------20182019The LKMM includes locking. In fact, there is special code for locking2020in the formal model, added in order to make tools run faster.2021However, this special code is intended to be more or less equivalent2022to concepts we have already covered. A spinlock_t variable is treated2023the same as an int, and spin_lock(&s) is treated almost the same as:20242025while (cmpxchg_acquire(&s, 0, 1) != 0)2026cpu_relax();20272028This waits until s is equal to 0 and then atomically sets it to 1,2029and the read part of the cmpxchg operation acts as an acquire fence.2030An alternate way to express the same thing would be:20312032r = xchg_acquire(&s, 1);20332034along with a requirement that at the end, r = 0. Similarly,2035spin_trylock(&s) is treated almost the same as:20362037return !cmpxchg_acquire(&s, 0, 1);20382039which atomically sets s to 1 if it is currently equal to 0 and returns2040true if it succeeds (the read part of the cmpxchg operation acts as an2041acquire fence only if the operation is successful). spin_unlock(&s)2042is treated almost the same as:20432044smp_store_release(&s, 0);20452046The "almost" qualifiers above need some explanation. In the LKMM, the2047store-release in a spin_unlock() and the load-acquire which forms the2048first half of the atomic rmw update in a spin_lock() or a successful2049spin_trylock() -- we can call these things lock-releases and2050lock-acquires -- have two properties beyond those of ordinary releases2051and acquires.20522053First, when a lock-acquire reads from or is po-after a lock-release,2054the LKMM requires that every instruction po-before the lock-release2055must execute before any instruction po-after the lock-acquire. This2056would naturally hold if the release and acquire operations were on2057different CPUs and accessed the same lock variable, but the LKMM says2058it also holds when they are on the same CPU, even if they access2059different lock variables. For example:20602061int x, y;2062spinlock_t s, t;20632064P0()2065{2066int r1, r2;20672068spin_lock(&s);2069r1 = READ_ONCE(x);2070spin_unlock(&s);2071spin_lock(&t);2072r2 = READ_ONCE(y);2073spin_unlock(&t);2074}20752076P1()2077{2078WRITE_ONCE(y, 1);2079smp_wmb();2080WRITE_ONCE(x, 1);2081}20822083Here the second spin_lock() is po-after the first spin_unlock(), and2084therefore the load of x must execute before the load of y, even though2085the two locking operations use different locks. Thus we cannot have2086r1 = 1 and r2 = 0 at the end (this is an instance of the MP pattern).20872088This requirement does not apply to ordinary release and acquire2089fences, only to lock-related operations. For instance, suppose P0()2090in the example had been written as:20912092P0()2093{2094int r1, r2, r3;20952096r1 = READ_ONCE(x);2097smp_store_release(&s, 1);2098r3 = smp_load_acquire(&s);2099r2 = READ_ONCE(y);2100}21012102Then the CPU would be allowed to forward the s = 1 value from the2103smp_store_release() to the smp_load_acquire(), executing the2104instructions in the following order:21052106r3 = smp_load_acquire(&s); // Obtains r3 = 12107r2 = READ_ONCE(y);2108r1 = READ_ONCE(x);2109smp_store_release(&s, 1); // Value is forwarded21102111and thus it could load y before x, obtaining r2 = 0 and r1 = 1.21122113Second, when a lock-acquire reads from or is po-after a lock-release,2114and some other stores W and W' occur po-before the lock-release and2115po-after the lock-acquire respectively, the LKMM requires that W must2116propagate to each CPU before W' does. For example, consider:21172118int x, y;2119spinlock_t s;21202121P0()2122{2123spin_lock(&s);2124WRITE_ONCE(x, 1);2125spin_unlock(&s);2126}21272128P1()2129{2130int r1;21312132spin_lock(&s);2133r1 = READ_ONCE(x);2134WRITE_ONCE(y, 1);2135spin_unlock(&s);2136}21372138P2()2139{2140int r2, r3;21412142r2 = READ_ONCE(y);2143smp_rmb();2144r3 = READ_ONCE(x);2145}21462147If r1 = 1 at the end then the spin_lock() in P1 must have read from2148the spin_unlock() in P0. Hence the store to x must propagate to P22149before the store to y does, so we cannot have r2 = 1 and r3 = 0. But2150if P1 had used a lock variable different from s, the writes could have2151propagated in either order. (On the other hand, if the code in P0 and2152P1 had all executed on a single CPU, as in the example before this2153one, then the writes would have propagated in order even if the two2154critical sections used different lock variables.)21552156These two special requirements for lock-release and lock-acquire do2157not arise from the operational model. Nevertheless, kernel developers2158have come to expect and rely on them because they do hold on all2159architectures supported by the Linux kernel, albeit for various2160differing reasons.216121622163PLAIN ACCESSES AND DATA RACES2164-----------------------------21652166In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y),2167smp_load_acquire(&z), and so on are collectively referred to as2168"marked" accesses, because they are all annotated with special2169operations of one kind or another. Ordinary C-language memory2170accesses such as x or y = 0 are simply called "plain" accesses.21712172Early versions of the LKMM had nothing to say about plain accesses.2173The C standard allows compilers to assume that the variables affected2174by plain accesses are not concurrently read or written by any other2175threads or CPUs. This leaves compilers free to implement all manner2176of transformations or optimizations of code containing plain accesses,2177making such code very difficult for a memory model to handle.21782179Here is just one example of a possible pitfall:21802181int a = 6;2182int *x = &a;21832184P0()2185{2186int *r1;2187int r2 = 0;21882189r1 = x;2190if (r1 != NULL)2191r2 = READ_ONCE(*r1);2192}21932194P1()2195{2196WRITE_ONCE(x, NULL);2197}21982199On the face of it, one would expect that when this code runs, the only2200possible final values for r2 are 6 and 0, depending on whether or not2201P1's store to x propagates to P0 before P0's load from x executes.2202But since P0's load from x is a plain access, the compiler may decide2203to carry out the load twice (for the comparison against NULL, then again2204for the READ_ONCE()) and eliminate the temporary variable r1. The2205object code generated for P0 could therefore end up looking rather2206like this:22072208P0()2209{2210int r2 = 0;22112212if (x != NULL)2213r2 = READ_ONCE(*x);2214}22152216And now it is obvious that this code runs the risk of dereferencing a2217NULL pointer, because P1's store to x might propagate to P0 after the2218test against NULL has been made but before the READ_ONCE() executes.2219If the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x",2220the compiler would not have performed this optimization and there2221would be no possibility of a NULL-pointer dereference.22222223Given the possibility of transformations like this one, the LKMM2224doesn't try to predict all possible outcomes of code containing plain2225accesses. It is instead content to determine whether the code2226violates the compiler's assumptions, which would render the ultimate2227outcome undefined.22282229In technical terms, the compiler is allowed to assume that when the2230program executes, there will not be any data races. A "data race"2231occurs when there are two memory accesses such that:223222331. they access the same location,223422352. at least one of them is a store,223622373. at least one of them is plain,223822394. they occur on different CPUs (or in different threads on the2240same CPU), and224122425. they execute concurrently.22432244In the literature, two accesses are said to "conflict" if they satisfy22451 and 2 above. We'll go a little farther and say that two accesses2246are "race candidates" if they satisfy 1 - 4. Thus, whether or not two2247race candidates actually do race in a given execution depends on2248whether they are concurrent.22492250The LKMM tries to determine whether a program contains race candidates2251which may execute concurrently; if it does then the LKMM says there is2252a potential data race and makes no predictions about the program's2253outcome.22542255Determining whether two accesses are race candidates is easy; you can2256see that all the concepts involved in the definition above are already2257part of the memory model. The hard part is telling whether they may2258execute concurrently. The LKMM takes a conservative attitude,2259assuming that accesses may be concurrent unless it can prove they2260are not.22612262If two memory accesses aren't concurrent then one must execute before2263the other. Therefore the LKMM decides two accesses aren't concurrent2264if they can be connected by a sequence of hb, pb, and rb links2265(together referred to as xb, for "executes before"). However, there2266are two complicating factors.22672268If X is a load and X executes before a store Y, then indeed there is2269no danger of X and Y being concurrent. After all, Y can't have any2270effect on the value obtained by X until the memory subsystem has2271propagated Y from its own CPU to X's CPU, which won't happen until2272some time after Y executes and thus after X executes. But if X is a2273store, then even if X executes before Y it is still possible that X2274will propagate to Y's CPU just as Y is executing. In such a case X2275could very well interfere somehow with Y, and we would have to2276consider X and Y to be concurrent.22772278Therefore when X is a store, for X and Y to be non-concurrent the LKMM2279requires not only that X must execute before Y but also that X must2280propagate to Y's CPU before Y executes. (Or vice versa, of course, if2281Y executes before X -- then Y must propagate to X's CPU before X2282executes if Y is a store.) This is expressed by the visibility2283relation (vis), where X ->vis Y is defined to hold if there is an2284intermediate event Z such that:22852286X is connected to Z by a possibly empty sequence of2287cumul-fence links followed by an optional rfe link (if none of2288these links are present, X and Z are the same event),22892290and either:22912292Z is connected to Y by a strong-fence link followed by a2293possibly empty sequence of xb links,22942295or:22962297Z is on the same CPU as Y and is connected to Y by a possibly2298empty sequence of xb links (again, if the sequence is empty it2299means Z and Y are the same event).23002301The motivations behind this definition are straightforward:23022303cumul-fence memory barriers force stores that are po-before2304the barrier to propagate to other CPUs before stores that are2305po-after the barrier.23062307An rfe link from an event W to an event R says that R reads2308from W, which certainly means that W must have propagated to2309R's CPU before R executed.23102311strong-fence memory barriers force stores that are po-before2312the barrier, or that propagate to the barrier's CPU before the2313barrier executes, to propagate to all CPUs before any events2314po-after the barrier can execute.23152316To see how this works out in practice, consider our old friend, the MP2317pattern (with fences and statement labels, but without the conditional2318test):23192320int buf = 0, flag = 0;23212322P0()2323{2324X: WRITE_ONCE(buf, 1);2325smp_wmb();2326W: WRITE_ONCE(flag, 1);2327}23282329P1()2330{2331int r1;2332int r2 = 0;23332334Z: r1 = READ_ONCE(flag);2335smp_rmb();2336Y: r2 = READ_ONCE(buf);2337}23382339The smp_wmb() memory barrier gives a cumul-fence link from X to W, and2340assuming r1 = 1 at the end, there is an rfe link from W to Z. This2341means that the store to buf must propagate from P0 to P1 before Z2342executes. Next, Z and Y are on the same CPU and the smp_rmb() fence2343provides an xb link from Z to Y (i.e., it forces Z to execute before2344Y). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y2345executes.23462347The second complicating factor mentioned above arises from the fact2348that when we are considering data races, some of the memory accesses2349are plain. Now, although we have not said so explicitly, up to this2350point most of the relations defined by the LKMM (ppo, hb, prop,2351cumul-fence, pb, and so on -- including vis) apply only to marked2352accesses.23532354There are good reasons for this restriction. The compiler is not2355allowed to apply fancy transformations to marked accesses, and2356consequently each such access in the source code corresponds more or2357less directly to a single machine instruction in the object code. But2358plain accesses are a different story; the compiler may combine them,2359split them up, duplicate them, eliminate them, invent new ones, and2360who knows what else. Seeing a plain access in the source code tells2361you almost nothing about what machine instructions will end up in the2362object code.23632364Fortunately, the compiler isn't completely free; it is subject to some2365limitations. For one, it is not allowed to introduce a data race into2366the object code if the source code does not already contain a data2367race (if it could, memory models would be useless and no multithreaded2368code would be safe!). For another, it cannot move a plain access past2369a compiler barrier.23702371A compiler barrier is a kind of fence, but as the name implies, it2372only affects the compiler; it does not necessarily have any effect on2373how instructions are executed by the CPU. In Linux kernel source2374code, the barrier() function is a compiler barrier. It doesn't give2375rise directly to any machine instructions in the object code; rather,2376it affects how the compiler generates the rest of the object code.2377Given source code like this:23782379... some memory accesses ...2380barrier();2381... some other memory accesses ...23822383the barrier() function ensures that the machine instructions2384corresponding to the first group of accesses will all end po-before2385any machine instructions corresponding to the second group of accesses2386-- even if some of the accesses are plain. (Of course, the CPU may2387then execute some of those accesses out of program order, but we2388already know how to deal with such issues.) Without the barrier()2389there would be no such guarantee; the two groups of accesses could be2390intermingled or even reversed in the object code.23912392The LKMM doesn't say much about the barrier() function, but it does2393require that all fences are also compiler barriers. In addition, it2394requires that the ordering properties of memory barriers such as2395smp_rmb() or smp_store_release() apply to plain accesses as well as to2396marked accesses.23972398This is the key to analyzing data races. Consider the MP pattern2399again, now using plain accesses for buf:24002401int buf = 0, flag = 0;24022403P0()2404{2405U: buf = 1;2406smp_wmb();2407X: WRITE_ONCE(flag, 1);2408}24092410P1()2411{2412int r1;2413int r2 = 0;24142415Y: r1 = READ_ONCE(flag);2416if (r1) {2417smp_rmb();2418V: r2 = buf;2419}2420}24212422This program does not contain a data race. Although the U and V2423accesses are race candidates, the LKMM can prove they are not2424concurrent as follows:24252426The smp_wmb() fence in P0 is both a compiler barrier and a2427cumul-fence. It guarantees that no matter what hash of2428machine instructions the compiler generates for the plain2429access U, all those instructions will be po-before the fence.2430Consequently U's store to buf, no matter how it is carried out2431at the machine level, must propagate to P1 before X's store to2432flag does.24332434X and Y are both marked accesses. Hence an rfe link from X to2435Y is a valid indicator that X propagated to P1 before Y2436executed, i.e., X ->vis Y. (And if there is no rfe link then2437r1 will be 0, so V will not be executed and ipso facto won't2438race with U.)24392440The smp_rmb() fence in P1 is a compiler barrier as well as a2441fence. It guarantees that all the machine-level instructions2442corresponding to the access V will be po-after the fence, and2443therefore any loads among those instructions will execute2444after the fence does and hence after Y does.24452446Thus U's store to buf is forced to propagate to P1 before V's load2447executes (assuming V does execute), ruling out the possibility of a2448data race between them.24492450This analysis illustrates how the LKMM deals with plain accesses in2451general. Suppose R is a plain load and we want to show that R2452executes before some marked access E. We can do this by finding a2453marked access X such that R and X are ordered by a suitable fence and2454X ->xb* E. If E was also a plain access, we would also look for a2455marked access Y such that X ->xb* Y, and Y and E are ordered by a2456fence. We describe this arrangement by saying that R is2457"post-bounded" by X and E is "pre-bounded" by Y.24582459In fact, we go one step further: Since R is a read, we say that R is2460"r-post-bounded" by X. Similarly, E would be "r-pre-bounded" or2461"w-pre-bounded" by Y, depending on whether E was a store or a load.2462This distinction is needed because some fences affect only loads2463(i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise2464the two types of bounds are the same. And as a degenerate case, we2465say that a marked access pre-bounds and post-bounds itself (e.g., if R2466above were a marked load then X could simply be taken to be R itself.)24672468The need to distinguish between r- and w-bounding raises yet another2469issue. When the source code contains a plain store, the compiler is2470allowed to put plain loads of the same location into the object code.2471For example, given the source code:24722473x = 1;24742475the compiler is theoretically allowed to generate object code that2476looks like:24772478if (x != 1)2479x = 1;24802481thereby adding a load (and possibly replacing the store entirely).2482For this reason, whenever the LKMM requires a plain store to be2483w-pre-bounded or w-post-bounded by a marked access, it also requires2484the store to be r-pre-bounded or r-post-bounded, so as to handle cases2485where the compiler adds a load.24862487(This may be overly cautious. We don't know of any examples where a2488compiler has augmented a store with a load in this fashion, and the2489Linux kernel developers would probably fight pretty hard to change a2490compiler if it ever did this. Still, better safe than sorry.)24912492Incidentally, the other tranformation -- augmenting a plain load by2493adding in a store to the same location -- is not allowed. This is2494because the compiler cannot know whether any other CPUs might perform2495a concurrent load from that location. Two concurrent loads don't2496constitute a race (they can't interfere with each other), but a store2497does race with a concurrent load. Thus adding a store might create a2498data race where one was not already present in the source code,2499something the compiler is forbidden to do. Augmenting a store with a2500load, on the other hand, is acceptable because doing so won't create a2501data race unless one already existed.25022503The LKMM includes a second way to pre-bound plain accesses, in2504addition to fences: an address dependency from a marked load. That2505is, in the sequence:25062507p = READ_ONCE(ptr);2508r = *p;25092510the LKMM says that the marked load of ptr pre-bounds the plain load of2511*p; the marked load must execute before any of the machine2512instructions corresponding to the plain load. This is a reasonable2513stipulation, since after all, the CPU can't perform the load of *p2514until it knows what value p will hold. Furthermore, without some2515assumption like this one, some usages typical of RCU would count as2516data races. For example:25172518int a = 1, b;2519int *ptr = &a;25202521P0()2522{2523b = 2;2524rcu_assign_pointer(ptr, &b);2525}25262527P1()2528{2529int *p;2530int r;25312532rcu_read_lock();2533p = rcu_dereference(ptr);2534r = *p;2535rcu_read_unlock();2536}25372538(In this example the rcu_read_lock() and rcu_read_unlock() calls don't2539really do anything, because there aren't any grace periods. They are2540included merely for the sake of good form; typically P0 would call2541synchronize_rcu() somewhere after the rcu_assign_pointer().)25422543rcu_assign_pointer() performs a store-release, so the plain store to b2544is definitely w-post-bounded before the store to ptr, and the two2545stores will propagate to P1 in that order. However, rcu_dereference()2546is only equivalent to READ_ONCE(). While it is a marked access, it is2547not a fence or compiler barrier. Hence the only guarantee we have2548that the load of ptr in P1 is r-pre-bounded before the load of *p2549(thus avoiding a race) is the assumption about address dependencies.25502551This is a situation where the compiler can undermine the memory model,2552and a certain amount of care is required when programming constructs2553like this one. In particular, comparisons between the pointer and2554other known addresses can cause trouble. If you have something like:25552556p = rcu_dereference(ptr);2557if (p == &x)2558r = *p;25592560then the compiler just might generate object code resembling:25612562p = rcu_dereference(ptr);2563if (p == &x)2564r = x;25652566or even:25672568rtemp = x;2569p = rcu_dereference(ptr);2570if (p == &x)2571r = rtemp;25722573which would invalidate the memory model's assumption, since the CPU2574could now perform the load of x before the load of ptr (there might be2575a control dependency but no address dependency at the machine level).25762577Finally, it turns out there is a situation in which a plain write does2578not need to be w-post-bounded: when it is separated from the other2579race-candidate access by a fence. At first glance this may seem2580impossible. After all, to be race candidates the two accesses must2581be on different CPUs, and fences don't link events on different CPUs.2582Well, normal fences don't -- but rcu-fence can! Here's an example:25832584int x, y;25852586P0()2587{2588WRITE_ONCE(x, 1);2589synchronize_rcu();2590y = 3;2591}25922593P1()2594{2595rcu_read_lock();2596if (READ_ONCE(x) == 0)2597y = 2;2598rcu_read_unlock();2599}26002601Do the plain stores to y race? Clearly not if P1 reads a non-zero2602value for x, so let's assume the READ_ONCE(x) does obtain 0. This2603means that the read-side critical section in P1 must finish executing2604before the grace period in P0 does, because RCU's Grace-Period2605Guarantee says that otherwise P0's store to x would have propagated to2606P1 before the critical section started and so would have been visible2607to the READ_ONCE(). (Another way of putting it is that the fre link2608from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link2609between those two events.)26102611This means there is an rcu-fence link from P1's "y = 2" store to P0's2612"y = 3" store, and consequently the first must propagate from P1 to P02613before the second can execute. Therefore the two stores cannot be2614concurrent and there is no race, even though P1's plain store to y2615isn't w-post-bounded by any marked accesses.26162617Putting all this material together yields the following picture. For2618race-candidate stores W and W', where W ->co W', the LKMM says the2619stores don't race if W can be linked to W' by a26202621w-post-bounded ; vis ; w-pre-bounded26222623sequence. If W is plain then they also have to be linked by an26242625r-post-bounded ; xb* ; w-pre-bounded26262627sequence, and if W' is plain then they also have to be linked by a26282629w-post-bounded ; vis ; r-pre-bounded26302631sequence. For race-candidate load R and store W, the LKMM says the2632two accesses don't race if R can be linked to W by an26332634r-post-bounded ; xb* ; w-pre-bounded26352636sequence or if W can be linked to R by a26372638w-post-bounded ; vis ; r-pre-bounded26392640sequence. For the cases involving a vis link, the LKMM also accepts2641sequences in which W is linked to W' or R by a26422643strong-fence ; xb* ; {w and/or r}-pre-bounded26442645sequence with no post-bounding, and in every case the LKMM also allows2646the link simply to be a fence with no bounding at all. If no sequence2647of the appropriate sort exists, the LKMM says that the accesses race.26482649There is one more part of the LKMM related to plain accesses (although2650not to data races) we should discuss. Recall that many relations such2651as hb are limited to marked accesses only. As a result, the2652happens-before, propagates-before, and rcu axioms (which state that2653various relation must not contain a cycle) doesn't apply to plain2654accesses. Nevertheless, we do want to rule out such cycles, because2655they don't make sense even for plain accesses.26562657To this end, the LKMM imposes three extra restrictions, together2658called the "plain-coherence" axiom because of their resemblance to the2659rules used by the operational model to ensure cache coherence (that2660is, the rules governing the memory subsystem's choice of a store to2661satisfy a load request and its determination of where a store will2662fall in the coherence order):26632664If R and W are race candidates and it is possible to link R to2665W by one of the xb* sequences listed above, then W ->rfe R is2666not allowed (i.e., a load cannot read from a store that it2667executes before, even if one or both is plain).26682669If W and R are race candidates and it is possible to link W to2670R by one of the vis sequences listed above, then R ->fre W is2671not allowed (i.e., if a store is visible to a load then the2672load must read from that store or one coherence-after it).26732674If W and W' are race candidates and it is possible to link W2675to W' by one of the vis sequences listed above, then W' ->co W2676is not allowed (i.e., if one store is visible to a second then2677the second must come after the first in the coherence order).26782679This is the extent to which the LKMM deals with plain accesses.2680Perhaps it could say more (for example, plain accesses might2681contribute to the ppo relation), but at the moment it seems that this2682minimal, conservative approach is good enough.268326842685ODDS AND ENDS2686-------------26872688This section covers material that didn't quite fit anywhere in the2689earlier sections.26902691The descriptions in this document don't always match the formal2692version of the LKMM exactly. For example, the actual formal2693definition of the prop relation makes the initial coe or fre part2694optional, and it doesn't require the events linked by the relation to2695be on the same CPU. These differences are very unimportant; indeed,2696instances where the coe/fre part of prop is missing are of no interest2697because all the other parts (fences and rfe) are already included in2698hb anyway, and where the formal model adds prop into hb, it includes2699an explicit requirement that the events being linked are on the same2700CPU.27012702Another minor difference has to do with events that are both memory2703accesses and fences, such as those corresponding to smp_load_acquire()2704calls. In the formal model, these events aren't actually both reads2705and fences; rather, they are read events with an annotation marking2706them as acquires. (Or write events annotated as releases, in the case2707smp_store_release().) The final effect is the same.27082709Although we didn't mention it above, the instruction execution2710ordering provided by the smp_rmb() fence doesn't apply to read events2711that are part of a non-value-returning atomic update. For instance,2712given:27132714atomic_inc(&x);2715smp_rmb();2716r1 = READ_ONCE(y);27172718it is not guaranteed that the load from y will execute after the2719update to x. This is because the ARMv8 architecture allows2720non-value-returning atomic operations effectively to be executed off2721the CPU. Basically, the CPU tells the memory subsystem to increment2722x, and then the increment is carried out by the memory hardware with2723no further involvement from the CPU. Since the CPU doesn't ever read2724the value of x, there is nothing for the smp_rmb() fence to act on.27252726The LKMM defines a few extra synchronization operations in terms of2727things we have already covered. In particular, rcu_dereference() is2728treated as READ_ONCE() and rcu_assign_pointer() is treated as2729smp_store_release() -- which is basically how the Linux kernel treats2730them.27312732Although we said that plain accesses are not linked by the ppo2733relation, they do contribute to it indirectly. Firstly, when there is2734an address dependency from a marked load R to a plain store W,2735followed by smp_wmb() and then a marked store W', the LKMM creates a2736ppo link from R to W'. The reasoning behind this is perhaps a little2737shaky, but essentially it says there is no way to generate object code2738for this source code in which W' could execute before R. Just as with2739pre-bounding by address dependencies, it is possible for the compiler2740to undermine this relation if sufficient care is not taken.27412742Secondly, plain accesses can carry dependencies: If a data dependency2743links a marked load R to a store W, and the store is read by a load R'2744from the same thread, then the data loaded by R' depends on the data2745loaded originally by R. Thus, if R' is linked to any access X by a2746dependency, R is also linked to access X by the same dependency, even2747if W' or R' (or both!) are plain.27482749There are a few oddball fences which need special treatment:2750smp_mb__before_atomic(), smp_mb__after_atomic(), and2751smp_mb__after_spinlock(). The LKMM uses fence events with special2752annotations for them; they act as strong fences just like smp_mb()2753except for the sets of events that they order. Instead of ordering2754all po-earlier events against all po-later events, as smp_mb() does,2755they behave as follows:27562757smp_mb__before_atomic() orders all po-earlier events against2758po-later atomic updates and the events following them;27592760smp_mb__after_atomic() orders po-earlier atomic updates and2761the events preceding them against all po-later events;27622763smp_mb__after_spinlock() orders po-earlier lock acquisition2764events and the events preceding them against all po-later2765events.27662767Interestingly, RCU and locking each introduce the possibility of2768deadlock. When faced with code sequences such as:27692770spin_lock(&s);2771spin_lock(&s);2772spin_unlock(&s);2773spin_unlock(&s);27742775or:27762777rcu_read_lock();2778synchronize_rcu();2779rcu_read_unlock();27802781what does the LKMM have to say? Answer: It says there are no allowed2782executions at all, which makes sense. But this can also lead to2783misleading results, because if a piece of code has multiple possible2784executions, some of which deadlock, the model will report only on the2785non-deadlocking executions. For example:27862787int x, y;27882789P0()2790{2791int r0;27922793WRITE_ONCE(x, 1);2794r0 = READ_ONCE(y);2795}27962797P1()2798{2799rcu_read_lock();2800if (READ_ONCE(x) > 0) {2801WRITE_ONCE(y, 36);2802synchronize_rcu();2803}2804rcu_read_unlock();2805}28062807Is it possible to end up with r0 = 36 at the end? The LKMM will tell2808you it is not, but the model won't mention that this is because P12809will self-deadlock in the executions where it stores 36 in y.281028112812