Path: blob/master/tools/memory-model/Documentation/simple.txt
26285 views
This document provides options for those wishing to keep their1memory-ordering lives simple, as is necessary for those whose domain2is complex. After all, there are bugs other than memory-ordering bugs,3and the time spent gaining memory-ordering knowledge is not available4for gaining domain knowledge. Furthermore Linux-kernel memory model5(LKMM) is quite complex, with subtle differences in code often having6dramatic effects on correctness.78The options near the beginning of this list are quite simple. The idea9is not that kernel hackers don't already know about them, but rather10that they might need the occasional reminder.1112Please note that this is a generic guide, and that specific subsystems13will often have special requirements or idioms. For example, developers14of MMIO-based device drivers will often need to use mb(), rmb(), and15wmb(), and therefore might find smp_mb(), smp_rmb(), and smp_wmb()16to be more natural than smp_load_acquire() and smp_store_release().17On the other hand, those coming in from other environments will likely18be more familiar with these last two.192021Single-threaded code22====================2324In single-threaded code, there is no reordering, at least assuming25that your toolchain and hardware are working correctly. In addition,26it is generally a mistake to assume your code will only run in a single27threaded context as the kernel can enter the same code path on multiple28CPUs at the same time. One important exception is a function that makes29no external data references.3031In the general case, you will need to take explicit steps to ensure that32your code really is executed within a single thread that does not access33shared variables. A simple way to achieve this is to define a global lock34that you acquire at the beginning of your code and release at the end,35taking care to ensure that all references to your code's shared data are36also carried out under that same lock. Because only one thread can hold37this lock at a given time, your code will be executed single-threaded.38This approach is called "code locking".3940Code locking can severely limit both performance and scalability, so it41should be used with caution, and only on code paths that execute rarely.42After all, a huge amount of effort was required to remove the Linux43kernel's old "Big Kernel Lock", so let's please be very careful about44adding new "little kernel locks".4546One of the advantages of locking is that, in happy contrast with the47year 1981, almost all kernel developers are very familiar with locking.48The Linux kernel's lockdep (CONFIG_PROVE_LOCKING=y) is very helpful with49the formerly feared deadlock scenarios.5051Please use the standard locking primitives provided by the kernel rather52than rolling your own. For one thing, the standard primitives interact53properly with lockdep. For another thing, these primitives have been54tuned to deal better with high contention. And for one final thing, it is55surprisingly hard to correctly code production-quality lock acquisition56and release functions. After all, even simple non-production-quality57locking functions must carefully prevent both the CPU and the compiler58from moving code in either direction across the locking function.5960Despite the scalability limitations of single-threaded code, RCU61takes this approach for much of its grace-period processing and also62for early-boot operation. The reason RCU is able to scale despite63single-threaded grace-period processing is use of batching, where all64updates that accumulated during one grace period are handled by the65next one. In other words, slowing down grace-period processing makes66it more efficient. Nor is RCU unique: Similar batching optimizations67are used in many I/O operations.686970Packaged code71=============7273Even if performance and scalability concerns prevent your code from74being completely single-threaded, it is often possible to use library75functions that handle the concurrency nearly or entirely on their own.76This approach delegates any LKMM worries to the library maintainer.7778In the kernel, what is the "library"? Quite a bit. It includes the79contents of the lib/ directory, much of the include/linux/ directory along80with a lot of other heavily used APIs. But heavily used examples include81the list macros (for example, include/linux/{,rcu}list.h), workqueues,82smp_call_function(), and the various hash tables and search trees.838485Data locking86============8788With code locking, we use single-threaded code execution to guarantee89serialized access to the data that the code is accessing. However,90we can also achieve this by instead associating the lock with specific91instances of the data structures. This creates a "critical section"92in the code execution that will execute as though it is single threaded.93By placing all the accesses and modifications to a shared data structure94inside a critical section, we ensure that the execution context that95holds the lock has exclusive access to the shared data.9697The poster boy for this approach is the hash table, where placing a lock98in each hash bucket allows operations on different buckets to proceed99concurrently. This works because the buckets do not overlap with each100other, so that an operation on one bucket does not interfere with any101other bucket.102103As the number of buckets increases, data locking scales naturally.104In particular, if the amount of data increases with the number of CPUs,105increasing the number of buckets as the number of CPUs increase results106in a naturally scalable data structure.107108109Per-CPU processing110==================111112Partitioning processing and data over CPUs allows each CPU to take113a single-threaded approach while providing excellent performance and114scalability. Of course, there is no free lunch: The dark side of this115excellence is substantially increased memory footprint.116117In addition, it is sometimes necessary to occasionally update some global118view of this processing and data, in which case something like locking119must be used to protect this global view. This is the approach taken120by the percpu_counter infrastructure. In many cases, there are already121generic/library variants of commonly used per-cpu constructs available.122Please use them rather than rolling your own.123124RCU uses DEFINE_PER_CPU*() declaration to create a number of per-CPU125data sets. For example, each CPU does private quiescent-state processing126within its instance of the per-CPU rcu_data structure, and then uses data127locking to report quiescent states up the grace-period combining tree.128129130Packaged primitives: Sequence locking131=====================================132133Lockless programming is considered by many to be more difficult than134lock-based programming, but there are a few lockless design patterns that135have been built out into an API. One of these APIs is sequence locking.136Although this API can be used in extremely complex ways, there are simple137and effective ways of using it that avoid the need to pay attention to138memory ordering.139140The basic keep-things-simple rule for sequence locking is "do not write141in read-side code". Yes, you can do writes from within sequence-locking142readers, but it won't be so simple. For example, such writes will be143lockless and should be idempotent.144145For more sophisticated use cases, LKMM can guide you, including use146cases involving combining sequence locking with other synchronization147primitives. (LKMM does not yet know about sequence locking, so it is148currently necessary to open-code it in your litmus tests.)149150Additional information may be found in include/linux/seqlock.h.151152Packaged primitives: RCU153========================154155Another lockless design pattern that has been baked into an API156is RCU. The Linux kernel makes sophisticated use of RCU, but the157keep-things-simple rules for RCU are "do not write in read-side code"158and "do not update anything that is visible to and accessed by readers",159and "protect updates with locking".160161These rules are illustrated by the functions foo_update_a() and162foo_get_a() shown in Documentation/RCU/whatisRCU.rst. Additional163RCU usage patterns maybe found in Documentation/RCU and in the164source code.165166167Packaged primitives: Atomic operations168======================================169170Back in the day, the Linux kernel had three types of atomic operations:1711721. Initialization and read-out, such as atomic_set() and atomic_read().1731742. Operations that did not return a value and provided no ordering,175such as atomic_inc() and atomic_dec().1761773. Operations that returned a value and provided full ordering, such as178atomic_add_return() and atomic_dec_and_test(). Note that some179value-returning operations provide full ordering only conditionally.180For example, cmpxchg() provides ordering only upon success.181182More recent kernels have operations that return a value but do not183provide full ordering. These are flagged with either a _relaxed()184suffix (providing no ordering), or an _acquire() or _release() suffix185(providing limited ordering).186187Additional information may be found in these files:188189Documentation/atomic_t.txt190Documentation/atomic_bitops.txt191Documentation/core-api/refcount-vs-atomic.rst192193Reading code using these primitives is often also quite helpful.194195196Lockless, fully ordered197=======================198199When using locking, there often comes a time when it is necessary200to access some variable or another without holding the data lock201that serializes access to that variable.202203If you want to keep things simple, use the initialization and read-out204operations from the previous section only when there are no racing205accesses. Otherwise, use only fully ordered operations when accessing206or modifying the variable. This approach guarantees that code prior207to a given access to that variable will be seen by all CPUs as having208happened before any code following any later access to that same variable.209210Please note that per-CPU functions are not atomic operations and211hence they do not provide any ordering guarantees at all.212213If the lockless accesses are frequently executed reads that are used214only for heuristics, or if they are frequently executed writes that215are used only for statistics, please see the next section.216217218Lockless statistics and heuristics219==================================220221Unordered primitives such as atomic_read(), atomic_set(), READ_ONCE(), and222WRITE_ONCE() can safely be used in some cases. These primitives provide223no ordering, but they do prevent the compiler from carrying out a number224of destructive optimizations (for which please see the next section).225One example use for these primitives is statistics, such as per-CPU226counters exemplified by the rt_cache_stat structure's routing-cache227statistics counters. Another example use case is heuristics, such as228the jiffies_till_first_fqs and jiffies_till_next_fqs kernel parameters229controlling how often RCU scans for idle CPUs.230231But be careful. "Unordered" really does mean "unordered". It is all232too easy to assume ordering, and this assumption must be avoided when233using these primitives.234235236Don't let the compiler trip you up237==================================238239It can be quite tempting to use plain C-language accesses for lockless240loads from and stores to shared variables. Although this is both241possible and quite common in the Linux kernel, it does require a242surprising amount of analysis, care, and knowledge about the compiler.243Yes, some decades ago it was not unfair to consider a C compiler to be244an assembler with added syntax and better portability, but the advent of245sophisticated optimizing compilers mean that those days are long gone.246Today's optimizing compilers can profoundly rewrite your code during the247translation process, and have long been ready, willing, and able to do so.248249Therefore, if you really need to use C-language assignments instead of250READ_ONCE(), WRITE_ONCE(), and so on, you will need to have a very good251understanding of both the C standard and your compiler. Here are some252introductory references and some tooling to start you on this noble quest:253254Who's afraid of a big bad optimizing compiler?255https://lwn.net/Articles/793253/256Calibrating your fear of big bad optimizing compilers257https://lwn.net/Articles/799218/258Concurrency bugs should fear the big bad data-race detector (part 1)259https://lwn.net/Articles/816850/260Concurrency bugs should fear the big bad data-race detector (part 2)261https://lwn.net/Articles/816854/262263264More complex use cases265======================266267If the alternatives above do not do what you need, please look at the268recipes.txt file to peel off the next layer of the memory-ordering269onion.270271272