Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
torvalds
GitHub Repository: torvalds/linux
Path: blob/master/tools/memory-model/Documentation/simple.txt
26285 views
1
This document provides options for those wishing to keep their
2
memory-ordering lives simple, as is necessary for those whose domain
3
is complex. After all, there are bugs other than memory-ordering bugs,
4
and the time spent gaining memory-ordering knowledge is not available
5
for gaining domain knowledge. Furthermore Linux-kernel memory model
6
(LKMM) is quite complex, with subtle differences in code often having
7
dramatic effects on correctness.
8
9
The options near the beginning of this list are quite simple. The idea
10
is not that kernel hackers don't already know about them, but rather
11
that they might need the occasional reminder.
12
13
Please note that this is a generic guide, and that specific subsystems
14
will often have special requirements or idioms. For example, developers
15
of MMIO-based device drivers will often need to use mb(), rmb(), and
16
wmb(), and therefore might find smp_mb(), smp_rmb(), and smp_wmb()
17
to be more natural than smp_load_acquire() and smp_store_release().
18
On the other hand, those coming in from other environments will likely
19
be more familiar with these last two.
20
21
22
Single-threaded code
23
====================
24
25
In single-threaded code, there is no reordering, at least assuming
26
that your toolchain and hardware are working correctly. In addition,
27
it is generally a mistake to assume your code will only run in a single
28
threaded context as the kernel can enter the same code path on multiple
29
CPUs at the same time. One important exception is a function that makes
30
no external data references.
31
32
In the general case, you will need to take explicit steps to ensure that
33
your code really is executed within a single thread that does not access
34
shared variables. A simple way to achieve this is to define a global lock
35
that you acquire at the beginning of your code and release at the end,
36
taking care to ensure that all references to your code's shared data are
37
also carried out under that same lock. Because only one thread can hold
38
this lock at a given time, your code will be executed single-threaded.
39
This approach is called "code locking".
40
41
Code locking can severely limit both performance and scalability, so it
42
should be used with caution, and only on code paths that execute rarely.
43
After all, a huge amount of effort was required to remove the Linux
44
kernel's old "Big Kernel Lock", so let's please be very careful about
45
adding new "little kernel locks".
46
47
One of the advantages of locking is that, in happy contrast with the
48
year 1981, almost all kernel developers are very familiar with locking.
49
The Linux kernel's lockdep (CONFIG_PROVE_LOCKING=y) is very helpful with
50
the formerly feared deadlock scenarios.
51
52
Please use the standard locking primitives provided by the kernel rather
53
than rolling your own. For one thing, the standard primitives interact
54
properly with lockdep. For another thing, these primitives have been
55
tuned to deal better with high contention. And for one final thing, it is
56
surprisingly hard to correctly code production-quality lock acquisition
57
and release functions. After all, even simple non-production-quality
58
locking functions must carefully prevent both the CPU and the compiler
59
from moving code in either direction across the locking function.
60
61
Despite the scalability limitations of single-threaded code, RCU
62
takes this approach for much of its grace-period processing and also
63
for early-boot operation. The reason RCU is able to scale despite
64
single-threaded grace-period processing is use of batching, where all
65
updates that accumulated during one grace period are handled by the
66
next one. In other words, slowing down grace-period processing makes
67
it more efficient. Nor is RCU unique: Similar batching optimizations
68
are used in many I/O operations.
69
70
71
Packaged code
72
=============
73
74
Even if performance and scalability concerns prevent your code from
75
being completely single-threaded, it is often possible to use library
76
functions that handle the concurrency nearly or entirely on their own.
77
This approach delegates any LKMM worries to the library maintainer.
78
79
In the kernel, what is the "library"? Quite a bit. It includes the
80
contents of the lib/ directory, much of the include/linux/ directory along
81
with a lot of other heavily used APIs. But heavily used examples include
82
the list macros (for example, include/linux/{,rcu}list.h), workqueues,
83
smp_call_function(), and the various hash tables and search trees.
84
85
86
Data locking
87
============
88
89
With code locking, we use single-threaded code execution to guarantee
90
serialized access to the data that the code is accessing. However,
91
we can also achieve this by instead associating the lock with specific
92
instances of the data structures. This creates a "critical section"
93
in the code execution that will execute as though it is single threaded.
94
By placing all the accesses and modifications to a shared data structure
95
inside a critical section, we ensure that the execution context that
96
holds the lock has exclusive access to the shared data.
97
98
The poster boy for this approach is the hash table, where placing a lock
99
in each hash bucket allows operations on different buckets to proceed
100
concurrently. This works because the buckets do not overlap with each
101
other, so that an operation on one bucket does not interfere with any
102
other bucket.
103
104
As the number of buckets increases, data locking scales naturally.
105
In particular, if the amount of data increases with the number of CPUs,
106
increasing the number of buckets as the number of CPUs increase results
107
in a naturally scalable data structure.
108
109
110
Per-CPU processing
111
==================
112
113
Partitioning processing and data over CPUs allows each CPU to take
114
a single-threaded approach while providing excellent performance and
115
scalability. Of course, there is no free lunch: The dark side of this
116
excellence is substantially increased memory footprint.
117
118
In addition, it is sometimes necessary to occasionally update some global
119
view of this processing and data, in which case something like locking
120
must be used to protect this global view. This is the approach taken
121
by the percpu_counter infrastructure. In many cases, there are already
122
generic/library variants of commonly used per-cpu constructs available.
123
Please use them rather than rolling your own.
124
125
RCU uses DEFINE_PER_CPU*() declaration to create a number of per-CPU
126
data sets. For example, each CPU does private quiescent-state processing
127
within its instance of the per-CPU rcu_data structure, and then uses data
128
locking to report quiescent states up the grace-period combining tree.
129
130
131
Packaged primitives: Sequence locking
132
=====================================
133
134
Lockless programming is considered by many to be more difficult than
135
lock-based programming, but there are a few lockless design patterns that
136
have been built out into an API. One of these APIs is sequence locking.
137
Although this API can be used in extremely complex ways, there are simple
138
and effective ways of using it that avoid the need to pay attention to
139
memory ordering.
140
141
The basic keep-things-simple rule for sequence locking is "do not write
142
in read-side code". Yes, you can do writes from within sequence-locking
143
readers, but it won't be so simple. For example, such writes will be
144
lockless and should be idempotent.
145
146
For more sophisticated use cases, LKMM can guide you, including use
147
cases involving combining sequence locking with other synchronization
148
primitives. (LKMM does not yet know about sequence locking, so it is
149
currently necessary to open-code it in your litmus tests.)
150
151
Additional information may be found in include/linux/seqlock.h.
152
153
Packaged primitives: RCU
154
========================
155
156
Another lockless design pattern that has been baked into an API
157
is RCU. The Linux kernel makes sophisticated use of RCU, but the
158
keep-things-simple rules for RCU are "do not write in read-side code"
159
and "do not update anything that is visible to and accessed by readers",
160
and "protect updates with locking".
161
162
These rules are illustrated by the functions foo_update_a() and
163
foo_get_a() shown in Documentation/RCU/whatisRCU.rst. Additional
164
RCU usage patterns maybe found in Documentation/RCU and in the
165
source code.
166
167
168
Packaged primitives: Atomic operations
169
======================================
170
171
Back in the day, the Linux kernel had three types of atomic operations:
172
173
1. Initialization and read-out, such as atomic_set() and atomic_read().
174
175
2. Operations that did not return a value and provided no ordering,
176
such as atomic_inc() and atomic_dec().
177
178
3. Operations that returned a value and provided full ordering, such as
179
atomic_add_return() and atomic_dec_and_test(). Note that some
180
value-returning operations provide full ordering only conditionally.
181
For example, cmpxchg() provides ordering only upon success.
182
183
More recent kernels have operations that return a value but do not
184
provide full ordering. These are flagged with either a _relaxed()
185
suffix (providing no ordering), or an _acquire() or _release() suffix
186
(providing limited ordering).
187
188
Additional information may be found in these files:
189
190
Documentation/atomic_t.txt
191
Documentation/atomic_bitops.txt
192
Documentation/core-api/refcount-vs-atomic.rst
193
194
Reading code using these primitives is often also quite helpful.
195
196
197
Lockless, fully ordered
198
=======================
199
200
When using locking, there often comes a time when it is necessary
201
to access some variable or another without holding the data lock
202
that serializes access to that variable.
203
204
If you want to keep things simple, use the initialization and read-out
205
operations from the previous section only when there are no racing
206
accesses. Otherwise, use only fully ordered operations when accessing
207
or modifying the variable. This approach guarantees that code prior
208
to a given access to that variable will be seen by all CPUs as having
209
happened before any code following any later access to that same variable.
210
211
Please note that per-CPU functions are not atomic operations and
212
hence they do not provide any ordering guarantees at all.
213
214
If the lockless accesses are frequently executed reads that are used
215
only for heuristics, or if they are frequently executed writes that
216
are used only for statistics, please see the next section.
217
218
219
Lockless statistics and heuristics
220
==================================
221
222
Unordered primitives such as atomic_read(), atomic_set(), READ_ONCE(), and
223
WRITE_ONCE() can safely be used in some cases. These primitives provide
224
no ordering, but they do prevent the compiler from carrying out a number
225
of destructive optimizations (for which please see the next section).
226
One example use for these primitives is statistics, such as per-CPU
227
counters exemplified by the rt_cache_stat structure's routing-cache
228
statistics counters. Another example use case is heuristics, such as
229
the jiffies_till_first_fqs and jiffies_till_next_fqs kernel parameters
230
controlling how often RCU scans for idle CPUs.
231
232
But be careful. "Unordered" really does mean "unordered". It is all
233
too easy to assume ordering, and this assumption must be avoided when
234
using these primitives.
235
236
237
Don't let the compiler trip you up
238
==================================
239
240
It can be quite tempting to use plain C-language accesses for lockless
241
loads from and stores to shared variables. Although this is both
242
possible and quite common in the Linux kernel, it does require a
243
surprising amount of analysis, care, and knowledge about the compiler.
244
Yes, some decades ago it was not unfair to consider a C compiler to be
245
an assembler with added syntax and better portability, but the advent of
246
sophisticated optimizing compilers mean that those days are long gone.
247
Today's optimizing compilers can profoundly rewrite your code during the
248
translation process, and have long been ready, willing, and able to do so.
249
250
Therefore, if you really need to use C-language assignments instead of
251
READ_ONCE(), WRITE_ONCE(), and so on, you will need to have a very good
252
understanding of both the C standard and your compiler. Here are some
253
introductory references and some tooling to start you on this noble quest:
254
255
Who's afraid of a big bad optimizing compiler?
256
https://lwn.net/Articles/793253/
257
Calibrating your fear of big bad optimizing compilers
258
https://lwn.net/Articles/799218/
259
Concurrency bugs should fear the big bad data-race detector (part 1)
260
https://lwn.net/Articles/816850/
261
Concurrency bugs should fear the big bad data-race detector (part 2)
262
https://lwn.net/Articles/816854/
263
264
265
More complex use cases
266
======================
267
268
If the alternatives above do not do what you need, please look at the
269
recipes.txt file to peel off the next layer of the memory-ordering
270
onion.
271
272