Path: blob/master/Documentation/RCU/rcubarrier.txt
10821 views
RCU and Unloadable Modules12[Originally published in LWN Jan. 14, 2007: http://lwn.net/Articles/217484/]34RCU (read-copy update) is a synchronization mechanism that can be thought5of as a replacement for read-writer locking (among other things), but with6very low-overhead readers that are immune to deadlock, priority inversion,7and unbounded latency. RCU read-side critical sections are delimited8by rcu_read_lock() and rcu_read_unlock(), which, in non-CONFIG_PREEMPT9kernels, generate no code whatsoever.1011This means that RCU writers are unaware of the presence of concurrent12readers, so that RCU updates to shared data must be undertaken quite13carefully, leaving an old version of the data structure in place until all14pre-existing readers have finished. These old versions are needed because15such readers might hold a reference to them. RCU updates can therefore be16rather expensive, and RCU is thus best suited for read-mostly situations.1718How can an RCU writer possibly determine when all readers are finished,19given that readers might well leave absolutely no trace of their20presence? There is a synchronize_rcu() primitive that blocks until all21pre-existing readers have completed. An updater wishing to delete an22element p from a linked list might do the following, while holding an23appropriate lock, of course:2425list_del_rcu(p);26synchronize_rcu();27kfree(p);2829But the above code cannot be used in IRQ context -- the call_rcu()30primitive must be used instead. This primitive takes a pointer to an31rcu_head struct placed within the RCU-protected data structure and32another pointer to a function that may be invoked later to free that33structure. Code to delete an element p from the linked list from IRQ34context might then be as follows:3536list_del_rcu(p);37call_rcu(&p->rcu, p_callback);3839Since call_rcu() never blocks, this code can safely be used from within40IRQ context. The function p_callback() might be defined as follows:4142static void p_callback(struct rcu_head *rp)43{44struct pstruct *p = container_of(rp, struct pstruct, rcu);4546kfree(p);47}484950Unloading Modules That Use call_rcu()5152But what if p_callback is defined in an unloadable module?5354If we unload the module while some RCU callbacks are pending,55the CPUs executing these callbacks are going to be severely56disappointed when they are later invoked, as fancifully depicted at57http://lwn.net/images/ns/kernel/rcu-drop.jpg.5859We could try placing a synchronize_rcu() in the module-exit code path,60but this is not sufficient. Although synchronize_rcu() does wait for a61grace period to elapse, it does not wait for the callbacks to complete.6263One might be tempted to try several back-to-back synchronize_rcu()64calls, but this is still not guaranteed to work. If there is a very65heavy RCU-callback load, then some of the callbacks might be deferred66in order to allow other processing to proceed. Such deferral is required67in realtime kernels in order to avoid excessive scheduling latencies.686970rcu_barrier()7172We instead need the rcu_barrier() primitive. This primitive is similar73to synchronize_rcu(), but instead of waiting solely for a grace74period to elapse, it also waits for all outstanding RCU callbacks to75complete. Pseudo-code using rcu_barrier() is as follows:76771. Prevent any new RCU callbacks from being posted.782. Execute rcu_barrier().793. Allow the module to be unloaded.8081Quick Quiz #1: Why is there no srcu_barrier()?8283The rcutorture module makes use of rcu_barrier in its exit function84as follows:85861 static void872 rcu_torture_cleanup(void)883 {894 int i;905916 fullstop = 1;927 if (shuffler_task != NULL) {938 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task");949 kthread_stop(shuffler_task);9510 }9611 shuffler_task = NULL;97129813 if (writer_task != NULL) {9914 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task");10015 kthread_stop(writer_task);10116 }10217 writer_task = NULL;1031810419 if (reader_tasks != NULL) {10520 for (i = 0; i < nrealreaders; i++) {10621 if (reader_tasks[i] != NULL) {10722 VERBOSE_PRINTK_STRING(10823 "Stopping rcu_torture_reader task");10924 kthread_stop(reader_tasks[i]);11025 }11126 reader_tasks[i] = NULL;11227 }11328 kfree(reader_tasks);11429 reader_tasks = NULL;11530 }11631 rcu_torture_current = NULL;1173211833 if (fakewriter_tasks != NULL) {11934 for (i = 0; i < nfakewriters; i++) {12035 if (fakewriter_tasks[i] != NULL) {12136 VERBOSE_PRINTK_STRING(12237 "Stopping rcu_torture_fakewriter task");12338 kthread_stop(fakewriter_tasks[i]);12439 }12540 fakewriter_tasks[i] = NULL;12641 }12742 kfree(fakewriter_tasks);12843 fakewriter_tasks = NULL;12944 }1304513146 if (stats_task != NULL) {13247 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task");13348 kthread_stop(stats_task);13449 }13550 stats_task = NULL;1365113752 /* Wait for all RCU callbacks to fire. */13853 rcu_barrier();1395414055 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */1415614257 if (cur_ops->cleanup != NULL)14358 cur_ops->cleanup();14459 if (atomic_read(&n_rcu_torture_error))14560 rcu_torture_print_module_parms("End of test: FAILURE");14661 else14762 rcu_torture_print_module_parms("End of test: SUCCESS");14863 }149150Line 6 sets a global variable that prevents any RCU callbacks from151re-posting themselves. This will not be necessary in most cases, since152RCU callbacks rarely include calls to call_rcu(). However, the rcutorture153module is an exception to this rule, and therefore needs to set this154global variable.155156Lines 7-50 stop all the kernel tasks associated with the rcutorture157module. Therefore, once execution reaches line 53, no more rcutorture158RCU callbacks will be posted. The rcu_barrier() call on line 53 waits159for any pre-existing callbacks to complete.160161Then lines 55-62 print status and do operation-specific cleanup, and162then return, permitting the module-unload operation to be completed.163164Quick Quiz #2: Is there any other situation where rcu_barrier() might165be required?166167Your module might have additional complications. For example, if your168module invokes call_rcu() from timers, you will need to first cancel all169the timers, and only then invoke rcu_barrier() to wait for any remaining170RCU callbacks to complete.171172Of course, if you module uses call_rcu_bh(), you will need to invoke173rcu_barrier_bh() before unloading. Similarly, if your module uses174call_rcu_sched(), you will need to invoke rcu_barrier_sched() before175unloading. If your module uses call_rcu(), call_rcu_bh(), -and-176call_rcu_sched(), then you will need to invoke each of rcu_barrier(),177rcu_barrier_bh(), and rcu_barrier_sched().178179180Implementing rcu_barrier()181182Dipankar Sarma's implementation of rcu_barrier() makes use of the fact183that RCU callbacks are never reordered once queued on one of the per-CPU184queues. His implementation queues an RCU callback on each of the per-CPU185callback queues, and then waits until they have all started executing, at186which point, all earlier RCU callbacks are guaranteed to have completed.187188The original code for rcu_barrier() was as follows:1891901 void rcu_barrier(void)1912 {1923 BUG_ON(in_interrupt());1934 /* Take cpucontrol mutex to protect against CPU hotplug */1945 mutex_lock(&rcu_barrier_mutex);1956 init_completion(&rcu_barrier_completion);1967 atomic_set(&rcu_barrier_cpu_count, 0);1978 on_each_cpu(rcu_barrier_func, NULL, 0, 1);1989 wait_for_completion(&rcu_barrier_completion);19910 mutex_unlock(&rcu_barrier_mutex);20011 }201202Line 3 verifies that the caller is in process context, and lines 5 and 10203use rcu_barrier_mutex to ensure that only one rcu_barrier() is using the204global completion and counters at a time, which are initialized on lines2056 and 7. Line 8 causes each CPU to invoke rcu_barrier_func(), which is206shown below. Note that the final "1" in on_each_cpu()'s argument list207ensures that all the calls to rcu_barrier_func() will have completed208before on_each_cpu() returns. Line 9 then waits for the completion.209210This code was rewritten in 2008 to support rcu_barrier_bh() and211rcu_barrier_sched() in addition to the original rcu_barrier().212213The rcu_barrier_func() runs on each CPU, where it invokes call_rcu()214to post an RCU callback, as follows:2152161 static void rcu_barrier_func(void *notused)2172 {2183 int cpu = smp_processor_id();2194 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);2205 struct rcu_head *head;22162227 head = &rdp->barrier;2238 atomic_inc(&rcu_barrier_cpu_count);2249 call_rcu(head, rcu_barrier_callback);22510 }226227Lines 3 and 4 locate RCU's internal per-CPU rcu_data structure,228which contains the struct rcu_head that needed for the later call to229call_rcu(). Line 7 picks up a pointer to this struct rcu_head, and line2308 increments a global counter. This counter will later be decremented231by the callback. Line 9 then registers the rcu_barrier_callback() on232the current CPU's queue.233234The rcu_barrier_callback() function simply atomically decrements the235rcu_barrier_cpu_count variable and finalizes the completion when it236reaches zero, as follows:2372381 static void rcu_barrier_callback(struct rcu_head *notused)2392 {2403 if (atomic_dec_and_test(&rcu_barrier_cpu_count))2414 complete(&rcu_barrier_completion);2425 }243244Quick Quiz #3: What happens if CPU 0's rcu_barrier_func() executes245immediately (thus incrementing rcu_barrier_cpu_count to the246value one), but the other CPU's rcu_barrier_func() invocations247are delayed for a full grace period? Couldn't this result in248rcu_barrier() returning prematurely?249250251rcu_barrier() Summary252253The rcu_barrier() primitive has seen relatively little use, since most254code using RCU is in the core kernel rather than in modules. However, if255you are using RCU from an unloadable module, you need to use rcu_barrier()256so that your module may be safely unloaded.257258259Answers to Quick Quizzes260261Quick Quiz #1: Why is there no srcu_barrier()?262263Answer: Since there is no call_srcu(), there can be no outstanding SRCU264callbacks. Therefore, there is no need to wait for them.265266Quick Quiz #2: Is there any other situation where rcu_barrier() might267be required?268269Answer: Interestingly enough, rcu_barrier() was not originally270implemented for module unloading. Nikita Danilov was using271RCU in a filesystem, which resulted in a similar situation at272filesystem-unmount time. Dipankar Sarma coded up rcu_barrier()273in response, so that Nikita could invoke it during the274filesystem-unmount process.275276Much later, yours truly hit the RCU module-unload problem when277implementing rcutorture, and found that rcu_barrier() solves278this problem as well.279280Quick Quiz #3: What happens if CPU 0's rcu_barrier_func() executes281immediately (thus incrementing rcu_barrier_cpu_count to the282value one), but the other CPU's rcu_barrier_func() invocations283are delayed for a full grace period? Couldn't this result in284rcu_barrier() returning prematurely?285286Answer: This cannot happen. The reason is that on_each_cpu() has its last287argument, the wait flag, set to "1". This flag is passed through288to smp_call_function() and further to smp_call_function_on_cpu(),289causing this latter to spin until the cross-CPU invocation of290rcu_barrier_func() has completed. This by itself would prevent291a grace period from completing on non-CONFIG_PREEMPT kernels,292since each CPU must undergo a context switch (or other quiescent293state) before the grace period can complete. However, this is294of no use in CONFIG_PREEMPT kernels.295296Therefore, on_each_cpu() disables preemption across its call297to smp_call_function() and also across the local call to298rcu_barrier_func(). This prevents the local CPU from context299switching, again preventing grace periods from completing. This300means that all CPUs have executed rcu_barrier_func() before301the first rcu_barrier_callback() can possibly execute, in turn302preventing rcu_barrier_cpu_count from prematurely reaching zero.303304Currently, -rt implementations of RCU keep but a single global305queue for RCU callbacks, and thus do not suffer from this306problem. However, when the -rt RCU eventually does have per-CPU307callback queues, things will have to change. One simple change308is to add an rcu_read_lock() before line 8 of rcu_barrier()309and an rcu_read_unlock() after line 8 of this same function. If310you can think of a better change, please let me know!311312313