/*1* Copyright 2011-2015 Samy Al Bahra.2* All rights reserved.3*4* Redistribution and use in source and binary forms, with or without5* modification, are permitted provided that the following conditions6* are met:7* 1. Redistributions of source code must retain the above copyright8* notice, this list of conditions and the following disclaimer.9* 2. Redistributions in binary form must reproduce the above copyright10* notice, this list of conditions and the following disclaimer in the11* documentation and/or other materials provided with the distribution.12*13* THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND14* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE15* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE16* ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE17* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL18* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS19* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)20* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT21* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY22* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF23* SUCH DAMAGE.24*/2526/*27* The implementation here is inspired from the work described in:28* Fraser, K. 2004. Practical Lock-Freedom. PhD Thesis, University29* of Cambridge Computing Laboratory.30*/3132#include <ck_backoff.h>33#include <ck_cc.h>34#include <ck_epoch.h>35#include <ck_pr.h>36#include <ck_stack.h>37#include <ck_stdbool.h>38#include <ck_string.h>3940/*41* Only three distinct values are used for reclamation, but reclamation occurs42* at e+2 rather than e+1. Any thread in a "critical section" would have43* acquired some snapshot (e) of the global epoch value (e_g) and set an active44* flag. Any hazardous references will only occur after a full memory barrier.45* For example, assume an initial e_g value of 1, e value of 0 and active value46* of 0.47*48* ck_epoch_begin(...)49* e = e_g50* active = 151* memory_barrier();52*53* Any serialized reads may observe e = 0 or e = 1 with active = 0, or e = 0 or54* e = 1 with active = 1. The e_g value can only go from 1 to 2 if every thread55* has already observed the value of "1" (or the value we are incrementing56* from). This guarantees us that for any given value e_g, any threads with-in57* critical sections (referred to as "active" threads from here on) would have58* an e value of e_g-1 or e_g. This also means that hazardous references may be59* shared in both e_g-1 and e_g even if they are logically deleted in e_g.60*61* For example, assume all threads have an e value of e_g. Another thread may62* increment to e_g to e_g+1. Older threads may have a reference to an object63* which is only deleted in e_g+1. It could be that reader threads are64* executing some hash table look-ups, while some other writer thread (which65* causes epoch counter tick) actually deletes the same items that reader66* threads are looking up (this writer thread having an e value of e_g+1).67* This is possible if the writer thread re-observes the epoch after the68* counter tick.69*70* Psuedo-code for writer:71* ck_epoch_begin()72* ht_delete(x)73* ck_epoch_end()74* ck_epoch_begin()75* ht_delete(x)76* ck_epoch_end()77*78* Psuedo-code for reader:79* for (;;) {80* x = ht_lookup(x)81* ck_pr_inc(&x->value);82* }83*84* Of course, it is also possible for references logically deleted at e_g-1 to85* still be accessed at e_g as threads are "active" at the same time86* (real-world time) mutating shared objects.87*88* Now, if the epoch counter is ticked to e_g+1, then no new hazardous89* references could exist to objects logically deleted at e_g-1. The reason for90* this is that at e_g+1, all epoch read-side critical sections started at91* e_g-1 must have been completed. If any epoch read-side critical sections at92* e_g-1 were still active, then we would never increment to e_g+1 (active != 093* ^ e != e_g). Additionally, e_g may still have hazardous references to94* objects logically deleted at e_g-1 which means objects logically deleted at95* e_g-1 cannot be deleted at e_g+1 unless all threads have observed e_g+196* (since it is valid for active threads to be at e_g and threads at e_g still97* require safe memory accesses).98*99* However, at e_g+2, all active threads must be either at e_g+1 or e_g+2.100* Though e_g+2 may share hazardous references with e_g+1, and e_g+1 shares101* hazardous references to e_g, no active threads are at e_g or e_g-1. This102* means no hazardous references could exist to objects deleted at e_g-1 (at103* e_g+2).104*105* To summarize these important points,106* 1) Active threads will always have a value of e_g or e_g-1.107* 2) Items that are logically deleted e_g or e_g-1 cannot be physically108* deleted.109* 3) Objects logically deleted at e_g-1 can be physically destroyed at e_g+2110* or at e_g+1 if no threads are at e_g.111*112* Last but not least, if we are at e_g+2, then no active thread is at e_g113* which means it is safe to apply modulo-3 arithmetic to e_g value in order to114* re-use e_g to represent the e_g+3 state. This means it is sufficient to115* represent e_g using only the values 0, 1 or 2. Every time a thread re-visits116* a e_g (which can be determined with a non-empty deferral list) it can assume117* objects in the e_g deferral list involved at least three e_g transitions and118* are thus, safe, for physical deletion.119*120* Blocking semantics for epoch reclamation have additional restrictions.121* Though we only require three deferral lists, reasonable blocking semantics122* must be able to more gracefully handle bursty write work-loads which could123* easily cause e_g wrap-around if modulo-3 arithmetic is used. This allows for124* easy-to-trigger live-lock situations. The work-around to this is to not125* apply modulo arithmetic to e_g but only to deferral list indexing.126*/127#define CK_EPOCH_GRACE 3U128129/*130* CK_EPOCH_LENGTH must be a power-of-2 (because (CK_EPOCH_LENGTH - 1) is used131* as a mask, and it must be at least 3 (see comments above).132*/133#if (CK_EPOCH_LENGTH < 3 || (CK_EPOCH_LENGTH & (CK_EPOCH_LENGTH - 1)) != 0)134#error "CK_EPOCH_LENGTH must be a power of 2 and >= 3"135#endif136137enum {138CK_EPOCH_STATE_USED = 0,139CK_EPOCH_STATE_FREE = 1140};141142CK_STACK_CONTAINER(struct ck_epoch_record, record_next,143ck_epoch_record_container)144CK_STACK_CONTAINER(struct ck_epoch_entry, stack_entry,145ck_epoch_entry_container)146147#define CK_EPOCH_SENSE_MASK (CK_EPOCH_SENSE - 1)148149bool150_ck_epoch_delref(struct ck_epoch_record *record,151struct ck_epoch_section *section)152{153struct ck_epoch_ref *current, *other;154unsigned int i = section->bucket;155156current = &record->local.bucket[i];157current->count--;158159if (current->count > 0)160return false;161162/*163* If the current bucket no longer has any references, then164* determine whether we have already transitioned into a newer165* epoch. If so, then make sure to update our shared snapshot166* to allow for forward progress.167*168* If no other active bucket exists, then the record will go169* inactive in order to allow for forward progress.170*/171other = &record->local.bucket[(i + 1) & CK_EPOCH_SENSE_MASK];172if (other->count > 0 &&173((int)(current->epoch - other->epoch) < 0)) {174/*175* The other epoch value is actually the newest,176* transition to it.177*/178ck_pr_store_uint(&record->epoch, other->epoch);179}180181return true;182}183184void185_ck_epoch_addref(struct ck_epoch_record *record,186struct ck_epoch_section *section)187{188struct ck_epoch *global = record->global;189struct ck_epoch_ref *ref;190unsigned int epoch, i;191192epoch = ck_pr_load_uint(&global->epoch);193i = epoch & CK_EPOCH_SENSE_MASK;194ref = &record->local.bucket[i];195196if (ref->count++ == 0) {197#ifndef CK_MD_TSO198struct ck_epoch_ref *previous;199200/*201* The system has already ticked. If another non-zero bucket202* exists, make sure to order our observations with respect203* to it. Otherwise, it is possible to acquire a reference204* from the previous epoch generation.205*206* On TSO architectures, the monoticity of the global counter207* and load-{store, load} ordering are sufficient to guarantee208* this ordering.209*/210previous = &record->local.bucket[(i + 1) &211CK_EPOCH_SENSE_MASK];212if (previous->count > 0)213ck_pr_fence_acqrel();214#endif /* !CK_MD_TSO */215216/*217* If this is this is a new reference into the current218* bucket then cache the associated epoch value.219*/220ref->epoch = epoch;221}222223section->bucket = i;224return;225}226227void228ck_epoch_init(struct ck_epoch *global)229{230231ck_stack_init(&global->records);232global->epoch = 1;233global->n_free = 0;234ck_pr_fence_store();235return;236}237238struct ck_epoch_record *239ck_epoch_recycle(struct ck_epoch *global, void *ct)240{241struct ck_epoch_record *record;242ck_stack_entry_t *cursor;243unsigned int state;244245if (ck_pr_load_uint(&global->n_free) == 0)246return NULL;247248CK_STACK_FOREACH(&global->records, cursor) {249record = ck_epoch_record_container(cursor);250251if (ck_pr_load_uint(&record->state) == CK_EPOCH_STATE_FREE) {252/* Serialize with respect to deferral list clean-up. */253ck_pr_fence_load();254state = ck_pr_fas_uint(&record->state,255CK_EPOCH_STATE_USED);256if (state == CK_EPOCH_STATE_FREE) {257ck_pr_dec_uint(&global->n_free);258ck_pr_store_ptr(&record->ct, ct);259260/*261* The context pointer is ordered by a262* subsequent protected section.263*/264return record;265}266}267}268269return NULL;270}271272void273ck_epoch_register(struct ck_epoch *global, struct ck_epoch_record *record,274void *ct)275{276size_t i;277278record->global = global;279record->state = CK_EPOCH_STATE_USED;280record->active = 0;281record->epoch = 0;282record->n_dispatch = 0;283record->n_peak = 0;284record->n_pending = 0;285record->ct = ct;286memset(&record->local, 0, sizeof record->local);287288for (i = 0; i < CK_EPOCH_LENGTH; i++)289ck_stack_init(&record->pending[i]);290291ck_pr_fence_store();292ck_stack_push_upmc(&global->records, &record->record_next);293return;294}295296void297ck_epoch_unregister(struct ck_epoch_record *record)298{299struct ck_epoch *global = record->global;300size_t i;301302record->active = 0;303record->epoch = 0;304record->n_dispatch = 0;305record->n_peak = 0;306record->n_pending = 0;307memset(&record->local, 0, sizeof record->local);308309for (i = 0; i < CK_EPOCH_LENGTH; i++)310ck_stack_init(&record->pending[i]);311312ck_pr_store_ptr(&record->ct, NULL);313ck_pr_fence_store();314ck_pr_store_uint(&record->state, CK_EPOCH_STATE_FREE);315ck_pr_inc_uint(&global->n_free);316return;317}318319static struct ck_epoch_record *320ck_epoch_scan(struct ck_epoch *global,321struct ck_epoch_record *cr,322unsigned int epoch,323bool *af)324{325ck_stack_entry_t *cursor;326327if (cr == NULL) {328cursor = CK_STACK_FIRST(&global->records);329*af = false;330} else {331cursor = &cr->record_next;332*af = true;333}334335while (cursor != NULL) {336unsigned int state, active;337338cr = ck_epoch_record_container(cursor);339340state = ck_pr_load_uint(&cr->state);341if (state & CK_EPOCH_STATE_FREE) {342cursor = CK_STACK_NEXT(cursor);343continue;344}345346active = ck_pr_load_uint(&cr->active);347*af |= active;348349if (active != 0 && ck_pr_load_uint(&cr->epoch) != epoch)350return cr;351352cursor = CK_STACK_NEXT(cursor);353}354355return NULL;356}357358static unsigned int359ck_epoch_dispatch(struct ck_epoch_record *record, unsigned int e, ck_stack_t *deferred)360{361unsigned int epoch = e & (CK_EPOCH_LENGTH - 1);362ck_stack_entry_t *head, *next, *cursor;363unsigned int n_pending, n_peak;364unsigned int i = 0;365366head = ck_stack_batch_pop_upmc(&record->pending[epoch]);367for (cursor = head; cursor != NULL; cursor = next) {368struct ck_epoch_entry *entry =369ck_epoch_entry_container(cursor);370371next = CK_STACK_NEXT(cursor);372if (deferred != NULL)373ck_stack_push_spnc(deferred, &entry->stack_entry);374else375entry->function(entry);376377i++;378}379380n_peak = ck_pr_load_uint(&record->n_peak);381n_pending = ck_pr_load_uint(&record->n_pending);382383/* We don't require accuracy around peak calculation. */384if (n_pending > n_peak)385ck_pr_store_uint(&record->n_peak, n_peak);386387if (i > 0) {388ck_pr_add_uint(&record->n_dispatch, i);389ck_pr_sub_uint(&record->n_pending, i);390}391392return i;393}394395/*396* Reclaim all objects associated with a record.397*/398void399ck_epoch_reclaim(struct ck_epoch_record *record)400{401unsigned int epoch;402403for (epoch = 0; epoch < CK_EPOCH_LENGTH; epoch++)404ck_epoch_dispatch(record, epoch, NULL);405406return;407}408409CK_CC_FORCE_INLINE static void410epoch_block(struct ck_epoch *global, struct ck_epoch_record *cr,411ck_epoch_wait_cb_t *cb, void *ct)412{413414if (cb != NULL)415cb(global, cr, ct);416417return;418}419420/*421* This function must not be called with-in read section.422*/423void424ck_epoch_synchronize_wait(struct ck_epoch *global,425ck_epoch_wait_cb_t *cb, void *ct)426{427struct ck_epoch_record *cr;428unsigned int delta, epoch, goal, i;429bool active;430431ck_pr_fence_memory();432433/*434* The observation of the global epoch must be ordered with respect to435* all prior operations. The re-ordering of loads is permitted given436* monoticity of global epoch counter.437*438* If UINT_MAX concurrent mutations were to occur then it is possible439* to encounter an ABA-issue. If this is a concern, consider tuning440* write-side concurrency.441*/442delta = epoch = ck_pr_load_uint(&global->epoch);443goal = epoch + CK_EPOCH_GRACE;444445for (i = 0, cr = NULL; i < CK_EPOCH_GRACE - 1; cr = NULL, i++) {446bool r;447448/*449* Determine whether all threads have observed the current450* epoch with respect to the updates on invocation.451*/452while (cr = ck_epoch_scan(global, cr, delta, &active),453cr != NULL) {454unsigned int e_d;455456ck_pr_stall();457458/*459* Another writer may have already observed a grace460* period.461*/462e_d = ck_pr_load_uint(&global->epoch);463if (e_d == delta) {464epoch_block(global, cr, cb, ct);465continue;466}467468/*469* If the epoch has been updated, we may have already470* met our goal.471*/472delta = e_d;473if ((goal > epoch) & (delta >= goal))474goto leave;475476epoch_block(global, cr, cb, ct);477478/*479* If the epoch has been updated, then a grace period480* requires that all threads are observed idle at the481* same epoch.482*/483cr = NULL;484}485486/*487* If we have observed all threads as inactive, then we assume488* we are at a grace period.489*/490if (active == false)491break;492493/*494* Increment current epoch. CAS semantics are used to eliminate495* increment operations for synchronization that occurs for the496* same global epoch value snapshot.497*498* If we can guarantee there will only be one active barrier or499* epoch tick at a given time, then it is sufficient to use an500* increment operation. In a multi-barrier workload, however,501* it is possible to overflow the epoch value if we apply502* modulo-3 arithmetic.503*/504r = ck_pr_cas_uint_value(&global->epoch, delta, delta + 1,505&delta);506507/* Order subsequent thread active checks. */508ck_pr_fence_atomic_load();509510/*511* If CAS has succeeded, then set delta to latest snapshot.512* Otherwise, we have just acquired latest snapshot.513*/514delta = delta + r;515}516517/*518* A majority of use-cases will not require full barrier semantics.519* However, if non-temporal instructions are used, full barrier520* semantics are necessary.521*/522leave:523ck_pr_fence_memory();524return;525}526527void528ck_epoch_synchronize(struct ck_epoch_record *record)529{530531ck_epoch_synchronize_wait(record->global, NULL, NULL);532return;533}534535void536ck_epoch_barrier(struct ck_epoch_record *record)537{538539ck_epoch_synchronize(record);540ck_epoch_reclaim(record);541return;542}543544void545ck_epoch_barrier_wait(struct ck_epoch_record *record, ck_epoch_wait_cb_t *cb,546void *ct)547{548549ck_epoch_synchronize_wait(record->global, cb, ct);550ck_epoch_reclaim(record);551return;552}553554/*555* It may be worth it to actually apply these deferral semantics to an epoch556* that was observed at ck_epoch_call time. The problem is that the latter557* would require a full fence.558*559* ck_epoch_call will dispatch to the latest epoch snapshot that was observed.560* There are cases where it will fail to reclaim as early as it could. If this561* becomes a problem, we could actually use a heap for epoch buckets but that562* is far from ideal too.563*/564bool565ck_epoch_poll_deferred(struct ck_epoch_record *record, ck_stack_t *deferred)566{567bool active;568unsigned int epoch;569struct ck_epoch_record *cr = NULL;570struct ck_epoch *global = record->global;571unsigned int n_dispatch;572573epoch = ck_pr_load_uint(&global->epoch);574575/* Serialize epoch snapshots with respect to global epoch. */576ck_pr_fence_memory();577578/*579* At this point, epoch is the current global epoch value.580* There may or may not be active threads which observed epoch - 1.581* (ck_epoch_scan() will tell us that). However, there should be582* no active threads which observed epoch - 2.583*584* Note that checking epoch - 2 is necessary, as race conditions can585* allow another thread to increment the global epoch before this586* thread runs.587*/588n_dispatch = ck_epoch_dispatch(record, epoch - 2, deferred);589590cr = ck_epoch_scan(global, cr, epoch, &active);591if (cr != NULL)592return (n_dispatch > 0);593594/* We are at a grace period if all threads are inactive. */595if (active == false) {596record->epoch = epoch;597for (epoch = 0; epoch < CK_EPOCH_LENGTH; epoch++)598ck_epoch_dispatch(record, epoch, deferred);599600return true;601}602603/*604* If an active thread exists, rely on epoch observation.605*606* All the active threads entered the epoch section during607* the current epoch. Therefore, we can now run the handlers608* for the immediately preceding epoch and attempt to609* advance the epoch if it hasn't been already.610*/611(void)ck_pr_cas_uint(&global->epoch, epoch, epoch + 1);612613ck_epoch_dispatch(record, epoch - 1, deferred);614return true;615}616617bool618ck_epoch_poll(struct ck_epoch_record *record)619{620621return ck_epoch_poll_deferred(record, NULL);622}623624625