Thursday, September 22, 2016

URCU ReadersVersion in C++

On the previous post we showed a new Userspace RCU (URCU) algorithm that can be implemented with just a atomics and the Java/C11/C++1x memory model.
Here is the implementation in C++:

Not only is this code simple enough to just copy/paste into your codebase, but it has a pretty good throughput and scales well.
This seems to be the best basic implementation of an URCU (with just atomics) in all possible scenarios. There may be other implementations that surpass it, but they would definitely have to use some non-portable trickery, which URCUs usually do, stuff like using POSIX signals, or some CPU special instructions, or some Kernel API.
Anyways, if you're interested in portable low-overhead memory reclamation for C/C++ that is wait-free population oblivious for readers and blocking for updaters, yet capable of aggregating grace periods, then this should do the trick for you!

We did a simple benchmark against two of the URCU implementations in the userspace urcu lib and we'll make the benchmark available in a couple weeks, but in the meantime here are some plots below.
We compared against the Bullet-Proof URCU and the default URCU. More details in this readme

The first plot shows just readers, i.e. calls to rcu_read_lock()/rcu_read_unlock(), where the read-side critical section is just an atomic load on a global variable. This means the read-side critical section is as short as possible, and that if there is any contention is caused by the overhead in rcu_read_lock()/rcu_read_unlock(). As you can see from the plots, all three scale well, but the ReadersVersion is the one that scales better.

The second plot shows just calls to synchronize_rcu() and none of the methods scale (more on this in another post) but at least they don't dropoff as the number of threads increases, and ReadersVersion is about 10x faster than the other two URCUs.

The third plot shows again calls to synchronize_rcu() but this time there are two other thread that are continuously doing rcu_read_lock/rcu_read_unlock. The read-side critical section is very long, it consists of reading an array with 100000 entries, which means the time is dominated by the waiting for the readers and not by the synchronization mechanism inside synchronize_rcu().
Again here, ReadersVersion scales well because it aggregates grace periods, and so does the default urcu but doesn't have as good scalability as that. Bullet-Proof has no mechanism for sharing grace periods, and therefore, has no scalability on this kind of scenario.

Wednesday, September 14, 2016

A simple Userspace RCU in Java

In this post we're going to talk about a simple RCU implementatio, and provide an example in Java, but it's easy to port it to C++.

So, what is RCU again?
Yeah, well, RCU is many things and it has been talked about it detail by one of its original authors here.

... no, I mean, like, what is RCU, like _really_?
In its most basic form, RCU is a concurrency construct to provide on-demand quiescence.

By concurrency construct I mean it's a building block to use in a concurrency setting, the same way that a mutual exclusion lock is a concurrency construct, or a semaphore, or a reader-writer lock, etc. And by on-demand quiescence I mean that you can introduce quiescence anywhere in your application, as needed (more on that later).

RCU's basic API is:
  • rcu_read_lock()
  • rcu_read_unlock()
  • synchronize_rcu()

with the first two being called in a block, i.e. for every rcu_read_lock() there must be a later rcu_read_unlock(). Also, we can not call synchronize_rcu() from within a block of code enclosed by rcu_read_lock() / rcu_read_unlock().

Now, unlike the examples I gave of locks and semaphores, RCU does not provide any kind of mutual exclusivity property, at least not directly, but it does provides two very interesting guarantees:
First, the effects of any operation that happens before synchronize_rcu() in one thread (T1), will for sure be visible in another thread (T2) after rcu_read_lock(), if the call to synchronize_rcu() in T1 returned before the call to rcu_read_lock() in T2;
Second, the effects of any operation that happens after synchronize_rcu() in one thread (T1), will for sure not be visible in another thread (T2) before rcu_read_unlock(), if the call to rcu_read_lock() in T2 started before the call to synchronize_rcu() in T1.

... yes, it's confusing when you read it for the first time, but the idea is pretty simple. Here is a schematic that may help to understand:

T1 did some stuff (stuff before) then it called synchronize_rcu() and then it did some other stuff (stuff after). Time flows from top to bottom, i.e. execution order.
The other threads, T2, T3, T4 and T5 are doing rcu_read_lock()/rcu_read_unlock() and we show which events are seen in each situation.

The trickier one here is T5 for which rcu_read_lock() started during the call of synchronize_rcu() (if there was an absolute clock), but because synchronize_rcu() terminated before the call for rcu_read_unlock(), it means that T1 considered that rcu_read_lock() occurred after and so it did have to wait for the call to rcu_read_unlock(), and therefore, T5 will see the stuff that happened before on T1 and may or may not see that stuff that happened after.

Still confused?
That's ok, let's try another way then: The semantics of rcu_read_lock()/rcu_read_unlock() work just like the semantics of a read-lock in a reader-writer lock (only the semantics are similar, the effects are very different). And the semantics for synchronize_rcu(), well, there is no good analogy but you can think of it as having a lock()/unlock() of a writer lock... it's the closest thing I can come up with.

What does an RCU implementation looks like in Java?

static class RCU {

    final static long NOT_READING = Long.MAX_VALUE;

    final static int MAX_THREADS = 128;

    final AtomicLong reclaimerVersion = new AtomicLong(0);

    final AtomicLongArray readersVersion = new AtomicLongArray(MAX_THREADS);


    public RCU() {

        for (int i=0; i < MAX_THREADS; i++) readersVersion.set(i, NOT_READING);



    public static int getTID() {

        return (int)(Thread.currentThread().getId() % MAX_THREADS);



    public void read_lock(final int tid) {  // rcu_read_lock()

        final long rv = reclaimerVersion.get();

        readersVersion.set(tid, rv);

        final long nrv = reclaimerVersion.get();

        if (rv != nrv) readersVersion.lazySet(tid, nrv);



    public void read_unlock(final int tid) { // rcu_read_unlock()

        readersVersion.set(tid, NOT_READING);



    public void synchronize_rcu() {

        final long waitForVersion = reclaimerVersion.incrementAndGet();

        for (int i=0; i < MAX_THREADS; i++) {

            while (readersVersion.get(i) < waitForVersion) { } // spin




In this algorithm above, each thread calling synchronize_rcu() does a incrementAndGet() on an atomic variable named reclaimersVersion and then spins while waiting for all the ongoing readers to complete by scanning the array of their versions named readersVersion.
Upon calling rcu_read_lock(), a (reader) thread will read the current value in reclaimersVersion and set its uniquely assigned entry in the readersVersion array to the same value, and then re-check that the version has not changed and if so, update the readersVersion with the new entry.
When the read operation is done, it will call rcu_read_unlock() which will set its entry in readersVersion back to NOT_READING, a constant with a special value that means that the reader is not currently active.

It's hard to get any simpler than this  ;)

The advantages of this algorithm are:
- Very light in synchronization on the reader's side: Only two sequentially consistent loads and one seq-cst store on rcu_read_lock() and a release store in rcu_read_unlock();
- Wait-free for readers: No loops or retries on either rcu_read_lock() or rcu_read_unlock();
- Starvation-free for reclaimers: Calls to synchronize_rcu() are free from starvation... non-obvious why, but true;
- Multiple reclaimers can make progress simultaneously: Unlike some of the other Userspace RCU implementations that have a mutual exclusion lock inside synchronize_rcu(), this one does not, which means it scales well for multiple reclaimers;
- Code is short and sweet: There can't be many hidden bugs in something that takes less than 20 lines of code;

The only true subtle detail about this algorithm is in the fact that rcu_read_lock() does a load, a store, a load and then a relaxed store. The first load and store would suffice for the algorithm itself, but the thing about volatile/seq-cst stores is that they allow regular code to be reordered from below to above. This means that we could have code being executed before the readersVersion.set(tid, rv) which would be incorrect from the point of view of the expected semantics of RCU, and therefore, we need some kind of barrier to prevent code from traveling up. The easiest way to accomplish this is to do a volatile/seq-cst load, and since we're doing that, we might as well check if it changed and if it did, then we might as well update it with a lazy (or relaxed) store.
Like I said, it's subtle and not too important to understand how the algorithm works, it's more of an implementation detail.

Why would anyone want use RCU in Java? RCU is for memory reclamation, and Java has a GC, right?
Like I mentioned before, RCU is not (just) about memory reclamation, it's about using quiescence in your application to do whatever you want with quiescence, and as amazing as it may seem, there is a non-negligible amount of things you can do with it!

For example, suppose you have a pool of complex objects that are too costly to create, so you want to re-use by returning to the pool. But this pool is concurrent, so you have to make sure that some thread that is lagging behind isn't holding a reference to the object anymore.
After returning the object to the pool, when is it safe to re-use the object?
The answer is RCU: Protect all accesses to the objects from that pool with an rcu_read_lock()/rcu_read_unlock() and only reuse from the pool after calling synchronize_rcu().
Another example is Relativistic Programming which you can see more of in this paper:

A funny thing about RCU is that one single RCU instance can be used for as many purposes as desired, for example, one RCU instance for all the pools (if there are more than one), or one per pool, as you prefer.

We'll prepare a C++ implementation for a future post.