On the costs of mutual exclusion in C++

Disruptor [1] is a software component that implements lock-free message passing between different threads of control. Disruptor is open source; its way of operation and its advantages over traditional message passing are described in [2], [3] and [4].

The motivation for building the Disruptor was the search for the most efficient ways to pass messages from one thread to another. That, in turn, requires some synchronisation between the threads: the producing thread must be able to inform the consumer that the next message is ready to be processed. In [1], the authors experiment with different ways of synchronising threads in Java. I decided to perform similar tests in C++ to see if there is a substantial language dependency in the results.

The table below shows average time, in milliseconds, to perform 500 million increments of a 64-bit unsigned integer. The results were produced on an Intel Core i3-2310M CPU (2 physical cores with hyperthreading give 4 logical cores) running at 2.10GHz. Of course, the results for other CPUs might differ greatly, so it is the ratio between the execution times that matters, not the times themselves. The last column shows the ratio of each measurement to that of an un-contended read and write.

Method Time, ms Ratio
Atomic read and write, no contention 1,774 1.00
Atomic read and write, contention 1,863 1.05
Atomic fetch_add, no contention 5,539 3.12
Atomic fetch_add, contention 30,264 17.05
Locking a mutex, no contention 17,729 9.99
Locking a mutex, contention 153,566 86.56

In the last four rows, contention means that two threads increment the same variable. For atomic reads and writes, however, an experiment with two threads updating the same variable would make little sense, as no correct real-life program would do that. In this case, contention means that two threads update different variables which are located next to each other in memory, hence they probably will be accessed through the same cache line.

These results are not very different from the results shown in section 3.1 of [1], apart from one. The first two rows are almost identical. This means that the speed of atomic reads and writes does not differ much if the variables written by the two threads are next to each other or not. In other words, the effect of “false sharing” (section 3.4 of [1]) has almost no effect on the results of this experiment. This detail, however, is not relevant to the conclusion of the experiment: traditional ways of synchronisation, such as mutex locking or even atomic fetch_add, are far inferior to the algorithms where each variable is modified by only one thread (although it can be read by many).

The code which I used to run the experiments can be downloaded as timing.tgz. I compiled it with GCC version 4.8.2 and ran it on 64-bit Ubuntu 14.04. I did not try, but it should be possible to compile the code with any C++11-compliant compiler.
The tar includes the following files:

timing.h Utilities to measure execution time.
atomic_update.cpp Measurement of atomic reads and writes.
fetch_add.cpp Measurement of fetch_add.
locked_update.cpp Measurement of locking a mutex.
mean_stddev.cpp Utility to compute the mean and standard deviation.
run.sh Script that compiles the code and runs the measurements.

Almost all code is self-explanatory. Each measurement is run a few times (from 2000 to 200, depending on what kind of measurement it is), and the average running time is computed. Besides, the standard deviation of results is computed. A high standard deviation shows that the conditions were not consistent throughout the experiment, which means that the results might be off. In my case the standard deviation was never greater than 60, which is fairly decent.

The last file on the list, run.sh, is probably the only file that would need modification for other compilers and operating systems.

I am curious what the results might look like for other compilers and OS’es, so if you decide to run the measurements, I would appreciate if you share the results with me. The measurement code is free to use and modify, it is provided “as is”. Keep in mind, though, that it might take a fairly long time to run the measurements, even on modern hardware; I usually do it overnight.

References

[1] http://disruptor.googlecode.com/files/Disruptor-1.0.pdf
[2] http://martinfowler.com/articles/lmax.html
[3] http://mechanitis.blogspot.nl/2011/06/dissecting-disruptor-whats-so-special.html
[4] http://mechanitis.blogspot.nl/2011/07/dissecting-disruptor-why-its-so-fast.html

Advertisements

About Alluve

Alluve develops and markets software solutions for assessing risks on the portfolios of financial derivatives. Our product, Alluve MarketSimulator, uses Monte-Carlo simulation to calculate risk measures, such as value at risk (VaR) and exposure at default (EAD).
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s