Parallelism in Open Source Projects

Reviewed by Greg Wilson / 2016-06-12
Keywords: Parallelism

Kiefer2015 Marc Kiefer, Daniel Warzel, and Walter F. Tichy: "An empirical study on parallelism in modern open-source projects". Proceedings of the 2nd International Workshop on Software Engineering for Parallel Systems, 10.1145/2837476.2837481.

We present an empirical study of 135 parallel open-source projects in Java, C# and C++ ranging from small (<1000 lines of code) to very large (>2M lines of code) codebases. We examine the projects to find out how language features, synchronization mechanisms, parallel data structures and libraries are used by developers to express parallelism. We also determine which common parallel patterns are used and how the implemented solutions compare to typical textbook advice.

The results show that similar parallel constructs are used equally often across languages, but usage also heavily depends on how easy to use a certain language feature is. Patterns that do not map well to a language are much rarer compared to other languages. Bad practices are prevalent in hobby projects but also occur in larger projects.

I wrote my first parallel program in 1986 on an ICL DAP with 4096 bit-serial processors. I wrote my second in Occam the following spring for a 16-transputer machine, and spent the next ten years thinking, "There has to be a better way." Two decades later, the answer to that is still, "Not yet," but in those years parallelism of various kinds has gone from esoteric to everyday.

But how do we actually use it? And how well? To answer those questions, the authors of this paper explored open source programs written in several different languages. Here are a few of their findings (the figures for Java are omitted, since those tables are larger than the ones for C# and C++ combined):

Use of High-Level Constructs Use of Synchronization Primitives
Language Feature # projects % of total Language Feature # projects % of total
C# Task 33 73.33% C# lock() 42 93.33%
ThreadPool 19 42.22% ManualResetEvent 22 48.89%
TaskScheduler 10 22.22% Monitor 17 37.78%
Parallel.For 8 17.78% AutoResetEvent 16 35.56%
BlockingCollection 6 13.33% ReaderWriterLockSlim 15 33.33%
Parallel.ForEach 6 13.33% WaitHandle 13 28.89%
TaskFactory 5 11.11% EventWaitHandle 10 22.22%
Parallel.Invoke 1 2.22% Mutex 10 22.22%
ManualResetEventSlim 8 17.78%
Barrier 7 15.56%
Semaphore 6 13.33%
SpinWait 4 8.89%
CountdownEvent 3 6.67%
ReaderWriterLock 3 6.67%
SemaphoreSlim 3 6.67%
MethodImplOptions Synchronized 2 4.44%
Interlocked.MemoryBarrier 1 2,22%
SpinLock 0 0.00%
C++ #pragma omp parallel for 6 13.64% mutex 39 88.64%
future/promise 3 6.82% condition variable 28 63.63%
#pragma omp parallel 2 4.55% Semaphore 18 40.91%
packaged task 0 0.00% CriticalSection 17 38.64%
shared future 0 0.00% unique lock 16 36.36%
lock guard 12 27.27%
barrier 5 11.36%
#pragma omp critical 3 6.82%

But these tables don't tell the most important parts of the story. For that, we have to look at patterns:

Pattern Java C# C++
Master-worker 29 23 23
Producer-consumer 10 5 5
Pipeline 8 9 8
Parallel Loop 1 8 5
Fork-join 3 1 1

But the authors go even further and look at where programmers do things they shouldn't:

  • "Smaller projects tend to reimplement slight variations of existing functionality, while projects with a larger codebase often reuse existing functionality and extend these classes via subclassing."
  • "An exception is the master-worker pattern. It seems that it is such an intuitive pattern, that some developers seem to implement it by accident. Unexperienced developers just create threads and distribute their work to them. In larger projects comments and class naming suggests that they are aware of the master-worker pattern and implement it on purpose…"
  • "Prime examples of bad pratice are the synchronized() and lock() features in Java and C#. Both language references strongly advise against locking on publicly accessible types or classes. Nevertheless this is what happens in most instances in both Java and C# regardless of the size of the project."

It's easy to say this means we need better documentation—better, but wrong. Just as you can't refactor your way to better security, you can't document your way to better usability. Knowing what people use, how they use it, and how they mis-use it could and should inform the design of the next generation of parallel programming tools. And if that's too optimistic, we can at least hope that it will give the builders of the next generation of program checking tools a larger list of things to look for…