Parallelism in Open Source Projects

Reviewed by Greg Wilson / 2016-06-12
Keywords: Parallelism

Kiefer2015 Marc Kiefer, Daniel Warzel, and Walter F. Tichy: "An empirical study on parallelism in modern open-source projects". Proceedings of the 2nd International Workshop on Software Engineering for Parallel Systems, 10.1145/2837476.2837481.

We present an empirical study of 135 parallel open-source projects in Java, C# and C++ ranging from small (<1000 lines of code) to very large (>2M lines of code) codebases. We examine the projects to find out how language features, synchronization mechanisms, parallel data structures and libraries are used by developers to express parallelism. We also determine which common parallel patterns are used and how the implemented solutions compare to typical textbook advice.

The results show that similar parallel constructs are used equally often across languages, but usage also heavily depends on how easy to use a certain language feature is. Patterns that do not map well to a language are much rarer compared to other languages. Bad practices are prevalent in hobby projects but also occur in larger projects.

I wrote my first parallel program in 1986 on an ICL DAP with 4096 bit-serial processors. I wrote my second in Occam the following spring for a 16-transputer machine, and spent the next ten years thinking, "There has to be a better way." Two decades later, the answer to that is still, "Not yet," but in those years parallelism of various kinds has gone from esoteric to everyday.

But how do we actually use it? And how well? To answer those questions, the authors of this paper explored open source programs written in several different languages. Here are a few of their findings (the figures for Java are omitted, since those tables are larger than the ones for C# and C++ combined):

Use of High-Level Constructs				Use of Synchronization Primitives
Language	Feature	# projects	% of total	Language	Feature	# projects	% of total
C#	Task	33	73.33%	C#	lock()	42	93.33%
	ThreadPool	19	42.22%		ManualResetEvent	22	48.89%
	TaskScheduler	10	22.22%		Monitor	17	37.78%
	Parallel.For	8	17.78%		AutoResetEvent	16	35.56%
	BlockingCollection	6	13.33%		ReaderWriterLockSlim	15	33.33%
	Parallel.ForEach	6	13.33%		WaitHandle	13	28.89%
	TaskFactory	5	11.11%		EventWaitHandle	10	22.22%
	Parallel.Invoke	1	2.22%		Mutex	10	22.22%
					ManualResetEventSlim	8	17.78%
					Barrier	7	15.56%
					Semaphore	6	13.33%
					SpinWait	4	8.89%
					CountdownEvent	3	6.67%
					ReaderWriterLock	3	6.67%
					SemaphoreSlim	3	6.67%
					MethodImplOptions Synchronized	2	4.44%
					Interlocked.MemoryBarrier	1	2,22%
					SpinLock	0	0.00%
C++	#pragma omp parallel for	6	13.64%		mutex	39	88.64%
	future/promise	3	6.82%		condition variable	28	63.63%
	#pragma omp parallel	2	4.55%		Semaphore	18	40.91%
	packaged task	0	0.00%		CriticalSection	17	38.64%
	shared future	0	0.00%		unique lock	16	36.36%
					lock guard	12	27.27%
					barrier	5	11.36%
					#pragma omp critical	3	6.82%

But these tables don't tell the most important parts of the story. For that, we have to look at patterns:

Pattern	Java	C#	C++
Master-worker	29	23	23
Producer-consumer	10	5	5
Pipeline	8	9	8
Parallel Loop	1	8	5
Fork-join	3	1	1

But the authors go even further and look at where programmers do things they shouldn't:

"Smaller projects tend to reimplement slight variations of existing functionality, while projects with a larger codebase often reuse existing functionality and extend these classes via subclassing."
"An exception is the master-worker pattern. It seems that it is such an intuitive pattern, that some developers seem to implement it by accident. Unexperienced developers just create threads and distribute their work to them. In larger projects comments and class naming suggests that they are aware of the master-worker pattern and implement it on purpose…"
"Prime examples of bad pratice are the synchronized() and lock() features in Java and C#. Both language references strongly advise against locking on publicly accessible types or classes. Nevertheless this is what happens in most instances in both Java and C# regardless of the size of the project."

It's easy to say this means we need better documentation—better, but wrong. Just as you can't refactor your way to better security, you can't document your way to better usability. Knowing what people use, how they use it, and how they mis-use it could and should inform the design of the next generation of parallel programming tools. And if that's too optimistic, we can at least hope that it will give the builders of the next generation of program checking tools a larger list of things to look for…

« Too Many Knobs

Frequency Distribution of Error Message »