Variability and Reproducibility in Software Engineering: A Study of Four Companies that Developed the Same System

Reviewed by Greg Wilson / 2011-09-22
Keywords: Research Methods

Anda2009 B.C.D. Anda, D.I.K. Sjøberg, and A. Mockus: "Variability and Reproducibility in Software Engineering: A Study of Four Companies that Developed the Same System". IEEE Transactions on Software Engineering, 35(3), 2009, 10.1109/tse.2008.89.

The scientific study of a phenomenon requires it to be reproducible. Mature engineering industries are recognized by projects and products that are, to some extent, reproducible. Yet, reproducibility in software engineering (SE) has not been investigated thoroughly, despite the fact that lack of reproducibility has both practical and scientific consequences. We report a longitudinal multiple-case study of variations and reproducibility in software development, from bidding to deployment, on the basis of the same requirement specification. In a call for tender to 81 companies, 35 responded. Four of them developed the system independently. The firm price, planned schedule, and planned development process, had, respectively, "low", "low", and "medium" reproducibilities. The contractor's costs, actual lead time, and schedule overrun of the projects had, respectively, "medium", "high", and "low" reproducibilities. The quality dimensions of the delivered products, reliability, usability, and maintainability had, respectively, "low", "high", and "low" reproducibilities. Moreover, variability for predictable reasons is also included in the notion of reproducibility. We found that the observed outcome of the four development projects matched our expectations, which were formulated partially on the basis of SE folklore. Nevertheless, achieving more reproducibility in SE remains a great challenge for SE research, education, and industry.

Albert Einstein once defined insanity as, "Doing the same thing over and over again and expecting different results." That's also a good definition of science: we repeat our experiments so that we can gather statistics about their outcomes, which in turn give us deeper insight into what the universe is doing. This can be an expensive process—just look at the LHC, or at the cost of putting a probe into space, or the salaries of professional programmers. As much as they'd like to, most researchers simply can't afford to have several teams develop the same software independently, just so that the differences in what they do can be studied.

That's what makes this paper so valuable. As their abstract says, Anda, Sjøberg, and Mockus had four teams build the same software independently and in parallel so that they could look at how much variation there was in what happened. Their results are worth re-summarizing:

  • High reproducibility: actual lead time, usability
  • Medium reproducibility: planned development process, cost
  • Low reproducibility: firm price, planned schedule, schedule overrun, reliability, maintainability

Note that putting something in the "low" category doesn't mean that it was uniformly poor. Instead, it means that there was wide variation, i.e., that results were unpredictable. As they say, their results match software engineering folklore, and are a solid guide to what research should focus on improving.