How Reliable is Scientific Software?

Reviewed by Greg Wilson / 2021-09-25
Keywords: Scientific Computing, Software Reliability

I started reading empirical software engineering research because I was working with scientists and became embarrassed by how many of the things I did and taught were based on personal experience and anecdotes. Was agile development really better than designing things up front? Were some programming languages actually better than others? And what exactly does "better" mean in sentences like that? My friends and colleagues in physics, ecology, and public health could cite evidence to back up their professional opinions; all I could do was change the subject.

My interests have shifted over the years, but software engineering and scientific computing have always been near the center, which makes this set of papers a double pleasure to review. The first, Hatton1994, is now a quartery of a century old, but its conclusions are still fresh. The authors fed the same data into nine commercial geophysical software packages and compared the results; they found that, "numerical disagreement grows at around the rate of 1% in average absolute difference per 4000 fines of implemented code, and, even worse, the nature of the disagreement is nonrandom" (i.e., the authors of different packages make similar mistakes).

Hatton1997 revisited this result while also reporting on a concurrent experiment that analyzed large scientific applications written in C and Fortran. This study found that, "…C and Fortran are riddled with statically detectable inconsistencies independent of the application area. For example, interface inconsistencies occur at the rate of one in every 7 interfaces on average in Fortran, and one in every 37 interfaces in C. They also show that…roughly 30% of the Fortran population and 10% of the C…would be deemed untestable by any standards."

Over twenty years later, Malik2019 and Schweinsberg2021's findings are equally sobering. Malik2019 compared five backscatter programs and found that different ones would flag different inputs as invalid, that each application reported different mean levels, and so on; its authors trace some of these differences in results back to different assumptions about the underlying science and others back to different ways of doing calculations.

Rather than using existing software, Schweinsberg2021 had multiple independent teams use the same dataset to set two hypotheses however they wanted. They found that, "Researchers reported radically different analyses and dispersed empirical outcomes, in a number of cases obtaining significant effects in opposite directions for the same research question," and that, "decisions about how to operationalize variables explain variability in outcomes above and beyond statistical choices." In other words, the differences between how researchers translated the question into specific tests of specific variables extracted from the data materially affected their conclusions.

One response to studies like these is to say they prove that computational science doesn't meet the standards set for experimental science. However, a lot of experimental work is just as shaky when examined closely, and if mathematicians insisted that every theorem proof actually had to be rigorous, many fewer would be published. The truth is that we don't know why any of this stuff works as well as it does—if you doubt that, see how much of your faith survives a course on the philosophy of science.

What we do know is that the more open researchers are, the more likely their results are to be correct. Wicherts2011 found a strong correlation between researchers' willingness to share their data and the number of errors in their analyses. Making the software and data used to produce results freely available doesn't guarantee that those results are correct, but not doing so ensures that such checks are impossible.

Hatton1994 L. Hatton and A. Roberts: "How accurate is scientific software?". IEEE Transactions on Software Engineering, 20(10), 1994, 10.1109/32.328993.

This paper describes some results of what, to the authors' knowledge, is the largest N-version programming experiment ever performed. The object of this ongoing four-year study is to attempt to determine just how consistent the results of scientific computation really are, and, from this, to estimate accuracy. The experiment is being carried out in a branch of the earth sciences known as seismic data processing, where 15 or so independently developed large commercial packages that implement mathematical algorithms from the same or similar published specifications in the same programming language (Fortran) have been developed over the last 20 years. The results of processing the same input dataset, using the same user-specified parameters, for nine of these packages is reported in this paper. Finally, feedback of obvious flaws was attempted to reduce the overall disagreement. The results are deeply disturbing. Whereas scientists like to think that their code is accurate to the precision of the arithmetic used, in this study, numerical disagreement grows at around the rate of 1% in average absolute difference per 4000 fines of implemented code, and, even worse, the nature of the disagreement is nonrandom. Furthermore, the seismic data processing industry has better than average quality standards for its software development with both identifiable quality assurance functions and substantial test datasets.

Hatton1997 L. Hatton: "The T-experiments: errors in scientific software". In Ronald F. Boisvert (ed.): Quality of Numerical Software. Springer US, 1997, 978-1-5041-2940-4.

This paper covers two very large experiments carried out concurrently between 1990 and 1994, together known as the T-experiments. Experiment T1 had the objective of measuring the consistency of several million lines of scientific software written in C and Fortran 77 by static deep-flow analysis across many different industries and application areas, and experiment T2 had the objective of measuring the level of dynamic disagreement between independent implementations of the same algorithms acting on the same input data with the same parameters in just one of these industrial application areas. Experiment T1 showed that C and Fortran are riddled with statically detectable inconsistencies independent of the application area. For example, interface inconsistencies occur at the rate of one in every 7 interfaces on average in Fortran, and one in every 37 interfaces in C. They also show that Fortran components are typically 2.5 times bigger than C components, and that roughly 30% of the Fortran population and 10% of the C population would be deemed untestable by any standards. Experiment T2 was even more disturbing. Whereas scientists like to think that their results are accurate to the precision of the arithmetic used, in this study, the degree of agreement gradually degenerated from 6 significant figures to 1 significant figure during the computation. The reasons for this disagreement are laid squarely at the door of software failure, as other possible causes are considered and rejected. Taken with other evidence, these two experiments suggest that the results of scientific calculations involving significant amounts of software should be taken with several large pinches of salt.

Malik2019 Mashkoor Malik, Alexandre C. G. Schimel, Giuseppe Masetti, Marc Roche, Julian Le Deunf, Margaret F.J. Dolan, Jonathan Beaudoin, Jean-Marie Augustin, Travis Hamilton, and Iain Parnum: "Results from the First Phase of the Seafloor Backscatter Processing Software Inter-Comparison Project". Geosciences, 9(12), 2019, 10.3390/geosciences9120516.

Seafloor backscatter mosaics are now routinely produced from multibeam echosounder data and used in a wide range of marine applications. However, large differences (>5 dB) can often be observed between the mosaics produced by different software packages processing the same dataset. Without transparency of the processing pipeline and the lack of consistency between software packages raises concerns about the validity of the final results. To recognize the source(s) of inconsistency between software, it is necessary to understand at which stage(s) of the data processing chain the differences become substantial. To this end, willing commercial and academic software developers were invited to generate intermediate processed backscatter results from a common dataset, for cross-comparison. The first phase of the study requested intermediate processed results consisting of two stages of the processing sequence: the one-value-per-beam level obtained after reading the raw data and the level obtained after radiometric corrections but before compensation of the angular dependence. Both of these intermediate results showed large differences between software solutions. This study explores the possible reasons for these differences and highlights the need for collaborative efforts between software developers and their users to improve the consistency and transparency of the backscatter data processing sequence.

Schweinsberg2021 Martin Schweinsberg and 178 others: "Same data, different conclusions: Radical dispersion in empirical results when independent analysts operationalize and test the same hypothesis". Organizational Behavior and Human Decision Processes, 165, 2021, 10.1016/j.obhdp.2021.02.003.

In this crowdsourced initiative, independent analysts used the same dataset to test two hypotheses regarding the effects of scientists' gender and professional status on verbosity during group meetings. Not only the analytic approach but also the operationalizations of key variables were left unconstrained and up to individual analysts. For instance, analysts could choose to operationalize status as job title, institutional ranking, citation counts, or some combination. To maximize transparency regarding the process by which analytic choices are made, the analysts used a platform we developed called DataExplained to justify both preferred and rejected analytic paths in real time. Analyses lacking sufficient detail, reproducible code, or with statistical errors were excluded, resulting in 29 analyses in the final sample. Researchers reported radically different analyses and dispersed empirical outcomes, in a number of cases obtaining significant effects in opposite directions for the same research question. A Boba multiverse analysis demonstrates that decisions about how to operationalize variables explain variability in outcomes above and beyond statistical choices (e.g., covariates). Subjective researcher decisions play a critical role in driving the reported empirical results, underscoring the need for open data, systematic robustness checks, and transparency regarding both analytic paths taken and not taken. Implications for organizations and leaders, whose decision making relies in part on scientific findings, consulting reports, and internal analyses by data scientists, are discussed.