It Will Never Work in Theory

The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics

Posted Jul 7, 2011 by Greg Wilson

| Metrics |

Khaled El Emam, Saida Benlarbi, Nishith Goel, and Shesh N. Rai: "The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics". IEEE Transasctions on Software Engineering, 27(7), July 2001.

Much effort has been devoted to the development and empirical validation of object-oriented metrics. The empirical validations performed thus far would suggest that a core set of validated metrics is close to being identified. However, none of these studies allow for the potentially confounding effect of class size. In this paper, we demonstrate a strong size confounding effect and question the results of previous object-oriented metrics validation studies. We first investigated whether there is a confounding effect of class size in validation studies of object-oriented metrics and show that, based on previous work, there is reason to believe that such an effect exists. We then describe a detailed empirical methodology for identifying those effects. Finally, we perform a study on a large C++ telecommunications framework to examine if size is really a confounder. This study considered the Chidamber and Kemerer metrics and a subset of the Lorenz and Kidd metrics. The dependent variable was the incidence of a fault attributable to a field failure (fault-proneness of a class). Our findings indicate that, before controlling for size, the results are very similar to previous studies: the metrics that are expected to be validated are indeed associated with fault-proneness. After controlling for size, none of the metrics we studied were associated with fault-proneness any more. This demonstrates a strong size confounding effect and casts doubt on the results of previous object-oriented metrics validation studies. It is recommended that previous validation studies be reexamined to determine whether their conclusions would still hold after controlling for size and that future validation studies should always control for size.

We all know that some programs are more complex than others, but can we actually quantify that? Ever since the early 1970s, researchers have invented metrics (such as cyclomatic complexity or coupling and cohesion), then validated them by seeing how well they correlate with things like post-release bug counts. The idea is that if what we mean by "complex" is "hard to understand", complex software should have more bugs than simple software, and a measure that can predict the likely number of bugs in a product before it's released would be a very useful thing.

El Emam and his colleagues repeated some of those experiments using bivariate analysis so that they could allocate a share of the blame to code size (measured by number of lines) and the metric in question. It turned out that code size accounted for all of the significant variation: in other words, the object-oriented metrics they looked at didn't have any actual predictive power once they normalized for the number of lines of code. Herraiz and Hassan's chapter in Making Software, which reports on an even larger study using open source software, reached the same conclusion:

...for non-header files written in C language, all the complexity metrics are highly correlated with lines of code, and therefore the more complex metrics provide no further information that could not be measured simply with lines of code... In our opinion, there is a clear lesson from this study: syntactic complexity metrics cannot capture the whole picture of software complexity. Complexity metrics that are exclusively based on the structure of the program or the properties of the text...do not provide information on the amount of effort that is needed to comprehend a piece of code—or, at least, no more information than lines of code do.

This emphatically doesn't mean that trying to measure software is a waste of time: Weyuker and Ostrand's chapter in that same book shows that it is possible to predict which files are likely to contain the most bugs. What it does mean, though, is that figuring out whether some new measure actually tells us something we didn't already know is harder than it seems.

Comments powered by Disqus