To Do

BibTeX | PDF



AbuHassan2020 Amjad AbuHassan, Mohammad Alshayeb, and Lahouari Ghouti: "Software smell detection techniques: A systematic literature review". Journal of Software: Evolution and Process, 33(3), 2020, 10.1002/smr.2320.

Software smells indicate design or code issues that might degrade the evolution and maintenance of software systems. Detecting and identifying these issues are challenging tasks. This paper explores, identifies, and analyzes the existing software smell detection techniques at design and code levels. We carried out a systematic literature review (SLR) to identify and collect 145 primary studies related to smell detection in software design and code. Based on these studies, we address several questions related to the analysis of the existing smell detection techniques in terms of abstraction level (design or code), targeted smells, used metrics, implementation, and validation. Our analysis identified several detection techniques categories. We observed that 57% of the studies did not use any performance measures, 41% of them omitted details on the targeted programing language, and the detection techniques were not validated in 14% of these studies. With respect to the abstraction level, only 18% of the studies addressed bad smell detection at the design level. This low coverage urges for more focus on bad smell detection at the design level to handle them at early stages. Finally, our SLR brings to the attention of the research community several opportunities for future research.


Balali2018 Sogol Balali, Igor Steinmacher, Umayal Annamalai, Anita Sarma, and Marco Aurelio Gerosa: "Newcomers' Barriers… Is That All? An Analysis of Mentors' and Newcomers' Barriers in OSS Projects". Computer Supported Cooperative Work, 27(3-6), 2018, 10.1007/s10606-018-9310-8.

Newcomers' seamless onboarding is important for open collaboration communities, particularly those that leverage outsiders' contributions to remain sustainable. Nevertheless, previous work shows that OSS newcomers often face several barriers to contribute, which lead them to lose motivation and even give up on contributing. A well-known way to help newcomers overcome initial contribution barriers is mentoring. This strategy has proven effective in offline and online communities, and to some extent has been employed in OSS projects. Studying mentors' perspectives on the barriers that newcomers face play a vital role in improving onboarding processes; yet, OSS mentors face their own barriers, which hinder the effectiveness of the strategy. Since little is known about the barriers mentors face, in this paper, we investigate the barriers that affect mentors and their newcomer mentees. We interviewed mentors from OSS projects and qualitatively analyzed their answers. We found 44 barriers: 19 that affect mentors; and 34 that affect newcomers (9 affect both newcomers and mentors). Interestingly, most of the barriers we identified (66%) have a social nature. Additionally, we identified 10 strategies that mentors indicated to potentially alleviate some of the barriers. Since gender-related challenges emerged in our analysis, we conducted nine follow-up structured interviews to further explore this perspective. The contributions of this paper include: identifying the barriers mentors face; bringing the unique perspective of mentors on barriers faced by newcomers; unveiling strategies that can be used by mentors to support newcomers; and investigating gender-specific challenges in OSS mentorship. Mentors, newcomers, online communities, and educators can leverage this knowledge to foster new contributors to OSS projects.

Bao2021 Lingfeng Bao, Xin Xia, David Lo, and Gail C. Murphy: "A Large Scale Study of Long-Time Contributor Prediction for GitHub Projects". IEEE Transactions on Software Engineering, 47(6), 2021, 10.1109/tse.2019.2918536.

The continuous contributions made by long time contributors (LTCs) are a key factor enabling open source software (OSS) projects to be successful and survival. We study Github as it has a large number of OSS projects and millions of contributors, which enables the study of the transition from newcomers to LTCs. In this paper, we investigate whether we can effectively predict newcomers in OSS projects to be LTCs based on their activity data that is collected from Github. We collect Github data from GHTorrent, a mirror of Github data. We select the most popular 917 projects, which contain 75,046 contributors. We determine a developer as a LTC of a project if the time interval between his/her first and last commit in the project is larger than a certain time $T$. In our experiment, we use three different settings on the time interval: 1, 2, and 3 years. There are 9,238, 3,968, and 1,577 contributors who become LTCs of a project in three settings of time interval, respectively. To build a prediction model, we extract many features from the activities of developers on Github, which group into five dimensions: developer profile, repository profile, developer monthly activity, repository monthly activity, and collaboration network. We apply several classifiers including naive Bayes, SVM, decision tree, kNN and random forest. We find that random forest classifier achieves the best performance with AUCs of more than 0.75 in all three settings of time interval for LTCs. We also investigate the most important features that differentiate newcomers who become LTCs from newcomers who stay in the projects for a short time. We find that the number of followers is the most important feature in all three settings of the time interval studied. We also find that the programming language and the average number of commits contributed by other developers when a newcomer joins a project also belong to the top 10 most important features in all three settings of time interval for LTCs. Finally, we provide several implications for action based on our analysis results to help OSS projects retain newcomers.

Barke2019 Helena Barke and Lutz Prechelt: "Role clarity deficiencies can wreck agile teams". PeerJ Computer Science, 5, 2019, 10.7717/peerj-cs.241.

Background One of the twelve agile principles is to build projects around motivated individuals and trust them to get the job done. Such agile teams must self-organize, but this involves conflict, making self-organization difficult. One area of difficulty is agreeing on everybody's role. Background What dynamics arise in a self-organizing team from the negotiation of everybody's role? Method We conceptualize observations from five agile teams (work observations, interviews) by Charmazian Grounded Theory Methodology. Results We define role as something transient and implicit, not fixed and named. The roles are characterized by the responsibilities and expectations of each team member. Every team member must understand and accept their own roles (Local role clarity) and everbody else's roles (Team-wide role clarity). Role clarity allows a team to work smoothly and effectively and to develop its members' skills fast. Lack of role clarity creates friction that not only hampers the day-to-day work, but also appears to lead to high employee turnover. Agile coaches are critical to create and maintain role clarity. Conclusions Agile teams should pay close attention to the levels of Local role clarity of each member and Team-wide role clarity overall, because role clarity deficits are highly detrimental.

Bi2021 Tingting Bi, Wei Ding, Peng Liang, and Antony Tang: "Architecture information communication in two OSS projects: The why, who, when, and what". Journal of Systems and Software, 181, 2021, 10.1016/j.jss.2021.111035.

Architecture information is vital for Open Source Software (OSS) development, and mailing list is one of the widely used channels for developers to share and communicate architecture information. This work investigates the nature of architecture information communication (i.e., why, who, when, and what) by OSS developers via developer mailing lists. We employed a multiple case study approach to extract and analyze the architecture information communication from the developer mailing lists of two OSS projects, ArgoUML and Hibernate, during their development life-cycle of over 18 years. Our main findings are: (a) architecture negotiation and interpretation are the two main reasons (i.e., why) of architecture communication; (b) the amount of architecture information communicated in developer mailing lists decreases after the first stable release (i.e., when); (c) architecture communications centered around a few core developers (i.e., who); (d) and the most frequently communicated architecture elements (i.e., what) are Architecture Rationale and Architecture Model. There are a few similarities of architecture communication between the two OSS projects. Such similarities point to how OSS developers naturally gravitate towards the four aspects of architecture communication in OSS development.

Blackwell2019 Alan F. Blackwell, Marian Petre, and Luke Church: "Fifty years of the psychology of programming". International Journal of Human-Computer Studies, 131, 2019, 10.1016/j.ijhcs.2019.06.009.

Abstract This paper reflects on the evolution (past, present and future) of the 'psychology of programming' over the 50 year period of this anniversary issue. The International Journal of Human-Computer Studies (IJHCS) has been a key venue for much seminal work in this field, including its first foundations, and we review the changing research concerns seen in publications over these five decades. We relate this thematic evolution to research taking place over the same period within more specialist communities, especially the Psychology of Programming Interest Group (PPIG), the Empirical Studies of Programming series (ESP), and the ongoing community in Visual Languages and Human-Centric Computing (VL/HCC). Many other communities have interacted with psychology of programming, both influenced by research published within the specialist groups, and in turn influencing research priorities. We end with an overview of the core theories that have been developed over this period, as an introductory resource for new researchers, and also with the authors' own analysis of key priorities for future research.

Bogart2021 Chris Bogart, Christian Kästner, James Herbsleb, and Ferdian Thung: "When and How to Make Breaking Changes". ACM Transactions on Software Engineering and Methodology, 30(4), 2021, 10.1145/3447245.

Open source software projects often rely on package management systems that help projects discover, incorporate, and maintain dependencies on other packages, maintained by other people. Such systems save a great deal of effort over ad hoc ways of advertising, packaging, and transmitting useful libraries, but coordination among project teams is still needed when one package makes a breaking change affecting other packages. Ecosystems differ in their approaches to breaking changes, and there is no general theory to explain the relationships between features, behavioral norms, ecosystem outcomes, and motivating values. We address this through two empirical studies. In an interview case study, we contrast Eclipse, NPM, and CRAN, demonstrating that these different norms for coordination of breaking changes shift the costs of using and maintaining the software among stakeholders, appropriate to each ecosystem's mission. In a second study, we combine a survey, repository mining, and document analysis to broaden and systematize these observations across 18 ecosystems. We find that all ecosystems share values such as stability and compatibility, but differ in other values. Ecosystems' practices often support their espoused values, but in surprisingly diverse ways. The data provides counterevidence against easy generalizations about why ecosystem communities do what they do.

Brown2020 Chris Brown and Chris Parnin: "Understanding the impact of GitHub suggested changes on recommendations between developers". Proc. European Software Engineering Conference/International Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020, 10.1145/3368089.3409722.

Recommendations between colleagues are effective for encouraging developers to adopt better practices. Research shows these peer interactions are useful for improving developer behaviors, or the adoption of activities to help software engineers complete programming tasks. However, in-person recommendations between developers in the workplace are declining. One form of online recommendations between developers are pull requests, which allow users to propose code changes and provide feedback on contributions. GitHub, a popular code hosting platform, recently introduced the suggested changes feature, which allows users to recommend improvements for pull requests. To better understand this feature and its impact on recommendations between developers, we report an empirical study of this system, measuring usage, effectiveness, and perception. Our results show that suggested changes support code review activities and significantly impact the timing and communication between developers on pull requests. This work provides insight into the suggested changes feature and implications for improving future systems for automated developer recommendations, such as providing situated, concise, and actionable feedback.

Butler2019 Simon Butler and 8 others: "On Company Contributions to Community Open Source Software Projects". IEEE Transactions on Software Engineering, 2019, 10.1109/tse.2019.2919305.

The majority of contributions to community open source software (OSS) projects are made by practitioners acting on behalf of companies and other organisations. Previous research has addressed the motivations of both individuals and companies to engage with OSS projects. However, limited research has been undertaken that examines and explains the practical mechanisms or work practices used by companies and their developers to pursue their commercial and technical objectives when engaging with OSS projects. This research investigates the variety of work practices used in public communication channels by company contributors to engage with and contribute to eight community OSS projects. Through interviews with contributors to the eight projects we draw on their experiences and insights to explore the motivations to use particular methods of contribution. We find that companies utilise work practices for contributing to community projects which are congruent with the circumstances and their capabilities that support their short- and long-term needs. We also find that companies contribute to community OSS projects in ways that may not always be apparent from public sources, such as employing core project developers, making donations, and joining project steering committees in order to advance strategic interests. The factors influencing contributor work practices can be complex and are often dynamic arising from considerations such as company and project structure, as well as technical concerns and commercial strategies. The business context in which software created by the OSS project is deployed is also found to influence contributor work practices.


Cabral2007 Bruno Cabral and Paulo Marques: "Exception Handling: A Field Study in Java and .NET". Proc. European Conference on Object-Oriented Programming (ECOOP), 2007, 10.1007/978-3-540-73589-2_8.

Most modern programming languages rely on exceptions for dealing with abnormal situations. Although exception handling was a significant improvement over other mechanisms like checking return codes, it is far from perfect. In fact, it can be argued that this mechanism is seriously limited, if not, flawed. This paper aims to contribute to the discussion by providing quantitative measures on how programmers are currently using exception handling. We examined 32 different applications, both for Java and .NET. The major conclusion for this work is that exceptions are not being correctly used as an error recovery mechanism. Exception handlers are not specialized enough for allowing recovery and, typically, programmers just do one of the following actions: logging, user notification and application termination. To our knowledge, this is the most comprehensive study done on exception handling to date, providing a quantitative measure useful for guiding the development of new error handling mechanisms.

Cates2021 Roee Cates, Nadav Yunik, and Dror G. Feitelson: "Does Code Structure Affect Comprehension? On Using and Naming Intermediate Variables". Proc. International Conference on Program Comprehension (ICPC), 2021, 10.1109/icpc52881.2021.00020.

Intermediate variables can be used to break complex expressions into more manageable smaller expressions, which may be easier to understand. But it is unclear when and whether this actually helps. We conducted an experiment in which subjects read 6 mathematical functions and were supposed to give them meaningful names. 113 subjects participated, of which 58% had 3 or more years of programming work experience. Each function had 3 versions: using a compound expression, using intermediate variables with meaningless names, or using intermediate variables with meaningful names. The results were that in only one case there was a significant difference between the two extreme versions, in favor of the one with intermediate variables with meaningful names. This case was the function that was the hardest to understand to begin with. In two additional cases using intermediate variables with meaningless names appears to have caused a slight decrease in understanding. In all other cases the code structure did not make much of a difference. As it is hard to anticipate what others will find difficult to understand, the conclusion is that using intermediate variables is generally desirable. However, this recommendation hinges on giving them good names.

Catolino2019 Gemma Catolino, Fabio Palomba, Damian A. Tamburri, Alexander Serebrenik, and Filomena Ferrucci: "Gender Diversity and Women in Software Teams: How Do They Affect Community Smells?". Proc. International Conference on Software Engineering (ICSE), 2019, 10.1109/icse-seis.2019.00010.

As social as software engineers are, there is a known and established gender imbalance in our community structures, regardless of their open-or closed-source nature. To shed light on the actual benefits of achieving such balance, this empirical study looks into the relations between such balance and the occurrence of community smells, that is, sub-optimal circumstances and patterns across the software organizational structure. Examples of community smells are Organizational Silo effects (overly disconnected sub-groups) or Lone Wolves (defiant community members). Results indicate that the presence of women generally reduces the amount of community smells. We conclude that women are instrumental to reducing community smells in software development teams.

Chatterjee2021 Preetha Chatterjee, Kostadin Damevski, Nicholas A. Kraft, and Lori Pollock: "Automatically Identifying the Quality of Developer Chats for Post Hoc Use". ACM Transactions on Software Engineering and Methodology, 30(4), 2021, 10.1145/3450503.

Software engineers are crowdsourcing answers to their everyday challenges on Q&A forums (e.g., Stack Overflow) and more recently in public chat communities such as Slack, IRC, and Gitter. Many software-related chat conversations contain valuable expert knowledge that is useful for both mining to improve programming support tools and for readers who did not participate in the original chat conversations. However, most chat platforms and communities do not contain built-in quality indicators (e.g., accepted answers, vote counts). Therefore, it is difficult to identify conversations that contain useful information for mining or reading, i.e., conversations of post hoc quality. In this article, we investigate automatically detecting developer conversations of post hoc quality from public chat channels. We first describe an analysis of 400 developer conversations that indicate potential characteristics of post hoc quality, followed by a machine learning-based approach for automatically identifying conversations of post hoc quality. Our evaluation of 2,000 annotated Slack conversations in four programming communities (python, clojure, elm, and racket) indicates that our approach can achieve precision of 0.82, recall of 0.90, F-measure of 0.86, and MCC of 0.57. To our knowledge, this is the first automated technique for detecting developer conversations of post hoc quality.

Chattopadhyay2020 Souti Chattopadhyay, Nicholas Nelson, Audrey Au, Natalia Morales, Christopher Sanchez, Rahul Pandita, and Anita Sarma: "A tale from the trenches: cognitive biases and software development". Proc. International Conference on Software Engineering (ICSE), 2020, 10.1145/3377811.3380330.

Cognitive biases are hard-wired behaviors that influence developer actions and can set them on an incorrect course of action, necessitating backtracking. While researchers have found that cognitive biases occur in development tasks in controlled lab studies, we still don't know how these biases affect developers' everyday behavior. Without such an understanding, development tools and practices remain inadequate. To close this gap, we conducted a 2-part field study to examine the extent to which cognitive biases occur, the consequences of these biases on developer behavior, and the practices and tools that developers use to deal with these biases. About 70% of observed actions that were reversed were associated with at least one cognitive bias. Further, even though developers recognized that biases frequently occur, they routinely are forced to deal with such issues with ad hoc processes and sub-optimal tool support. As one participant (IP12) lamented: There is no salvation!

Cogo2021 Filipe R. Cogo, Gustavo A. Oliva, Cor-Paul Bezemer, and Ahmed E. Hassan: "An empirical study of same-day releases of popular packages in the npm ecosystem". Empirical Software Engineering, 26(5), 2021, 10.1007/s10664-021-09980-6.

Within a software ecosystem, client packages can reuse provider packages as third-party libraries. The reuse relation between client and provider packages is called a dependency. When a client package depends on the code of a provider package, every change that is introduced in a release of the provider has the potential to impact the client package. Since a large number of dependencies exist within a software ecosystem, releases of a popular provider package can impact a large number of clients. Occasionally, multiple releases of a popular package need to be published on the same day, leading to a scenario in which the time available to revise, test, build, and document the release is restricted compared to releases published within a regular schedule. In this paper, our objective is to study the same-day releases that are published by popular packages in the npm ecosystem. We design an exploratory study to characterize the type of changes that are introduced in same-day releases, the prevalence of same-day releases in the npm ecosystem, and the adoption of same-day releases by client packages. A preliminary manual analysis of the existing release notes suggests that same-day releases introduce non-trivial changes (e.g., bug fixes). We then focus on three RQs. First, we study how often same-day releases are published. We found that the median proportion of regularly scheduled releases that are interrupted by a same-day release (per popular package) is 22%, suggesting the importance of having timely and systematic procedures to cope with same-day releases. Second, we study the performed code changes in same-day releases. We observe that 32% of the same-day releases have larger changes compared with their prior release, thus showing that some same-day releases can undergo significant maintenance activity despite their time-constrained nature. In our third RQ, we study how client packages react to same-day releases of their providers. We observe the vast majority of client packages that adopt the release preceding the same-day release would also adopt the latter without having to change their versioning statement (implicit updates). We also note that explicit adoptions of same-day releases (i.e., adoptions that require a change to the versioning statement of the provider in question) is significantly faster than the explicit adoption of regular releases. Based on our findings, we argue that (i) third-party tools that support the automation of dependency management (e.g., Dependabot) should consider explicitly flagging same-day releases, (ii) popular packages should strive for optimized release pipelines that can properly handle same-day releases, and (iii) future research should design scalable, ecosystem-ready tools that support provider packages in assessing the impact of their code changes on client packages.


Danilova2021 Anastasia Danilova, Alena Naiakshina, Stefan Horstmann, and Matthew Smith: "Do you Really Code? Designing and Evaluating Screening Questions for Online Surveys with Programmers". Proc. International Conference on Software Engineering (ICSE), 2021, 10.1109/icse43902.2021.00057.

Recruiting professional programmers in sufficient numbers for research studies can be challenging because they often cannot spare the time, or due to their geographical distribution and potentially the cost involved. Online platforms such as Clickworker or Qualtrics do provide options to recruit participants with programming skill; however, misunderstandings and fraud can be an issue. This can result in participants without programming skill taking part in studies and surveys. If these participants are not detected, they can cause detrimental noise in the survey data. In this paper, we develop screener questions that are easy and quick to answer for people with programming skill but difficult to answer correctly for those without. In order to evaluate our questionnaire for efficacy and efficiency, we recruited several batches of participants with and without programming skill and tested the questions. In our batch 42% of Clickworkers stating that they have programming skill did not meet our criteria and we would recommend filtering these from studies. We also evaluated the questions in an adversarial setting. We conclude with a set of recommended questions which researchers can use to recruit participants with programming skill from online platforms.

Decan2021 Alexandre Decan and Tom Mens: "What Do Package Dependencies Tell Us About Semantic Versioning?". IEEE Transactions on Software Engineering, 47(6), 2021, 10.1109/tse.2019.2918315.

The semantic versioning (semver) policy is commonly accepted by open source package management systems to inform whether new releases of software packages introduce possibly backward incompatible changes. Maintainers depending on such packages can use this information to avoid or reduce the risk of breaking changes in their own packages by specifying version constraints on their dependencies. Depending on the amount of control a package maintainer desires to have over her package dependencies, these constraints can range from very permissive to very restrictive. This article empirically compares semver compliance of four software packaging ecosystems (Cargo, npm, Packagist and Rubygems), and studies how this compliance evolves over time. We explore to what extent ecosystem-specific characteristics or policies influence the degree of compliance. We also propose an evaluation based on the ``wisdom of the crowds'' principle to help package maintainers decide which type of version constraints they should impose on their dependencies.

DeOliveiraNeto2019 Francisco Gomes de Oliveira Neto, Richard Torkar, Robert Feldt, Lucas Gren, Carlo A. Furia, and Ziwei Huang: "Evolution of statistical analysis in empirical software engineering research: Current state and steps forward". Journal of Systems and Software, 156, 2019, 10.1016/j.jss.2019.07.002.

Software engineering research is evolving and papers are increasingly based on empirical data from a multitude of sources, using statistical tests to determine if and to what degree empirical evidence supports their hypotheses. To investigate the practices and trends of statistical analysis in empirical software engineering (ESE), this paper presents a review of a large pool of papers from top-ranked software engineering journals. First, we manually reviewed 161 papers and in the second phase of our method, we conducted a more extensive semi-automatic classification of papers spanning the years 2001--2015 and 5,196 papers. Results from both review steps was used to: i) identify and analyze the predominant practices in ESE (e.g., using t-test or ANOVA), as well as relevant trends in usage of specific statistical methods (e.g., nonparametric tests and effect size measures) and, ii) develop a conceptual model for a statistical analysis workflow with suggestions on how to apply different statistical methods as well as guidelines to avoid pitfalls. Lastly, we confirm existing claims that current ESE practices lack a standard to report practical significance of results. We illustrate how practical significance can be discussed in terms of both the statistical analysis and in the practitioner's context.

Devanbu2016 Prem Devanbu, Thomas Zimmermann, and Christian Bird: "Belief & evidence in empirical software engineering". Proc. International Conference on Software Engineering (ICSE), 2016, 10.1145/2884781.2884812.

Empirical software engineering has produced a steady stream of evidence-based results concerning the factors that affect important outcomes such as cost, quality, and interval. However, programmers often also have strongly-held a priori opinions about these issues. These opinions are important, since developers are highly trained professionals whose beliefs would doubtless affect their practice. As in evidence-based medicine, disseminating empirical findings to developers is a key step in ensuring that the findings impact practice. In this paper, we describe a case study, on the prior beliefs of developers at Microsoft, and the relationship of these beliefs to actual empirical data on the projects in which these developers work. Our findings are that a) programmers do indeed have very strong beliefs on certain topics b) their beliefs are primarily formed based on personal experience, rather than on findings in empirical research and c) beliefs can vary with each project, but do not necessarily correspond with actual evidence in that project. Our findings suggest that more effort should be taken to disseminate empirical findings to developers and that more in-depth study the interplay of belief and evidence in software practice is needed.

Dias2021 Edson Dias, Paulo Meirelles, Fernando Castor, Igor Steinmacher, Igor Wiese, and Gustavo Pinto: "What Makes a Great Maintainer of Open Source Projects?". Proc. International Conference on Software Engineering (ICSE), 2021, 10.1109/icse43902.2021.00093.

Although Open Source Software (OSS) maintainers devote a significant proportion of their work to coding tasks, great maintainers must excel in many other activities beyond coding. Maintainers should care about fostering a community, helping new members to find their place, while also saying ``no'' to patches that although are well-coded and well-tested, do not contribute to the goal of the project. To perform all these activities masterfully, maintainers should exercise attributes that software engineers (working on closed source projects) do not always need to master. This paper aims to uncover, relate, and prioritize the unique attributes that great OSS maintainers might have. To achieve this goal, we conducted 33 semi-structured interviews with well-experienced maintainers that are the gatekeepers of notable projects such as the Linux Kernel, the Debian operating system, and the GitLab coding platform. After we analyzed the interviews and curated a list of attributes, we created a conceptual framework to explain how these attributes are connected. We then conducted a rating survey with 90 OSS contributors. We noted that ``technical excellence'' and ``communication'' are the most recurring attributes. When grouped, these attributes fit into four broad categories: management, social, technical, and personality. While we noted that ``sustain a long term vision of the project'' and being ``extremely careful'' seem to form the basis of our framework, we noted through our survey that the communication attribute was perceived as the most essential one.

Ding2021 Zhen Yu Ding and Claire Le Goues: "An Empirical Study of OSS-Fuzz Bugs". Proc. International Conference on Mining Software Repositories (MSR), 2021, 10.1109/msr52588.2021.00026.

Continuous fuzzing is an increasingly popular technique for automated quality and security assurance. Google maintains OSS-Fuzz: a continuous fuzzing service for open source software. We conduct the first empirical study of OSS-Fuzz, analyzing 23,907 bugs found in 316 projects. We examine the characteristics of fuzzer-found faults, the lifecycles of such faults, and the evolution of fuzzing campaigns over time. We find that OSS-Fuzz is often effective at quickly finding bugs, and developers are often quick to patch them. However, flaky bugs, timeouts, and out of memory errors are problematic, people rarely file CVEs for security vulnerabilities, and fuzzing campaigns often exhibit punctuated equilibria, where developers might be surprised by large spikes in bugs found. Our findings have implications on future fuzzing research and practice.

Dyba2006 Tore Dybå, Vigdis By Kampenes, and Dag I.K. Sjøberg: "A systematic review of statistical power in software engineering experiments". Information and Software Technology, 48(8), 2006, 10.1016/j.infsof.2005.08.009.

Statistical power is an inherent part of empirical studies that employ significance testing and is essential for the planning of studies, for the interpretation of study results, and for the validity of study conclusions. This paper reports a quantitative assessment of the statistical power of empirical software engineering research based on the 103 papers on controlled experiments (of a total of 5,453 papers) published in nine major software engineering journals and three conference proceedings in the decade 1993--2002. The results show that the statistical power of software engineering experiments falls substantially below accepted norms as well as the levels found in the related discipline of information systems research. Given this study's findings, additional attention must be directed to the adequacy of sample sizes and research designs to ensure acceptable levels of statistical power. Furthermore, the current reporting of significance tests should be enhanced by also reporting effect sizes and confidence intervals.



Fagerholm2017 Fabian Fagerholm, Marco Kuhrmann, and Jürgen Münch: "Guidelines for using empirical studies in software engineering education". PeerJ Computer Science, 3, 2017, 10.7717/peerj-cs.131.

Software engineering education is supposed to provide students with industry-relevant knowledge and skills. Educators must address issues beyond exercises and theories that can be directly rehearsed in small settings. A way to experience such efects and to increase the relevance of software engineering education is to apply empirical studies in teaching. In our article, we show how diferent types of empirical studies can be used for educational purposes in software engineering. We give examples illustrating how to utilize empirical studies, discuss challenges, and derive an initial guideline that supports teachers to include empirical studies in software engineering courses. This summary refers to the paper Guidelines for Using Empirical Studies in Software Engineering Education [FKM17]. This paper was published in the PeerJ Computer Science journal.

Farzat2021 Fabio de A. Farzat, Marcio de O. Barros, and Guilherme H. Travassos: "Evolving JavaScript Code to Reduce Load Time". IEEE Transactions on Software Engineering, 47(8), 2021, 10.1109/tse.2019.2928293.

JavaScript is one of the most used programming languages for front-end development of Web applications. The increase in complexity of front-end features brings concerns about performance, especially the load and execution time of JavaScript code. In this paper, we propose an evolutionary program improvement technique to reduce the size of JavaScript programs and, therefore, the time required to load and execute them in Web applications. To guide the development of this technique, we performed an experimental study to characterize the patches applied to JavaScript programs to reduce their size while keeping the functionality required to pass all test cases in their test suites. We applied this technique to 19 JavaScript programs varying from 92 to 15,602 LOC and observed reductions from 0.2 to 73.8 percent of the original code, as well as a relationship between the quality of a program's test suite and the ability to reduce the size of its source code.

Feal2020 'Alvaro Feal, Paolo Calciati, Narseo Vallina-Rodriguez, Carmela Troncoso, and Alessandra Gorla: "Angel or Devil? A Privacy Study of Mobile Parental Control Apps". Proceedings on Privacy Enhancing Technologies, 2020(2), 2020, 10.2478/popets-2020-0029.

Android parental control applications are used by parents to monitor and limit their children's mobile behaviour (e.g., mobile apps usage, web browsing, calling, and texting). In order to offer this service, parental control apps require privileged access to sys-tem resources and access to sensitive data. This may significantly reduce the dangers associated with kids' online activities, but it raises important privacy con-cerns. These concerns have so far been overlooked by organizations providing recommendations regarding the use of parental control applications to the public. We conduct the first in-depth study of the Android parental control app's ecosystem from a privacy and regulatory point of view. We exhaustively study 46 apps from 43 developers which have a combined 20M installs in the Google Play Store. Using a combination of static and dynamic analysis we find that: these apps are on average more permissions-hungry than the top 150 apps in the Google Play Store, and tend to request more dangerous permissions with new releases; 11% of the apps transmit personal data in the clear; 34% of the apps gather and send personal information without appropriate consent; and 72% of the apps share data with third parties (including online advertising and analytics services) without mentioning their presence in their privacy policies. In summary, parental control applications lack transparency and lack compliance with reg ulatory requirements. This holds even for those applications recommended by European and other national security centers.

Ferreira2021 Gabriel Ferreira, Limin Jia, Joshua Sunshine, and Christian Kästner: "Containing Malicious Package Updates in npm with a Lightweight Permission System". Proc. International Conference on Software Engineering (ICSE), 2021, 10.1109/ICSE43902.2021.00121.

The large amount of third-party packages available in fast-moving software ecosystems, such as Node.js/npm, enables attackers to compromise applications by pushing malicious updates to their package dependencies. Studying the npm repository, we observed that many packages in the npm repository that are used in Node.js applications perform only simple computations and do not need access to filesystem or network APIs. This offers the opportunity to enforce least-privilege design per package, protecting applications and package dependencies from malicious updates. We propose a lightweight permission system that protects Node.js applications by enforcing package permissions at runtime. We discuss the design space of solutions and show that our system makes a large number of packages much harder to be exploited, almost for free.

Flint2021 Samuel W. Flint, Jigyasa Chauhan, and Robert Dyer: "Escaping the Time Pit: Pitfalls and Guidelines for Using Time-Based Git Data". Proc. International Conference on Mining Software Repositories (MSR), 2021, 10.1109/msr52588.2021.00022.

Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could affect the research results. This paper presents the first survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 690 technical track and data papers published in MSR 2004--2020, we saw at least 35% of papers utilized time-based data. We then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty commit timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories.

Foundjem2021 Armstrong Foundjem and Bram Adams: "Release synchronization in software ecosystems". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-020-09929-1.

Software ecosystems bring value by integrating software projects related to a given domain, such as Linux distributions integrating upstream open-source projects or the Android ecosystem for mobile Apps. Since each project within an ecosystem may potentially have its release cycle and roadmap, this creates an enormous burden for users who must expend the effort to identify and install compatible project releases from the ecosystem manually. Thus, many ecosystems, such as the Linux distributions, take it upon them to release a polished, well-integrated product to the end-user. However, the body of knowledge lacks empirical evidence about the coordination and synchronization efforts needed at the ecosystem level to ensure such federated releases. This paper empirically studies the strategies used to synchronize releases of ecosystem projects in the context of the OpenStack ecosystem, in which a central release team manages the six-month release cycle of the overall OpenStack ecosystem product. We use qualitative analysis on the release team's IRC-meeting logs that comprise two OpenStack releases (one-year long). Thus, we identified, cataloged, and documented ten major release synchronization activities, which we further validated through interviews with eight active OpenStack senior practitioners (members of either the release team or project teams). Our results suggest that even though an ecosystem's power lies in the interaction of inter-dependent projects, release synchronization remains a challenge for both the release team and the project teams. Moreover, we found evidence (and reasons) of multiple release strategies co-existing within a complex ecosystem.

Fritz2010 Thomas Fritz, Jingwen Ou, Gail C. Murphy, and Emerson Murphy-Hill: "A degree-of-knowledge model to capture source code familiarity". Proc. International Conference on Software Engineering (ICSE), 2010, 10.1145/1806799.1806856.

The size and high rate of change of source code comprising a software system make it difficult for software developers to keep up with who on the team knows about particular parts of the code. Existing approaches to this problem are based solely on authorship of code. In this paper, we present data from two professional software development teams to show that both authorship and interaction information about how a developer interacts with the code are important in characterizing a developer's knowledge of code. We introduce the degree-of-knowledge model that computes automatically a real value for each source code element based on both authorship and interaction information. We show that the degree-of-knowledge model can provide better results than an existing expertise finding approach and also report on case studies of the use of the model to support knowledge transfer and to identify changes of interest.

Fritzsch2021 Jonas Fritzsch, Marvin Wyrich, Justus Bogner, and Stefan Wagner: "Résumé-Driven Development: A Definition and Empirical Characterization". Proc. International Conference on Software Engineering (ICSE), 2021, 10.1109/icse-seis52602.2021.00011.

Technologies play an important role in the hiring process for software professionals. Within this process, several studies revealed misconceptions and bad practices which lead to suboptimal recruitment experiences. In the same context, grey literature anecdotally coined the term Résumé-Driven Development (RDD), a phenomenon describing the overemphasis of trending technologies in both job offerings and resumes as an interaction between employers and applicants. While RDD has been sporadically mentioned in books and online discussions, there are so far no scientific studies on the topic, despite its potential negative consequences. We therefore empirically investigated this phenomenon by surveying 591 software professionals in both hiring (130) and technical (558) roles and identified RDD facets in substantial parts of our sample: 60% of our hiring professionals agreed that trends influence their job offerings, while 82% of our software professionals believed that using trending technologies in their daily work makes them more attractive for prospective employers. Grounded in the survey results, we conceptualize a theory to frame and explain Résumé-Driven Development. Finally, we discuss influencing factors and consequences and propose a definition of the term. Our contribution provides a foundation for future research and raises awareness for a potentially systemic trend that may broadly affect the software industry.

Fu2016 Wei Fu, Tim Menzies, and Xipeng Shen: "Tuning for software analytics: Is it really necessary?". Information and Software Technology, 76, 2016, 10.1016/j.infsof.2016.04.017.

Context: Data miners have been widely used in software engineering to, say, generate defect predictors from static code measures. Such static code defect predictors perform well compared to manual methods, and they are easy to use and useful to use. But one of the ``black arts'' of data mining is setting the tunings that control the miner.Objective: We seek simple, automatic, and very effective method for finding those tunings.Method: For each experiment with different data sets (from open source JAVA systems), we ran differential evolution as an optimizer to explore the tuning space (as a first step) then tested the tunings using hold-out data.Results: Contrary to our prior expectations, we found these tunings were remarkably simple: it only required tens, not thousands, of attempts to obtain very good results. For example, when learning software defect predictors, this method can quickly find tunings that alter detection precision from 0% to 60%. Conclusion: Since (1) the improvements are so large, and (2) the tuning is so simple, we need to change standard methods in software analytics. At least for defect prediction, it is no longer enough to just run a data miner and present the result without conducting a tuning optimization study. The implication for other kinds of analytics is now an open and pressing issue.

Furia2019 Carlo Alberto Furia, Robert Feldt, and Richard Torkar: "Bayesian Data Analysis in Empirical Software Engineering Research". IEEE Transactions on Software Engineering, 2019, 10.1109/tse.2019.2935974.

Statistics comes in two main flavors: frequentist and Bayesian. For historical and technical reasons, frequentist statistics have traditionally dominated empirical data analysis, and certainly remain prevalent in empirical software engineering. This situation is unfortunate because frequentist statistics suffer from a number of shortcomings—such as lack of flexibility and results that are unintuitive and hard to interpret—that curtail their effectiveness when dealing with the heterogeneous data that is increasingly available for empirical analysis of software engineering practice. In this paper, we pinpoint these shortcomings, and present Bayesian data analysis techniques that work better on the same data—as they can provide clearer results that are simultaneously robust and nuanced. After a short, high-level introduction to the basic tools of Bayesian statistics, our presentation targets the reanalysis of two empirical studies targeting data about the effectiveness of automatically generated tests and the performance of programming languages. By contrasting the original frequentist analysis to our new Bayesian analysis, we demonstrate concrete advantages of using Bayesian techniques, and we advocate a prominent role for them in empirical software engineering research and practice.


Garcia2021 Boni García, Mario Munoz-Organero, Carlos Alario-Hoyos, and Carlos Delgado Kloos: "Automated driver management for Selenium WebDriver". Empirical Software Engineering, 26(5), 2021, 10.1007/s10664-021-09975-3.

Selenium WebDriver is a framework used to control web browsers automatically. It provides a cross-browser Application Programming Interface (API) for different languages (e.g., Java, Python, or JavaScript) that allows automatic navigation, user impersonation, and verification of web applications. Internally, Selenium WebDriver makes use of the native automation support of each browser. Hence, a platform-dependent binary file (the so-called driver) must be placed between the Selenium WebDriver script and the browser to support this native communication. The management (i.e., download, setup, and maintenance) of these drivers is cumbersome for practitioners. This paper provides a complete methodology to automate this management process. Particularly, we present WebDriverManager, the reference tool implementing this methodology. WebDriverManager provides different execution methods: as a Java dependency, as a Command-Line Interface (CLI) tool, as a server, as a Docker container, and as a Java agent. To provide empirical validation of the proposed approach, we surveyed the WebDriverManager users. The aim of this study is twofold. First, we assessed the extent to which WebDriverManager is adopted and used. Second, we evaluated the WebDriverManager API following Clarke's usability dimensions. A total of 148 participants worldwide completed this survey in 2020. The results show a remarkable assessment of the automation capabilities and API usability of WebDriverManager by Java users, but a scarce adoption for other languages.

Gerosa2021 Marco Gerosa, Igor Wiese, Bianca Trinkenreich, Georg Link, Gregorio Robles, Christoph Treude, Igor Steinmacher, and Anita Sarma: "The Shifting Sands of Motivation: Revisiting What Drives Contributors in Open Source". Proc. International Conference on Software Engineering (ICSE), 2021, 10.1109/icse43902.2021.00098.

Open Source Software (OSS) has changed drastically over the last decade, with OSS projects now producing a large ecosystem of popular products, involving industry participation, and providing professional career opportunities. But our field's understanding of what motivates people to contribute to OSS is still fundamentally grounded in studies from the early 2000s. With the changed landscape of OSS, it is very likely that motivations to join OSS have also evolved. Through a survey of 242 OSS contributors, we investigate shifts in motivation from three perspectives: (1) the impact of the new OSS landscape, (2) the impact of individuals' personal growth as they become part of OSS communities, and (3) the impact of differences in individuals' demographics. Our results show that some motivations related to social aspects and reputation increased in frequency and that some intrinsic and internalized motivations, such as learning and intellectual stimulation, are still highly relevant. We also found that contributing to OSS often transforms extrinsic motivations to intrinsic, and that while experienced contributors often shift toward altruism, novices often shift toward career, fun, kinship, and learning. OSS projects can leverage our results to revisit current strategies to attract and retain contributors, and researchers and tool builders can better support the design of new studies and tools to engage and support OSS development.

Glanz2020 Leonid Glanz, Patrick Müller, Lars Baumgärtner, Michael Reif, Sven Amann, Pauline Anthonysamy, and Mira Mezini: "Hidden in Plain Sight: Obfuscated Strings Threatening Your Privacy". Proc. Asia Conference on Computer and Communications Security (ACCCS), 2020, 10.1145/3320269.3384745.

String obfuscation is an established technique used by proprietary, closed-source applications to protect intellectual property. Furthermore, it is also frequently used to hide spyware or malware in applications. In both cases, the techniques range from bit-manipulation over XOR operations to AES encryption. However, string obfuscation techniques/tools suffer from one shared weakness: They generally have to embed the necessary logic to deobfuscate strings into the app code. In this paper, we show that most of the string obfuscation techniques found in malicious and benign applications for Android can easily be broken in an automated fashion. We developed StringHound, an open-source tool that uses novel techniques that identify obfuscated strings and reconstruct the originals using slicing. We evaluated StringHound on both benign and malicious Android apps. In summary, we deobfuscate almost 30 times more obfuscated strings than other string deobfuscation tools. Additionally, we analyzed 100,000 Google Play Store apps and found multiple obfuscated strings that hide vulnerable cryptographic usages, insecure internet accesses, API keys, hard-coded passwords, and exploitation of privileges without the awareness of the developer. Furthermore, our analysis reveals that not only malware uses string obfuscation but also benign apps make extensive use of string obfuscation.

Golubev2021 Yaroslav Golubev, Zarina Kurbatova, Eman Abdullah AlOmar, Timofey Bryksin, and Mohamed Wiem Mkaouer: "One thousand and one stories: a large-scale survey of software refactoring". Proc. European Software Engineering Conference/International Symposium on the Foundations of Software Engineering (ESEC/FSE), 2021, 10.1145/3468264.3473924.

Despite the availability of refactoring as a feature in popular IDEs, recent studies revealed that developers are reluctant to use them, and still prefer the manual refactoring of their code. At JetBrains, our goal is to fully support refactoring features in IntelliJ-based IDEs and improve their adoption in practice. Therefore, we start by raising the following main questions. How exactly do people refactor code? What refactorings are the most popular? Why do some developers tend not to use convenient IDE refactoring tools? In this paper, we investigate the raised questions through the design and implementation of a survey targeting 1,183 users of IntelliJ-based IDEs. Our quantitative and qualitative analysis of the survey results shows that almost two-thirds of developers spend more than one hour in a single session refactoring their code; that refactoring types vary greatly in popularity; and that a lot of developers would like to know more about IDE refactoring features but lack the means to do so. These results serve us internally to support the next generation of refactoring features, as well as can help our research community to establish new directions in the refactoring usability research.

Gujral2021 Harshit Gujral, Sangeeta Lal, and Heng Li: "An exploratory semantic analysis of logging questions". Journal of Software: Evolution and Process, 33(7), 2021, 10.1002/smr.2361.

Logging is an integral part of software development. Software practitioners often face issues in software logging, and they post these issues on Q&A websites to take suggestions from the experts. In this study, we perform a three-level empirical analysis of logging questions posted on six popular technical Q&A websites, namely, Stack Overflow (SO), Serverfault (SF), Superuser (SU), Database Administrators (DB), Software Engineering (SE), and Android Enthusiasts (AE). The findings show that logging issues are prevalent across various domains, for example, database, networks, and mobile computing, and software practitioners from different domains face different logging issues. The semantic analysis of logging questions using Latent Dirichlet Allocation (LDA) reveals trends of several existing and new logging topics, such as logging conversion pattern, Android device logging, and database logging. In addition, we observe specific logging topics for each website: DB (log shipping and log file growing/shrinking), SU (event log and syslog configuration), SF (log analysis and syslog configuration), AE (app install and usage tracking), SE (client server logging and exception logging), and SO (log file creation/deletion, Android emulator logging, and logger class of Log4j). We obtain an increasing trend of logging topics on the SO, SU, and DB websites whereas a decreasing trend of logging topics on the SF website.


Hazoom2021 Moshe Hazoom, Vibhor Malik, and Ben Bogin: "Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data". Proc. Workshop on Natural Language Processing for Programming (NLP4Prog), 2021, 10.18653/v1/2021.nlp4prog-1.9.

Most available semantic parsing datasets, comprising of pairs of natural utterances and logical forms, were collected solely for the purpose of training and evaluation of natural language understanding systems. As a result, they do not contain any of the richness and variety of natural-occurring utterances, where humans ask about data they need or are curious about. In this work, we release SEDE, a dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website. We show that these pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset, propose an evaluation metric based on comparison of partial query clauses that is more suitable for real-world queries, and conduct experiments with strong baselines, showing a large gap between the performance on SEDE compared to other common datasets.

Hoda2021 Rashina Hoda: "Socio-Technical Grounded Theory for Software Engineering". IEEE Transactions on Software Engineering, 2021, 10.1109/tse.2021.3106280.

Grounded Theory (GT), a sociological research method designed to study social phenomena, is increasingly being used to investigate the human and social aspects of software engineering (SE). However, being written by and for sociologists, GT is often challenging for a majority of SE researchers to understand and apply. Additionally, SE researchers attempting ad hoc adaptations of traditional GT guidelines for modern socio-technical (ST) contexts often struggle in the absence of clear and relevant guidelines to do so, resulting in poor quality studies. To overcome these research community challenges and leverage modern research opportunities, this paper presents Socio-Technical Grounded Theory (STGT) designed to ease application and achieve quality outcomes. It defines what exactly is meant by an ST research context and presents the STGT guidelines that expand GT's philosophical foundations, provide increased clarity and flexibility in its methodological steps and procedures, define possible scope and contexts of application, encourage frequent reporting of a variety of interim, preliminary, and mature outcomes, and introduce nuanced evaluation guidelines for different outcomes. It is hoped that the SE research community and related ST disciplines such as computer science, data science, artificial intelligence, information systems, human computer/robot/AI interaction, human-centered emerging technologies (and increasingly other disciplines being transformed by rapid digitalisation and AI-based augmentation), will benefit from applying STGT to conduct quality research studies and systematically produce rich findings and mature theories with confidence.

Holmes2020 Josie Holmes, Iftekhar Ahmed, Caius Brindescu, Rahul Gopinath, He Zhang, and Alex Groce: "Using Relative Lines of Code to Guide Automated Test Generation for Python". ACM Transactions on Software Engineering and Methodology, 29(4), 2020, 10.1145/3408896.

Raw lines of code (LOC) is a metric that does not, at first glance, seem extremely useful for automated test generation. It is both highly language-dependent and not extremely meaningful, semantically, within a language: one coder can produce the same effect with many fewer lines than another. However, relative LOC, between components of the same project, turns out to be a highly useful metric for automated testing. In this article, we make use of a heuristic based on LOC counts for tested functions to dramatically improve the effectiveness of automated test generation. This approach is particularly valuable in languages where collecting code coverage data to guide testing has a very high overhead. We apply the heuristic to property-based Python testing using the TSTL (Template Scripting Testing Language) tool. In our experiments, the simple LOC heuristic can improve branch and statement coverage by large margins (often more than 20%, up to 40% or more) and improve fault detection by an even larger margin (usually more than 75% and up to 400% or more). The LOC heuristic is also easy to combine with other approaches and is comparable to, and possibly more effective than, two well-established approaches for guiding random testing.

Hoyos2021 Juan Hoyos, Rabe Abdalkareem, Suhaib Mujahid, Emad Shihab, and Albeiro Espinosa Bedoya: "On the Removal of Feature Toggles: A Study of Python Projects and Practitioners Motivations". Empirical Software Engineering, 26(2), 2021, 10.1007/s10664-020-09902-y.

Feature Toggling is a technique to control the execution of features in a software project. For example, practitioners using feature toggles can experiment with new features in a production environment by exposing them to a subset of users. Some of these toggles require additional maintainability efforts and are expected to be removed, whereas others are meant to remain for a long time. However, to date, very little is known about the removal of feature toggles, which is why we focus on this topic in our paper. We conduct an empirical study that focuses on the removal of feature toggles. We use source code analysis techniques to analyze 12 Python open source projects and surveyed 61 software practitioners to provide deeper insights on the topic. Our study shows that 75% of the toggle components in the studied Python projects are removed within 49 weeks after introduction. However, eventually practitioners remove feature toggles to follow the life cycle of a feature when it becomes stable in production. We also find that not all long-term feature toggles are designed to live that long and not all feature toggles are removed from the source code, opening the possibilities to unwanted risks. Our study broadens the understanding of feature toggles by identifying reasons for their survival in practice and aims to help practitioners make better decisions regarding the way they manage and remove feature toggles.

Huang2020 Yu Huang, Kevin Leach, Zohreh Sharafi, Nicholas McKay, Tyler Santander, and Westley Weimer: "Biases and differences in code review using medical imaging and eye-tracking: genders, humans, and machines". Proc. European Software Engineering Conference/International Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020, 10.1145/3368089.3409681.

Code review is a critical step in modern software quality assurance, yet it is vulnerable to human biases. Previous studies have clarified the extent of the problem, particularly regarding biases against the authors of code,but no consensus understanding has emerged. Advances in medical imaging are increasingly applied to software engineering, supporting grounded neurobiological explorations of computing activities, including the review, reading, and writing of source code. In this paper, we present the results of a controlled experiment using both medical imaging and also eye tracking to investigate the neurological correlates of biases and differences between genders of humans and machines (e.g., automated program repair tools) in code review. We find that men and women conduct code reviews differently, in ways that are measurable and supported by behavioral, eye-tracking and medical imaging data. We also find biases in how humans review code as a function of its apparent author, when controlling for code quality. In addition to advancing our fundamental understanding of how cognitive biases relate to the code review process, the results may inform subsequent training and tool design to reduce bias.

Huijgens2020 Hennie Huijgens, Ayushi Rastogi, Ernst Mulders, Georgios Gousios, and Arie van Deursen: "Questions for data scientists in software engineering: a replication". Proc. European Software Engineering Conference/International Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020, 10.1145/3368089.3409717.

In 2014, a Microsoft study investigated the sort of questions that data science applied to software engineering should answer. This resulted in 145 questions that developers considered relevant for data scientists to answer, thus providing a research agenda to the community. Fast forward to five years, no further studies investigated whether the questions from the software engineers at Microsoft hold for other software companies, including software-intensive companies with different primary focus (to which we refer as software-defined enterprises). Furthermore, it is not evident that the problems identified five years ago are still applicable, given the technological advances in software engineering. This paper presents a study at ING, a software-defined enterprise in banking in which over 15,000 IT staff provides in-house software solutions. This paper presents a comprehensive guide of questions for data scientists selected from the previous study at Microsoft along with our current work at ING. We replicated the original Microsoft study at ING, looking for questions that impact both software companies and software-defined enterprises and continue to impact software engineering. We also add new questions that emerged from differences in the context of the two companies and the five years gap in between. Our results show that software engineering questions for data scientists in the software-defined enterprise are largely similar to the software company, albeit with exceptions. We hope that the software engineering research community builds on the new list of questions to create a useful body of knowledge.


Imam2021 Ahmed Imam and Tapajit Dey: "Tracking Hackathon Code Creation and Reuse". Proc. International Conference on Mining Software Repositories (MSR), 2021, 10.1109/msr52588.2021.00085.

Background: Hackathons have become popular events for teams to collaborate on projects and develop software prototypes. Most existing research focuses on activities during an event with limited attention to the evolution of the code brought to or created during a hackathon. Aim: We aim to understand the evolution of hackathon-related code, specifically, how much hackathon teams rely on pre-existing code or how much new code they develop during a hackathon. Moreover, we aim to understand if and where that code gets reused. Method: We collected information about 22,183 hackathon projects from Devpost—a hackathon database—and obtained related code (blobs), authors, and project characteristics from the World of Code. We investigated if code blobs in hackathon projects were created before, during, or after an event by identifying the original blob creation date and author, and also checked if the original author was a hackathon project member. We tracked code reuse by first identifying all commits containing blobs created during an event before determining all projects that contain those commits. Result: While only approximately 9.14% of the code blobs are created during hackathons, this amount is still significant considering time and member constraints of such events. Approximately a third of these code blobs get reused in other projects. Conclusion: Our study demonstrates to what extent pre-existing code is used and new code is created during a hackathon and how much of it is reused elsewhere afterwards. Our findings help to better understand code reuse as a phenomenon and the role of hackathons in this context and can serve as a starting point for further studies in this area.


Jalote2021 Pankaj Jalote and Damodaram Kamma: "Studying Task Processes for Improving Programmer Productivity". IEEE Transactions on Software Engineering, 47(4), 2021, 10.1109/tse.2019.2904230.

Productivity of a software development organization can be enhanced by improving the software process, using better tools/technology, and enhancing the productivity of programmers. This work focuses on improving programmer productivity by studying the process used by a programmer for executing an assigned task, which we call the task process. We propose a general framework for studying the impact of task processes on programmer productivity and also the impact of transferring task processes of high-productivity programmers to average-productivity peers. We applied the framework to a few live projects in Robert Bosch Engineering and Business Solutions Limited, a CMMI Level 5 company. In each project, we identified two groups of programmers: high-productivity and average-productivity programmers. We requested each programmer to video capture their computer screen while executing his/her assigned tasks. We then analyzed these task videos to extract the task processes and then used them to identify the differences between the task processes used by the two groups. Some key differences were found between the task processes, which could account for the difference in productivities of the two groups. Similarities between the task processes were also analyzed quantitatively by modeling each task process as a Markov chain. We found that programmers from the same group used similar task processes, but the task processes of the two groups differed considerably. The task processes of high-productivity programmers were transferred to the average-productivity programmers by training them on the key steps missing in their process but commonly present in the work of their high-productivity peers. A substantial productivity gain was found in the average-productivity programmers as a result of this transfer. The study shows that task processes of programmers impact their productivity, and it is possible to improve the productivity of average-productivity programmers by transferring task processes from high-productivity programmers to them.

Jin2021 Xianhao Jin and Francisco Servant: "What Helped, and what did not? An Evaluation of the Strategies to Improve Continuous Integration". Proc. International Conference on Software Engineering (ICSE), 2021, 10.1109/icse43902.2021.00031.

Continuous integration (CI) is a widely used practice in modern software engineering. Unfortunately, it is also an expensive practice - Google and Mozilla estimate their CI systems in millions of dollars. There are a number of techniques and tools designed to or having the potential to save the cost of CI or expand its benefit - reducing time to feedback. However, their benefits in some dimensions may also result in drawbacks in others. They may also be beneficial in other scenarios where they are not designed to help. In this paper, we perform the first exhaustive comparison of techniques to improve CI, evaluating 14 variants of 10 techniques using selection and prioritization strategies on build and test granularity. We evaluate their strengths and weaknesses with 10 different cost and time-tofeedback saving metrics on 100 real-world projects. We analyze the results of all techniques to understand the design decisions that helped different dimensions of benefit. We also synthesized those results to lay out a series of recommendations for the development of future research techniques to advance this area.

Johnson2019 John Johnson, Sergio Lubo, Nishitha Yedla, Jairo Aponte, and Bonita Sharif: "An Empirical Study Assessing Source Code Readability in Comprehension". Proc. International Conference on Software Maintenance and Evolution (ICSME), 2019, 10.1109/icsme.2019.00085.

Software developers spend a significant amount of time reading source code. If code is not written with readability in mind, it impacts the time required to maintain it. In order to alleviate the time taken to read and understand code, it is important to consider how readable the code is. The general consensus is that source code should be written to minimize the time it takes for others to read and understand it. In this paper, we conduct a controlled experiment to assess two code readability rules: nesting and looping. We test 32 Java methods in four categories: ones that follow/do not follow the readability rule and that are correct/incorrect. The study was conducted online with 275 participants. The results indicate that minimizing nesting decreases the time a developer spends reading and understanding source code, increases confidence about the developer's understanding of the code, and also suggests that it improves their ability to find bugs. The results also show that avoiding the do-while statement had no significant impact on level of understanding, time spent reading and understanding, confidence in understanding, or ease of finding bugs. It was also found that the better knowledge of English a participant had, the more their readability and comprehension confidence ratings were affected by the minimize nesting rule. We discuss the implications of these findings for code readability and comprehension.

Johnson2021 Brittany Johnson, Thomas Zimmermann, and Christian Bird: "The Effect of Work Environments on Productivity and Satisfaction of Software Engineers". IEEE Transactions on Software Engineering, 47(4), 2021, 10.1109/tse.2019.2903053.

The physical work environment of software engineers can have various effects on their satisfaction and the ability to get the work done. To better understand the factors of the environment that affect productivity and satisfaction of software engineers, we explored different work environments at Microsoft. We used a mixed-methods, multiple stage research design with a total of 1,159 participants: two surveys with 297 and 843 responses respectively and interviews with 19 employees. We found several factors that were considered as important for work environments: personalization, social norms and signals, room composition and atmosphere, work-related environment affordances, work area and furniture, and productivity strategies. We built statistical models for satisfaction with the work environment and perceived productivity of software engineers and compared them to models for employees in the Program Management, IT Operations, Marketing, and Business Program & Operations disciplines. In the satisfaction models, the ability to work privately with no interruptions and the ability to communicate with the team and leads were important factors among all disciplines. In the productivity models, the overall satisfaction with the work environment and the ability to work privately with no interruptions were important factors among all disciplines. For software engineers, another important factor for perceived productivity was the ability to communicate with the team and leads. We found that private offices were linked to higher perceived productivity across all disciplines.

Jolak2020 Rodi Jolak and 9 others: "Software engineering whispers: The effect of textual vs. graphical software design descriptions on software design communication". Empirical Software Engineering, 25(6), 2020, 10.1007/s10664-020-09835-6.

Software engineering is a social and collaborative activity. Communicating and sharing knowledge between software developers requires much effort. Hence, the quality of communication plays an important role in influencing project success. To better understand the effect of communication on project success, more in-depth empirical studies investigating this phenomenon are needed. We investigate the effect of using a graphical versus textual design description on co-located software design communication. Therefore, we conducted a family of experiments involving a mix of 240 software engineering students from four universities. We examined how different design representations (i.e., graphical vs. textual) affect the ability to Explain, Understand, Recall, and Actively Communicate knowledge. We found that the graphical design description is better than the textual in promoting Active Discussion between developers and improving the Recall of design details. Furthermore, compared to its unaltered version, a well-organized and motivated textual design description—that is used for the same amount of time—enhances the recall of design details and increases the amount of active discussions at the cost of reducing the perceived quality of explaining.

Jones2020 Derek M. Jones: Evidence-based Software Engineering: based on the publicly available data. Knowledge Software, Ltd., 2020, 9781838291303.

This book discusses what is currently known about software engineering, based on an analysis of all the publicly available data. This aim is not as ambitious as it sounds, because there is not a great deal of data publicly available. The intent is to provide material that is useful to professional developers working in industry; until recently researchers in software engineering have been more interested in vanity work, promoted by ego and bluster. The material is organized in two parts, the first covering software engineering and the second the statistics likely to be needed for the analysis of software engineering data.


Kamienski2021 Arthur V. Kamienski, Luisa Palechor, Cor-Paul Bezemer, and Abram Hindle: "PySStuBs: Characterizing Single-Statement Bugs in Popular Open-Source Python Projects". Proc. International Conference on Mining Software Repositories (MSR), 2021, 10.1109/msr52588.2021.00066.

Single-statement bugs (SStuBs) can have a severe impact on developer productivity. Despite usually being simple and not offering much of a challenge to fix, these bugs may still disturb a developer's workflow and waste precious development time. However, few studies have paid attention to these simple bugs, focusing instead on bugs of any size and complexity. In this study, we explore the occurrence of SStuBs in some of the most popular open-source Python projects on GitHub, while also characterizing their patterns and distribution. We further compare these bugs to SStuBs found in a previous study on Java Maven projects. We find that these Python projects have different SStuB patterns than the ones in Java Maven projects and identify 7 new SStuB patterns. Our results may help uncover the importance of understanding these bugs for the Python programming language, and how developers can handle them more effectively.

Kim2021 Dong Jae Kim, Tse-Hsun Chen, and Jinqiu Yang: "The secret life of test smells—an empirical study on test smell evolution and maintenance". Empirical Software Engineering, 26(5), 2021, 10.1007/s10664-021-09969-1.

In recent years, researchers and practitioners have been studying the impact of test smells in test maintenance. However, there is still limited empirical evidence on why developers remove test smells in software maintenance and the mechanism employed for addressing test smells. In this paper, we conduct an empirical study on 12 real-world open-source systems to study the evolution and maintenance of test smells and how test smells are related to software quality. Results show that: 1) Although the number of test smell instances increases, test smell density decreases as systems evolve. 2) However, our qualitative analysis on those removed test smells reveals that most test smell removal (83%) is a by-product of feature maintenance activities. 45% of the removed test smells relocate to other test cases due to refactoring, while developers deliberately address the only 17% of test smells, consisting of largely Exception Catch/Throw and Sleepy Test. 3) Our statistical model shows that test smell metrics can provide additional explanatory power on post-release defects over traditional baseline metrics (an average of 8.25% increase in AUC). However, most types of test smells have a minimal effect on post-release defects. Our study provides insight into developers' perception of test smells and current practices. Future studies on test smells may consider focusing on the specific types of test smells that may have a higher correlation with defect-proneness when helping developers with test code maintenance.

Klotins2021 Eriks Klotins, Michael Unterkalmsteiner, Panagiota Chatzipetrou, Tony Gorschek, Rafael Prikladnicki, Nirnaya Tripathi, and Leandro Bento Pompermaier: "A Progression Model of Software Engineering Goals, Challenges, and Practices in Start-Ups". IEEE Transactions on Software Engineering, 47(3), 2021, 10.1109/tse.2019.2900213.

Context: Software start-ups are emerging as suppliers of innovation and software-intensive products. However, traditional software engineering practices are not evaluated in the context, nor adopted to goals and challenges of start-ups. As a result, there is insufficient support for software engineering in the start-up context. Objective: We aim to collect data related to engineering goals, challenges, and practices in start-up companies to ascertain trends and patterns characterizing engineering work in start-ups. Such data allows researchers to understand better how goals and challenges are related to practices. This understanding can then inform future studies aimed at designing solutions addressing those goals and challenges. Besides, these trends and patterns can be useful for practitioners to make more informed decisions in their engineering practice. Method: We use a case survey method to gather first-hand, in-depth experiences from a large sample of software start-ups. We use open coding and cross-case analysis to describe and identify patterns, and corroborate the findings with statistical analysis. Results: We analyze 84 start-up cases and identify 16 goals, 9 challenges, and 16 engineering practices that are common among start-ups. We have mapped these goals, challenges, and practices to start-up life-cycle stages (inception, stabilization, growth, and maturity). Thus, creating the progression model guiding software engineering efforts in start-ups. Conclusions: We conclude that start-ups to a large extent face the same challenges and use the same practices as established companies. However, the primary software engineering challenge in start-ups is to evolve multiple process areas at once, with a little margin for serious errors.

Kochhar2019 Pavneet Singh Kochhar, Eirini Kalliamvakou, Nachiappan Nagappan, Thomas Zimmermann, and Christian Bird: "Moving from Closed to Open Source: Observations from Six Transitioned Projects to GitHub". IEEE Transactions on Software Engineering, 2019, 10.1109/tse.2019.2937025.

Open source software systems have gained a lot of attention in the past few years. With the emergence of open source platforms like GitHub, developers can contribute, store, and manage their projects with ease. Large organizations like Microsoft, Google, and Facebook are open sourcing their in-house technologies in an effort to more broadly involve the community in the development of software systems. Although closed source and open source systems have been studied extensively, there has been little research on the transition from closed source to open source systems. Through this study we aim to: a) provide guidance and insights for other teams planning to open source their projects and b) to help them avoid pitfalls during the transition process. We studied six different Microsoft systems, which were recently open-sourced i.e., CoreFX, CoreCLR, Roslyn, Entity Framework, MVC, and Orleans. This paper presents the transition from the viewpoints of both Microsoft and the open source community based on interviews with eleven Microsoft developer, five Microsoft senior managers involved in the decision to open source, and eleven open-source developers. From Microsoft's perspective we discuss the reasons for the transition, experiences of developers involved, and the transition's outcomes and challenges. Our results show that building a vibrant community, prompt answers, developing an open source culture, security regulations and business opportunities are the factors which persuade companies to open source their products. We also discuss the transition outcomes on processes such as code reviews, version control systems, continuous integration as well as developers' perception of these changes. From the open source community's perspective, we illustrate the response to the open-sourcing initiative through contributions and interactions with the internal developers and provide guidelines for other projects planning to go open source.

Krueger2020 Ryan Krueger, Yu Huang, Xinyu Liu, Tyler Santander, Westley Weimer, and Kevin Leach: "Neurological divide: an fMRI study of prose and code writing". Proc. International Conference on Software Engineering (ICSE), 2020, 10.1145/3377811.3380348.

Software engineering involves writing new code or editing existing code. Recent efforts have investigated the neural processes associated with reading and comprehending code—however, we lack a thorough understanding of the human cognitive processes underlying code writing. While prose reading and writing have been studied thoroughly, that same scrutiny has not been applied to code writing. In this paper, we leverage functional brain imaging to investigate neural representations of code writing in comparison to prose writing. We present the first human study in which participants wrote code and prose while undergoing a functional magnetic resonance imaging (fMRI) brain scan, making use of a full-sized fMRI-safe QWERTY keyboard. We find that code writing and prose writing are significantly dissimilar neural tasks. While prose writing entails significant left hemisphere activity associated with language, code writing involves more activations of the right hemisphere, including regions associated with attention control, working memory, planning and spatial cognition. These findings are unlike existing work in which code and prose comprehension were studied. By contrast, we present the first evidence suggesting that code and prose writing are quite dissimilar at the neural level.


Lamba2020 Hemank Lamba, Asher Trockman, Daniel Armanios, Christian Kästner, Heather Miller, and Bogdan Vasilescu: "Heard it through the Gitvine: an empirical study of tool diffusion across the npm ecosystem". Proc. European Software Engineering Conference/International Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020, 10.1145/3368089.3409705.

Automation tools like continuous integration services, code coverage reporters, style checkers, dependency managers, etc. are all known to provide significant improvements in developer productivity and software quality. Some of these tools are widespread, others are not. How do these automation ``best practices'' spread? And how might we facilitate the diffusion process for those that have seen slower adoption? In this paper, we rely on a recent innovation in transparency on code hosting platforms like GitHub—the use of repository badges—to track how automation tools spread in open-source ecosystems through different social and technical mechanisms over time. Using a large longitudinal data set, multivariate network science techniques, and survival analysis, we study which socio-technical factors can best explain the observed diffusion process of a number of popular automation tools. Our results show that factors such as social exposure, competition, and observability affect the adoption of tools significantly, and they provide a roadmap for software engineers and researchers seeking to propagate best practices and tools.

Lampel2021 Johannes Lampel, Sascha Just, Sven Apel, and Andreas Zeller: "When life gives you oranges: detecting and diagnosing intermittent job failures at Mozilla". Proc. European Software Engineering Conference/International Symposium on the Foundations of Software Engineering (ESEC/FSE), 2021, 10.1145/3468264.3473931.

Continuous delivery of cloud systems requires constant running of jobs (build processes, tests, etc.). One issue that plagues this continuous integration (CI) process are intermittent failures - non-deterministic, false alarms that do not result from a bug in the software or job specification, but rather from issues in the underlying infrastructure. At Mozilla, such intermittent failures are called oranges as a reference to the color of the build status indicator. As such intermittent failures disrupt CI and lead to failures, they erode the developers' trust in the jobs. We present a novel approach that automatically classifies failing jobs to determine whether job execution failures arise from an actual software bug or were caused by flakiness in the job (e.g., test) or the underlying infrastructure. For this purpose, we train classification models using job telemetry data to diagnose failure patterns involving features such as runtime, cpu load, operating system version, or specific platform with high precision. In an evaluation on a set of Mozilla CI jobs, our approach achieves precision scores of 73%, on average, across all data sets with some test suites achieving precision scores good enough for fully automated classification (i.e., precision scores of up to 100%), and recall scores of 82% on average (up to 94%).

Latendresse2021 Jasmine Latendresse, Rabe Abdalkareem, Diego Elias Costa, and Emad Shihab: "How Effective is Continuous Integration in Indicating Single-Statement Bugs?". Proc. International Conference on Mining Software Repositories (MSR), 2021, 10.1109/msr52588.2021.00062.

Continuous Integration (CI) is the process of automatically compiling, building, and testing code changes in the hope of catching bugs as they are introduced into the code base. With bug fixing being a core and increasingly costly task in software development, the community has adopted CI to mitigate this issue and improve the quality of their software products. Bug fixing is a core task in software development and becomes increasingly costly over time. However, little is known about how effective CI is at detecting simple, single-statement bugs.In this paper, we analyze the effectiveness of CI in 14 popular open source Java-based projects to warn about 318 single-statement bugs (SStuBs). We analyze the build status at the commits that introduce SStuBs and before the SStuBs were fixed. We then investigate how often CI indicates the presence of these bugs, through test failure. Our results show that only 2% of the commits that introduced SStuBs have builds with failed tests and 7.5% of builds before the fix reported test failures. Upon close manual inspection, we found that none of the failed builds actually captured SStuBs, indicating that CI is not the right medium to capture the SStuBs we studied. Our results suggest that developers should not rely on CI to catch SStuBs or increase their CI pipeline coverage to detect single-statement bugs.

Lee2020a Daniel Lee, Dayi Lin, Cor-Paul Bezemer, and Ahmed E. Hassan: "Building the perfect game – an empirical study of game modifications". Empirical Software Engineering, 25(4), 2020, 10.1007/s10664-019-09783-w.

Prior work has shown that gamer loyalty is important for the sales of a developer's future games. Therefore, it is important for game developers to increase the longevity of their games. However, game developers cannot always meet the growing and changing needs of the gaming community, due to the often already overloaded schedules of developers. So-called modders can potentially assist game developers with addressing gamers' needs. Modders are enthusiasts who provide modifications or completely new content for a game. By supporting modders, game developers can meet the rapidly growing and varying needs of their gamer base. Modders have the potential to play a role in extending the life expectancy of a game, thereby saving game developers time and money, and leading to a better overall gaming experience for their gamer base. In this paper, we empirically study the metadata of 9,521 mods that were extracted from the Nexus Mods distribution platform. The Nexus Mods distribution platform is one of the largest mod distribution platforms for PC games at the time of our study. The goal of our paper is to provide useful insights about mods on the Nexus Mods distribution platform from a quantitative perspective, and to provide researchers a solid foundation to further explore game mods. To better understand the potential of mods to extend the longevity of a game we study their characteristics, and we study their release schedules and post-release support (in terms of bug reports) as a proxy for the willingness of the modding community to contribute to a game. We find that providing official support for mods can be beneficial for the perceived quality of the mods of a game: games for which a modding tool is provided by the original game developer have a higher median endorsement ratio than mods for games that do not have such a tool. In addition, mod users are willing to submit bug reports for a mod. However, they often fail to do this in a systematic manner using the bug reporting tool of the Nexus Mods platform, resulting in low-quality bug reports which are difficult to resolve. Our findings give the first insights into the characteristics, release schedule and post-release support of game mods. Our findings show that some games have a very active modding community, which contributes to those games through mods. Based on our findings, we recommend that game developers who desire an active modding community for their own games provide the modding community with an officially-supported modding tool. In addition, we recommend that mod distribution platforms, such as Nexus Mods, improve their bug reporting system to receive higher quality bug reports.

Lee2020b Daniel Lee, Gopi Krishnan Rajbahadur, Dayi Lin, Mohammed Sayagh, Cor-Paul Bezemer, and Ahmed E. Hassan: "An empirical study of the characteristics of popular Minecraft mods". Empirical Software Engineering, 25(5), 2020, 10.1007/s10664-020-09840-9.

It is becoming increasingly difficult for game developers to manage the cost of developing a game, while meeting the high expectations of gamers. One way to balance the increasing gamer expectation and development stress is to build an active modding community around the game. There exist several examples of games with an extremely active and successful modding community, with the Minecraft game being one of the most notable ones. This paper reports on an empirical study of 1,114 popular and 1,114 unpopular Minecraft mods from the CurseForge mod distribution platform, one of the largest distribution platforms for Minecraft mods. We analyzed the relationship between 33 features across 5 dimensions of mod characteristics and the popularity of mods (i.e., mod category, mod documentation, environmental context of the mod, remuneration for the mod, and community contribution for the mod), to understand the characteristics of popular Minecraft mods. We firstly verify that the studied dimensions have significant explanatory power in distinguishing the popularity of the studied mods. Then we evaluated the contribution of each of the 33 features across the 5 dimensions. We observed that popular mods tend to have a high quality description and promote community contribution.

LeGoues2018 Claire Le Goues, Ciera Jaspan, Ipek Ozkaya, Mary Shaw, and Kathryn T. Stolee: "Bridging the Gap: From Research to Practical Advice". IEEE Software, 35(5), 2018, 10.1109/ms.2018.3571235.

Software developers need actionable guidance, but researchers rarely integrate diverse types of evidence in a way that indicates the recommendations' strength. A levels-ofevidence framework might allow researchers and practitioners to translate research results to a pragmatically useful form.

LeGoues2021 Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Satish Chandra: "Automatic Program Repair". IEEE Software, 38(4), 2021, 10.1109/ms.2021.3072577.

An introduction to a special journal issue on automatic program repair.

Lemire2021 Daniel Lemire: "Number parsing at a gigabyte per second". Software: Practice and Experience, 51(8), 2021, 10.1002/spe.2984.

With disks and networks providing gigabytes per second, parsing decimal numbers from strings becomes a bottleneck. We consider the problem of parsing decimal numbers to the nearest binary floating-point value. The general problem requires variable-precision arithmetic. However, we need at most 17 digits to represent 64-bit standard floating-point numbers (IEEE 754). Thus, we can represent the decimal significand with a single 64-bit word. By combining the significand and precomputed tables, we can compute the nearest floating-point number using as few as one or two 64-bit multiplications. Our implementation can be several times faster than conventional functions present in standard C libraries on modern 64-bit systems (Intel, AMD, ARM, and POWER9). Our work is available as open source software used by major systems such as Apache Arrow and Yandex ClickHouse. The Go standard library has adopted a version of our approach.

Lima2021a Luan P. Lima, Lincoln S. Rocha, Carla I. M. Bezerra, and Matheus Paixao: "Assessing exception handling testing practices in open-source libraries". Empirical Software Engineering, 26(5), 2021, 10.1007/s10664-021-09983-3.

Modern programming languages (e.g., Java and C#) provide features to separate error-handling code from regular code, seeking to enhance software comprehensibility and maintainability. Nevertheless, the way exception handling (EH) code is structured in such languages may lead to multiple, different, and complex control flows, which may affect the software testability. Previous studies have reported that EH code is typically neglected, not well tested, and its misuse can lead to reliability degradation and catastrophic failures. However, little is known about the relationship between testing practices and EH testing effectiveness. In this exploratory study, we (i) measured the adequacy degree of EH testing concerning code coverage (instruction, branch, and method) criteria; and (ii) evaluated the effectiveness of the EH testing by measuring its capability to detect artificially injected faults (i.e., mutants) using 7 EH mutation operators. Our study was performed using test suites of 27 long-lived Java libraries from open-source ecosystems. Our results show that instructions and branches within catch blocks and throw instructions are less covered, with statistical significance, than the overall instructions and branches. Nevertheless, most of the studied libraries presented test suites capable of detecting more than 70% of the injected faults. From a total of 12, 331 mutants created in this study, the test suites were able to detect 68% of them.

Lima2021b Igor Lima, Jefferson Silva, Breno Miranda, Gustavo Pinto, and Marcelo d'Amorim: "Exposing bugs in JavaScript engines through test transplantation and differential testing". Software Quality Journal, 29(1), 2021, 10.1007/s11219-020-09537-8.

JavaScript is a popular programming language today with several implementations competing for market dominance. Although a specification document and a conformance test suite exist to guide engine development, bugs occur and have important practical consequences. Implementing correct engines is challenging because the spec is intentionally incomplete and evolves frequently. This paper investigates the use of test transplantation and differential testing for revealing functional bugs in JavaScript engines. The former technique runs the regression test suite of a given engine on another engine. The latter technique fuzzes existing inputs and then compares the output produced by different engines with a differential oracle. We conducted experiments with engines from five major players—Apple, Facebook, Google, Microsoft, and Mozilla—to assess the effectiveness of test transplantation and differential testing. Our results indicate that both techniques revealed several bugs, many of which are confirmed by developers. We reported 35 bugs with test transplantation (23 of these bugs confirmed and 19 fixed) and reported 24 bugs with differential testing (17 of these confirmed and 10 fixed). Results indicate that most of these bugs affected two engines—Apple's JSC and Microsoft's ChakraCore (24 and 26 bugs, respectively). To summarize, our results show that test transplantation and differential testing are easy to apply and very effective in finding bugs in complex software, such as JavaScript engines.

LimaJunior2021 Manoel Limeira Lima Júnior, Daricélio Soares, Alexandre Plastino, and Leonardo Murta: "Predicting the lifetime of pull requests in open-source projects". Journal of Software: Evolution and Process, 33(6), 2021, 10.1002/smr.2337.

A recent survey using industrial projects has shown that providing an estimate of the lifetime of pull requests to developers helps to speed up their conclusion. Previous work has explored pull request lifetime prediction in open-source projects using regression techniques but with a broad margin of error. The first objective of our work was to reduce the average error rate of the prediction obtained by the regression techniques so far. We performed experiments with different regression techniques and achieved a significant decrease in the mean error rate. The second objective of our work was to obtain a more effective and useful predictive model that can classify pull requests according to five discrete time intervals. We proposed new predictive attributes for the estimation of the time intervals and employed attribute selection strategies to identify subsets of attributes that could improve the predictive behavior of the classifiers. Our classification approach achieved the best accuracy in all the 20 projects evaluated in comparison with the literature. The average accuracy was of 45.28% to predict pull request lifetime, with an average normalized improvement of 14.68% in relation to the majority class and 6.49% in relation to the state-of-the-art.

Liu2021 Kui Liu, Dongsun Kim, Tegawende F. Bissyande, Shin Yoo, and Yves Le Traon: "Mining Fix Patterns for FindBugs Violations". IEEE Transactions on Software Engineering, 47(1), 2021, 10.1109/tse.2018.2884955.

Several static analysis tools, such as Splint or FindBugs, have been proposed to the software development community to help detect security vulnerabilities or bad programming practices. However, the adoption of these tools is hindered by their high false positive rates. If the false positive rate is too high, developers may get acclimated to violation reports from these tools, causing concrete and severe bugs being overlooked. Fortunately, some violations are actually addressed and resolved by developers. We claim that those violations that are recurrently fixed are likely to be true positives, and an automated approach can learn to repair similar unseen violations. However, there is lack of a systematic way to investigate the distributions on existing violations and fixed ones in the wild, that can provide insights into prioritizing violations for developers, and an effective way to mine code and fix patterns which can help developers easily understand the reasons of leading violations and how to fix them. In this paper, we first collect and track a large number of fixed and unfixed violations across revisions of software. The empirical analyses reveal that there are discrepancies in the distributions of violations that are detected and those that are fixed, in terms of occurrences, spread and categories, which can provide insights into prioritizing violations. To automatically identify patterns in violations and their fixes, we propose an approach that utilizes convolutional neural networks to learn features and clustering to regroup similar instances. We then evaluate the usefulness of the identified fix patterns by applying them to unfixed violations. The results show that developers will accept and merge a majority (69/116) of fixes generated from the inferred fix patterns. It is also noteworthy that the yielded patterns are applicable to four real bugs in the Defects4J major benchmark for software testing and automated repair.

Louis2020 Annie Louis, Santanu Kumar Dash, Earl T. Barr, Michael D. Ernst, and Charles Sutton: "Where should I comment my code?: a dataset and model for predicting locations that need comments". Proc. International Conference on Software Engineering (ICSE), 2020, 10.1145/3377816.3381736.

Programmers should write code comments, but not on every line of code. We have created a machine learning model that suggests locations where a programmer should write a code comment. We trained it on existing commented code to learn locations that are chosen by developers. Once trained, the model can predict locations in new code. Our models achieved precision of 74% and recall of 13% in identifying comment-worthy locations. This first success opens the door to future work, both in the new where-to-comment problem and in guiding comment generation. Our code and data is available at

Lunn2021 Stephanie Lunn, Monique Ross, Zahra Hazari, Mark Allen Weiss, Michael Georgiopoulos, and Kenneth Christensen: "The Impact of Technical Interviews, and other Professional and Cultural Experiences on Students' Computing Identity". Proc. Conference on Innovation and Technology in Computer Science Education (ITiCSE), 2021, 10.1145/3430665.3456362.

Increasingly companies assess a computing candidate's capabilities using technical interviews (TIs). Yet students struggle to code on demand, and there is already an insufficient amount of computing graduates to meet industry needs. Therefore, it is important to understand students' perceptions of TIs, and other professional experiences (e.g., computing jobs). We surveyed 740 undergraduate computing students at three universities to examine their experiences with the hiring process, as well as the impact of professional and cultural experiences (e.g., familial support) on computing identity. We considered the interactions between these experiences and social identity for groups underrepresented in computing - women, Black/African American, and Hispanic/Latinx students. Among other findings, we observed that students that did not have positive experiences with TIs had a reduced computing identity, but that facing discrimination during technical interviews had the opposite effect. Social support may play a role. Having friends in computing bolsters computing identity for Hispanic/Latinx students, as does a supportive home environment for women. Also, freelance computing jobs increase computing identity for Black/African American students. Our findings are intended to raise awareness of the best way for educators to help diverse groups of students to succeed, and to inform them of the experiences that may influence students' engagement, resilience, and computing identity development.

Luu2021 Quang-Hung Luu, Man F. Lau, Sebastian P.H. Ng, and Tsong Yueh Chen: "Testing multiple linear regression systems with metamorphic testing". Journal of Systems and Software, 182, 2021, 10.1016/j.jss.2021.111062.

Regression is one of the most commonly used statistical techniques. However, testing regression systems is a great challenge because of the absence of test oracle. In this paper, we show that Metamorphic Testing is an effective approach to test multiple linear regression systems. In doing so, we identify intrinsic mathematical properties of linear regression, and then propose 11 Metamorphic Relations to be used for testing. Their effectiveness is examined using mutation analysis with a range of different regression programs. We further look at how the testing could be adopted in a more effective way. Our work is applicable to examine the reliability of predictive systems based on regression that has been widely used in economics, engineering and science, as well as of the regression calculation manipulated by statistical users.


Ma2021 Yuxing Ma and 8 others: "World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data". Empirical Software Engineering, 26(2), 2021, 10.1007/s10664-020-09905-9.

Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are the tens of millions of projects in the periphery interconnected through. technical dependencies, code sharing, or knowledge flow? To answer such questions we: a) create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of the FLOSS ecosystems and b) provide capabilities to efficiently correct, augment, query, and analyze that data. Our current WoC implementation is capable of being updated on a monthly basis and contains over 18B Git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.

Macho2021 Christian Macho, Stefanie Beyer, Shane McIntosh, and Martin Pinzger: "The nature of build changes: An empirical study of Maven-based build systems". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-020-09926-4.

Build systems are an essential part of modern software projects. As software projects change continuously, it is crucial to understand how the build system changes because neglecting its maintenance can, at best, lead to expensive build breakage, or at worst, introduce user-reported defects due to incorrectly compiled, linked, packaged, or deployed official releases. Recent studies have investigated the (co-)evolution of build configurations and reasons for build breakage; however, the prior analysis focused on a coarse-grained outcome (i.e., either build changing or not). In this paper, we present BUILDDIFF, an approach to extract detailed build changes from MAVEN build files and classify them into 143 change types. In a manual evaluation of 400 build-changing commits, we show that BUILDDIFF can extract and classify build changes with average precision, recall, and f1-scores of 0.97, 0.98, and 0.97, respectively. We then present two studies using the build changes extracted from 144 open source Java projects to study the frequency and time of build changes. The results show that the top-10 most frequent change types account for 51% of the build changes. Among them, changes to version numbers and changes to dependencies of the projects occur most frequently. We also observe frequently co-occurring changes, such as changes to the source code management definitions, and corresponding changes to the dependency management system and the dependency declaration. Furthermore, our results show that build changes frequently occur around release days. In particular, critical changes, such as updates to plugin configuration parts and dependency insertions, are performed before a release day. The contributions of this paper lay in the foundation for future research, such as for analyzing the (co-)evolution of build files with other artifacts, improving effort estimation approaches by incorporating necessary modifications to the build system specification, or automatic repair approaches for configuration code. Furthermore, our detailed change information enables improvements of refactoring approaches for build configurations and improvements of prediction models to identify error-prone build files.

Masood2020b Zainab Masood, Rashina Hoda, and Kelly Blincoe: "Real World Scrum A Grounded Theory of Variations in Practice". IEEE Transactions on Software Engineering, 2020, 10.1109/tse.2020.3025317.

Scrum, the most popular agile method and project management framework, is widely reported to be used, adapted, misused, and abused in practice. However, not much is known about how Scrum actually works in practice, and critically, where, when, how and why it diverges from Scrum by the book. Through a Grounded Theory study involving semi-structured interviews of 45 participants from 30 companies and observations of five teams, we present our findings on how Scrum works in practice as compared to how it is presented in its formative books. We identify significant variations in these practices such as work breakdown, estimation, prioritization, assignment, the associated roles and artefacts, and discuss the underlying rationales driving the variations. Critically, we claim that not all variations are process misuse/abuse and propose a nuanced classification approach to understanding variations as standard, necessary, contextual, and clear deviations for successful Scrum use and adaptation.

May2019 Anna May, Johannes Wachs, and Anikó Hannák: "Gender differences in participation and reward on Stack Overflow". Empirical Software Engineering, 24(4), 2019, 10.1007/s10664-019-09685-x.

Programming is a valuable skill in the labor market, making the underrepresentation of women in computing an increasingly important issue. Online question and answer platforms serve a dual purpose in this field: they form a body of knowledge useful as a reference and learning tool, and they provide opportunities for individuals to demonstrate credible, verifiable expertise. Issues, such as male-oriented site design or overrepresentation of men among the site's elite may therefore compound the issue of women's underrepresentation in IT. In this paper we audit the differences in behavior and outcomes between men and women on Stack Overflow, the most popular of these Q&A sites. We observe significant differences in how men and women participate in the platform and how successful they are. For example, the average woman has roughly half of the reputation points, the primary measure of success on the site, of the average man. Using an Oaxaca-Blinder decomposition, an econometric technique commonly applied to analyze differences in wages between groups, we find that most of the gap in success between men and women can be explained by differences in their activity on the site and differences in how these activities are rewarded. Specifically, 1) men give more answers than women and 2) are rewarded more for their answers on average, even when controlling for possible confounders such as tenure or buy-in to the site. Women ask more questions and gain more reward per question. We conclude with a hypothetical redesign of the site's scoring system based on these behavioral differences, cutting the reputation gap in half.

Melo2019 Hugo Melo, Roberta Coelho, and Christoph Treude: "Unveiling Exception Handling Guidelines Adopted by Java Developers". Proc. International Conference on Software Analysis, Evolution and Reengineering (SANER), 2019, 10.1109/saner.2019.8668001.

Despite being an old language feature, Java exception handling code is one of the least understood parts of many systems. Several studies have analyzed the characteristics of exception handling code, trying to identify common practices or even link such practices to software bugs. Few works, however, have investigated exception handling issues from the point of view of developers. None of the works have focused on discovering exception handling guidelines adopted by current systems—which are likely to be a driver of common practices. In this work, we conducted a qualitative study based on semi-structured interviews and a survey whose goal was to investigate the guidelines that are (or should be) followed by developers in their projects. Initially, we conducted semi-structured interviews with seven experienced developers, which were used to inform the design of a survey targeting a broader group of Java developers (i.e., a group of active Java developers from top-starred projects on GitHub). We emailed 863 developers and received 98 valid answers. The study shows that exception handling guidelines usually exist (70%) and are usually implicit and undocumented (54%). Our study identifies 48 exception handling guidelines related to seven different categories. We also investigated how such guidelines are disseminated to the project team and how compliance between code and guidelines is verified; we could observe that according to more than half of respondents the guidelines are both disseminated and verified through code inspection or code review. Our findings provide software development teams with a means to improve exception handling guidelines based on insights from the state of practice of 87 software projects.

Mo2021 Ran Mo, Yuanfang Cai, Rick Kazman, Lu Xiao, and Qiong Feng: "Architecture Anti-Patterns: Automatically Detectable Violations of Design Principles". IEEE Transactions on Software Engineering, 47(5), 2021, 10.1109/tse.2019.2910856.

In large-scale software systems, error-prone or change-prone files rarely stand alone. They are typically architecturally connected and their connections usually exhibit architecture problems causing the propagation of error-proneness or change-proneness. In this paper, we propose and empirically validate a suite of architecture anti-patterns that occur in all large-scale software systems and are involved in high maintenance costs. We define these architecture anti-patterns based on fundamental design principles and Baldwin and Clark's design rule theory. We can automatically detect these anti-patterns by analyzing a project's structural relationships and revision history. Through our analyses of 19 large-scale software projects, we demonstrate that these architecture anti-patterns have significant impact on files' bug-proneness and change-proneness. In particular, we show that 1) files involved in these architecture anti-patterns are more error-prone and change-prone; 2) the more anti-patterns a file is involved in, the more error-prone and change-prone it is; and 3) while all of our defined architecture anti-patterns contribute to file's error-proneness and change-proneness, Unstable Interface and Crossing contribute the most by far.

Moraes2021 João Pedro Moraes, Ivanilton Polato, Igor Wiese, Filipe Saraiva, and Gustavo Pinto: "From one to hundreds: multi-licensing in the JavaScript ecosystem". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-020-09936-2.

Open source licenses create a legal framework that plays a crucial role in the widespread adoption of open source projects. Without a license, any source code available on the internet could not be openly (re)distributed. Although recent studies provide evidence that most popular open source projects have a license, developers might lack confidence or expertise when they need to combine software licenses, leading to a mistaken project license unification.This license usage is challenged by the high degree of reuse that occurs in the heart of modern software development practices, in which third-party libraries and frameworks are easily and quickly integrated into a software codebase.This scenario creates what we call ``multi-licensed'' projects, which happens when one project has components that are licensed under more than one license. Although these components exist at the file-level, they naturally impact licensing decisions at the project-level. In this paper, we conducted a mix-method study to shed some light on these questions. We started by parsing 1,426,263 (source code and non-source code) files available on 1,552 JavaScript projects, looking for license information. Among these projects, we observed that 947 projects (61%) employ more than one license. On average, there are 4.7 licenses per studied project (max: 256). Among the reasons for multi-licensing is to incorporate the source code of third-party libraries into the project's codebase. When doing so, we observed that 373 of the multi-licensed projects introduced at least one license incompatibility issue. We also surveyed with 83 maintainers of these projects aimed to cross-validate our findings. We observed that 63% of the surveyed maintainers are not aware of the multi-licensing implications. For those that are aware, they adopt multiple licenses mostly to conform with third-party libraries' licenses.

MoreiraSoares2020 Daricélio Moreira Soares, Manoel Limeira Lima Júnior, Leonardo Murta, and Alexandre Plastino: "What factors influence the lifetime of pull requests?". Software: Practice and Experience, 51(6), 2020, 10.1002/spe.2946.

When external contributors want to collaborate with an open-source project, they fork the repository, make changes, and send a pull request to the core team. However, the lifetime of a pull request, defined by the time interval between its opening and its closing, has a high variation, potentially affecting the contributor engagement. In this context, understanding the root causes of pull request lifetime is important to both the external contributors and the core team. The former can adopt strategies that increase the chances of fast review, while the latter can establish priorities in the reviewing process, alleviating the pending tasks and improving the software quality. In this work, we mined association rules from 97,463 pull requests from 30 projects in order to find characteristics that have affected the pull requests lifetime. In addition, we present a qualitative analysis, helping to understand the patterns discovered from the association rules. The results indicate that: (i) contributions with shorter lifetimes tend to be accepted; (ii) structural characteristics, such as number of commits, changed files, and lines of code, have influence, in an isolated or combined way, on the pull request lifetime; (iii) the files changed and the directories to which they belong can be robust predictors for pull request lifetime; (iv) the profile of external contributors and their social relationships have influence on lifetime; and (v) the number of comments in a pull request, as well as the developer responsible for the review, are important predictors for its lifetime.

MurphyHill2021 Emerson Murphy-Hill and 8 others: "What Predicts Software Developers' Productivity?". IEEE Transactions on Software Engineering, 47(3), 2021, 10.1109/tse.2019.2900308.

Organizations have a variety of options to help their software developers become their most productive selves, from modifying office layouts, to investing in better tools, to cleaning up the source code. But which options will have the biggest impact? Drawing from the literature in software engineering and industrial/organizational psychology to identify factors that correlate with productivity, we designed a survey that asked 622 developers across 3 companies about these productivity factors and about self-rated productivity. Our results suggest that the factors that most strongly correlate with self-rated productivity were non-technical factors, such as job enthusiasm, peer support for new ideas, and receiving useful feedback about job performance. Compared to other knowledge workers, our results also suggest that software developers' self-rated productivity is more strongly related to task variety and ability to work remotely.


NguyenDuc2021 Anh Nguyen-Duc, Kai-Kristian Kemell, and Pekka Abrahamsson: "The entrepreneurial logic of startup software development: A study of 40 software startups". Empirical Software Engineering, 26(5), 2021, 10.1007/s10664-021-09987-z.

Context: Software startups are an essential source of innovation and software-intensive products. The need to understand product development in startups and to provide relevant support are highlighted in software research. While state-of-the-art literature reveals how startups develop their software, the reasons why they adopt these activities are underexplored. Objective: This study investigates the tactics behind software engineering (SE) activities by analyzing key engineering events during startup journeys. We explore how entrepreneurial mindsets may be associated with SE knowledge areas and with each startup case. Method: Our theoretical foundation is based on causation and effectuation models. We conducted semi-structured interviews with 40 software startups. We used two-round open coding and thematic analysis to describe and identify entrepreneurial software development patterns. Additionally, we calculated an effectuation index for each startup case. Results: We identified 621 events merged into 32 codes of entrepreneurial logic in SE from the sample. We found a systemic occurrence of the logic in all areas of SE activities. Minimum Viable Product (MVP), Technical Debt (TD), and Customer Involvement (CI) tend to be associated with effectual logic, while testing activities at different levels are associated with causal logic. The effectuation index revealed that startups are either effectuation-driven or mixed-logics-driven. Conclusions: Software startups fall into two types that differentiate between how traditional SE approaches may apply to them. Effectuation seems the most relevant and essential model for explaining and developing suitable SE practices for software startups.


Olejniczak2020 Anthony J. Olejniczak and Molly J. Wilson: "Who's writing open access (OA) articles? Characteristics of OA authors at Ph.D.-granting institutions in the United States". Quantitative Science Studies, 1(4), 2020, 10.1162/qss_a_00091.

The open access (OA) publication movement aims to present research literature to the public at no cost and with no restrictions. While the democratization of access to scholarly literature is a primary focus of the movement, it remains unclear whether OA has uniformly democratized the corpus of freely available research, or whether authors who choose to publish in OA venues represent a particular subset of scholars—those with access to resources enabling them to afford article processing charges (APCs). We investigated the number of OA articles with article processing charges (APC OA) authored by 182,320 scholars with known demographic and institutional characteristics at American research universities across 11 broad fields of study. The results show, in general, that the likelihood for a scholar to author an APC OA article increases with male gender, employment at a prestigious institution (AAU member universities), association with a STEM discipline, greater federal research funding, and more advanced career stage (i.e., higher professorial rank). Participation in APC OA publishing appears to be skewed toward scholars with greater access to resources and job security.

Olsson2021 Jesper Olsson, Erik Risfelt, Terese Besker, Antonio Martini, and Richard Torkar: "Measuring affective states from technical debt". Empirical Software Engineering, 26(5), 2021, 10.1007/s10664-021-09998-w.

Context: Software engineering is a human activity. Despite this, human aspects are under-represented in technical debt research, perhaps because they are challenging to evaluate. Objective: This study's objective was to investigate the relationship between technical debt and affective states (feelings, emotions, and moods) from software practitioners. Method: Forty participants (N=40) from twelve companies took part in a mixed-methods approach, consisting of a repeated-measures (r=5) experiment (n=200), a survey, and semi-structured interviews. From the qualitative data, it is clear that technical debt activates a substantial portion of the emotional spectrum and is psychologically taxing. Further, the practitioners' reactions to technical debt appear to fall in different levels of maturity. Results: The statistical analysis shows that different design smells (strong indicators of technical debt) negatively or positively impact affective states. Conclusions: We argue that human aspects in technical debt are important factors to consider, as they may result in, e.g., procrastination, apprehension, and burnout.


Palomba2021 Fabio Palomba, Damian Andrew Tamburri, Francesca Arcelli Fontana, Rocco Oliveto, Andy Zaidman, and Alexander Serebrenik: "Beyond Technical Aspects: How Do Community Smells Influence the Intensity of Code Smells?". IEEE Transactions on Software Engineering, 47(1), 2021, 10.1109/tse.2018.2883603.

Code smells are poor implementation choices applied by developers during software evolution that often lead to critical flaws or failure. Much in the same way, community smells reflect the presence of organizational and socio-technical issues within a software community that may lead to additional project costs. Recent empirical studies provide evidence that community smells are often—if not always—connected to circumstances such as code smells. In this paper we look deeper into this connection by conducting a mixed-methods empirical study of 117 releases from 9 open-source systems. The qualitative and quantitative sides of our mixed-methods study were run in parallel and assume a mutually-confirmative connotation. On the one hand, we survey 162 developers of the 9 considered systems to investigate whether developers perceive relationship between community smells and the code smells found in those projects. On the other hand, we perform a fine-grained analysis into the 117 releases of our dataset to measure the extent to which community smells impact code smell intensity (i.e., criticality). We then propose a code smell intensity prediction model that relies on both technical and community-related aspects. The results of both sides of our mixed-methods study lead to one conclusion: community-related factors contribute to the intensity of code smells. This conclusion supports the joint use of community and code smells detection as a mechanism for the joint management of technical and social problems around software development communities.

Paltoglou2021 Katerina Paltoglou, Vassilis E. Zafeiris, N.A. Diamantidis, and E.A. Giakoumakis: "Automated refactoring of legacy JavaScript code to ES6 modules". Journal of Systems and Software, 181, 2021, 10.1016/j.jss.2021.111049.

The JavaScript language did not specify, until ECMAScript 6 (ES6), native features for streamlining encapsulation and modularity. Developer community filled the gap with a proliferation of design patterns and module formats, with impact on code reusability, portability and complexity of build configurations. This work studies the automated refactoring of legacy ES5 code to ES6 modules with fine-grained reuse of module contents through the named import/export language constructs. The focus is on reducing the coupling of refactored modules through destructuring exported module objects to fine-grained module features and enhancing module dependencies by leveraging the ES6 syntax. We employ static analysis to construct a model of a JavaScript project, the Module Dependence Graph (MDG), that represents modules and their dependencies. On the basis of MDG we specify the refactoring procedure for module migration to ES6. A prototype implementation has been empirically evaluated on 19 open source projects. Results highlight the relevance of the refactoring with a developer intent for fine-grained reuse. The analysis of refactored code shows an increase in the number of reusable elements per project and reduction in the coupling of refactored modules. The soundness of the refactoring is empirically validated through code inspection and execution of projects' test suites.

Passos2021 Leonardo Passos, Rodrigo Queiroz, Mukelabai Mukelabai, Thorsten Berger, Sven Apel, Krzysztof Czarnecki, and Jesus Alejandro Padilla: "A Study of Feature Scattering in the Linux Kernel". IEEE Transactions on Software Engineering, 47(1), 2021, 10.1109/tse.2018.2884911.

Feature code is often scattered across a software system. Scattering is not necessarily bad if used with care, as witnessed by systems with highly scattered features that evolved successfully. Feature scattering, often realized with a pre-processor, circumvents limitations of programming languages and software architectures. Unfortunately, little is known about the principles governing scattering in large and long-living software systems. We present a longitudinal study of feature scattering in the Linux kernel, complemented by a survey with 74, and interviews with nine Linux kernel developers. We analyzed almost eight years of the kernel's history, focusing on its largest subsystem: device drivers. We learned that the ratio of scattered features remained nearly constant and that most features were introduced without scattering. Yet, scattering easily crosses subsystem boundaries, and highly scattered outliers exist. Scattering often addresses a performance-maintenance tradeoff (alleviating complicated APIs), hardware design limitations, and avoids code duplication. While developers do not consciously enforce scattering limits, they actually improve the system design and refactor code, thereby mitigating pre-processor idiosyncrasies or reducing its use.

Patra2021 Jibesh Patra and Michael Pradel: "Semantic bug seeding: a learning-based approach for creating realistic bugs". Proc. European Software Engineering Conference/International Symposium on the Foundations of Software Engineering (ESEC/FSE), 2021, 10.1145/3468264.3468623.

When working on techniques to address the wide-spread problem of software bugs, one often faces the need for a large number of realistic bugs in real-world programs. Such bugs can either help evaluate an approach, e.g., in form of a bug benchmark or a suite of program mutations, or even help build the technique, e.g., in learning-based bug detection. Because gathering a large number of real bugs is difficult, a common approach is to rely on automatically seeded bugs. Prior work seeds bugs based on syntactic transformation patterns, which often results in unrealistic bugs and typically cannot introduce new, application-specific code tokens. This paper presents SemSeed, a technique for automatically seeding bugs in a semantics-aware way. The key idea is to imitate how a given real-world bug would look like in other programs by semantically adapting the bug pattern to the local context. To reason about the semantics of pieces of code, our approach builds on learned token embeddings that encode the semantic similarities of identifiers and literals. Our evaluation with real-world JavaScript software shows that the approach effectively reproduces real bugs and clearly outperforms a semantics-unaware approach. The seeded bugs are useful as training data for learning-based bug detection, where they significantly improve the bug detection ability. Moreover, we show that SemSeed-created bugs complement existing mutation testing operators, and that our approach is efficient enough to seed hundreds of thousands of bugs within an hour.

Peitek2021 Norman Peitek, Sven Apel, Chris Parnin, Andre Brechmann, and Janet Siegmund: "Program Comprehension and Code Complexity Metrics: An fMRI Study". Proc. International Conference on Software Engineering (ICSE), 2021, 10.1109/icse43902.2021.00056.

Background: Researchers and practitioners have been using code complexity metrics for decades to predict how developers comprehend a program. While it is plausible and tempting to use code metrics for this purpose, their validity is debated, since they rely on simple code properties and rarely consider particularities of human cognition. Aims: We investigate whether and how code complexity metrics reflect difficulty of program comprehension. Method: We have conducted a functional magnetic resonance imaging (fMRI) study with 19 participants observing program comprehension of short code snippets at varying complexity levels. We dissected four classes of code complexity metrics and their relationship to neuronal, behavioral, and subjective correlates of program comprehension, overall analyzing more than 41 metrics. Results: While our data corroborate that complexity metrics can-to a limited degree-explain programmers' cognition in program comprehension, fMRI allowed us to gain insights into why some code properties are difficult to process. In particular, a code's textual size drives programmers' attention, and vocabulary size burdens programmers' working memory. Conclusion: Our results provide neuro-scientific evidence supporting warnings of prior research questioning the validity of code complexity metrics and pin down factors relevant to program comprehension. Future Work: We outline several follow-up experiments investigating fine-grained effects of code complexity and describe possible refinements to code complexity metrics.

Pizard2021 Sebastián Pizard, Fernando Acerenza, Ximena Otegui, Silvana Moreno, Diego Vallespir, and Barbara Kitchenham: "Training students in evidence-based software engineering and systematic reviews: a systematic review and empirical study". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-021-09953-9.

Context Although influential in academia, evidence-based software engineering (EBSE) has had little impact on industry practice. We found that other disciplines have identified lack of training as a significant barrier to Evidence-Based Practice. Objective To build and assess an EBSE training proposal suitable for students with more than 3 years of computer science/software engineering university-level training. Method We performed a systematic literature review (SLR) of EBSE teaching initiatives and used the SLR results to help us to develop and evaluate an EBSE training proposal. The course was based on the theory of learning outcomes and incorporated a large practical content related to performing an SLR. We ran the course with 10 students and based course evaluation on student performance and opinions of both students and teachers. We assessed knowledge of EBSE principles from the mid-term and final tests, as well as evaluating the SLRs produced by the student teams. We solicited student opinions about the course and its value via a student survey, a team survey, and a focus group. The teachers' viewpoint was collected in a debriefing meeting. Results Our SLR identified 14 relevant primary studies. The primary studies emphasized the importance of practical examples (usually based on the SLR process) and used a variety of evaluation methods, but lacked any formal education methodology. We identified 54 learning outcomes covering aspects of EBSE and the SLR method. All 10 students passed the course. Our course evaluation showed that a large percentage of the learning outcomes established for training were accomplished. Conclusions The course proved suitable for students to understand the EBSE paradigm and to be able to apply it to a limited-scope practical assignment. Our learning outcomes, course structure, and course evaluation process should help to improve the effectiveness and comparability of future studies of EBSE training. However, future courses should increase EBSE training related to the use of SLR results.


Qiu2019 Huilian Sophie Qiu, Alexander Nolte, Anita Brown, Alexander Serebrenik, and Bogdan Vasilescu: "Going Farther Together: The Impact of Social Capital on Sustained Participation in Open Source". Proc. International Conference on Software Engineering (ICSE), 2019, 10.1109/icse.2019.00078.

Sustained participation by contributors in opensource software is critical to the survival of open-source projects and can provide career advancement benefits to individual contributors. However, not all contributors reap the benefits of open-source participation fully, with prior work showing that women are particularly underrepresented and at higher risk of disengagement. While many barriers to participation in open-source have been documented in the literature, relatively little is known about how the social networks that open-source contributors form impact their chances of long-term engagement. In this paper we report on a mixed-methods empirical study of the role of social capital (i.e., the resources people can gain from their social connections) for sustained participation by women and men in open-source GitHub projects. After combining survival analysis on a large, longitudinal data set with insights derived from a user survey, we confirm that while social capital is beneficial for prolonged engagement for both genders, women are at disadvantage in teams lacking diversity in expertise.


RakAmnouykit2020 Ingkarat Rak-amnouykit, Daniel McCrevan, Ana Milanova, Martin Hirzel, and Julian Dolby: "Python 3 types in the wild: a tale of two type systems". Proc. International Symposium on Dynamic Languages (ISDL), 2020, 10.1145/3426422.3426981.

Python 3 is a highly dynamic language, but it has introduced a syntax for expressing types with PEP484. This paper ex- plores how developers use these type annotations, the type system semantics provided by type checking and inference tools, and the performance of these tools. We evaluate the types and tools on a corpus of public GitHub repositories. We review MyPy and PyType, two canonical static type checking and inference tools, and their distinct approaches to type analysis. We then address three research questions: (i) How often and in what ways do developers use Python 3 types? (ii) Which type errors do developers make? (iii) How do type errors from different tools compare? Surprisingly, when developers use static types, the code rarely type-checks with either of the tools. MyPy and PyType exhibit false positives, due to their static nature, but also flag many useful errors in our corpus. Lastly, MyPy and PyType embody two distinct type systems, flagging different errors in many cases. Understanding the usage of Python types can help guide tool-builders and researchers. Understanding the performance of popular tools can help increase the adoption of static types and tools by practitioners, ultimately leading to more correct and more robust Python code.

Rahman2020b Mohammad Masudur Rahman, Foutse Khomh, and Marco Castelluccio: "Why are Some Bugs Non-Reproducible? An Empirical Investigation using Data Fusion". Proc. International Conference on Software Maintenance and Evolution (ICSME), 2020, 10.1109/icsme46990.2020.00063.

Software developers attempt to reproduce software bugs to understand their erroneous behaviours and to fix them. Unfortunately, they often fail to reproduce (or fix) them, which leads to faulty, unreliable software systems. However, to date, only a little research has been done to better understand what makes the software bugs non-reproducible. In this paper, we conduct a multimodal study to better understand the non-reproducibility of software bugs. First, we perform an empirical study using 576 non-reproducible bug reports from two popular software systems (Firefox, Eclipse) and identify 11 key factors that might lead a reported bug to non-reproducibility. Second, we conduct a user study involving 13 professional developers where we investigate how the developers cope with non-reproducible bugs. We found that they either close these bugs or solicit for further information, which involves long deliberations and counter-productive manual searches. Third, we offer several actionable insights on how to avoid non-reproducibility (e.g., false-positive bug report detector) and improve reproducibility of the reported bugs (e.g., sandbox for bug reproduction) by combining our analyses from multiple studies (e.g., empirical study, developer study).

Rahman2021 Akond Rahman, Md Rayhanur Rahman, Chris Parnin, and Laurie Williams: "Security Smells in Ansible and Chef Scripts". ACM Transactions on Software Engineering and Methodology, 30(1), 2021, 10.1145/3408897.

Context: Security smells are recurring coding patterns that are indicative of security weakness and require further inspection. As infrastructure as code (IaC) scripts, such as Ansible and Chef scripts, are used to provision cloud-based servers and systems at scale, security smells in IaC scripts could be used to enable malicious users to exploit vulnerabilities in the provisioned systems. Goal: The goal of this article is to help practitioners avoid insecure coding practices while developing infrastructure as code scripts through an empirical study of security smells in Ansible and Chef scripts. Methodology: We conduct a replication study where we apply qualitative analysis with 1,956 IaC scripts to identify security smells for IaC scripts written in two languages: Ansible and Chef. We construct a static analysis tool called Security Linter for Ansible and Chef scripts (SLAC) to automatically identify security smells in 50,323 scripts collected from 813 open source software repositories. We also submit bug reports for 1,000 randomly selected smell occurrences. Results: We identify two security smells not reported in prior work: missing default in case statement and no integrity check. By applying SLAC we identify 46,600 occurrences of security smells that include 7,849 hard-coded passwords. We observe agreement for 65 of the responded 94 bug reports, which suggests the relevance of security smells for Ansible and Chef scripts amongst practitioners. Conclusion: We observe security smells to be prevalent in Ansible and Chef scripts, similarly to that of the Puppet scripts. We recommend practitioners to rigorously inspect the presence of the identified security smells in Ansible and Chef scripts using (i) code review, and (ii) static analysis tools.

Reyes2018 Rolando P. Reyes, Oscar Dieste, Efraín R. Fonseca, and Natalia Juristo: "Statistical errors in software engineering experiments". Proc. International Conference on Software Engineering (ICSE), 2018, 10.1145/3180155.3180161.

Background: Statistical concepts and techniques are often applied incorrectly, even in mature disciplines such as medicine or psychology. Surprisingly, there are very few works that study statistical problems in software engineering (SE). Aim: Assess the existence of statistical errors in SE experiments. Method: Compile the most common statistical errors in experimental disciplines. Survey experiments published in ICSE to assess whether errors occur in high quality SE publications. Results: The same errors as identified in others disciplines were found in ICSE experiments, where 30 of the reviewed papers included several error types such as: a) missing statistical hypotheses, b) missing sample size calculation, c) failure to assess statistical test assumptions, and d) uncorrected multiple testing. This rather large error rate is greater for research papers where experiments are confined to the validation section. The origin of the errors can be traced back to: a) researchers not having sufficient statistical training, and b) a profusion of exploratory research. Conclusions: This paper provides preliminary evidence that SE research suffers from the same statistical problems as other experimental disciplines. However, the SE community appears to be unaware of any shortcomings in its experiments, whereas other disciplines work hard to avoid these threats. Further research is necessary to find the underlying causes and set up corrective measures, but there are some potentially effective actions and are a priori easy to implement: a) improve the statistical training of SE researchers, and b) enforce quality assessment and reporting guidelines in SE publications.

Rico2021 Sergio Rico, Elizabeth Bjarnason, Emelie Engström, Martin Höst, and Per Runeson: "A case study of industry–academia communication in a joint software engineering research project". Journal of Software: Evolution and Process, 2021, 10.1002/smr.2372.

Empirical software engineering research relies on good communication with industrial partners. Conducting joint research both requires and contributes to bridging the communication gap between industry and academia (IA) in software engineering. This study aims to explore communication between the two parties in such a setting. To better understand what facilitates good IA communication and what project outcomes such communication promotes, we performed a case study, in the context of a long-term IA joint project, followed by a validating survey among practitioners and researchers with experience of working in similar settings. We identified five facilitators of IA communication and nine project outcomes related to this communication. The facilitators concern the relevance of the research, practitioners' attitude and involvement in research, frequency of communication and longevity of the collaboration. The project outcomes promoted by this communication include, for researchers, changes in teaching and new scientific venues, and for practitioners, increased awareness, changes to practice, and new tools and source code. Besides, both parties gain new knowledge and develop social-networks through IA communication. Our study presents empirically based insights that can provide advise on how to improve communication in IA research projects and thus the co-creation of software engineering knowledge that is anchored in both practice and research.

Rodeghero2021 Paige Rodeghero, Thomas Zimmermann, Brian Houck, and Denae Ford: "Please Turn Your Cameras on: Remote Onboarding of Software Developers During a Pandemic". Proc. International Conference on Software Engineering (ICSE), 2021, 10.1109/icse-seip52600.2021.00013.

The COVID-19 pandemic has impacted the way that software development teams onboard new hires. Previously, most software developers worked in physical offices and new hires onboarded to their teams in the physical office, following a standard onboarding process. However, when companies transitioned employees to work from home due to the pandemic, there was little to no time to develop new onboarding procedures. In this paper, we present a survey of 267 new hires at Microsoft that onboarded to software development teams during the pandemic. We explored their remote onboarding process, including the challenges that the new hires encountered and their social connectedness with their teams. We found that most developers onboarded remotely and never had an opportunity to meet their teammates in person. This leads to one of the biggest challenges faced by these new hires, building a strong social connection with their team. We use these results to provide recommendations for onboarding remote hires.

RodriguezPerez2020 Gema Rodríguez-Pérez, Gregorio Robles, Alexander Serebrenik, Andy Zaidman, Daniel M. Germán, and Jesus M. Gonzalez-Barahona: "How bugs are born: a model to identify how bugs are introduced in software components". Empirical Software Engineering, 25(2), 2020, 10.1007/s10664-019-09781-y.

When identifying the origin of software bugs, many studies assume that ``a bug was introduced by the lines of code that were modified to fix it''. However, this assumption does not always hold and at least in some cases, these modified lines are not responsible for introducing the bug. For example, when the bug was caused by a change in an external API. The lack of empirical evidence makes it impossible to assess how important these cases are and therefore, to which extent the assumption is valid. To advance in this direction, and better understand how bugs ``are born'', we propose a model for defining criteria to identify the first snapshot of an evolving software system that exhibits a bug. This model, based on the perfect test idea, decides whether a bug is observed after a change to the software. Furthermore, we studied the model's criteria by carefully analyzing how 116 bugs were introduced in two different open source software projects. The manual analysis helped classify the root cause of those bugs and created manually curated datasets with bug-introducing changes and with bugs that were not introduced by any change in the source code. Finally, we used these datasets to evaluate the performance of four existing SZZ-based algorithms for detecting bug-introducing changes. We found that SZZ-based algorithms are not very accurate, especially when multiple commits are found; the F-Score varies from 0.44 to 0.77, while the percentage of true positives does not exceed 63%. Our results show empirical evidence that the prevalent assumption, ``a bug was introduced by the lines of code that were modified to fix it'', is just one case of how bugs are introduced in a software system. Finding what introduced a bug is not trivial: bugs can be introduced by the developers and be in the code, or be created irrespective of the code. Thus, further research towards a better understanding of the origin of bugs in software projects could help to improve design integration tests and to design other procedures to make software development more robust.

Romano2021 Alan Romano, Zihe Song, Sampath Grandhi, Wei Yang, and Weihang Wang: "An Empirical Analysis of UI-Based Flaky Tests". Proc. International Conference on Software Engineering (ICSE), 2021, 10.1109/icse43902.2021.00141.

Flaky tests have gained attention from the research community in recent years and with good reason. These tests lead to wasted time and resources, and they reduce the reliability of the test suites and build systems they affect. However, most of the existing work on flaky tests focus exclusively on traditional unit tests. This work ignores UI tests that have larger input spaces and more diverse running conditions than traditional unit tests. In addition, UI tests tend to be more complex and resource-heavy, making them unsuited for detection techniques involving rerunning test suites multiple times. In this paper, we perform a study on flaky UI tests. We analyze 235 flaky UI test samples found in 62 projects from both web and Android environments. We identify the common underlying root causes of flakiness in the UI tests, the strategies used to manifest the flaky behavior, and the fixing strategies used to remedy flaky UI tests. The findings made in this work can provide a foundation for the development of detection and prevention techniques for flakiness arising in UI tests.

Russo2020 Daniel Russo and Klaas-Jan Stol: "Gender Differences in Personality Traits of Software Engineers". IEEE Transactions on Software Engineering, 2020, 10.1109/tse.2020.3003413.

There is a growing body of gender studies in software engineering to understand diversity and inclusion issues, as diversity is recognized to be a key issue to healthy teams and communities. A second factor often linked to team performance is personality, which has received far more attention. Very few studies, however, have focused on the intersection of these two fields. Hence, we set out to study gender differences in personality traits of software engineers. Through a survey study we collected personality data, using the HEXACO model, of 483 software engineers. The data were analyzed using a Bayesian independent sample t-test and network analysis. The results suggest that women score significantly higher in Openness to Experience, Honesty-Humility, and Emotionality than men. Further, men show higher psychopathic traits than women. Based on these findings, we develop a number of propositions that can guide future research.


Sambasivan2021 Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo: "'Everyone wants to do the model work, not the data work': Data Cascades in High-Stakes AI". Proc. Conference on Human Factors in Computing Systems (HFCS), 2021, 10.1145/3411764.3445518.

AI models are increasingly applied in high-stakes domains like health and conservation. Data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations. Paradoxically, data is the most under-valued and de-glamorised aspect of AI. In this paper, we report on data practices in high-stakes AI, from interviews with 53 AI practitioners in India, East and West African countries, and USA. We define, identify, and present empirical evidence on Data Cascades—compounding events causing negative, downstream effects from data issues—triggered by conventional AI/ML practices that undervalue data quality. Data cascades are pervasive (92% prevalence), invisible, delayed, but often avoidable. We discuss HCI opportunities in designing and incentivizing data excellence as a first-class citizen of AI, resulting in safer and more robust systems for all.

Sarker2019 Farhana Sarker, Bogdan Vasilescu, Kelly Blincoe, and Vladimir Filkov: "Socio-Technical Work-Rate Increase Associates With Changes in Work Patterns in Online Projects". Proc. International Conference on Software Engineering (ICSE), 2019, 10.1109/icse.2019.00099.

Software developers work on a variety of tasks ranging from the technical, e.g., writing code, to the social, e.g., participating in issue resolution discussions. The amount of work developers perform per week (their work-rate) also varies and depends on project needs and developer schedules. Prior work has shown that while moderate levels of increased technical work and multitasking lead to higher productivity, beyond a certain threshold, they can lead to lowered performance. Here, we study how increases in the short-term work-rate along both the technical and social dimensions are associated with changes in developers' work patterns, in particular communication sentiment, technical productivity, and social productivity. We surveyed active and prolific developers on GitHub to understand the causes and impacts of increased work-rates. Guided by the responses, we developed regression models to study how communication and committing patterns change with increased work-rates and fit those models to large-scale data gathered from traces left by thousands of GitHub developers. From our survey and models, we find that most developers do experience work-rate-increase-related changes in behavior. Most notably, our models show that there is a sizable effect when developers comment much more than their average: the negative sentiment in their comments increases, suggesting an increased level of stress. Our models also show that committing patterns do not change with increased commenting, and vice versa, suggesting that technical and social activities tend not to be multitasked.

Shao2020 Shudi Shao, Zhengyi Qiu, Xiao Yu, Wei Yang, Guoliang Jin, Tao Xie, and Xintao Wu: "Database-Access Performance Antipatterns in Database-Backed Web Applications". Proc. International Conference on Software Maintenance and Evolution (ICSME), 2020, 10.1109/icsme46990.2020.00016.

Database-backed web applications are prone to performance bugs related to database accesses. While much work has been conducted on database-access antipatterns with some recent work focusing on performance impact, there still lacks a comprehensive view of database-access performance antipatterns in database-backed web applications. To date, no existing work systematically reports known antipatterns in the literature, and no existing work has studied database-access performance bugs in major types of web applications that access databases differently.To address this issue, we first summarize all known database-access performance antipatterns found through our literature survey, and we report all of them in this paper. We further collect database-access performance bugs from web applications that access databases through language-provided SQL interfaces, which have been largely ignored by recent work, to check how extensively the known antipatterns can cover these bugs. For bugs not covered by the known antipatterns, we extract new database-access performance antipatterns based on real-world performance bugs from such web applications. Our study in total reports 24 known and 10 new database-access performance antipatterns. Our results can guide future work to develop effective tool support for different types of web applications.

Sharma2021 Pankajeshwara Nand Sharma, Bastin Tony Roy Savarimuthu, and Nigel Stanger: "Extracting Rationale for Open Source Software Development Decisions—A Study of Python Email Archives". Proc. International Conference on Software Engineering (ICSE), 2021, 10.1109/icse43902.2021.00095.

A sound Decision-Making (DM) process is key to the successful governance of software projects. In many Open Source Software Development (OSSD) communities, DM processes lie buried amongst vast amounts of publicly available data. Hidden within this data lie the rationale for decisions that led to the evolution and maintenance of software products. While there have been some efforts to extract DM processes from publicly available data, the rationale behind 'how' the decisions are made have seldom been explored. Extracting the rationale for these decisions can facilitate transparency (by making them known), and also promote accountability on the part of decision-makers. This work bridges this gap by means of a large-scale study that unearths the rationale behind decisions from Python development email archives comprising about 1.5 million emails. This paper makes two main contributions. First, it makes a knowledge contribution by unearthing and presenting the rationale behind decisions made. Second, it makes a methodological contribution by presenting a heuristics-based rationale extraction system called Rationale Miner that employs multiple heuristics, and follows a data-driven, bottom-up approach to infer the rationale behind specific decisions (e.g., whether a new module is implemented based on core developer consensus or benevolent dictator's pronouncement). Our approach can be applied to extract rationale in other OSSD communities that have similar governance structures.

Shrestha2020 Nischal Shrestha, Colton Botta, Titus Barik, and Chris Parnin: "Here we go again: why is it difficult for developers to learn another programming language?". Proc. International Conference on Software Engineering (ICSE), 2020, 10.1145/3377811.3380352.

Once a programmer knows one language, they can leverage concepts and knowledge already learned, and easily pick up another programming language. But is that always the case? To understand if programmers have difficulty learning additional programming languages, we conducted an empirical study of Stack Overflow questions across 18 different programming languages. We hypothesized that previous knowledge could potentially interfere with learning a new programming language. From our inspection of 450 Stack Overflow questions, we found 276 instances of interference that occurred due to faulty assumptions originating from knowledge about a different language. To understand why these difficulties occurred, we conducted semi-structured interviews with 16 professional programmers. The interviews revealed that programmers make failed attempts to relate a new programming language with what they already know. Our findings inform design implications for technical authors, toolsmiths, and language designers, such as designing documentation and automated tools that reduce interference, anticipating uncommon language transitions during language design, and welcoming programmers not just into a language, but its entire ecosystem.

Simon2021 Simon and Juha Sorva: "How Concrete Should an Abstract Be?". Proc. Conference on Innovation and Technology in Computer Science Education (ITiCSE), 2021, 10.1145/3430665.3456342.

For many decades the abstract has served as a standalone summary of an academic publication, one that succinctly informs readers of what they might expect to find upon reading the paper. While some publication venues require abstracts to conform with a specified structure, many others, including ITiCSE, leave the structure entirely to the paper's authors. In this paper we report on the components identified in the abstracts of ITiCSE's full papers and working group reports. We examine the abstracts of all 1496 of these publications from 25 years of ITiCSE to determine what structural elements they employ. We also construct something of an ethos of computing education by compiling assertions from the introductions of many abstracts. We find, among other things, that very few abstracts include all of the components that are recommended in a structured abstract; that a number of abstracts consist of nothing but background; that nearly half of abstracts do not include any results; and that nearly five percent of abstracts include references, despite often not having an associated reference list. As an example from the ethos, we find that industry wants people with soft skills, and it is important that we teach our students these skills. Our analysis will guide future ITiCSE authors as they consider how to formulate their own abstracts.

Sobrinho2021 Elder Vicente de Paulo Sobrinho, Andrea De Lucia, and Marcelo de Almeida Maia: "A Systematic Literature Review on Bad Smells–5 W's: Which, When, What, Who, Where". IEEE Transactions on Software Engineering, 47(1), 2021, 10.1109/tse.2018.2880977.

Bad smells are sub-optimal code structures that may represent problems needing attention. We conduct an extensive literature review on bad smells relying on a large body of knowledge from 1990 to 2017. We show that some smells are much more studied in the literature than others, and also that some of them are intrinsically inter-related (which). We give a perspective on how the research has been driven across time (when). In particular, while the interest in duplicated code emerged before the reference publications by Fowler and Beck and by Brown et al., other types of bad smells only started to be studied after these seminal publications, with an increasing trend in the last decade. We analyzed aims, findings, and respective experimental settings, and observed that the variability of these elements may be responsible for some apparently contradictory findings on bad smells (what). Moreover, we could observe that, in general, papers tend to study different types of smells at once. However, only a small percentage of those papers actually investigate possible relations between the respective smells (co-studies), i.e., each smell tends to be studied in isolation. Despite of a few relations between some types of bad smells have been investigated, there are other possible relations for further investigation. We also report that authors have different levels of interest in the subject, some of them publishing sporadically and others continuously (who). We observed that scientific connections are ruled by a large ``small world'' connected graph among researchers and several small disconnected graphs. We also found that the communities studying duplicated code and other types of bad smells are largely separated. Finally, we observed that some venues are more likely to disseminate knowledge on Duplicate Code (which often is listed as a conference topic on its own), while others have a more balanced distribution among other smells (where). Finally, we provide a discussion on future directions for bad smell research.

SotoValero2021 César Soto-Valero, Nicolas Harrand, Martin Monperrus, and Benoit Baudry: "A comprehensive study of bloated dependencies in the Maven ecosystem". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-020-09914-8.

Build automation tools and package managers have a profound influence on software development. They facilitate the reuse of third-party libraries, support a clear separation between the application's code and its external dependencies, and automate several software development tasks. However, the wide adoption of these tools introduces new challenges related to dependency management. In this paper, we propose an original study of one such challenge: the emergence of bloated dependencies. Bloated dependencies are libraries that the build tool packages with the application's compiled code but that are actually not necessary to build and run the application. This phenomenon artificially grows the size of the built binary and increases maintenance effort. We propose a tool, called DepClean, to analyze the presence of bloated dependencies in Maven artifacts. We analyze 9,639 Java artifacts hosted on Maven Central, which include a total of 723,444 dependency relationships. Our key result is that 75.1% of the analyzed dependency relationships are bloated. In other words, it is feasible to reduce the number of dependencies of Maven artifacts up to 1/4 of its current count. We also perform a qualitative study with 30 notable open-source projects. Our results indicate that developers pay attention to their dependencies and are willing to remove bloated dependencies: 18/21 answered pull requests were accepted and merged by developers, removing 131 dependencies in total.

Spadini2019 Davide Spadini, Fabio Palomba, Tobias Baum, Stefan Hanenberg, Magiel Bruntink, and Alberto Bacchelli: "Test-Driven Code Review: An Empirical Study". Proc. International Conference on Software Engineering (ICSE), 2019, 10.1109/icse.2019.00110.

Test-Driven Code Review (TDR) is a code review practice in which a reviewer inspects a patch by examining the changed test code before the changed production code. Although this practice has been mentioned positively by practitioners in informal literature and interviews, there is no systematic knowledge of its effects, prevalence, problems, and advantages. In this paper, we aim at empirically understanding whether this practice has an effect on code review effectiveness and how developers' perceive TDR. We conduct (i) a controlled experiment with 93 developers that perform more than 150 reviews, and (ii) 9 semi-structured interviews and a survey with 103 respondents to gather information on how TDR is perceived. Key results from the experiment show that developers adopting TDR find the same proportion of defects in production code, but more in test code, at the expenses of fewer maintainability issues in production code. Furthermore, we found that most developers prefer to review production code as they deem it more critical and tests should follow from it. Moreover, general poor test code quality and no tool support hinder the adoption of TDR.

Spadini2020 Davide Spadini, Gül Çalikli, and Alberto Bacchelli: "Primers or reminders?: the effects of existing review comments on code review". Proc. International Conference on Software Engineering (ICSE), 2020, 10.1145/3377811.3380385.

In contemporary code review, the comments put by reviewers on a specific code change are immediately visible to the other reviewers involved. Could this visibility prime new reviewers' attention (due to the human's proneness to availability bias), thus biasing the code review outcome? In this study, we investigate this topic by conducting a controlled experiment with 85 developers who perform a code review and a psychological experiment. With the psychological experiment, we find that ≈70% of participants are prone to availability bias. However, when it comes to the code review, our experiment results show that participants are primed only when the existing code review comment is about a type of bug that is not normally considered; when this comment is visible, participants are more likely to find another occurrence of this type of bug. Moreover, this priming effect does not influence reviewers' likelihood of detecting other types of bugs. Our findings suggest that the current code review practice is effective because existing review comments about bugs in code changes are not negative primers, rather positive reminders for bugs that would otherwise be overlooked during code review. Data and materials:

Spasic2020 Mirko Spasić and Milena Vujovsević Janivcić: "Verification supported refactoring of embedded SQL". Software Quality Journal, 29(3), 2020, 10.1007/s11219-020-09517-y.

Improving code quality without changing its functionality, e.g., by refactoring or optimization, is an everyday programming activity. Good programming practice requires that each such change should be followed by a check if the change really preserves the code behavior. If such a check is performed by testing, it can be time consuming and still cannot guarantee the absence of differences in behavior between two versions of the code. Hence, tools that could automatically verify code equivalence would be of great help. An area that we are focused on is embedded sql programming. There are a number of approaches for dealing with equivalence of either pairs of imperative code fragments or pairs of sql statements. However, in database-driven applications, simultaneous changes (changes that include both sql and a host language code) are also present and important. Such changes can preserve the overall equivalence without preserving equivalence of these two parts considered separately. In this paper, we propose an automated approach for dealing with equivalence of programs after such changes, a problem that is hardly tackled in literature. Our approach uses our custom first-order logic modeling of sql queries that corresponds to imperative semantics. The approach generates equivalence conditions that can be efficiently checked using smt solvers or first-order logic provers. We implemented the proposed approach as a framework sqlav, which is publicly available and open source.

Spiegler2021 Simone V. Spiegler, Christoph Heinecke, and Stefan Wagner: "An empirical study on changing leadership in agile teams". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-021-09949-5.

An increasing number of companies aim to enable their development teams to work in an agile manner. When introducing agile teams, companies face several challenges. This paper explores the kind of leadership needed to support teams to work in an agile way. One theoretical agile leadership concept describes a Scrum Master who is supposed to empower the team to lead itself. Empirical findings on such a leadership role are controversial. We still have not understood how leadership unfolds in a team that is by definition self-organizing. Further exploration is needed to better understand leadership in agile teams. Our goal is to explore how leadership changes while the team matures using the example of the Scrum Master. Through a grounded theory study containing 75 practitioners from 11 divisions at the Robert Bosch GmbH we identified a set of nine leadership roles that are transferred from the Scrum Master to the Development Team while it matures. We uncovered that a leadership gap and a supportive internal team climate are enablers of the role transfer process, whereas role conflicts may diminish the role transfer. To make the Scrum Master change in a mature team, team members need to receive trust and freedom to take on a leadership role which was previously filled by the Scrum Master. We conclude with practical implications for managers, Product Owners, Development Teams and Scrum Masters which they can apply in real settings.

Stol2018 Klaas-Jan Stol and Brian Fitzgerald: "The ABC of Software Engineering Research". ACM Transactions on Software Engineering and Methodology, 27(3), 2018, 10.1145/3241743.

A variety of research methods and techniques are available to SE researchers, and while several overviews exist, there is consistency neither in the research methods covered nor in the terminology used. Furthermore, research is sometimes critically reviewed for characteristics inherent to the methods. We adopt a taxonomy from the social sciences, termed here the ABC framework for SE research, which offers a holistic view of eight archetypal research strategies. ABC refers to the research goal that strives for generalizability over Actors (A) and precise measurement of their Behavior (B), in a realistic Context (C). The ABC framework uses two dimensions widely considered to be key in research design: the level of obtrusiveness of the research and the generalizability of research findings. We discuss metaphors for each strategy and their inherent limitations and potential strengths. We illustrate these research strategies in two key SE domains, global software engineering and requirements engineering, and apply the framework on a sample of 75 articles. Finally, we discuss six ways in which the framework can advance SE research.

Stray2021 Viktoria Stray, Raluca Florea, and Lucas Paruch: "Exploring human factors of the agile software tester". Software Quality Journal, 2021, 10.1007/s11219-021-09561-2.

Although extensive research has been conducted on the characteristics of the agile developer, little attention has been given to the features of the software-testing role. This paper explores the human factors of the software testers working in agile projects through a qualitative study focusing on how these factors are perceived. We interviewed 22 agile software practitioners working in three international companies: 14 testers, five developers, and three designers. Additionally, we observed 11 meetings and daily work of 13 participants in one of the companies. Our findings show that the views on the human factors shaping the agile software tester's role were crystallized into seven traits, which the agile team members saw as central for the software-testing role: the ability to see the whole picture, good communication skills, detail-orientation, structuredness, creativeness, curiosity, and adaptability. The testers spent half their day communicating and learned how to mitigate the fact that they had to bring bad news to other project members. They also facilitated communication between the business side and development. Based on our results, we propose the seven traits as dimensions to consider for organizations recruiting agile software testers, as well as a reference for IT and non-IT professionals considering a software-testing career.


Tamburri2020 Damian Andrew Tamburri, Kelly Blincoe, Fabio Palomba, and Rick Kazman: "'The Canary in the Coal Mine…' A cautionary tale from the decline of SourceForge". Software: Practice and Experience, 50(10), 2020, 10.1002/spe.2874.

Forges are online collaborative platforms to support the development of distributed open source software. While once mighty keepers of open source vitality, software forges are rapidly becoming less and less relevant. For example, of the top 10 forges in 2011, only one survives today—SourceForge—the biggest of them all, but its numbers are dropping and its community is tenuous at best. Through mixed-methods research, this article chronicles and analyze the software practice and experiences of the project's history—in particular its architectural and community/organizational decisions. We discovered a number of suboptimal social and architectural decisions and circumstances that, may have led to SourceForge's demise. In addition, we found evidence suggesting that the impact of such decisions could have been monitored, reduced, and possibly avoided altogether. The use of sociotechnical insights needs to become a basic set of design and software/organization monitoring principles that tell a cautionary tale on what to measure and what not to do in the context of large-scale software forge and community design and management.

Tan2020a Xin Tan, Minghui Zhou, and Zeyu Sun: "A first look at good first issues on GitHub". Proc. European Software Engineering Conference/International Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020, 10.1145/3368089.3409746.

Keeping a good influx of newcomers is critical for open source software projects' survival, while newcomers face many barriers to contributing to a project for the first time. To support newcomers onboarding, GitHub encourages projects to apply labels such as good first issue (GFI) to tag issues suitable for newcomers. However, many newcomers still fail to contribute even after many attempts, which not only reduces the enthusiasm of newcomers to contribute but makes the efforts of project members in vain. To better support the onboarding of newcomers, this paper reports a preliminary study on this mechanism from its application status, effect, problems, and best practices. By analyzing 9,368 GFIs from 816 popular GitHub projects and conducting email surveys with newcomers and project members, we obtain the following results. We find that more and more projects are applying this mechanism in the past decade, especially the popular projects. Compared to common issues, GFIs usually need more days to be solved. While some newcomers really join the projects through GFIs, almost half of GFIs are not solved by newcomers. We also discover a series of problems covering mechanism (e.g., inappropriate GFIs), project (e.g., insufficient GFIs) and newcomer (e.g., uneven skills) that makes this mechanism ineffective. We discover the practices that may address the problems, including identifying GFIs that have informative description and available support, and require limited scope and skill, etc. Newcomer onboarding is an important but challenging question in open source projects and our work enables a better understanding of GFI mechanism and its problems, as well as highlights ways in improving them.

Tan2020b Jie Tan, Daniel Feitosa, Paris Avgeriou, and Mircea Lungu: "Evolution of technical debt remediation in Python: A case study on the Apache Software Ecosystem". Journal of Software: Evolution and Process, 33(4), 2020, 10.1002/smr.2319.

In recent years, the evolution of software ecosystems and the detection of technical debt received significant attention by researchers from both industry and academia. While a few studies that analyze various aspects of technical debt evolution already exist, to the best of our knowledge, there is no large-scale study that focuses on the remediation of technical debt over time in Python projects—that is, one of the most popular programming languages at the moment. In this paper, we analyze the evolution of technical debt in 44 Python open-source software projects belonging to the Apache Software Foundation. We focus on the type and amount of technical debt that is paid back. The study required the mining of over 60K commits, detailed code analysis on 3.7K system versions, and the analysis of almost 43K fixed issues. The findings show that most of the repayment effort goes into testing, documentation, complexity, and duplication removal. Moreover, more than half of the Python technical debt is short term being repaid in less than 2 months. In particular, the observations that a minority of rules account for the majority of issues fixed and spent effort suggest that addressing those kinds of debt in the future is important for research and practice.

Tomasdottir2020 Kristín Fjóla Tómasdóttir, Maur'icio Aniche, and Arie van Deursen: "The Adoption of JavaScript Linters in Practice: A Case Study on ESLint". IEEE Transactions on Software Engineering, 46(8), 2020, 10.1109/tse.2018.2871058.

A linter is a static analysis tool that warns software developers about possible code errors or violations to coding standards. By using such a tool, errors can be surfaced early in the development process when they are cheaper to fix. For a linter to be successful, it is important to understand the needs and challenges of developers when using a linter. In this paper, we examine developers' perceptions on JavaScript linters. We study why and how developers use linters along with the challenges they face while using such tools. For this purpose we perform a case study on ESLint, the most popular JavaScript linter. We collect data with three different methods where we interviewed 15 developers from well-known open source projects, analyzed over 9,500 ESLint configuration files, and surveyed 337 developers from the JavaScript community. Our results provide practitioners with reasons for using linters in their JavaScript projects as well as several configuration strategies and their advantages. We also provide a list of linter rules that are often enabled and disabled, which can be interpreted as the most important rules to reason about when configuring linters. Finally, we propose several feature suggestions for tool makers and future work for researchers.

Turcotte2020 Alexi Turcotte, Aviral Goel, Filip Křikava, and Jan Vitek: "Designing types for R, empirically". Proceedings of the ACM on Programming Languages, 4(OOPSLA), 2020, 10.1145/3428249.

The R programming language is widely used in a variety of domains. It was designed to favor an interactive style of programming with minimal syntactic and conceptual overhead. This design is well suited to data analysis, but a bad fit for tools such as compilers or program analyzers. In particular, R has no type annotations, and all operations are dynamically checked at run-time. The starting point for our work are the two questions: what expressive power is needed to accurately type R code? and which type system is the R community willing to adopt? Both questions are difficult to answer without actually experimenting with a type system. The goal of this paper is to provide data that can feed into that design process. To this end, we perform a large corpus analysis to gain insights in the degree of polymorphism exhibited by idiomatic R code and explore potential benefits that the R community could accrue from a simple type system. As a starting point, we infer type signatures for 25,215 functions from 412 packages among the most widely used open source R libraries. We then conduct an evaluation on 8,694 clients of these packages, as well as on end-user code from the Kaggle data science competition website.


Uesbeck2020 P. Merlin Uesbeck, Cole S. Peterson, Bonita Sharif, and Andreas Stefik: "A randomized controlled trial on the effects of embedded computer language switching". Proc. European Software Engineering Conference/International Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020, 10.1145/3368089.3409701.

Polyglot programming, the use of multiple programming languages during the development process, is common practice in modern software development. This study investigates this practice through a randomized controlled trial conducted under the context of database programming. Participants in the study were given coding tasks written in Java and one of three SQL-like embedded languages. One was plain SQL in strings, one was in Java only, and the third was a hybrid embedded language that was closer to the host language. We recorded 109 valid data points. Results showed significant differences in how developers of different experience levels code using polyglot techniques. Notably, less experienced programmers wrote correct programs faster in the hybrid condition (frequent, but less severe, switches), while more experienced developers that already knew both languages performed better in traditional SQL (less frequent but more complete switches). The results indicate that the productivity impact of polyglot programming is complex and experience level dependent.


Venigalla2021 Akhila Sri Manasa Venigalla and Sridhar Chimalakonda: "On the comprehension of application programming interface usability in game engines". Software: Practice and Experience, 51(8), 2021, 10.1002/spe.2985.

Extensive development of games for various purposes including education and entertainment has resulted in increased development of game engines. Game engines are being used on a large scale as they support and simplify game development to a greater extent. Game developers using game engines are often compelled to use various application programming interfaces (APIs) of game engines in the process of game development. Thus, both quality and ease of development of games are greatly influenced by APIs defined in game engines. Hence, understanding API usability in game engines could greatly help in choosing better game engines among the ones that are available for game development and also could help developers in designing better game engines. In this article, we thus aim to evaluate API usability of 95 publicly available game engine repositories on GitHub, written primarily in C++ programming language. We test API usability of these game engines against the eight structural API usability metrics—AMNOI, AMNCI, AMGI, APXI, APLCI, AESI, ATSI, and ADI. We see this research as a first step toward the direction of improving usability of APIs in game engines. We present the results of the study, which indicate that about 25% of the game engines considered have minimal API usability, with respect to the considered metrics. It was observed that none of the considered repositories have ideal (all metric scores equal to 1) API usability, indicating the need for developers to consider API usability metrics while designing game engines.


Weintrop2017 David Weintrop and Uri Wilensky: "Comparing Block-Based and Text-Based Programming in High School Computer Science Classrooms". ACM Transactions on Computing Education, 18(1), 2017, 10.1145/3089799.

The number of students taking high school computer science classes is growing. Increasingly, these students are learning with graphical, block-based programming environments either in place of or prior to traditional text-based programming languages. Despite their growing use in formal settings, relatively little empirical work has been done to understand the impacts of using block-based programming environments in high school classrooms. In this article, we present the results of a 5-week, quasi-experimental study comparing isomorphic block-based and text-based programming environments in an introductory high school programming class. The findings from this study show students in both conditions improved their scores between pre- and postassessments; however, students in the blocks condition showed greater learning gains and a higher level of interest in future computing courses. Students in the text condition viewed their programming experience as more similar to what professional programmers do and as more effective at improving their programming ability. No difference was found between students in the two conditions with respect to confidence or enjoyment. The implications of these findings with respect to pedagogy and design are discussed, along with directions for future work.

Weir2021 Charles Weir, Ingolf Becker, and Lynne Blair: "A Passion for Security: Intervening to Help Software Developers". Proc. International Conference on Software Engineering (ICSE), 2021, 10.1109/icse-seip52600.2021.00011.

While the techniques to achieve secure, privacy-preserving software are now well understood, evidence shows that many software development teams do not use them: they lack the 'security maturity' to assess security needs and decide on appropriate tools and processes; and they lack the ability to negotiate with product management for the required resources. This paper describes a measuring approach to assess twelve aspects of this security maturity; its use to assess the impact of a lightweight package of workshops designed to increase security maturity; and a novel approach within that package to support developers in resource negotiation. Based on trials in eight organizations, involving over 80 developers, this paper demonstrates that (1) development teams can notably improve their security maturity even in the absence of security specialists; and (2) suitably guided, developers can find effective ways to promote security to product management. Empowering developers to make their own decisions and promote security in this way offers a powerful grassroots approach to improving the security of software worldwide.



Yasmin2020 Jerin Yasmin, Yuan Tian, and Jinqiu Yang: "A First Look at the Deprecation of RESTful APIs: An Empirical Study". Proc. International Conference on Software Maintenance and Evolution (ICSME), 2020, 10.1109/icsme46990.2020.00024.

REpresentational State Transfer (REST) is considered as one standard software architectural style to build web APIs that can integrate software systems over the internet. However, while connecting systems, RESTful APIs might also break the dependent applications that rely on their services when they introduce breaking changes, e.g., an older version of the API is no longer supported. To warn developers promptly and thus prevent critical impact on downstream applications, a deprecated-removed model should be followed, and deprecation-related information such as alternative approaches should also be listed. While API deprecation analysis as a theme is not new, most existing work focuses on non-web APIs, such as the ones provided by Java and Android.To investigate RESTful API deprecation, we propose a framework called RADA (RESTful API Deprecation Analyzer). RADA is capable of automatically identifying deprecated API elements and analyzing impacted operations from an OpenAPI specification, a machine-readable profile for describing RESTful web service. We apply RADA on 2,224 OpenAPI specifications of 1,368 RESTful APIs collected from, the largest directory of OpenAPI specifications. Based on the data mined by RADA, we perform an empirical study to investigate how the deprecated-removed protocol is followed in RESTful APIs and characterize practices in RESTful API deprecation. The results of our study reveal several severe deprecation-related problems in existing RESTful APIs. Our implementation of RADA and detailed empirical results are publicly available for future intelligent tools that could automatically identify and migrate usage of deprecated RESTful API operations in client code.

Young2021 Jean-Gabriel Young, Amanda Casari, Katie McLaughlin, Milo Z. Trujillo, Laurent Hebert-Dufresne, and James P. Bagrow: "Which contributions count? Analysis of attribution in open source". Proc. International Conference on Mining Software Repositories (MSR), 2021, 10.1109/msr52588.2021.00036.

Open source software projects usually acknowledge contributions with text files, websites, and other idiosyncratic methods. These data sources are hard to mine, which is why contributorship is most frequently measured through changes to repositories, such as commits, pushes, or patches. Recently, some open source projects have taken to recording contributor actions with standardized systems; this opens up a unique opportunity to understand how community-generated notions of contributorship map onto codebases as the measure of contribution. Here, we characterize contributor acknowledgment models in open source by analyzing thousands of projects that use a model called All Contributors to acknowledge diverse contributions like outreach, finance, infrastructure, and community management. We analyze the life cycle of projects through this model's lens and contrast its representation of contributorship with the picture given by other methods of acknowledgment, including GitHub's top committers indicator and contributions derived from actions taken on the platform. We find that community-generated systems of contribution acknowledgment make work like idea generation or bug finding more visible, which generates a more extensive picture of collaboration. Further, we find that models requiring explicit attribution lead to more clearly defined boundaries around what is and is not a contribution.

Yu2021 Zhongxing Yu, Chenggang Bai, Lionel Seinturier, and Martin Monperrus: "Characterizing the Usage, Evolution and Impact of Java Annotations in Practice". IEEE Transactions on Software Engineering, 47(5), 2021, 10.1109/tse.2019.2910516.

Annotations have been formally introduced into Java since Java 5. Since then, annotations have been widely used by the Java community for different purposes, such as compiler guidance and runtime processing. Despite the ever-growing use, there is still limited empirical knowledge about the actual usage of annotations in practice, the changes made to annotations during software evolution, and the potential impact of annotations on code quality. To fill this gap, we perform the first large-scale empirical study about Java annotations on 1,094 notable open-source projects hosted on GitHub. Our study systematically investigates annotation usage, annotation evolution, and annotation impact, and generates 10 novel and important findings. We also present the implications of our findings, which shed light for developers, researchers, tool builders, and language or library designers in order to improve all facets of Java annotation engineering.


Zhang2021b Haoxiang Zhang, Shaowei Wang, Tse-Hsun Chen, Ying Zou, and Ahmed E. Hassan: "An Empirical Study of Obsolete Answers on Stack Overflow". IEEE Transactions on Software Engineering, 47(4), 2021, 10.1109/tse.2019.2906315.

Stack Overflow accumulates an enormous amount of software engineering knowledge. However, as time passes, certain knowledge in answers may become obsolete. Such obsolete answers, if not identified or documented clearly, may mislead answer seekers and cause unexpected problems (e.g., using an out-dated security protocol). In this paper, we investigate how the knowledge in answers becomes obsolete and identify the characteristics of such obsolete answers. We find that: 1) More than half of the obsolete answers (58.4 percent) were probably already obsolete when they were first posted. 2) When an obsolete answer is observed, only a small proportion (20.5 percent) of such answers are ever updated. 3) Answers to questions in certain tags (e.g., node.js, ajax, android, and objective-c) are more likely to become obsolete. Our findings suggest that Stack Overflow should develop mechanisms to encourage the whole community to maintain answers (to avoid obsolete answers) and answer seekers are encouraged to carefully go through all information (e.g., comments) in answer threads.

Zieris2020 Franz Zieris and Lutz Prechelt: "Explaining pair programming session dynamics from knowledge gaps". Proc. International Conference on Software Engineering (ICSE), 2020, 10.1145/3377811.3380925.

Background: Despite a lot of research on the effectiveness of Pair Programming (PP), the question when it is useful or less useful remains unsettled. Method: We analyze recordings of many industrial PP sessions with Grounded Theory Methodology and build on prior work that identified various phenomena related to within-session knowledge build-up and transfer. We validate our findings with practitioners. Result: We identify two fundamentally different types of required knowledge and explain how different constellations of knowledge gaps in these two respects lead to different session dynamics. Gaps in project-specific systems knowledge are more hampering than gaps in general programming knowledge and are dealt with first and foremost in a PP session. Conclusion: Partner constellations with complementary knowledge make PP a particularly effective practice. In PP sessions, differences in system understanding are more important than differences in general software development knowledge.