To Do




AbuHassan2020 Amjad AbuHassan, Mohammad Alshayeb, and Lahouari Ghouti: "Software smell detection techniques: A systematic literature review". Journal of Software: Evolution and Process, 33(3), 2020, 10.1002/smr.2320.

Software smells indicate design or code issues that might degrade the evolution and maintenance of software systems. Detecting and identifying these issues are challenging tasks. This paper explores, identifies, and analyzes the existing software smell detection techniques at design and code levels. We carried out a systematic literature review (SLR) to identify and collect 145 primary studies related to smell detection in software design and code. Based on these studies, we address several questions related to the analysis of the existing smell detection techniques in terms of abstraction level (design or code), targeted smells, used metrics, implementation, and validation. Our analysis identified several detection techniques categories. We observed that 57% of the studies did not use any performance measures, 41% of them omitted details on the targeted programing language, and the detection techniques were not validated in 14% of these studies. With respect to the abstraction level, only 18% of the studies addressed bad smell detection at the design level. This low coverage urges for more focus on bad smell detection at the design level to handle them at early stages. Finally, our SLR brings to the attention of the research community several opportunities for future research.

Aghajani2019 Emad Aghajani, Csaba Nagy, Olga Lucero Vega-Marquez, Mario Linares-Vasquez, Laura Moreno, Gabriele Bavota, and Michele Lanza: "Software Documentation Issues Unveiled". 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 10.1109/icse.2019.00122.

(Good) Software documentation provides developers and users with a description of what a software system does, how it operates, and how it should be used. For example, technical documentation (e.g., an API reference guide) aids developers during evolution/maintenance activities, while a user manual explains how users are to interact with a system. Despite its intrinsic value, the creation and the maintenance of documentation is often neglected, negatively impacting its quality and usefulness, ultimately leading to a generally unfavourable take on documentation. Previous studies investigating documentation issues have been based on surveying developers, which naturally leads to a somewhat biased view of problems affecting documentation. We present a large scale empirical study, where we mined, analyzed, and categorized 878 documentation-related artifacts stemming from four different sources, namely mailing lists, Stack Overflow discussions, issue repositories, and pull requests. The result is a detailed taxonomy of documentation issues from which we infer a series of actionable proposals both for researchers and practitioners.

Ajami2018 Shulamyt Ajami, Yonatan Woodbridge, and Dror G. Feitelson: "Syntax, predicates, idioms—what really affects code complexity?". Empirical Software Engineering, 24(1), 2018, 10.1007/s10664-018-9628-3.

Program comprehension concerns the ability to understand code written by others. But not all code is the same. We use an experimental platform fashioned as an online game-like environment to measure how quickly and accurately 220 professional programmers can interpret code snippets with similar functionality but different structures; snippets that take longer to understand or produce more errors are considered harder. The results indicate, inter alia, that for loops are significantly harder than if s, that some but not all negations make a predicate harder, and that loops counting down are slightly harder than loops counting up. This demonstrates how the effect of syntactic structures, different ways to express predicates, and the use of known idioms can be measured empirically, and that syntactic structures are not necessarily the most important factor. We also found that the metrics of time to understanding and errors made are not necessarily equivalent. Thus loops counting down took slightly longer, but loops with unusual bounds caused many more errors. By amassing many more empirical results like these it may be possible to derive better code complexity metrics than we have today, and also to better appreciate their limitations.

Alkhabaz2021 Ridha Alkhabaz, Seth Poulsen, Mei Chen, and Abdussalam Alawini: "Insights from Student Solutions to MongoDB Homework Problems". Proceedings of the 26th ACM Conference on Innovation and Technology in Computer Science Education, 10.1145/3430665.3456308.

We analyze submissions for homework assignments of 527 students in an upper-level database course offered at the University of Illinois at Urbana-Champaign. The ability to query databases is becoming a crucial skill for technology professionals and academics. Although we observe a large demand for teaching database skills, there is little research on database education. Also, despite the industry's continued demand for NoSQL databases, we have virtually no research on the matter of how students learn NoSQL databases, such as MongoDB. In this paper, we offer an in-depth analysis of errors committed by students working on MongoDB homework assignments over the course of two semesters. We show that as students use more advanced MongoDB operators, they make more Reference errors. Additionally, when students face a new functionality of MongoDB operators, such as $group operator, they usually take time to understand it but do not make the same errors again in later problems. Finally, our analysis suggests that students struggle with advanced concepts for a comparable amount of time. Our results suggest that instructors should allocate more time and effort for the discussed topics in our paper.

AlSubaihin2021 Afnan A. Al-Subaihin, Federica Sarro, Sue Black, Licia Capra, and Mark Harman: "App Store Effects on Software Engineering Practices". IEEE Transactions on Software Engineering, 47(2), 2021, 10.1109/tse.2019.2891715.

In this paper, we study the app store as a phenomenon from the developers' perspective to investigate the extent to which app stores affect software engineering tasks. Through developer interviews and questionnaires, we uncover findings that highlight and quantify the effects of three high-level app store themes: bridging the gap between developers and users, increasing market transparency and affecting mobile release management. Our findings have implications for testing, requirements engineering and mining software repositories research fields. These findings can help guide future research in supporting mobile app developers through a deeper understanding of the app store-developer interaction.

Ames2018 Morgan G. Ames: "Hackers, Computers, and Cooperation: A Critical History of Logo and Constructionist Learning". Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 2018, 10.1145/3274287.

This paper examines the history of the learning theory "constructionism" and its most well-known implementation, Logo, to examine beliefs involving both "C's" in CSCW: computers and cooperation. Tracing the tumultuous history of one of the first examples of computer-supported cooperative learning (CSCL) allows us to question some present-day assumptions regarding the universal appeal of learning to program computers that undergirds popular CSCL initiatives today, including the Scratch programming environment and the "FabLab" makerspace movement. Furthermore, teasing out the individualistic and anti-authority threads in this project and its links to present day narratives of technology development exposes the deeply atomized and even oppositional notions of collaboration in these projects and others under the auspices of CSCW today that draw on early notions of 'hacker culture.' These notions tend to favor a limited view of work, learning, and practice-an invisible constraint that continues to inform how we build and evaluate CSCW technologies.


Bafatakis2019 Nikolaos Bafatakis, Niels Boecker, Wenjie Boon, Martin Cabello Salazar, Jens Krinke, Gazi Oznacar, and Robert White: "Python Coding Style Compliance on Stack Overflow". 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), 10.1109/msr.2019.00042.

Software developers all over the world use Stack Overflow (SO) to interact and exchange code snippets. Research also uses SO to harvest code snippets for use with recommendation systems. However, previous work has shown that code on SO may have quality issues, such as security or license problems. We analyse Python code on SO to determine its coding style compliance. From 1,962,535 code snippets tagged with 'python', we extracted 407,097 snippets of at least 6 statements of Python code. Surprisingly, 93.87% of the extracted snippets contain style violations, with an average of 0.7 violations per statement and a huge number of snippets with a considerably higher ratio. Researchers and developers should, therefore, be aware that code snippets on SO may not representative of good coding style. Furthermore, while user reputation seems to be unrelated to coding style compliance, for posts with vote scores in the range between -10 and 20, we found a strong correlation ($r = -0.87$, $p < 10^-7$) between the vote score a post received and the average number of violations per statement for snippets in such posts.

Balali2018 Sogol Balali, Igor Steinmacher, Umayal Annamalai, Anita Sarma, and Marco Aurelio Gerosa: "Newcomers' Barriers… Is That All? An Analysis of Mentors' and Newcomers' Barriers in OSS Projects". Computer Supported Cooperative Work (CSCW), 27(3-6), 2018, 10.1007/s10606-018-9310-8.

Newcomers' seamless onboarding is important for open collaboration communities, particularly those that leverage outsiders' contributions to remain sustainable. Nevertheless, previous work shows that OSS newcomers often face several barriers to contribute, which lead them to lose motivation and even give up on contributing. A well-known way to help newcomers overcome initial contribution barriers is mentoring. This strategy has proven effective in offline and online communities, and to some extent has been employed in OSS projects. Studying mentors' perspectives on the barriers that newcomers face play a vital role in improving onboarding processes; yet, OSS mentors face their own barriers, which hinder the effectiveness of the strategy. Since little is known about the barriers mentors face, in this paper, we investigate the barriers that affect mentors and their newcomer mentees. We interviewed mentors from OSS projects and qualitatively analyzed their answers. We found 44 barriers: 19 that affect mentors; and 34 that affect newcomers (9 affect both newcomers and mentors). Interestingly, most of the barriers we identified (66%) have a social nature. Additionally, we identified 10 strategies that mentors indicated to potentially alleviate some of the barriers. Since gender-related challenges emerged in our analysis, we conducted nine follow-up structured interviews to further explore this perspective. The contributions of this paper include: identifying the barriers mentors face; bringing the unique perspective of mentors on barriers faced by newcomers; unveiling strategies that can be used by mentors to support newcomers; and investigating gender-specific challenges in OSS mentorship. Mentors, newcomers, online communities, and educators can leverage this knowledge to foster new contributors to OSS projects.

Baltes2020 Sebastian Baltes, George Park, and Alexander Serebrenik: "Is 40 the New 60? How Popular Media Portrays the Employability of Older Software Developers". IEEE Software, 37(6), 2020, 10.1109/ms.2020.3014178.

We studied the public discourse around age and software development, focusing on the United States. This work was designed to build awareness among decision makers in software projects to help them anticipate and mitigate challenges that their older employees may face.

Bao2021 Lingfeng Bao, Xin Xia, David Lo, and Gail C. Murphy: "A Large Scale Study of Long-Time Contributor Prediction for GitHub Projects". IEEE Transactions on Software Engineering, 47(6), 2021, 10.1109/tse.2019.2918536.

The continuous contributions made by long time contributors (LTCs) are a key factor enabling open source software (OSS) projects to be successful and survival. We study Github as it has a large number of OSS projects and millions of contributors, which enables the study of the transition from newcomers to LTCs. In this paper, we investigate whether we can effectively predict newcomers in OSS projects to be LTCs based on their activity data that is collected from Github. We collect Github data from GHTorrent, a mirror of Github data. We select the most popular 917 projects, which contain 75,046 contributors. We determine a developer as a LTC of a project if the time interval between his/her first and last commit in the project is larger than a certain time $T$. In our experiment, we use three different settings on the time interval: 1, 2, and 3 years. There are 9,238, 3,968, and 1,577 contributors who become LTCs of a project in three settings of time interval, respectively. To build a prediction model, we extract many features from the activities of developers on Github, which group into five dimensions: developer profile, repository profile, developer monthly activity, repository monthly activity, and collaboration network. We apply several classifiers including naive Bayes, SVM, decision tree, kNN and random forest. We find that random forest classifier achieves the best performance with AUCs of more than 0.75 in all three settings of time interval for LTCs. We also investigate the most important features that differentiate newcomers who become LTCs from newcomers who stay in the projects for a short time. We find that the number of followers is the most important feature in all three settings of the time interval studied. We also find that the programming language and the average number of commits contributed by other developers when a newcomer joins a project also belong to the top 10 most important features in all three settings of time interval for LTCs. Finally, we provide several implications for action based on our analysis results to help OSS projects retain newcomers.

Barke2019 Helena Barke and Lutz Prechelt: "Role clarity deficiencies can wreck agile teams". PeerJ Computer Science, 5, 2019, 10.7717/peerj-cs.241.

Background One of the twelve agile principles is to build projects around motivated individuals and trust them to get the job done. Such agile teams must self-organize, but this involves conflict, making self-organization difficult. One area of difficulty is agreeing on everybody's role. Background What dynamics arise in a self-organizing team from the negotiation of everybody's role? Method We conceptualize observations from five agile teams (work observations, interviews) by Charmazian Grounded Theory Methodology. Results We define role as something transient and implicit, not fixed and named. The roles are characterized by the responsibilities and expectations of each team member. Every team member must understand and accept their own roles (Local role clarity) and everbody else's roles (Team-wide role clarity). Role clarity allows a team to work smoothly and effectively and to develop its members' skills fast. Lack of role clarity creates friction that not only hampers the day-to-day work, but also appears to lead to high employee turnover. Agile coaches are critical to create and maintain role clarity. Conclusions Agile teams should pay close attention to the levels of Local role clarity of each member and Team-wide role clarity overall, because role clarity deficits are highly detrimental.

Bi2021 Tingting Bi, Wei Ding, Peng Liang, and Antony Tang: "Architecture information communication in two OSS projects: The why, who, when, and what". Journal of Systems and Software, 181, 2021, 10.1016/j.jss.2021.111035.

Architecture information is vital for Open Source Software (OSS) development, and mailing list is one of the widely used channels for developers to share and communicate architecture information. This work investigates the nature of architecture information communication (i.e., why, who, when, and what) by OSS developers via developer mailing lists. We employed a multiple case study approach to extract and analyze the architecture information communication from the developer mailing lists of two OSS projects, ArgoUML and Hibernate, during their development life-cycle of over 18 years. Our main findings are: (a) architecture negotiation and interpretation are the two main reasons (i.e., why) of architecture communication; (b) the amount of architecture information communicated in developer mailing lists decreases after the first stable release (i.e., when); (c) architecture communications centered around a few core developers (i.e., who); (d) and the most frequently communicated architecture elements (i.e., what) are Architecture Rationale and Architecture Model. There are a few similarities of architecture communication between the two OSS projects. Such similarities point to how OSS developers naturally gravitate towards the four aspects of architecture communication in OSS development.

Blackwell2019 Alan F. Blackwell, Marian Petre, and Luke Church: "Fifty years of the psychology of programming". International Journal of Human-Computer Studies, 131, 2019, 10.1016/j.ijhcs.2019.06.009.

Abstract This paper reflects on the evolution (past, present and future) of the 'psychology of programming' over the 50 year period of this anniversary issue. The International Journal of Human-Computer Studies (IJHCS) has been a key venue for much seminal work in this field, including its first foundations, and we review the changing research concerns seen in publications over these five decades. We relate this thematic evolution to research taking place over the same period within more specialist communities, especially the Psychology of Programming Interest Group (PPIG), the Empirical Studies of Programming series (ESP), and the ongoing community in Visual Languages and Human-Centric Computing (VL/HCC). Many other communities have interacted with psychology of programming, both influenced by research published within the specialist groups, and in turn influencing research priorities. We end with an overview of the core theories that have been developed over this period, as an introductory resource for new researchers, and also with the authors' own analysis of key priorities for future research.

Bogart2021 Chris Bogart, Christian Kästner, James Herbsleb, and Ferdian Thung: "When and How to Make Breaking Changes". ACM Transactions on Software Engineering and Methodology, 30(4), 2021, 10.1145/3447245.

Open source software projects often rely on package management systems that help projects discover, incorporate, and maintain dependencies on other packages, maintained by other people. Such systems save a great deal of effort over ad hoc ways of advertising, packaging, and transmitting useful libraries, but coordination among project teams is still needed when one package makes a breaking change affecting other packages. Ecosystems differ in their approaches to breaking changes, and there is no general theory to explain the relationships between features, behavioral norms, ecosystem outcomes, and motivating values. We address this through two empirical studies. In an interview case study, we contrast Eclipse, NPM, and CRAN, demonstrating that these different norms for coordination of breaking changes shift the costs of using and maintaining the software among stakeholders, appropriate to each ecosystem’s mission. In a second study, we combine a survey, repository mining, and document analysis to broaden and systematize these observations across 18 ecosystems. We find that all ecosystems share values such as stability and compatibility, but differ in other values. Ecosystems’ practices often support their espoused values, but in surprisingly diverse ways. The data provides counterevidence against easy generalizations about why ecosystem communities do what they do.

Brown2020 Chris Brown and Chris Parnin: "Understanding the impact of GitHub suggested changes on recommendations between developers". Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 10.1145/3368089.3409722.

Recommendations between colleagues are effective for encouraging developers to adopt better practices. Research shows these peer interactions are useful for improving developer behaviors, or the adoption of activities to help software engineers complete programming tasks. However, in-person recommendations between developers in the workplace are declining. One form of online recommendations between developers are pull requests, which allow users to propose code changes and provide feedback on contributions. GitHub, a popular code hosting platform, recently introduced the suggested changes feature, which allows users to recommend improvements for pull requests. To better understand this feature and its impact on recommendations between developers, we report an empirical study of this system, measuring usage, effectiveness, and perception. Our results show that suggested changes support code review activities and significantly impact the timing and communication between developers on pull requests. This work provides insight into the suggested changes feature and implications for improving future systems for automated developer recommendations, such as providing situated, concise, and actionable feedback.

Butler2019 Simon Butler, Jonas Gamalielsson, Bjorn Lundell, Christoffer Brax, Johan Sjoberg, Anders Mattsson, Tomas Gustavsson, Jonas Feist, and Erik Lonroth: "On Company Contributions to Community Open Source Software Projects". IEEE Transactions on Software Engineering, 2019, 10.1109/tse.2019.2919305.

The majority of contributions to community open source software (OSS) projects are made by practitioners acting on behalf of companies and other organisations. Previous research has addressed the motivations of both individuals and companies to engage with OSS projects. However, limited research has been undertaken that examines and explains the practical mechanisms or work practices used by companies and their developers to pursue their commercial and technical objectives when engaging with OSS projects. This research investigates the variety of work practices used in public communication channels by company contributors to engage with and contribute to eight community OSS projects. Through interviews with contributors to the eight projects we draw on their experiences and insights to explore the motivations to use particular methods of contribution. We find that companies utilise work practices for contributing to community projects which are congruent with the circumstances and their capabilities that support their short- and long-term needs. We also find that companies contribute to community OSS projects in ways that may not always be apparent from public sources, such as employing core project developers, making donations, and joining project steering committees in order to advance strategic interests. The factors influencing contributor work practices can be complex and are often dynamic arising from considerations such as company and project structure, as well as technical concerns and commercial strategies. The business context in which software created by the OSS project is deployed is also found to influence contributor work practices.


Cabral2007 Bruno Cabral and Paulo Marques: "Exception Handling: A Field Study in Java and .NET". European Conference on Object-Oriented Programming (ECOOP 2007), 10.1007/978-3-540-73589-2_8.

Most modern programming languages rely on exceptions for dealing with abnormal situations. Although exception handling was a significant improvement over other mechanisms like checking return codes, it is far from perfect. In fact, it can be argued that this mechanism is seriously limited, if not, flawed. This paper aims to contribute to the discussion by providing quantitative measures on how programmers are currently using exception handling. We examined 32 different applications, both for Java and .NET. The major conclusion for this work is that exceptions are not being correctly used as an error recovery mechanism. Exception handlers are not specialized enough for allowing recovery and, typically, programmers just do one of the following actions: logging, user notification and application termination. To our knowledge, this is the most comprehensive study done on exception handling to date, providing a quantitative measure useful for guiding the development of new error handling mechanisms.

Catolino2019 Gemma Catolino, Fabio Palomba, Damian A. Tamburri, Alexander Serebrenik, and Filomena Ferrucci: "Gender Diversity and Women in Software Teams: How Do They Affect Community Smells?". 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS), 10.1109/icse-seis.2019.00010.

As social as software engineers are, there is a known and established gender imbalance in our community structures, regardless of their open-or closed-source nature. To shed light on the actual benefits of achieving such balance, this empirical study looks into the relations between such balance and the occurrence of community smells, that is, sub-optimal circumstances and patterns across the software organizational structure. Examples of community smells are Organizational Silo effects (overly disconnected sub-groups) or Lone Wolves (defiant community members). Results indicate that the presence of women generally reduces the amount of community smells. We conclude that women are instrumental to reducing community smells in software development teams.

Chatterjee2021 Preetha Chatterjee, Kostadin Damevski, Nicholas A. Kraft, and Lori Pollock: "Automatically Identifying the Quality of Developer Chats for Post Hoc Use". ACM Transactions on Software Engineering and Methodology, 30(4), 2021, 10.1145/3450503.

Software engineers are crowdsourcing answers to their everyday challenges on Q&A forums (e.g., Stack Overflow) and more recently in public chat communities such as Slack, IRC, and Gitter. Many software-related chat conversations contain valuable expert knowledge that is useful for both mining to improve programming support tools and for readers who did not participate in the original chat conversations. However, most chat platforms and communities do not contain built-in quality indicators (e.g., accepted answers, vote counts). Therefore, it is difficult to identify conversations that contain useful information for mining or reading, i.e., conversations of post hoc quality. In this article, we investigate automatically detecting developer conversations of post hoc quality from public chat channels. We first describe an analysis of 400 developer conversations that indicate potential characteristics of post hoc quality, followed by a machine learning-based approach for automatically identifying conversations of post hoc quality. Our evaluation of 2,000 annotated Slack conversations in four programming communities (python, clojure, elm, and racket) indicates that our approach can achieve precision of 0.82, recall of 0.90, F-measure of 0.86, and MCC of 0.57. To our knowledge, this is the first automated technique for detecting developer conversations of post hoc quality.

Chattopadhyay2020 Souti Chattopadhyay, Nicholas Nelson, Audrey Au, Natalia Morales, Christopher Sanchez, Rahul Pandita, and Anita Sarma: "A tale from the trenches: cognitive biases and software development". Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 10.1145/3377811.3380330.

Cognitive biases are hard-wired behaviors that influence developer actions and can set them on an incorrect course of action, necessitating backtracking. While researchers have found that cognitive biases occur in development tasks in controlled lab studies, we still don't know how these biases affect developers' everyday behavior. Without such an understanding, development tools and practices remain inadequate. To close this gap, we conducted a 2-part field study to examine the extent to which cognitive biases occur, the consequences of these biases on developer behavior, and the practices and tools that developers use to deal with these biases. About 70% of observed actions that were reversed were associated with at least one cognitive bias. Further, even though developers recognized that biases frequently occur, they routinely are forced to deal with such issues with ad hoc processes and sub-optimal tool support. As one participant (IP12) lamented: There is no salvation!

Cogo2021 Filipe R. Cogo, Gustavo A. Oliva, Cor-Paul Bezemer, and Ahmed E. Hassan: "An empirical study of same-day releases of popular packages in the npm ecosystem". Empirical Software Engineering, 26(5), 2021, 10.1007/s10664-021-09980-6.

Within a software ecosystem, client packages can reuse provider packages as third-party libraries. The reuse relation between client and provider packages is called a dependency. When a client package depends on the code of a provider package, every change that is introduced in a release of the provider has the potential to impact the client package. Since a large number of dependencies exist within a software ecosystem, releases of a popular provider package can impact a large number of clients. Occasionally, multiple releases of a popular package need to be published on the same day, leading to a scenario in which the time available to revise, test, build, and document the release is restricted compared to releases published within a regular schedule. In this paper, our objective is to study the same-day releases that are published by popular packages in the npm ecosystem. We design an exploratory study to characterize the type of changes that are introduced in same-day releases, the prevalence of same-day releases in the npm ecosystem, and the adoption of same-day releases by client packages. A preliminary manual analysis of the existing release notes suggests that same-day releases introduce non-trivial changes (e.g., bug fixes). We then focus on three RQs. First, we study how often same-day releases are published. We found that the median proportion of regularly scheduled releases that are interrupted by a same-day release (per popular package) is 22%, suggesting the importance of having timely and systematic procedures to cope with same-day releases. Second, we study the performed code changes in same-day releases. We observe that 32% of the same-day releases have larger changes compared with their prior release, thus showing that some same-day releases can undergo significant maintenance activity despite their time-constrained nature. In our third RQ, we study how client packages react to same-day releases of their providers. We observe the vast majority of client packages that adopt the release preceding the same-day release would also adopt the latter without having to change their versioning statement (implicit updates). We also note that explicit adoptions of same-day releases (i.e., adoptions that require a change to the versioning statement of the provider in question) is significantly faster than the explicit adoption of regular releases. Based on our findings, we argue that (i) third-party tools that support the automation of dependency management (e.g., Dependabot) should consider explicitly flagging same-day releases, (ii) popular packages should strive for optimized release pipelines that can properly handle same-day releases, and (iii) future research should design scalable, ecosystem-ready tools that support provider packages in assessing the impact of their code changes on client packages.

Costa2019 Diego Elias Damasceno Costa, Cor-Paul Bezemer, Philip Leitner, and Artur Andrzejak: "What's Wrong With My Benchmark Results? Studying Bad Practices in JMH Benchmarks". IEEE Transactions on Software Engineering, 2019, 10.1109/tse.2019.2925345.

Microbenchmarking frameworks, such as Java's Microbenchmark Harness (JMH), allow developers to write fine-grained performance test suites at the method or statement level. However, due to the complexities of the Java Virtual Machine, developers often struggle with writing expressive JMH benchmarks which accurately represent the performance of such methods or statements. In this paper, we empirically study bad practices of JMH benchmarks. We present a tool that leverages static analysis to identify 5 bad JMH practices. Our empirical study of 123 open source Java-based systems shows that each of these 5 bad practices are prevalent in open source software. Further, we conduct several experiments to quantify the impact of each bad practice in multiple case studies, and find that bad practices often significantly impact the benchmark results. To validate our experimental results, we constructed seven patches that fix the identified bad practices for six of the studied open source projects, of which six were merged into the main branch of the project. In this paper, we show that developers struggle with accurate Java microbenchmarking, and provide several recommendations to developers of microbenchmarking frameworks on how to improve future versions of their framework.


Decan2021 Alexandre Decan and Tom Mens: "What Do Package Dependencies Tell Us About Semantic Versioning?". IEEE Transactions on Software Engineering, 47(6), 2021, 10.1109/tse.2019.2918315.

The semantic versioning (semver) policy is commonly accepted by open source package management systems to inform whether new releases of software packages introduce possibly backward incompatible changes. Maintainers depending on such packages can use this information to avoid or reduce the risk of breaking changes in their own packages by specifying version constraints on their dependencies. Depending on the amount of control a package maintainer desires to have over her package dependencies, these constraints can range from very permissive to very restrictive. This article empirically compares semver compliance of four software packaging ecosystems (Cargo, npm, Packagist and Rubygems), and studies how this compliance evolves over time. We explore to what extent ecosystem-specific characteristics or policies influence the degree of compliance. We also propose an evaluation based on the "wisdom of the crowds" principle to help package maintainers decide which type of version constraints they should impose on their dependencies.

DeOliveiraNeto2019 Francisco Gomes de Oliveira Neto, Richard Torkar, Robert Feldt, Lucas Gren, Carlo A. Furia, and Ziwei Huang: "Evolution of statistical analysis in empirical software engineering research: Current state and steps forward". Journal of Systems and Software, 156, 2019, 10.1016/j.jss.2019.07.002.

Software engineering research is evolving and papers are increasingly based on empirical data from a multitude of sources, using statistical tests to determine if and to what degree empirical evidence supports their hypotheses. To investigate the practices and trends of statistical analysis in empirical software engineering (ESE), this paper presents a review of a large pool of papers from top-ranked software engineering journals. First, we manually reviewed 161 papers and in the second phase of our method, we conducted a more extensive semi-automatic classification of papers spanning the years 2001--2015 and 5,196 papers. Results from both review steps was used to: i) identify and analyze the predominant practices in ESE (e.g., using t-test or ANOVA), as well as relevant trends in usage of specific statistical methods (e.g., nonparametric tests and effect size measures) and, ii) develop a conceptual model for a statistical analysis workflow with suggestions on how to apply different statistical methods as well as guidelines to avoid pitfalls. Lastly, we confirm existing claims that current ESE practices lack a standard to report practical significance of results. We illustrate how practical significance can be discussed in terms of both the statistical analysis and in the practitioner's context.

Dias2021 Edson Dias, Paulo Meirelles, Fernando Castor, Igor Steinmacher, Igor Wiese, and Gustavo Pinto: "What Makes a Great Maintainer of Open Source Projects?". 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 10.1109/icse43902.2021.00093.

Although Open Source Software (OSS) maintainers devote a significant proportion of their work to coding tasks, great maintainers must excel in many other activities beyond coding. Maintainers should care about fostering a community, helping new members to find their place, while also saying "no" to patches that although are well-coded and well-tested, do not contribute to the goal of the project. To perform all these activities masterfully, maintainers should exercise attributes that software engineers (working on closed source projects) do not always need to master. This paper aims to uncover, relate, and prioritize the unique attributes that great OSS maintainers might have. To achieve this goal, we conducted 33 semi-structured interviews with well-experienced maintainers that are the gatekeepers of notable projects such as the Linux Kernel, the Debian operating system, and the GitLab coding platform. After we analyzed the interviews and curated a list of attributes, we created a conceptual framework to explain how these attributes are connected. We then conducted a rating survey with 90 OSS contributors. We noted that "technical excellence" and "communication" are the most recurring attributes. When grouped, these attributes fit into four broad categories: management, social, technical, and personality. While we noted that "sustain a long term vision of the project" and being "extremely careful" seem to form the basis of our framework, we noted through our survey that the communication attribute was perceived as the most essential one.

Durieux2020 Thomas Durieux, Claire Le Goues, Michael Hilton, and Rui Abreu: "Empirical Study of Restarted and Flaky Builds on Travis CI". Proceedings of the 17th International Conference on Mining Software Repositories, 10.1145/3379597.3387460.

Continuous Integration (CI) is a development practice where developers frequently integrate code into a common codebase. After the code is integrated, the CI server runs a test suite and other tools to produce a set of reports (e.g., the output of linters and tests). If the result of a CI test run is unexpected, developers have the option to manually restart the build, re-running the same test suite on the same code; this can reveal build flakiness, if the restarted build outcome differs from the original build. In this study, we analyze restarted builds, flaky builds, and their impact on the development workflow. We observe that developers restart at least 1.72% of builds, amounting to 56,522 restarted builds in our Travis CI dataset. We observe that more mature and more complex projects are more likely to include restarted builds. The restarted builds are mostly builds that are initially failing due to a test, network problem, or a Travis CI limitations such as execution timeout. Finally, we observe that restarted builds have an impact on development workflow. Indeed, in 54.42% of the restarted builds, the developers analyze and restart a build within an hour of the initial build execution. This suggests that developers wait for CI results, interrupting their workflow to address the issue. Restarted builds also slow down the merging of pull requests by a factor of three, bringing median merging time from 16h to 48h.



Fagerholm2017 Fabian Fagerholm, Marco Kuhrmann, and Jürgen Münch: "Guidelines for using empirical studies in software engineering education". PeerJ Computer Science, 3, 2017, 10.7717/peerj-cs.131.

Software engineering education is supposed to provide students with industry-relevant knowledge and skills. Educators must address issues beyond exercises and theories that can be directly rehearsed in small settings. A way to experience such efects and to increase the relevance of software engineering education is to apply empirical studies in teaching. In our article, we show how diferent types of empirical studies can be used for educational purposes in software engineering. We give examples illustrating how to utilize empirical studies, discuss challenges, and derive an initial guideline that supports teachers to include empirical studies in software engineering courses. This summary refers to the paper Guidelines for Using Empirical Studies in Software Engineering Education [FKM17]. This paper was published in the PeerJ Computer Science journal.

Farzat2021 Fabio de A. Farzat, Marcio de O. Barros, and Guilherme H. Travassos: "Evolving JavaScript Code to Reduce Load Time". IEEE Transactions on Software Engineering, 47(8), 2021, 10.1109/tse.2019.2928293.

JavaScript is one of the most used programming languages for front-end development of Web applications. The increase in complexity of front-end features brings concerns about performance, especially the load and execution time of JavaScript code. In this paper, we propose an evolutionary program improvement technique to reduce the size of JavaScript programs and, therefore, the time required to load and execute them in Web applications. To guide the development of this technique, we performed an experimental study to characterize the patches applied to JavaScript programs to reduce their size while keeping the functionality required to pass all test cases in their test suites. We applied this technique to 19 JavaScript programs varying from 92 to 15,602 LOC and observed reductions from 0.2 to 73.8 percent of the original code, as well as a relationship between the quality of a program’s test suite and the ability to reduce the size of its source code.

Feal2020 'Alvaro Feal, Paolo Calciati, Narseo Vallina-Rodriguez, Carmela Troncoso, and Alessandra Gorla: "Angel or Devil? A Privacy Study of Mobile Parental Control Apps". Proceedings on Privacy Enhancing Technologies, 2020(2), 2020, 10.2478/popets-2020-0029.

Android parental control applications are used by parents to monitor and limit their children’s mobile behaviour (e.g., mobile apps usage, web browsing, calling, and texting). In order to offer this service, parental control apps require privileged access to sys-tem resources and access to sensitive data. This may significantly reduce the dangers associated with kids’ online activities, but it raises important privacy con-cerns. These concerns have so far been overlooked by organizations providing recommendations regarding the use of parental control applications to the public. We conduct the first in-depth study of the Android parental control app’s ecosystem from a privacy and regulatory point of view. We exhaustively study 46 apps from 43 developers which have a combined 20M installs in the Google Play Store. Using a combination of static and dynamic analysis we find that: these apps are on average more permissions-hungry than the top 150 apps in the Google Play Store, and tend to request more dangerous permissions with new releases; 11% of the apps transmit personal data in the clear; 34% of the apps gather and send personal information without appropriate consent; and 72% of the apps share data with third parties (including online advertising and analytics services) without mentioning their presence in their privacy policies. In summary, parental control applications lack transparency and lack compliance with reg ulatory requirements. This holds even for those applications recommended by European and other national security centers.

Ferreira2021 Gabriel Ferreira, Limin Jia, Joshua Sunshine, and Christian Kästner: "Containing Malicious Package Updates in npm with a Lightweight Permission System". CoRR, abs/2103.05769, 2021.

The large amount of third-party packages available in fast-moving software ecosystems, such as Node.js/npm, enables attackers to compromise applications by pushing malicious updates to their package dependencies. Studying the npm repository, we observed that many packages in the npm repository that are used in Node.js applications perform only simple computations and do not need access to filesystem or network APIs. This offers the opportunity to enforce least-privilege design per package, protecting applications and package dependencies from malicious updates. We propose a lightweight permission system that protects Node.js applications by enforcing package permissions at runtime. We discuss the design space of solutions and show that our system makes a large number of packages much harder to be exploited, almost for free.

Foundjem2021 Armstrong Foundjem and Bram Adams: "Release synchronization in software ecosystems". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-020-09929-1.

Software ecosystems bring value by integrating software projects related to a given domain, such as Linux distributions integrating upstream open-source projects or the Android ecosystem for mobile Apps. Since each project within an ecosystem may potentially have its release cycle and roadmap, this creates an enormous burden for users who must expend the effort to identify and install compatible project releases from the ecosystem manually. Thus, many ecosystems, such as the Linux distributions, take it upon them to release a polished, well-integrated product to the end-user. However, the body of knowledge lacks empirical evidence about the coordination and synchronization efforts needed at the ecosystem level to ensure such federated releases. This paper empirically studies the strategies used to synchronize releases of ecosystem projects in the context of the OpenStack ecosystem, in which a central release team manages the six-month release cycle of the overall OpenStack ecosystem product. We use qualitative analysis on the release team's IRC-meeting logs that comprise two OpenStack releases (one-year long). Thus, we identified, cataloged, and documented ten major release synchronization activities, which we further validated through interviews with eight active OpenStack senior practitioners (members of either the release team or project teams). Our results suggest that even though an ecosystem's power lies in the interaction of inter-dependent projects, release synchronization remains a challenge for both the release team and the project teams. Moreover, we found evidence (and reasons) of multiple release strategies co-existing within a complex ecosystem.

Fucci2020 Davide Fucci, Giuseppe Scanniello, Simone Romano, and Natalia Juristo: "Need for Sleep: The Impact of a Night of Sleep Deprivation on Novice Developers' Performance". IEEE Transactions on Software Engineering, 46(1), 2020, 10.1109/tse.2018.2834900.

We present a quasi-experiment to investigate whether, and to what extent, sleep deprivation impacts the performance of novice software developers using the agile practice of test-first development (TFD). We recruited 45 undergraduates, and asked them to tackle a programming task. Among the participants, 23 agreed to stay awake the night before carrying out the task, while 22 slept normally. We analyzed the quality (i.e., the functional correctness) of the implementations delivered by the participants in both groups, their engagement in writing source code (i.e., the amount of activities performed in the IDE while tackling the programming task) and ability to apply TFD (i.e., the extent to which a participant is able to apply this practice). By comparing the two groups of participants, we found that a single night of sleep deprivation leads to a reduction of 50 percent in the quality of the implementations. There is notable evidence that the developers' engagement and their prowess to apply TFD are negatively impacted. Our results also show that sleep-deprived developers make more fixes to syntactic mistakes in the source code. We conclude that sleep deprivation has possibly disruptive effects on software development activities. The results open opportunities for improving developers' performance by integrating the study of sleep with other psycho-physiological factors in which the software engineering research community has recently taken an interest in.

Furia2019 Carlo Alberto Furia, Robert Feldt, and Richard Torkar: "Bayesian Data Analysis in Empirical Software Engineering Research". IEEE Transactions on Software Engineering, 2019, 10.1109/tse.2019.2935974.

Statistics comes in two main flavors: frequentist and Bayesian. For historical and technical reasons, frequentist statistics have traditionally dominated empirical data analysis, and certainly remain prevalent in empirical software engineering. This situation is unfortunate because frequentist statistics suffer from a number of shortcomings—such as lack of flexibility and results that are unintuitive and hard to interpret—that curtail their effectiveness when dealing with the heterogeneous data that is increasingly available for empirical analysis of software engineering practice. In this paper, we pinpoint these shortcomings, and present Bayesian data analysis techniques that work better on the same data—as they can provide clearer results that are simultaneously robust and nuanced. After a short, high-level introduction to the basic tools of Bayesian statistics, our presentation targets the reanalysis of two empirical studies targeting data about the effectiveness of automatically generated tests and the performance of programming languages. By contrasting the original frequentist analysis to our new Bayesian analysis, we demonstrate concrete advantages of using Bayesian techniques, and we advocate a prominent role for them in empirical software engineering research and practice.


Gao2020 Gao Gao, Finn Voichick, Michelle Ichinco, and Caitlin Kelleher: "Exploring Programmers' API Learning Processes: Collecting Web Resources as External Memory". 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 10.1109/vl/hcc50065.2020.9127274.

Modern programming frequently requires the use of APIs (Application Programming Interfaces). Yet many programmers struggle when trying to learn APIs. We ran an exploratory study in which we observed participants performing an API learning task. We analyze their processes using a proposed model of API learning, grounded in Cognitive Load Theory, Information Foraging Theory, and External Memory research. The results provide support for the model of API Learning and add new insights into the form and usage of external memory while learning APIs. Programmers quickly curated a set of API resources through Information Foraging which served as external memory and then primarily referred to these resources to meet information needs while coding.

Garcia2021 Boni García, Mario Munoz-Organero, Carlos Alario-Hoyos, and Carlos Delgado Kloos: "Automated driver management for Selenium WebDriver". Empirical Software Engineering, 26(5), 2021, 10.1007/s10664-021-09975-3.

Selenium WebDriver is a framework used to control web browsers automatically. It provides a cross-browser Application Programming Interface (API) for different languages (e.g., Java, Python, or JavaScript) that allows automatic navigation, user impersonation, and verification of web applications. Internally, Selenium WebDriver makes use of the native automation support of each browser. Hence, a platform-dependent binary file (the so-called driver) must be placed between the Selenium WebDriver script and the browser to support this native communication. The management (i.e., download, setup, and maintenance) of these drivers is cumbersome for practitioners. This paper provides a complete methodology to automate this management process. Particularly, we present WebDriverManager, the reference tool implementing this methodology. WebDriverManager provides different execution methods: as a Java dependency, as a Command-Line Interface (CLI) tool, as a server, as a Docker container, and as a Java agent. To provide empirical validation of the proposed approach, we surveyed the WebDriverManager users. The aim of this study is twofold. First, we assessed the extent to which WebDriverManager is adopted and used. Second, we evaluated the WebDriverManager API following Clarke's usability dimensions. A total of 148 participants worldwide completed this survey in 2020. The results show a remarkable assessment of the automation capabilities and API usability of WebDriverManager by Java users, but a scarce adoption for other languages.

Gerosa2021 Marco Gerosa, Igor Wiese, Bianca Trinkenreich, Georg Link, Gregorio Robles, Christoph Treude, Igor Steinmacher, and Anita Sarma: "The Shifting Sands of Motivation: Revisiting What Drives Contributors in Open Source". 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 10.1109/icse43902.2021.00098.

Open Source Software (OSS) has changed drastically over the last decade, with OSS projects now producing a large ecosystem of popular products, involving industry participation, and providing professional career opportunities. But our field's understanding of what motivates people to contribute to OSS is still fundamentally grounded in studies from the early 2000s. With the changed landscape of OSS, it is very likely that motivations to join OSS have also evolved. Through a survey of 242 OSS contributors, we investigate shifts in motivation from three perspectives: (1) the impact of the new OSS landscape, (2) the impact of individuals' personal growth as they become part of OSS communities, and (3) the impact of differences in individuals' demographics. Our results show that some motivations related to social aspects and reputation increased in frequency and that some intrinsic and internalized motivations, such as learning and intellectual stimulation, are still highly relevant. We also found that contributing to OSS often transforms extrinsic motivations to intrinsic, and that while experienced contributors often shift toward altruism, novices often shift toward career, fun, kinship, and learning. OSS projects can leverage our results to revisit current strategies to attract and retain contributors, and researchers and tool builders can better support the design of new studies and tools to engage and support OSS development.

Glanz2020 Leonid Glanz, Patrick Müller, Lars Baumgärtner, Michael Reif, Sven Amann, Pauline Anthonysamy, and Mira Mezini: "Hidden in Plain Sight: Obfuscated Strings Threatening Your Privacy". Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, 10.1145/3320269.3384745.

String obfuscation is an established technique used by proprietary, closed-source applications to protect intellectual property. Furthermore, it is also frequently used to hide spyware or malware in applications. In both cases, the techniques range from bit-manipulation over XOR operations to AES encryption. However, string obfuscation techniques/tools suffer from one shared weakness: They generally have to embed the necessary logic to deobfuscate strings into the app code. In this paper, we show that most of the string obfuscation techniques found in malicious and benign applications for Android can easily be broken in an automated fashion. We developed StringHound, an open-source tool that uses novel techniques that identify obfuscated strings and reconstruct the originals using slicing. We evaluated StringHound on both benign and malicious Android apps. In summary, we deobfuscate almost 30 times more obfuscated strings than other string deobfuscation tools. Additionally, we analyzed 100,000 Google Play Store apps and found multiple obfuscated strings that hide vulnerable cryptographic usages, insecure internet accesses, API keys, hard-coded passwords, and exploitation of privileges without the awareness of the developer. Furthermore, our analysis reveals that not only malware uses string obfuscation but also benign apps make extensive use of string obfuscation.

Gujral2021 Harshit Gujral, Sangeeta Lal, and Heng Li: "An exploratory semantic analysis of logging questions". Journal of Software: Evolution and Process, 33(7), 2021, 10.1002/smr.2361.

Logging is an integral part of software development. Software practitioners often face issues in software logging, and they post these issues on Q&A websites to take suggestions from the experts. In this study, we perform a three-level empirical analysis of logging questions posted on six popular technical Q&A websites, namely, Stack Overflow (SO), Serverfault (SF), Superuser (SU), Database Administrators (DB), Software Engineering (SE), and Android Enthusiasts (AE). The findings show that logging issues are prevalent across various domains, for example, database, networks, and mobile computing, and software practitioners from different domains face different logging issues. The semantic analysis of logging questions using Latent Dirichlet Allocation (LDA) reveals trends of several existing and new logging topics, such as logging conversion pattern, Android device logging, and database logging. In addition, we observe specific logging topics for each website: DB (log shipping and log file growing/shrinking), SU (event log and syslog configuration), SF (log analysis and syslog configuration), AE (app install and usage tracking), SE (client server logging and exception logging), and SO (log file creation/deletion, Android emulator logging, and logger class of Log4j). We obtain an increasing trend of logging topics on the SO, SU, and DB websites whereas a decreasing trend of logging topics on the SF website.


Hayashi2019 Junichi Hayashi, Yoshiki Higo, Shinsuke Matsumoto, and Shinji Kusumoto: "Impacts of Daylight Saving Time on Software Development". 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), 10.1109/msr.2019.00076.

Daylight saving time (DST) is observed in many countries and regions. DST is not considered on some software systems at the beginning of their developments, for example, software systems developed in regions where DST is not observed. However, such systems may have to consider DST at the requests of their users. Before now, there has been no study about the impacts of DST on software development. In this paper, we study the impacts of DST on software development by mining the repositories on GitHub. We analyze the date when the code related to DST is changed, and we analyze the regions where the developers applied the changes live. Furthermore, we classify the changes into some patterns.

Hoda2021 Rashina Hoda: "Socio-Technical Grounded Theory for Software Engineering". IEEE Transactions on Software Engineering, 2021, 10.1109/tse.2021.3106280.

Grounded Theory (GT), a sociological research method designed to study social phenomena, is increasingly being used to investigate the human and social aspects of software engineering (SE). However, being written by and for sociologists, GT is often challenging for a majority of SE researchers to understand and apply. Additionally, SE researchers attempting ad hoc adaptations of traditional GT guidelines for modern socio-technical (ST) contexts often struggle in the absence of clear and relevant guidelines to do so, resulting in poor quality studies. To overcome these research community challenges and leverage modern research opportunities, this paper presents Socio-Technical Grounded Theory (STGT) designed to ease application and achieve quality outcomes. It defines what exactly is meant by an ST research context and presents the STGT guidelines that expand GT’s philosophical foundations, provide increased clarity and flexibility in its methodological steps and procedures, define possible scope and contexts of application, encourage frequent reporting of a variety of interim, preliminary, and mature outcomes, and introduce nuanced evaluation guidelines for different outcomes. It is hoped that the SE research community and related ST disciplines such as computer science, data science, artificial intelligence, information systems, human computer/robot/AI interaction, human-centered emerging technologies (and increasingly other disciplines being transformed by rapid digitalisation and AI-based augmentation), will benefit from applying STGT to conduct quality research studies and systematically produce rich findings and mature theories with confidence.

Hora2021a Andre Hora: "Googling for Software Development: What Developers Search For and What They Find". 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 10.1109/msr52588.2021.00044.

Developers often search for software resources on the web. In practice, instead of going directly to websites (e.g., Stack Overflow), they rely on search engines (e.g., Google). Despite this being a common activity, we are not yet aware of what developers search from the perspective of popular software development websites and what search results are returned. With this knowledge, we can understand real-world queries, developers' needs, and the query impact on the search results. In this paper, we provide an empirical study to understand what developers search on the web and what they find. We assess 1.3M queries to popular programming websites and we perform thousands of queries on Google to explore search results. We find that (i) developers' queries typically start with keywords (e.g., Python, Android, etc.), are short (3 words), tend to omit functional words, and are similar among each other; (ii) minor changes to queries do not largely affect the Google search results, however, some cosmetic changes may have a non-negligible impact; and (iii) search results are dominated by Stack Overflow, but YouTube is also a relevant source nowadays. We conclude by presenting detailed implications for researchers and developers.

Hoyos2021 Juan Hoyos, Rabe Abdalkareem, Suhaib Mujahid, Emad Shihab, and Albeiro Espinosa Bedoya: "On the Removal of Feature Toggles: A Study of Python Projects and Practitioners Motivations". Empirical Software Engineering, 26(2), 2021, 10.1007/s10664-020-09902-y.

Feature Toggling is a technique to control the execution of features in a software project. For example, practitioners using feature toggles can experiment with new features in a production environment by exposing them to a subset of users. Some of these toggles require additional maintainability efforts and are expected to be removed, whereas others are meant to remain for a long time. However, to date, very little is known about the removal of feature toggles, which is why we focus on this topic in our paper. We conduct an empirical study that focuses on the removal of feature toggles. We use source code analysis techniques to analyze 12 Python open source projects and surveyed 61 software practitioners to provide deeper insights on the topic. Our study shows that 75% of the toggle components in the studied Python projects are removed within 49 weeks after introduction. However, eventually practitioners remove feature toggles to follow the life cycle of a feature when it becomes stable in production. We also find that not all long-term feature toggles are designed to live that long and not all feature toggles are removed from the source code, opening the possibilities to unwanted risks. Our study broadens the understanding of feature toggles by identifying reasons for their survival in practice and aims to help practitioners make better decisions regarding the way they manage and remove feature toggles.



Jalote2021 Pankaj Jalote and Damodaram Kamma: "Studying Task Processes for Improving Programmer Productivity". IEEE Transactions on Software Engineering, 47(4), 2021, 10.1109/tse.2019.2904230.

Productivity of a software development organization can be enhanced by improving the software process, using better tools/technology, and enhancing the productivity of programmers. This work focuses on improving programmer productivity by studying the process used by a programmer for executing an assigned task, which we call the task process. We propose a general framework for studying the impact of task processes on programmer productivity and also the impact of transferring task processes of high-productivity programmers to average-productivity peers. We applied the framework to a few live projects in Robert Bosch Engineering and Business Solutions Limited, a CMMI Level 5 company. In each project, we identified two groups of programmers: high-productivity and average-productivity programmers. We requested each programmer to video capture their computer screen while executing his/her assigned tasks. We then analyzed these task videos to extract the task processes and then used them to identify the differences between the task processes used by the two groups. Some key differences were found between the task processes, which could account for the difference in productivities of the two groups. Similarities between the task processes were also analyzed quantitatively by modeling each task process as a Markov chain. We found that programmers from the same group used similar task processes, but the task processes of the two groups differed considerably. The task processes of high-productivity programmers were transferred to the average-productivity programmers by training them on the key steps missing in their process but commonly present in the work of their high-productivity peers. A substantial productivity gain was found in the average-productivity programmers as a result of this transfer. The study shows that task processes of programmers impact their productivity, and it is possible to improve the productivity of average-productivity programmers by transferring task processes from high-productivity programmers to them.

Johnson2019 John Johnson, Sergio Lubo, Nishitha Yedla, Jairo Aponte, and Bonita Sharif: "An Empirical Study Assessing Source Code Readability in Comprehension". 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME), 10.1109/icsme.2019.00085.

Software developers spend a significant amount of time reading source code. If code is not written with readability in mind, it impacts the time required to maintain it. In order to alleviate the time taken to read and understand code, it is important to consider how readable the code is. The general consensus is that source code should be written to minimize the time it takes for others to read and understand it. In this paper, we conduct a controlled experiment to assess two code readability rules: nesting and looping. We test 32 Java methods in four categories: ones that follow/do not follow the readability rule and that are correct/incorrect. The study was conducted online with 275 participants. The results indicate that minimizing nesting decreases the time a developer spends reading and understanding source code, increases confidence about the developer's understanding of the code, and also suggests that it improves their ability to find bugs. The results also show that avoiding the do-while statement had no significant impact on level of understanding, time spent reading and understanding, confidence in understanding, or ease of finding bugs. It was also found that the better knowledge of English a participant had, the more their readability and comprehension confidence ratings were affected by the minimize nesting rule. We discuss the implications of these findings for code readability and comprehension.

Johnson2021 Brittany Johnson, Thomas Zimmermann, and Christian Bird: "The Effect of Work Environments on Productivity and Satisfaction of Software Engineers". IEEE Transactions on Software Engineering, 47(4), 2021, 10.1109/tse.2019.2903053.

The physical work environment of software engineers can have various effects on their satisfaction and the ability to get the work done. To better understand the factors of the environment that affect productivity and satisfaction of software engineers, we explored different work environments at Microsoft. We used a mixed-methods, multiple stage research design with a total of 1,159 participants: two surveys with 297 and 843 responses respectively and interviews with 19 employees. We found several factors that were considered as important for work environments: personalization, social norms and signals, room composition and atmosphere, work-related environment affordances, work area and furniture, and productivity strategies. We built statistical models for satisfaction with the work environment and perceived productivity of software engineers and compared them to models for employees in the Program Management, IT Operations, Marketing, and Business Program & Operations disciplines. In the satisfaction models, the ability to work privately with no interruptions and the ability to communicate with the team and leads were important factors among all disciplines. In the productivity models, the overall satisfaction with the work environment and the ability to work privately with no interruptions were important factors among all disciplines. For software engineers, another important factor for perceived productivity was the ability to communicate with the team and leads. We found that private offices were linked to higher perceived productivity across all disciplines.

Jolak2020 Rodi Jolak, Maxime Savary-Leblanc, Manuela Dalibor, Andreas Wortmann, Regina Hebig, Juraj Vincur, Ivan Polasek, Xavier Le Pallec, Sébastien Gérard, and Michel R. V. Chaudron: "Software engineering whispers: The effect of textual vs. graphical software design descriptions on software design communication". Empirical Software Engineering, 25(6), 2020, 10.1007/s10664-020-09835-6.

Software engineering is a social and collaborative activity. Communicating and sharing knowledge between software developers requires much effort. Hence, the quality of communication plays an important role in influencing project success. To better understand the effect of communication on project success, more in-depth empirical studies investigating this phenomenon are needed. We investigate the effect of using a graphical versus textual design description on co-located software design communication. Therefore, we conducted a family of experiments involving a mix of 240 software engineering students from four universities. We examined how different design representations (i.e., graphical vs. textual) affect the ability to Explain, Understand, Recall, and Actively Communicate knowledge. We found that the graphical design description is better than the textual in promoting Active Discussion between developers and improving the Recall of design details. Furthermore, compared to its unaltered version, a well-organized and motivated textual design description—that is used for the same amount of time—enhances the recall of design details and increases the amount of active discussions at the cost of reducing the perceived quality of explaining.

Jones2020 Derek M. Jones: Evidence-based Software Engineering: based on the publicly available data. Knowledge Software, Ltd., 2020, 978-1-8382913-0-3.

This book discusses what is currently known about software engineering, based on an analysis of all the publicly available data. This aim is not as ambitious as it sounds, because there is not a great deal of data publicly available. The intent is to provide material that is useful to professional developers working in industry; until recently researchers in software engineering have been more interested in vanity work, promoted by ego and bluster. The material is organized in two parts, the first covering software engineering and the second the statistics likely to be needed for the analysis of software engineering data.


Kamienski2021 Arthur V. Kamienski, Luisa Palechor, Cor-Paul Bezemer, and Abram Hindle: "PySStuBs: Characterizing Single-Statement Bugs in Popular Open-Source Python Projects". 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 10.1109/msr52588.2021.00066.

Single-statement bugs (SStuBs) can have a severe impact on developer productivity. Despite usually being simple and not offering much of a challenge to fix, these bugs may still disturb a developer's workflow and waste precious development time. However, few studies have paid attention to these simple bugs, focusing instead on bugs of any size and complexity. In this study, we explore the occurrence of SStuBs in some of the most popular open-source Python projects on GitHub, while also characterizing their patterns and distribution. We further compare these bugs to SStuBs found in a previous study on Java Maven projects. We find that these Python projects have different SStuB patterns than the ones in Java Maven projects and identify 7 new SStuB patterns. Our results may help uncover the importance of understanding these bugs for the Python programming language, and how developers can handle them more effectively.

Kavaler2019 David Kavaler, Asher Trockman, Bogdan Vasilescu, and Vladimir Filkov: "Tool Choice Matters: JavaScript Quality Assurance Tools and Usage Outcomes in GitHub Projects". 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 10.1109/icse.2019.00060.

Quality assurance automation is essential in modern software development. In practice, this automation is supported by a multitude of tools that fit different needs and require developers to make decisions about which tool to choose in a given context. Data and analytics of the pros and cons can inform these decisions. Yet, in most cases, there is a dearth of empirical evidence on the effectiveness of existing practices and tool choices. We propose a general methodology to model the time-dependent effect of automation tool choice on four outcomes of interest: prevalence of issues, code churn, number of pull requests, and number of contributors, all with a multitude of controls. On a large data set of npm JavaScript projects, we extract the adoption events for popular tools in three task classes: linters, dependency managers, and coverage reporters. Using mixed methods approaches, we study the reasons for the adoptions and compare the adoption effects within each class, and sequential tool adoptions across classes. We find that some tools within each group are associated with more beneficial outcomes than others, providing an empirical perspective for the benefits of each. We also find that the order in which some tools are implemented is associated with varying outcomes.

Kim2021 Dong Jae Kim, Tse-Hsun Chen, and Jinqiu Yang: "The secret life of test smells—an empirical study on test smell evolution and maintenance". Empirical Software Engineering, 26(5), 2021, 10.1007/s10664-021-09969-1.

In recent years, researchers and practitioners have been studying the impact of test smells in test maintenance. However, there is still limited empirical evidence on why developers remove test smells in software maintenance and the mechanism employed for addressing test smells. In this paper, we conduct an empirical study on 12 real-world open-source systems to study the evolution and maintenance of test smells and how test smells are related to software quality. Results show that: 1) Although the number of test smell instances increases, test smell density decreases as systems evolve. 2) However, our qualitative analysis on those removed test smells reveals that most test smell removal (83%) is a by-product of feature maintenance activities. 45% of the removed test smells relocate to other test cases due to refactoring, while developers deliberately address the only 17% of test smells, consisting of largely Exception Catch/Throw and Sleepy Test. 3) Our statistical model shows that test smell metrics can provide additional explanatory power on post-release defects over traditional baseline metrics (an average of 8.25% increase in AUC). However, most types of test smells have a minimal effect on post-release defects. Our study provides insight into developers' perception of test smells and current practices. Future studies on test smells may consider focusing on the specific types of test smells that may have a higher correlation with defect-proneness when helping developers with test code maintenance.

Klotins2021 Eriks Klotins, Michael Unterkalmsteiner, Panagiota Chatzipetrou, Tony Gorschek, Rafael Prikladnicki, Nirnaya Tripathi, and Leandro Bento Pompermaier: "A Progression Model of Software Engineering Goals, Challenges, and Practices in Start-Ups". IEEE Transactions on Software Engineering, 47(3), 2021, 10.1109/tse.2019.2900213.

Context: Software start-ups are emerging as suppliers of innovation and software-intensive products. However, traditional software engineering practices are not evaluated in the context, nor adopted to goals and challenges of start-ups. As a result, there is insufficient support for software engineering in the start-up context. Objective: We aim to collect data related to engineering goals, challenges, and practices in start-up companies to ascertain trends and patterns characterizing engineering work in start-ups. Such data allows researchers to understand better how goals and challenges are related to practices. This understanding can then inform future studies aimed at designing solutions addressing those goals and challenges. Besides, these trends and patterns can be useful for practitioners to make more informed decisions in their engineering practice. Method: We use a case survey method to gather first-hand, in-depth experiences from a large sample of software start-ups. We use open coding and cross-case analysis to describe and identify patterns, and corroborate the findings with statistical analysis. Results: We analyze 84 start-up cases and identify 16 goals, 9 challenges, and 16 engineering practices that are common among start-ups. We have mapped these goals, challenges, and practices to start-up life-cycle stages (inception, stabilization, growth, and maturity). Thus, creating the progression model guiding software engineering efforts in start-ups. Conclusions: We conclude that start-ups to a large extent face the same challenges and use the same practices as established companies. However, the primary software engineering challenge in start-ups is to evolve multiple process areas at once, with a little margin for serious errors.

Kochhar2019 Pavneet Singh Kochhar, Eirini Kalliamvakou, Nachiappan Nagappan, Thomas Zimmermann, and Christian Bird: "Moving from Closed to Open Source: Observations from Six Transitioned Projects to GitHub". IEEE Transactions on Software Engineering, 2019, 10.1109/tse.2019.2937025.

Open source software systems have gained a lot of attention in the past few years. With the emergence of open source platforms like GitHub, developers can contribute, store, and manage their projects with ease. Large organizations like Microsoft, Google, and Facebook are open sourcing their in-house technologies in an effort to more broadly involve the community in the development of software systems. Although closed source and open source systems have been studied extensively, there has been little research on the transition from closed source to open source systems. Through this study we aim to: a) provide guidance and insights for other teams planning to open source their projects and b) to help them avoid pitfalls during the transition process. We studied six different Microsoft systems, which were recently open-sourced i.e., CoreFX, CoreCLR, Roslyn, Entity Framework, MVC, and Orleans. This paper presents the transition from the viewpoints of both Microsoft and the open source community based on interviews with eleven Microsoft developer, five Microsoft senior managers involved in the decision to open source, and eleven open-source developers. From Microsoft's perspective we discuss the reasons for the transition, experiences of developers involved, and the transition's outcomes and challenges. Our results show that building a vibrant community, prompt answers, developing an open source culture, security regulations and business opportunities are the factors which persuade companies to open source their products. We also discuss the transition outcomes on processes such as code reviews, version control systems, continuous integration as well as developers' perception of these changes. From the open source community's perspective, we illustrate the response to the open-sourcing initiative through contributions and interactions with the internal developers and provide guidelines for other projects planning to go open source.

Kosar2018 Tomavz Kosar, Savso Gaberc, Jeffrey C. Carver, and Marjan Mernik: "Program comprehension of domain-specific and general-purpose languages: replication of a family of experiments using integrated development environments". Empirical Software Engineering, 23(5), 2018, 10.1007/s10664-017-9593-2.

Domain-specific languages (DSLs) allow developers to write code at a higher level of abstraction compared with general-purpose languages (GPLs). Developers often use DSLs to reduce the complexity of GPLs. Our previous study found that developers performed program comprehension tasks more accurately and efficiently with DSLs than with corresponding APIs in GPLs. This study replicates our previous study to validate and extend the results when developers use IDEs to perform program comprehension tasks. We performed a dependent replication of a family of experiments. We made two specific changes to the original study: (1) participants used IDEs to perform the program comprehension tasks, to address a threat to validity in the original experiment and (2) each participant performed program comprehension tasks on either DSLs or GPLs, not both as in the original experiment. The results of the replication are consistent with and expanded the results of the original study. Developers are significantly more effective and efficient in tool-based program comprehension when using a DSL than when using a corresponding API in a GPL. The results indicate that, where a DSL is available, developers will perform program comprehension better using the DSL than when using the corresponding API in a GPL.

Krueger2020 Ryan Krueger, Yu Huang, Xinyu Liu, Tyler Santander, Westley Weimer, and Kevin Leach: "Neurological divide: an fMRI study of prose and code writing". Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 10.1145/3377811.3380348.

Software engineering involves writing new code or editing existing code. Recent efforts have investigated the neural processes associated with reading and comprehending code—however, we lack a thorough understanding of the human cognitive processes underlying code writing. While prose reading and writing have been studied thoroughly, that same scrutiny has not been applied to code writing. In this paper, we leverage functional brain imaging to investigate neural representations of code writing in comparison to prose writing. We present the first human study in which participants wrote code and prose while undergoing a functional magnetic resonance imaging (fMRI) brain scan, making use of a full-sized fMRI-safe QWERTY keyboard. We find that code writing and prose writing are significantly dissimilar neural tasks. While prose writing entails significant left hemisphere activity associated with language, code writing involves more activations of the right hemisphere, including regions associated with attention control, working memory, planning and spatial cognition. These findings are unlike existing work in which code and prose comprehension were studied. By contrast, we present the first evidence suggesting that code and prose writing are quite dissimilar at the neural level.


Latendresse2021 Jasmine Latendresse, Rabe Abdalkareem, Diego Elias Costa, and Emad Shihab: "How Effective is Continuous Integration in Indicating Single-Statement Bugs?". 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 10.1109/msr52588.2021.00062.

Continuous Integration (CI) is the process of automatically compiling, building, and testing code changes in the hope of catching bugs as they are introduced into the code base. With bug fixing being a core and increasingly costly task in software development, the community has adopted CI to mitigate this issue and improve the quality of their software products. Bug fixing is a core task in software development and becomes increasingly costly over time. However, little is known about how effective CI is at detecting simple, single-statement bugs.In this paper, we analyze the effectiveness of CI in 14 popular open source Java-based projects to warn about 318 single-statement bugs (SStuBs). We analyze the build status at the commits that introduce SStuBs and before the SStuBs were fixed. We then investigate how often CI indicates the presence of these bugs, through test failure. Our results show that only 2% of the commits that introduced SStuBs have builds with failed tests and 7.5% of builds before the fix reported test failures. Upon close manual inspection, we found that none of the failed builds actually captured SStuBs, indicating that CI is not the right medium to capture the SStuBs we studied. Our results suggest that developers should not rely on CI to catch SStuBs or increase their CI pipeline coverage to detect single-statement bugs.

LeGoues2018 Claire Le Goues, Ciera Jaspan, Ipek Ozkaya, Mary Shaw, and Kathryn T. Stolee: "Bridging the Gap: From Research to Practical Advice". IEEE Software, 35(5), 2018, 10.1109/ms.2018.3571235.

Software developers need actionable guidance, but researchers rarely integrate diverse types of evidence in a way that indicates the recommendations' strength. A levels-ofevidence framework might allow researchers and practitioners to translate research results to a pragmatically useful form.

LeGoues2021 Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Satish Chandra: "Automatic Program Repair". IEEE Software, 38(4), 2021, 10.1109/ms.2021.3072577.

An introduction to a special journal issue on automatic program repair.

Leito2019 Roxanne Leit~ao: "Technology-Facilitated Intimate Partner Abuse: a qualitative analysis of data from online domestic abuse forums". Human–Computer Interaction, 36(3), 2019, 10.1080/07370024.2019.1685883.

This article reports on a qualitative analysis of data gathered from three online discussion forums for victims and survivors of domestic abuse. The analysis focussed on technology-facilitated abuse and the findings cover three main themes, namely, 1) forms of technology-facilitated abuse being discussed on the forums, 2) the ways in which forum members are using technology within the context of intimate partner abuse, and 3) the digital privacy and security advice being exchanged between victims/survivors on the forums. The article concludes with a discussion on the dual role of digital technologies within the context of intimate partner abuse, on the challenges and advantages of digital ubiquity, as well as on the issues surrounding digital evidence of abuse, and the labor of managing digital privacy and security.

Lemire2021 Daniel Lemire: "Number parsing at a gigabyte per second". Software: Practice and Experience, 51(8), 2021, 10.1002/spe.2984.

With disks and networks providing gigabytes per second, parsing decimal numbers from strings becomes a bottleneck. We consider the problem of parsing decimal numbers to the nearest binary floating-point value. The general problem requires variable-precision arithmetic. However, we need at most 17 digits to represent 64-bit standard floating-point numbers (IEEE 754). Thus, we can represent the decimal significand with a single 64-bit word. By combining the significand and precomputed tables, we can compute the nearest floating-point number using as few as one or two 64-bit multiplications. Our implementation can be several times faster than conventional functions present in standard C libraries on modern 64-bit systems (Intel, AMD, ARM, and POWER9). Our work is available as open source software used by major systems such as Apache Arrow and Yandex ClickHouse. The Go standard library has adopted a version of our approach.

Lima2021 Luan P. Lima, Lincoln S. Rocha, Carla I. M. Bezerra, and Matheus Paixao: "Assessing exception handling testing practices in open-source libraries". Empirical Software Engineering, 26(5), 2021, 10.1007/s10664-021-09983-3.

Modern programming languages (e.g., Java and C#) provide features to separate error-handling code from regular code, seeking to enhance software comprehensibility and maintainability. Nevertheless, the way exception handling (EH) code is structured in such languages may lead to multiple, different, and complex control flows, which may affect the software testability. Previous studies have reported that EH code is typically neglected, not well tested, and its misuse can lead to reliability degradation and catastrophic failures. However, little is known about the relationship between testing practices and EH testing effectiveness. In this exploratory study, we (i) measured the adequacy degree of EH testing concerning code coverage (instruction, branch, and method) criteria; and (ii) evaluated the effectiveness of the EH testing by measuring its capability to detect artificially injected faults (i.e., mutants) using 7 EH mutation operators. Our study was performed using test suites of 27 long-lived Java libraries from open-source ecosystems. Our results show that instructions and branches within catch blocks and throw instructions are less covered, with statistical significance, than the overall instructions and branches. Nevertheless, most of the studied libraries presented test suites capable of detecting more than 70% of the injected faults. From a total of 12, 331 mutants created in this study, the test suites were able to detect 68% of them.

LimaJunior2021 Manoel Limeira Lima Júnior, Daricélio Soares, Alexandre Plastino, and Leonardo Murta: "Predicting the lifetime of pull requests in open-source projects". Journal of Software: Evolution and Process, 33(6), 2021, 10.1002/smr.2337.

A recent survey using industrial projects has shown that providing an estimate of the lifetime of pull requests to developers helps to speed up their conclusion. Previous work has explored pull request lifetime prediction in open-source projects using regression techniques but with a broad margin of error. The first objective of our work was to reduce the average error rate of the prediction obtained by the regression techniques so far. We performed experiments with different regression techniques and achieved a significant decrease in the mean error rate. The second objective of our work was to obtain a more effective and useful predictive model that can classify pull requests according to five discrete time intervals. We proposed new predictive attributes for the estimation of the time intervals and employed attribute selection strategies to identify subsets of attributes that could improve the predictive behavior of the classifiers. Our classification approach achieved the best accuracy in all the 20 projects evaluated in comparison with the literature. The average accuracy was of 45.28% to predict pull request lifetime, with an average normalized improvement of 14.68% in relation to the majority class and 6.49% in relation to the state-of-the-art.

Liu2021 Kui Liu, Dongsun Kim, Tegawende F. Bissyande, Shin Yoo, and Yves Le Traon: "Mining Fix Patterns for FindBugs Violations". IEEE Transactions on Software Engineering, 47(1), 2021, 10.1109/tse.2018.2884955.

Several static analysis tools, such as Splint or FindBugs, have been proposed to the software development community to help detect security vulnerabilities or bad programming practices. However, the adoption of these tools is hindered by their high false positive rates. If the false positive rate is too high, developers may get acclimated to violation reports from these tools, causing concrete and severe bugs being overlooked. Fortunately, some violations are actually addressed and resolved by developers. We claim that those violations that are recurrently fixed are likely to be true positives, and an automated approach can learn to repair similar unseen violations. However, there is lack of a systematic way to investigate the distributions on existing violations and fixed ones in the wild, that can provide insights into prioritizing violations for developers, and an effective way to mine code and fix patterns which can help developers easily understand the reasons of leading violations and how to fix them. In this paper, we first collect and track a large number of fixed and unfixed violations across revisions of software. The empirical analyses reveal that there are discrepancies in the distributions of violations that are detected and those that are fixed, in terms of occurrences, spread and categories, which can provide insights into prioritizing violations. To automatically identify patterns in violations and their fixes, we propose an approach that utilizes convolutional neural networks to learn features and clustering to regroup similar instances. We then evaluate the usefulness of the identified fix patterns by applying them to unfixed violations. The results show that developers will accept and merge a majority (69/116) of fixes generated from the inferred fix patterns. It is also noteworthy that the yielded patterns are applicable to four real bugs in the Defects4J major benchmark for software testing and automated repair.

Louis2020 Annie Louis, Santanu Kumar Dash, Earl T. Barr, Michael D. Ernst, and Charles Sutton: "Where should I comment my code?: a dataset and model for predicting locations that need comments". Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results, 10.1145/3377816.3381736.

Programmers should write code comments, but not on every line of code. We have created a machine learning model that suggests locations where a programmer should write a code comment. We trained it on existing commented code to learn locations that are chosen by developers. Once trained, the model can predict locations in new code. Our models achieved precision of 74% and recall of 13% in identifying comment-worthy locations. This first success opens the door to future work, both in the new where-to-comment problem and in guiding comment generation. Our code and data is available at

Luu2021 Quang-Hung Luu, Man F. Lau, Sebastian P.H. Ng, and Tsong Yueh Chen: "Testing multiple linear regression systems with metamorphic testing". Journal of Systems and Software, 182, 2021, 10.1016/j.jss.2021.111062.

Regression is one of the most commonly used statistical techniques. However, testing regression systems is a great challenge because of the absence of test oracle. In this paper, we show that Metamorphic Testing is an effective approach to test multiple linear regression systems. In doing so, we identify intrinsic mathematical properties of linear regression, and then propose 11 Metamorphic Relations to be used for testing. Their effectiveness is examined using mutation analysis with a range of different regression programs. We further look at how the testing could be adopted in a more effective way. Our work is applicable to examine the reliability of predictive systems based on regression that has been widely used in economics, engineering and science, as well as of the regression calculation manipulated by statistical users.


Ma2021 Yuxing Ma, Tapajit Dey, Chris Bogart, Sadika Amreen, Marat Valiev, Adam Tutko, David Kennard, Russell Zaretzki, and Audris Mockus: "World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data". Empirical Software Engineering, 26(2), 2021, 10.1007/s10664-020-09905-9.

Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are the tens of millions of projects in the periphery interconnected through. technical dependencies, code sharing, or knowledge flow? To answer such questions we: a) create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of the FLOSS ecosystems and b) provide capabilities to efficiently correct, augment, query, and analyze that data. Our current WoC implementation is capable of being updated on a monthly basis and contains over 18B Git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.

Macho2021 Christian Macho, Stefanie Beyer, Shane McIntosh, and Martin Pinzger: "The nature of build changes: An empirical study of Maven-based build systems". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-020-09926-4.

Build systems are an essential part of modern software projects. As software projects change continuously, it is crucial to understand how the build system changes because neglecting its maintenance can, at best, lead to expensive build breakage, or at worst, introduce user-reported defects due to incorrectly compiled, linked, packaged, or deployed official releases. Recent studies have investigated the (co-)evolution of build configurations and reasons for build breakage; however, the prior analysis focused on a coarse-grained outcome (i.e., either build changing or not). In this paper, we present BUILDDIFF, an approach to extract detailed build changes from MAVEN build files and classify them into 143 change types. In a manual evaluation of 400 build-changing commits, we show that BUILDDIFF can extract and classify build changes with average precision, recall, and f1-scores of 0.97, 0.98, and 0.97, respectively. We then present two studies using the build changes extracted from 144 open source Java projects to study the frequency and time of build changes. The results show that the top-10 most frequent change types account for 51% of the build changes. Among them, changes to version numbers and changes to dependencies of the projects occur most frequently. We also observe frequently co-occurring changes, such as changes to the source code management definitions, and corresponding changes to the dependency management system and the dependency declaration. Furthermore, our results show that build changes frequently occur around release days. In particular, critical changes, such as updates to plugin configuration parts and dependency insertions, are performed before a release day. The contributions of this paper lay in the foundation for future research, such as for analyzing the (co-)evolution of build files with other artifacts, improving effort estimation approaches by incorporating necessary modifications to the build system specification, or automatic repair approaches for configuration code. Furthermore, our detailed change information enables improvements of refactoring approaches for build configurations and improvements of prediction models to identify error-prone build files.

Malik2019 Mashkoor Malik, Alexandre C. G. Schimel, Giuseppe Masetti, Marc Roche, Julian Le Deunf, Margaret F.J. Dolan, Jonathan Beaudoin, Jean-Marie Augustin, Travis Hamilton, and Iain Parnum: "Results from the First Phase of the Seafloor Backscatter Processing Software Inter-Comparison Project". Geosciences, 9(12), 2019, 10.3390/geosciences9120516.

Seafloor backscatter mosaics are now routinely produced from multibeam echosounder data and used in a wide range of marine applications. However, large differences (>5 dB) can often be observed between the mosaics produced by different software packages processing the same dataset. Without transparency of the processing pipeline and the lack of consistency between software packages raises concerns about the validity of the final results. To recognize the source(s) of inconsistency between software, it is necessary to understand at which stage(s) of the data processing chain the differences become substantial. To this end, willing commercial and academic software developers were invited to generate intermediate processed backscatter results from a common dataset, for cross-comparison. The first phase of the study requested intermediate processed results consisting of two stages of the processing sequence: the one-value-per-beam level obtained after reading the raw data and the level obtained after radiometric corrections but before compensation of the angular dependence. Both of these intermediate results showed large differences between software solutions. This study explores the possible reasons for these differences and highlights the need for collaborative efforts between software developers and their users to improve the consistency and transparency of the backscatter data processing sequence.

Mangano2015 Nicolas Mangano, Thomas D. LaToza, Marian Petre, and Andre van der Hoek: "How Software Designers Interact with Sketches at the Whiteboard". IEEE Transactions on Software Engineering, 41(2), 2015, 10.1109/tse.2014.2362924.

Whiteboard sketches play a crucial role in software development, helping to support groups of designers in reasoning about a software design problem at hand. However, little is known about these sketches and how they support design `in the moment', particularly in terms of the relationships among sketches, visual syntactic elements within sketches, and reasoning activities. To address this gap, we analyzed 14 hours of design activity by eight pairs of professional software designers, manually coding over 4000 events capturing the introduction of visual syntactic elements into sketches, focus transitions between sketches, and reasoning activities. Our findings indicate that sketches serve as a rich medium for supporting design conversations. Designers often use general-purpose notations. Designers introduce new syntactic elements to record aspects of the design, or re-purpose sketches as the design develops. Designers constantly shift focus between sketches, using groups of sketches together that contain complementary information. Finally, sketches play an important role in supporting several types of reasoning activities (mental simulation, review of progress, consideration of alternatives). But these activities often leave no trace and rarely lead to sketch creation. We discuss the implications of these and other findings for the practice of software design at the whiteboard and for the creation of new electronic software design sketching tools.

May2019 Anna May, Johannes Wachs, and Anikó Hannák: "Gender differences in participation and reward on Stack Overflow". Empirical Software Engineering, 24(4), 2019, 10.1007/s10664-019-09685-x.

Programming is a valuable skill in the labor market, making the underrepresentation of women in computing an increasingly important issue. Online question and answer platforms serve a dual purpose in this field: they form a body of knowledge useful as a reference and learning tool, and they provide opportunities for individuals to demonstrate credible, verifiable expertise. Issues, such as male-oriented site design or overrepresentation of men among the site's elite may therefore compound the issue of women's underrepresentation in IT. In this paper we audit the differences in behavior and outcomes between men and women on Stack Overflow, the most popular of these Q&A sites. We observe significant differences in how men and women participate in the platform and how successful they are. For example, the average woman has roughly half of the reputation points, the primary measure of success on the site, of the average man. Using an Oaxaca-Blinder decomposition, an econometric technique commonly applied to analyze differences in wages between groups, we find that most of the gap in success between men and women can be explained by differences in their activity on the site and differences in how these activities are rewarded. Specifically, 1) men give more answers than women and 2) are rewarded more for their answers on average, even when controlling for possible confounders such as tenure or buy-in to the site. Women ask more questions and gain more reward per question. We conclude with a hypothetical redesign of the site's scoring system based on these behavioral differences, cutting the reputation gap in half.

Melo2019 Hugo Melo, Roberta Coelho, and Christoph Treude: "Unveiling Exception Handling Guidelines Adopted by Java Developers". 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), 10.1109/saner.2019.8668001.

Despite being an old language feature, Java exception handling code is one of the least understood parts of many systems. Several studies have analyzed the characteristics of exception handling code, trying to identify common practices or even link such practices to software bugs. Few works, however, have investigated exception handling issues from the point of view of developers. None of the works have focused on discovering exception handling guidelines adopted by current systems – which are likely to be a driver of common practices. In this work, we conducted a qualitative study based on semi-structured interviews and a survey whose goal was to investigate the guidelines that are (or should be) followed by developers in their projects. Initially, we conducted semi-structured interviews with seven experienced developers, which were used to inform the design of a survey targeting a broader group of Java developers (i.e., a group of active Java developers from top-starred projects on GitHub). We emailed 863 developers and received 98 valid answers. The study shows that exception handling guidelines usually exist (70%) and are usually implicit and undocumented (54%). Our study identifies 48 exception handling guidelines related to seven different categories. We also investigated how such guidelines are disseminated to the project team and how compliance between code and guidelines is verified; we could observe that according to more than half of respondents the guidelines are both disseminated and verified through code inspection or code review. Our findings provide software development teams with a means to improve exception handling guidelines based on insights from the state of practice of 87 software projects.

Meyer2021 Andre N. Meyer, Earl T. Barr, Christian Bird, and Thomas Zimmermann: "Today Was a Good Day: The Daily Life of Software Developers". IEEE Transactions on Software Engineering, 47(5), 2021, 10.1109/tse.2019.2904957.

What is a good workday for a software developer? What is a typical workday? We seek to answer these two questions to learn how to make good days typical. Concretely, answering these questions will help to optimize development processes and select tools that increase job satisfaction and productivity. Our work adds to a large body of research on how software developers spend their time. We report the results from 5,971 responses of professional developers at Microsoft, who reflected about what made their workdays good and typical, and self-reported about how they spent their time on various activities at work. We developed conceptual frameworks to help define and characterize developer workdays from two new perspectives: good and typical. Our analysis confirms some findings in previous work, including the fact that developers actually spend little time on development and developers' aversion for meetings and interruptions. It also discovered new findings, such as that only 1.7 percent of survey responses mentioned emails as a reason for a bad workday, and that meetings and interruptions are only unproductive during development phases; during phases of planning, specification and release, they are common and constructive. One key finding is the importance of agency, developers' control over their workday and whether it goes as planned or is disrupted by external factors. We present actionable recommendations for researchers and managers to prioritize process and tool improvements that make good workdays typical. For instance, in light of our finding on the importance of agency, we recommend that, where possible, managers empower developers to choose their tools and tasks.

Miller2020 Barton Miller, Mengxiao Zhang, and Elisa Heymann: "The Relevance of Classic Fuzz Testing: Have We Solved This One?". IEEE Transactions on Software Engineering, 2020, 10.1109/tse.2020.3047766.

As fuzz testing has passed its 30th anniversary, and in the face of the incredible progress in fuzz testing techniques and tools, the question arises if the classic, basic fuzz technique is still useful and applicable? In that tradition, we have updated the basic fuzz tools and testing scripts and applied them to a large collection of Unix utilities on Linux, FreeBSD, and MacOS. As before, our failure criteria was whether the program crashed or hung. We found that 9 crash or hang out of 74 utilities on Linux, 15 out of 78 utilities on FreeBSD, and 12 out of 76 utilities on MacOS. A total of 24 different utilities failed across the three platforms. We note that these failure rates are somewhat higher than our in previous 1995, 2000, and 2006 studies of the reliability of command line utilities. In the basic fuzz tradition, we debugged each failed utility and categorized the causes the failures. Classic categories of failures, such as pointer and array errors and not checking return codes, were still broadly present in the current results. In addition, we found a couple of new categories of failures appearing. We present examples of these failures to illustrate the programming practices that allowed them to happen. As a side note, we tested the limited number of utilities available in a modern programming language (Rust) and found them to be of no better reliability than the standard ones.

Mitropoulos2019 Dimitris Mitropoulos, Panos Louridas, Vitalis Salis, and Diomidis Spinellis: "Time Present and Time Past: Analyzing the Evolution of JavaScript Code in the Wild". 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), 10.1109/msr.2019.00029.

JavaScript is one of the web's key building blocks. It is used by the majority of web sites and it is supported by all modern browsers. We present the first large-scale study of client-side JavaScript code over time. Specifically, we have collected and analyzed a dataset containing daily snapshots of JavaScript code coming from Alexa's Top 10000 web sites (~7.5 GB per day) for nine consecutive months, to study different temporal aspects of web client code. We found that scripts change often; typically every few days, indicating a rapid pace in web applications development. We also found that the lifetime of web sites themselves, measured as the time between JavaScript changes, is also short, in the same time scale. We then performed a qualitative analysis to investigate the nature of the changes that take place. We found that apart from standard changes such as the introduction of new functions, many changes are related to online configuration management. In addition, we examined JavaScript code reuse over time and especially the widespread reliance on third-party libraries. Furthermore, we observed how quality issues evolve by employing established static analysis tools to identify potential software bugs, whose evolution we tracked over time. Our results show that quality issues seem to persist over time, while vulnerable libraries tend to decrease.

Mo2021 Ran Mo, Yuanfang Cai, Rick Kazman, Lu Xiao, and Qiong Feng: "Architecture Anti-Patterns: Automatically Detectable Violations of Design Principles". IEEE Transactions on Software Engineering, 47(5), 2021, 10.1109/tse.2019.2910856.

In large-scale software systems, error-prone or change-prone files rarely stand alone. They are typically architecturally connected and their connections usually exhibit architecture problems causing the propagation of error-proneness or change-proneness. In this paper, we propose and empirically validate a suite of architecture anti-patterns that occur in all large-scale software systems and are involved in high maintenance costs. We define these architecture anti-patterns based on fundamental design principles and Baldwin and Clark's design rule theory. We can automatically detect these anti-patterns by analyzing a project's structural relationships and revision history. Through our analyses of 19 large-scale software projects, we demonstrate that these architecture anti-patterns have significant impact on files' bug-proneness and change-proneness. In particular, we show that 1) files involved in these architecture anti-patterns are more error-prone and change-prone; 2) the more anti-patterns a file is involved in, the more error-prone and change-prone it is; and 3) while all of our defined architecture anti-patterns contribute to file's error-proneness and change-proneness, Unstable Interface and Crossing contribute the most by far.

Mokhov2018 Andrey Mokhov, Neil Mitchell, and Simon Peyton Jones: "Build systems `a la carte". Proceedings of the ACM on Programming Languages, 2(ICFP), 2018, 10.1145/3236774.

Build systems are awesome, terrifying—and unloved. They are used by every developer around the world, but are rarely the object of study. In this paper we offer a systematic, and executable, framework for developing and comparing build systems, viewing them as related points in landscape rather than as isolated phenomena. By teasing apart existing build systems, we can recombine their components, allowing us to prototype new build systems with desired properties.

Moldon2021 Lukas Moldon, Markus Strohmaier, and Johannes Wachs: "How Gamification Affects Software Developers: Cautionary Evidence from a Natural Experiment on GitHub". 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 10.1109/icse43902.2021.00058.

We examine how the behavior of software developers changes in response to removing gamification elements from GitHub, an online platform for collaborative programming and software development. We find that the unannounced removal of daily activity streak counters from the user interface (from user profile pages) was followed by significant changes in behavior. Long-running streaks of activity were abandoned and became less common. Weekend activity decreased and days in which developers made a single contribution became less common. Synchronization of streaking behavior in the platform's social network also decreased, suggesting that gamification is a powerful channel for social influence. Focusing on a set of software developers that were publicly pursuing a goal to make contributions for 100 days in a row, we find that some of these developers abandon this quest following the removal of the public streak counter. Our findings provide evidence for the significant impact of gamification on the behavior of developers on large collaborative programming and software development platforms. They urge caution: gamification can steer the behavior of software developers in unexpected and unwanted directions.

Moraes2021 Jo~ao Pedro Moraes, Ivanilton Polato, Igor Wiese, Filipe Saraiva, and Gustavo Pinto: "From one to hundreds: multi-licensing in the JavaScript ecosystem". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-020-09936-2.

Open source licenses create a legal framework that plays a crucial role in the widespread adoption of open source projects. Without a license, any source code available on the internet could not be openly (re)distributed. Although recent studies provide evidence that most popular open source projects have a license, developers might lack confidence or expertise when they need to combine software licenses, leading to a mistaken project license unification.This license usage is challenged by the high degree of reuse that occurs in the heart of modern software development practices, in which third-party libraries and frameworks are easily and quickly integrated into a software codebase.This scenario creates what we call "multi-licensed" projects, which happens when one project has components that are licensed under more than one license. Although these components exist at the file-level, they naturally impact licensing decisions at the project-level. In this paper, we conducted a mix-method study to shed some light on these questions. We started by parsing 1,426,263 (source code and non-source code) files available on 1,552 JavaScript projects, looking for license information. Among these projects, we observed that 947 projects (61%) employ more than one license. On average, there are 4.7 licenses per studied project (max: 256). Among the reasons for multi-licensing is to incorporate the source code of third-party libraries into the project's codebase. When doing so, we observed that 373 of the multi-licensed projects introduced at least one license incompatibility issue. We also surveyed with 83 maintainers of these projects aimed to cross-validate our findings. We observed that 63% of the surveyed maintainers are not aware of the multi-licensing implications. For those that are aware, they adopt multiple licenses mostly to conform with third-party libraries' licenses.

MoreiraSoares2020 Daricélio Moreira Soares, Manoel Limeira Lima Júnior, Leonardo Murta, and Alexandre Plastino: "What factors influence the lifetime of pull requests?". Software: Practice and Experience, 51(6), 2020, 10.1002/spe.2946.

When external contributors want to collaborate with an open-source project, they fork the repository, make changes, and send a pull request to the core team. However, the lifetime of a pull request, defined by the time interval between its opening and its closing, has a high variation, potentially affecting the contributor engagement. In this context, understanding the root causes of pull request lifetime is important to both the external contributors and the core team. The former can adopt strategies that increase the chances of fast review, while the latter can establish priorities in the reviewing process, alleviating the pending tasks and improving the software quality. In this work, we mined association rules from 97,463 pull requests from 30 projects in order to find characteristics that have affected the pull requests lifetime. In addition, we present a qualitative analysis, helping to understand the patterns discovered from the association rules. The results indicate that: (i) contributions with shorter lifetimes tend to be accepted; (ii) structural characteristics, such as number of commits, changed files, and lines of code, have influence, in an isolated or combined way, on the pull request lifetime; (iii) the files changed and the directories to which they belong can be robust predictors for pull request lifetime; (iv) the profile of external contributors and their social relationships have influence on lifetime; and (v) the number of comments in a pull request, as well as the developer responsible for the review, are important predictors for its lifetime.

Muhammad2019 Hisham Muhammad, Lucas C. Villa Real, and Michael Homer: "Taxonomy of Package Management in Programming Languages and Operating Systems". Proceedings of the 10th Workshop on Programming Languages and Operating Systems, 10.1145/3365137.3365402.

Package management is instrumental for programming languages and operating systems, and yet it is neglected by both areas as an implementation detail. For this reason, it lacks the same kind of conceptual organization: we lack terminology to classify them or to reason about their design trade-offs. In this paper, we share our experience in both OS and language-specific package manager development, categorizing families of package managers and discussing their design implications beyond particular implementations. We also identify possibilities in the still largely unexplored area of package manager interoperability.

MurphyHill2021 Emerson Murphy-Hill, Ciera Jaspan, Caitlin Sadowski, David Shepherd, Michael Phillips, Collin Winter, Andrea Knight, Edward Smith, and Matthew Jorde: "What Predicts Software Developers' Productivity?". IEEE Transactions on Software Engineering, 47(3), 2021, 10.1109/tse.2019.2900308.

Organizations have a variety of options to help their software developers become their most productive selves, from modifying office layouts, to investing in better tools, to cleaning up the source code. But which options will have the biggest impact? Drawing from the literature in software engineering and industrial/organizational psychology to identify factors that correlate with productivity, we designed a survey that asked 622 developers across 3 companies about these productivity factors and about self-rated productivity. Our results suggest that the factors that most strongly correlate with self-rated productivity were non-technical factors, such as job enthusiasm, peer support for new ideas, and receiving useful feedback about job performance. Compared to other knowledge workers, our results also suggest that software developers' self-rated productivity is more strongly related to task variety and ability to work remotely.


NguyenDuc2021 Anh Nguyen-Duc, Kai-Kristian Kemell, and Pekka Abrahamsson: "The entrepreneurial logic of startup software development: A study of 40 software startups". Empirical Software Engineering, 26(5), 2021, 10.1007/s10664-021-09987-z.

Context: Software startups are an essential source of innovation and software-intensive products. The need to understand product development in startups and to provide relevant support are highlighted in software research. While state-of-the-art literature reveals how startups develop their software, the reasons why they adopt these activities are underexplored. Objective: This study investigates the tactics behind software engineering (SE) activities by analyzing key engineering events during startup journeys. We explore how entrepreneurial mindsets may be associated with SE knowledge areas and with each startup case. Method: Our theoretical foundation is based on causation and effectuation models. We conducted semi-structured interviews with 40 software startups. We used two-round open coding and thematic analysis to describe and identify entrepreneurial software development patterns. Additionally, we calculated an effectuation index for each startup case. Results: We identified 621 events merged into 32 codes of entrepreneurial logic in SE from the sample. We found a systemic occurrence of the logic in all areas of SE activities. Minimum Viable Product (MVP), Technical Debt (TD), and Customer Involvement (CI) tend to be associated with effectual logic, while testing activities at different levels are associated with causal logic. The effectuation index revealed that startups are either effectuation-driven or mixed-logics-driven. Conclusions: Software startups fall into two types that differentiate between how traditional SE approaches may apply to them. Effectuation seems the most relevant and essential model for explaining and developing suitable SE practices for software startups.


Olsson2021 Jesper Olsson, Erik Risfelt, Terese Besker, Antonio Martini, and Richard Torkar: "Measuring affective states from technical debt". Empirical Software Engineering, 26(5), 2021, 10.1007/s10664-021-09998-w.

Context: Software engineering is a human activity. Despite this, human aspects are under-represented in technical debt research, perhaps because they are challenging to evaluate. Objective: This study's objective was to investigate the relationship between technical debt and affective states (feelings, emotions, and moods) from software practitioners. Method: Forty participants (N=40) from twelve companies took part in a mixed-methods approach, consisting of a repeated-measures (r=5) experiment (n=200), a survey, and semi-structured interviews. From the qualitative data, it is clear that technical debt activates a substantial portion of the emotional spectrum and is psychologically taxing. Further, the practitioners' reactions to technical debt appear to fall in different levels of maturity. Results: The statistical analysis shows that different design smells (strong indicators of technical debt) negatively or positively impact affective states. Conclusions: We argue that human aspects in technical debt are important factors to consider, as they may result in, e.g., procrastination, apprehension, and burnout.

Overney2020 Cassandra Overney, Jens Meinicke, Christian Kästner, and Bogdan Vasilescu: "How to not get rich: an empirical study of donations in open source". Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 10.1145/3377811.3380410.

Open source is ubiquitous and many projects act as critical infrastructure, yet funding and sustaining the whole ecosystem is challenging. While there are many different funding models for open source and concerted efforts through foundations, donation platforms like PayPal, Patreon, and OpenCollective are popular and low-bar platforms to raise funds for open-source development. With a mixed-method study, we investigate the emerging and largely unexplored phenomenon of donations in open source. Specifically, we quantify how commonly open-source projects ask for donations, statistically model characteristics of projects that ask for and receive donations, analyze for what the requested funds are needed and used, and assess whether the received donations achieve the intended outcomes. We find 25,885 projects asking for donations on GitHub, often to support engineering activities; however, we also find no clear evidence that donations influence the activity level of a project. In fact, we find that donations are used in a multitude of ways, raising new research questions about effective funding.


Palomba2021 Fabio Palomba, Damian Andrew Tamburri, Francesca Arcelli Fontana, Rocco Oliveto, Andy Zaidman, and Alexander Serebrenik: "Beyond Technical Aspects: How Do Community Smells Influence the Intensity of Code Smells?". IEEE Transactions on Software Engineering, 47(1), 2021, 10.1109/tse.2018.2883603.

Code smells are poor implementation choices applied by developers during software evolution that often lead to critical flaws or failure. Much in the same way, community smells reflect the presence of organizational and socio-technical issues within a software community that may lead to additional project costs. Recent empirical studies provide evidence that community smells are often—if not always—connected to circumstances such as code smells. In this paper we look deeper into this connection by conducting a mixed-methods empirical study of 117 releases from 9 open-source systems. The qualitative and quantitative sides of our mixed-methods study were run in parallel and assume a mutually-confirmative connotation. On the one hand, we survey 162 developers of the 9 considered systems to investigate whether developers perceive relationship between community smells and the code smells found in those projects. On the other hand, we perform a fine-grained analysis into the 117 releases of our dataset to measure the extent to which community smells impact code smell intensity (i.e., criticality). We then propose a code smell intensity prediction model that relies on both technical and community-related aspects. The results of both sides of our mixed-methods study lead to one conclusion: community-related factors contribute to the intensity of code smells. This conclusion supports the joint use of community and code smells detection as a mechanism for the joint management of technical and social problems around software development communities.

Paltoglou2021 Katerina Paltoglou, Vassilis E. Zafeiris, N.A. Diamantidis, and E.A. Giakoumakis: "Automated refactoring of legacy JavaScript code to ES6 modules". Journal of Systems and Software, 181, 2021, 10.1016/j.jss.2021.111049.

The JavaScript language did not specify, until ECMAScript 6 (ES6), native features for streamlining encapsulation and modularity. Developer community filled the gap with a proliferation of design patterns and module formats, with impact on code reusability, portability and complexity of build configurations. This work studies the automated refactoring of legacy ES5 code to ES6 modules with fine-grained reuse of module contents through the named import/export language constructs. The focus is on reducing the coupling of refactored modules through destructuring exported module objects to fine-grained module features and enhancing module dependencies by leveraging the ES6 syntax. We employ static analysis to construct a model of a JavaScript project, the Module Dependence Graph (MDG), that represents modules and their dependencies. On the basis of MDG we specify the refactoring procedure for module migration to ES6. A prototype implementation has been empirically evaluated on 19 open source projects. Results highlight the relevance of the refactoring with a developer intent for fine-grained reuse. The analysis of refactored code shows an increase in the number of reusable elements per project and reduction in the coupling of refactored modules. The soundness of the refactoring is empirically validated through code inspection and execution of projects' test suites.

Passos2021 Leonardo Passos, Rodrigo Queiroz, Mukelabai Mukelabai, Thorsten Berger, Sven Apel, Krzysztof Czarnecki, and Jesus Alejandro Padilla: "A Study of Feature Scattering in the Linux Kernel". IEEE Transactions on Software Engineering, 47(1), 2021, 10.1109/tse.2018.2884911.

Feature code is often scattered across a software system. Scattering is not necessarily bad if used with care, as witnessed by systems with highly scattered features that evolved successfully. Feature scattering, often realized with a pre-processor, circumvents limitations of programming languages and software architectures. Unfortunately, little is known about the principles governing scattering in large and long-living software systems. We present a longitudinal study of feature scattering in the Linux kernel, complemented by a survey with 74, and interviews with nine Linux kernel developers. We analyzed almost eight years of the kernel's history, focusing on its largest subsystem: device drivers. We learned that the ratio of scattered features remained nearly constant and that most features were introduced without scattering. Yet, scattering easily crosses subsystem boundaries, and highly scattered outliers exist. Scattering often addresses a performance-maintenance tradeoff (alleviating complicated APIs), hardware design limitations, and avoids code duplication. While developers do not consciously enforce scattering limits, they actually improve the system design and refactor code, thereby mitigating pre-processor idiosyncrasies or reducing its use.

Peitek2021 Norman Peitek, Sven Apel, Chris Parnin, Andre Brechmann, and Janet Siegmund: "Program Comprehension and Code Complexity Metrics: An fMRI Study". 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 10.1109/icse43902.2021.00056.

Background: Researchers and practitioners have been using code complexity metrics for decades to predict how developers comprehend a program. While it is plausible and tempting to use code metrics for this purpose, their validity is debated, since they rely on simple code properties and rarely consider particularities of human cognition. Aims: We investigate whether and how code complexity metrics reflect difficulty of program comprehension. Method: We have conducted a functional magnetic resonance imaging (fMRI) study with 19 participants observing program comprehension of short code snippets at varying complexity levels. We dissected four classes of code complexity metrics and their relationship to neuronal, behavioral, and subjective correlates of program comprehension, overall analyzing more than 41 metrics. Results: While our data corroborate that complexity metrics can-to a limited degree-explain programmers' cognition in program comprehension, fMRI allowed us to gain insights into why some code properties are difficult to process. In particular, a code's textual size drives programmers' attention, and vocabulary size burdens programmers' working memory. Conclusion: Our results provide neuro-scientific evidence supporting warnings of prior research questioning the validity of code complexity metrics and pin down factors relevant to program comprehension. Future Work: We outline several follow-up experiments investigating fine-grained effects of code complexity and describe possible refinements to code complexity metrics.

Pizard2021 Sebastián Pizard, Fernando Acerenza, Ximena Otegui, Silvana Moreno, Diego Vallespir, and Barbara Kitchenham: "Training students in evidence-based software engineering and systematic reviews: a systematic review and empirical study". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-021-09953-9.

Context Although influential in academia, evidence-based software engineering (EBSE) has had little impact on industry practice. We found that other disciplines have identified lack of training as a significant barrier to Evidence-Based Practice. Objective To build and assess an EBSE training proposal suitable for students with more than 3 years of computer science/software engineering university-level training. Method We performed a systematic literature review (SLR) of EBSE teaching initiatives and used the SLR results to help us to develop and evaluate an EBSE training proposal. The course was based on the theory of learning outcomes and incorporated a large practical content related to performing an SLR. We ran the course with 10 students and based course evaluation on student performance and opinions of both students and teachers. We assessed knowledge of EBSE principles from the mid-term and final tests, as well as evaluating the SLRs produced by the student teams. We solicited student opinions about the course and its value via a student survey, a team survey, and a focus group. The teachers' viewpoint was collected in a debriefing meeting. Results Our SLR identified 14 relevant primary studies. The primary studies emphasized the importance of practical examples (usually based on the SLR process) and used a variety of evaluation methods, but lacked any formal education methodology. We identified 54 learning outcomes covering aspects of EBSE and the SLR method. All 10 students passed the course. Our course evaluation showed that a large percentage of the learning outcomes established for training were accomplished. Conclusions The course proved suitable for students to understand the EBSE paradigm and to be able to apply it to a limited-scope practical assignment. Our learning outcomes, course structure, and course evaluation process should help to improve the effectiveness and comparability of future studies of EBSE training. However, future courses should increase EBSE training related to the use of SLR results.


Qiu2019 Huilian Sophie Qiu, Alexander Nolte, Anita Brown, Alexander Serebrenik, and Bogdan Vasilescu: "Going Farther Together: The Impact of Social Capital on Sustained Participation in Open Source". 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 10.1109/icse.2019.00078.

Sustained participation by contributors in opensource software is critical to the survival of open-source projects and can provide career advancement benefits to individual contributors. However, not all contributors reap the benefits of open-source participation fully, with prior work showing that women are particularly underrepresented and at higher risk of disengagement. While many barriers to participation in open-source have been documented in the literature, relatively little is known about how the social networks that open-source contributors form impact their chances of long-term engagement. In this paper we report on a mixed-methods empirical study of the role of social capital (i.e., the resources people can gain from their social connections) for sustained participation by women and men in open-source GitHub projects. After combining survival analysis on a large, longitudinal data set with insights derived from a user survey, we confirm that while social capital is beneficial for prolonged engagement for both genders, women are at disadvantage in teams lacking diversity in expertise.


RakAmnouykit2020 Ingkarat Rak-amnouykit, Daniel McCrevan, Ana Milanova, Martin Hirzel, and Julian Dolby: "Python 3 types in the wild: a tale of two type systems". Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages, 10.1145/3426422.3426981.

Python 3 is a highly dynamic language, but it has introduced a syntax for expressing types with PEP484. This paper ex- plores how developers use these type annotations, the type system semantics provided by type checking and inference tools, and the performance of these tools. We evaluate the types and tools on a corpus of public GitHub repositories. We review MyPy and PyType, two canonical static type checking and inference tools, and their distinct approaches to type analysis. We then address three research questions: (i) How often and in what ways do developers use Python 3 types? (ii) Which type errors do developers make? (iii) How do type errors from different tools compare? Surprisingly, when developers use static types, the code rarely type-checks with either of the tools. MyPy and PyType exhibit false positives, due to their static nature, but also flag many useful errors in our corpus. Lastly, MyPy and PyType embody two distinct type systems, flagging different errors in many cases. Understanding the usage of Python types can help guide tool-builders and researchers. Understanding the performance of popular tools can help increase the adoption of static types and tools by practitioners, ultimately leading to more correct and more robust Python code.

Rahman2020b Mohammad Masudur Rahman, Foutse Khomh, and Marco Castelluccio: "Why are Some Bugs Non-Reproducible? An Empirical Investigation using Data Fusion". 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), 10.1109/icsme46990.2020.00063.

Software developers attempt to reproduce software bugs to understand their erroneous behaviours and to fix them. Unfortunately, they often fail to reproduce (or fix) them, which leads to faulty, unreliable software systems. However, to date, only a little research has been done to better understand what makes the software bugs non-reproducible. In this paper, we conduct a multimodal study to better understand the non-reproducibility of software bugs. First, we perform an empirical study using 576 non-reproducible bug reports from two popular software systems (Firefox, Eclipse) and identify 11 key factors that might lead a reported bug to non-reproducibility. Second, we conduct a user study involving 13 professional developers where we investigate how the developers cope with non-reproducible bugs. We found that they either close these bugs or solicit for further information, which involves long deliberations and counter-productive manual searches. Third, we offer several actionable insights on how to avoid non-reproducibility (e.g., false-positive bug report detector) and improve reproducibility of the reported bugs (e.g., sandbox for bug reproduction) by combining our analyses from multiple studies (e.g., empirical study, developer study).

Rahman2021 Akond Rahman, Md Rayhanur Rahman, Chris Parnin, and Laurie Williams: "Security Smells in Ansible and Chef Scripts". ACM Transactions on Software Engineering and Methodology, 30(1), 2021, 10.1145/3408897.

Context: Security smells are recurring coding patterns that are indicative of security weakness and require further inspection. As infrastructure as code (IaC) scripts, such as Ansible and Chef scripts, are used to provision cloud-based servers and systems at scale, security smells in IaC scripts could be used to enable malicious users to exploit vulnerabilities in the provisioned systems. Goal: The goal of this article is to help practitioners avoid insecure coding practices while developing infrastructure as code scripts through an empirical study of security smells in Ansible and Chef scripts. Methodology: We conduct a replication study where we apply qualitative analysis with 1,956 IaC scripts to identify security smells for IaC scripts written in two languages: Ansible and Chef. We construct a static analysis tool called Security Linter for Ansible and Chef scripts (SLAC) to automatically identify security smells in 50,323 scripts collected from 813 open source software repositories. We also submit bug reports for 1,000 randomly selected smell occurrences. Results: We identify two security smells not reported in prior work: missing default in case statement and no integrity check. By applying SLAC we identify 46,600 occurrences of security smells that include 7,849 hard-coded passwords. We observe agreement for 65 of the responded 94 bug reports, which suggests the relevance of security smells for Ansible and Chef scripts amongst practitioners. Conclusion: We observe security smells to be prevalent in Ansible and Chef scripts, similarly to that of the Puppet scripts. We recommend practitioners to rigorously inspect the presence of the identified security smells in Ansible and Chef scripts using (i) code review, and (ii) static analysis tools.

Reyes2018 Rolando P. Reyes, Oscar Dieste, Efraín R. Fonseca, and Natalia Juristo: "Statistical errors in software engineering experiments". Proceedings of the 40th International Conference on Software Engineering, 10.1145/3180155.3180161.

Background: Statistical concepts and techniques are often applied incorrectly, even in mature disciplines such as medicine or psychology. Surprisingly, there are very few works that study statistical problems in software engineering (SE). Aim: Assess the existence of statistical errors in SE experiments. Method: Compile the most common statistical errors in experimental disciplines. Survey experiments published in ICSE to assess whether errors occur in high quality SE publications. Results: The same errors as identified in others disciplines were found in ICSE experiments, where 30 of the reviewed papers included several error types such as: a) missing statistical hypotheses, b) missing sample size calculation, c) failure to assess statistical test assumptions, and d) uncorrected multiple testing. This rather large error rate is greater for research papers where experiments are confined to the validation section. The origin of the errors can be traced back to: a) researchers not having sufficient statistical training, and b) a profusion of exploratory research. Conclusions: This paper provides preliminary evidence that SE research suffers from the same statistical problems as other experimental disciplines. However, the SE community appears to be unaware of any shortcomings in its experiments, whereas other disciplines work hard to avoid these threats. Further research is necessary to find the underlying causes and set up corrective measures, but there are some potentially effective actions and are a priori easy to implement: a) improve the statistical training of SE researchers, and b) enforce quality assessment and reporting guidelines in SE publications.

Rico2021 Sergio Rico, Elizabeth Bjarnason, Emelie Engström, Martin Höst, and Per Runeson: "A case study of industry–academia communication in a joint software engineering research project". Journal of Software: Evolution and Process, 2021, 10.1002/smr.2372.

Empirical software engineering research relies on good communication with industrial partners. Conducting joint research both requires and contributes to bridging the communication gap between industry and academia (IA) in software engineering. This study aims to explore communication between the two parties in such a setting. To better understand what facilitates good IA communication and what project outcomes such communication promotes, we performed a case study, in the context of a long-term IA joint project, followed by a validating survey among practitioners and researchers with experience of working in similar settings. We identified five facilitators of IA communication and nine project outcomes related to this communication. The facilitators concern the relevance of the research, practitioners' attitude and involvement in research, frequency of communication and longevity of the collaboration. The project outcomes promoted by this communication include, for researchers, changes in teaching and new scientific venues, and for practitioners, increased awareness, changes to practice, and new tools and source code. Besides, both parties gain new knowledge and develop social-networks through IA communication. Our study presents empirically based insights that can provide advise on how to improve communication in IA research projects and thus the co-creation of software engineering knowledge that is anchored in both practice and research.

Rigger2020 Manuel Rigger and Zhendong Su: "Finding bugs in database systems via query partitioning". Proceedings of the ACM on Programming Languages, 4(OOPSLA), 2020, 10.1145/3428279.

Logic bugs in Database Management Systems (DBMSs) are bugs that cause an incorrect result for a given query, for example, by omitting a row that should be fetched. These bugs are critical, since they are likely to go unnoticed by users. We propose Query Partitioning, a general and effective approach for finding logic bugs in DBMSs. The core idea of Query Partitioning is to, starting from a given original query, derive multiple, more complex queries (called partitioning queries), each of which computes a partition of the result. The individual partitions are then composed to compute a result set that must be equivalent to the original query's result set. A bug in the DBMS is detected when these result sets differ. Our intuition is that due to the increased complexity, the partitioning queries are more likely to stress the DBMS and trigger a logic bug than the original query. As a concrete instance of a partitioning strategy, we propose Ternary Logic Partitioning (TLP), which is based on the observation that a boolean predicate p can either evaluate to TRUE, FALSE, or NULL. Accordingly, a query can be decomposed into three partitioning queries, each of which computes its result on rows or intermediate results for which p, NOT p, and p IS NULL hold. This technique is versatile, and can be used to test WHERE, GROUP BY, as well as HAVING clauses, aggregate functions, and DISTINCT queries. As part of an extensive testing campaign, we found 175 bugs in widely-used DBMSs such as MySQL, TiDB, SQLite, and CockroachDB, 125 of which have been fixed. Notably, 77 of these were logic bugs, while the remaining were error and crash bugs. We expect that the effectiveness and wide applicability of Query Partitioning will lead to its broad adoption in practice, and the formulation of additional partitioning strategies.

RodriguezPerez2018 Gema Rodríguez-Pérez, Gregorio Robles, and Jesús M. González-Barahona: "Reproducibility and credibility in empirical software engineering: A case study based on a systematic literature review of the use of the SZZ algorithm". Information and Software Technology, 99, 2018, 10.1016/j.infsof.2018.03.009.

When identifying the origin of software bugs, many studies assume that "a bug was introduced by the lines of code that were modified to fix it". However, this assumption does not always hold and at least in some cases, these modified lines are not responsible for introducing the bug. For example, when the bug was caused by a change in an external API. The lack of empirical evidence makes it impossible to assess how important these cases are and therefore, to which extent the assumption is valid. To advance in this direction, and better understand how bugs "are born", we propose a model for defining criteria to identify the first snapshot of an evolving software system that exhibits a bug. This model, based on the perfect test idea, decides whether a bug is observed after a change to the software. Furthermore, we studied the model's criteria by carefully analyzing how 116 bugs were introduced in two different open source software projects. The manual analysis helped classify the root cause of those bugs and created manually curated datasets with bug-introducing changes and with bugs that were not introduced by any change in the source code. Finally, we used these datasets to evaluate the performance of four existing SZZ-based algorithms for detecting bug-introducing changes. We found that SZZ-based algorithms are not very accurate, especially when multiple commits are found; the F-Score varies from 0.44 to 0.77, while the percentage of true positives does not exceed 63%. Our results show empirical evidence that the prevalent assumption, "a bug was introduced by the lines of code that were modified to fix it", is just one case of how bugs are introduced in a software system. Finding what introduced a bug is not trivial: bugs can be introduced by the developers and be in the code, or be created irrespective of the code. Thus, further research towards a better understanding of the origin of bugs in software projects could help to improve design integration tests and to design other procedures to make software development more robust.

RodriguezPerez2020 Gema Rodríguez-Pérez, Gregorio Robles, Alexander Serebrenik, Andy Zaidman, Daniel M. Germán, and Jesus M. Gonzalez-Barahona: "How bugs are born: a model to identify how bugs are introduced in software components". Empirical Software Engineering, 25(2), 2020, 10.1007/s10664-019-09781-y.

When identifying the origin of software bugs, many studies assume that "a bug was introduced by the lines of code that were modified to fix it". However, this assumption does not always hold and at least in some cases, these modified lines are not responsible for introducing the bug. For example, when the bug was caused by a change in an external API. The lack of empirical evidence makes it impossible to assess how important these cases are and therefore, to which extent the assumption is valid. To advance in this direction, and better understand how bugs "are born", we propose a model for defining criteria to identify the first snapshot of an evolving software system that exhibits a bug. This model, based on the perfect test idea, decides whether a bug is observed after a change to the software. Furthermore, we studied the model's criteria by carefully analyzing how 116 bugs were introduced in two different open source software projects. The manual analysis helped classify the root cause of those bugs and created manually curated datasets with bug-introducing changes and with bugs that were not introduced by any change in the source code. Finally, we used these datasets to evaluate the performance of four existing SZZ-based algorithms for detecting bug-introducing changes. We found that SZZ-based algorithms are not very accurate, especially when multiple commits are found; the F-Score varies from 0.44 to 0.77, while the percentage of true positives does not exceed 63%. Our results show empirical evidence that the prevalent assumption, "a bug was introduced by the lines of code that were modified to fix it", is just one case of how bugs are introduced in a software system. Finding what introduced a bug is not trivial: bugs can be introduced by the developers and be in the code, or be created irrespective of the code. Thus, further research towards a better understanding of the origin of bugs in software projects could help to improve design integration tests and to design other procedures to make software development more robust.


Sarker2019 Farhana Sarker, Bogdan Vasilescu, Kelly Blincoe, and Vladimir Filkov: "Socio-Technical Work-Rate Increase Associates With Changes in Work Patterns in Online Projects". 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 10.1109/icse.2019.00099.

Software developers work on a variety of tasks ranging from the technical, e.g., writing code, to the social, e.g., participating in issue resolution discussions. The amount of work developers perform per week (their work-rate) also varies and depends on project needs and developer schedules. Prior work has shown that while moderate levels of increased technical work and multitasking lead to higher productivity, beyond a certain threshold, they can lead to lowered performance. Here, we study how increases in the short-term work-rate along both the technical and social dimensions are associated with changes in developers' work patterns, in particular communication sentiment, technical productivity, and social productivity. We surveyed active and prolific developers on GitHub to understand the causes and impacts of increased work-rates. Guided by the responses, we developed regression models to study how communication and committing patterns change with increased work-rates and fit those models to large-scale data gathered from traces left by thousands of GitHub developers. From our survey and models, we find that most developers do experience work-rate-increase-related changes in behavior. Most notably, our models show that there is a sizable effect when developers comment much more than their average: the negative sentiment in their comments increases, suggesting an increased level of stress. Our models also show that committing patterns do not change with increased commenting, and vice versa, suggesting that technical and social activities tend not to be multitasked.

Scalabrino2018 Simone Scalabrino, Mario Linares-Vásquez, Rocco Oliveto, and Denys Poshyvanyk: "A comprehensive model for code readability". Journal of Software: Evolution and Process, 30(6), 2018, 10.1002/smr.1958.

Unreadable code could compromise program comprehension, and it could cause the introduction of bugs. Code consists of mostly natural language text, both in identifiers and comments, and it is a particular form of text. Nevertheless, the models proposed to estimate code readability take into account only structural aspects and visual nuances of source code, such as line length and alignment of characters. In this paper, we extend our previous work in which we use textual features to improve code readability models. We introduce 2 new textual features, and we reassess the readability prediction power of readability models on more than 600 code snippets manually evaluated, in terms of readability, by 5K+ people. We also replicate a study by Buse and Weimer on the correlation between readability and FindBugs warnings, evaluating different models on 20 software systems, for a total of 3M lines of code. The results demonstrate that (1) textual features complement other features and (2) a model containing all the features achieves a significantly higher accuracy as compared with all the other state-of-the-art models. Also, readability estimation resulting from a more accurate model, ie, the combined model, is able to predict more accurately FindBugs warnings.

Scalabrino2021 Simone Scalabrino, Gabriele Bavota, Christopher Vendome, Mario Linares-Vasquez, Denys Poshyvanyk, and Rocco Oliveto: "Automatically Assessing Code Understandability". IEEE Transactions on Software Engineering, 47(3), 2021, 10.1109/tse.2019.2901468.

Understanding software is an inherent requirement for many maintenance and evolution tasks. Without a thorough understanding of the code, developers would not be able to fix bugs or add new features timely. Measuring code understandability might be useful to guide developers in writing better code, and could also help in estimating the effort required to modify code components. Unfortunately, there are no metrics designed to assess the understandability of code snippets. In this work, we perform an extensive evaluation of 121 existing as well as new code-related, documentation-related, and developer-related metrics. We try to (i) correlate each metric with understandability and (ii) build models combining metrics to assess understandability. To do this, we use 444 human evaluations from 63 developers and we obtained a bold negative result: none of the 121 experimented metrics is able to capture code understandability, not even the ones assumed to assess quality attributes apparently related, such as code readability and complexity. While we observed some improvements while combining metrics in models, their effectiveness is still far from making them suitable for practical applications. Finally, we conducted interviews with five professional developers to understand the factors that influence their ability to understand code snippets, aiming at identifying possible new metrics.

Shao2020 Shudi Shao, Zhengyi Qiu, Xiao Yu, Wei Yang, Guoliang Jin, Tao Xie, and Xintao Wu: "Database-Access Performance Antipatterns in Database-Backed Web Applications". 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), 10.1109/icsme46990.2020.00016.

Database-backed web applications are prone to performance bugs related to database accesses. While much work has been conducted on database-access antipatterns with some recent work focusing on performance impact, there still lacks a comprehensive view of database-access performance antipatterns in database-backed web applications. To date, no existing work systematically reports known antipatterns in the literature, and no existing work has studied database-access performance bugs in major types of web applications that access databases differently.To address this issue, we first summarize all known database-access performance antipatterns found through our literature survey, and we report all of them in this paper. We further collect database-access performance bugs from web applications that access databases through language-provided SQL interfaces, which have been largely ignored by recent work, to check how extensively the known antipatterns can cover these bugs. For bugs not covered by the known antipatterns, we extract new database-access performance antipatterns based on real-world performance bugs from such web applications. Our study in total reports 24 known and 10 new database-access performance antipatterns. Our results can guide future work to develop effective tool support for different types of web applications.

Sharma2021 Pankajeshwara Nand Sharma, Bastin Tony Roy Savarimuthu, and Nigel Stanger: "Extracting Rationale for Open Source Software Development Decisions—A Study of Python Email Archives". 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 10.1109/icse43902.2021.00095.

A sound Decision-Making (DM) process is key to the successful governance of software projects. In many Open Source Software Development (OSSD) communities, DM processes lie buried amongst vast amounts of publicly available data. Hidden within this data lie the rationale for decisions that led to the evolution and maintenance of software products. While there have been some efforts to extract DM processes from publicly available data, the rationale behind 'how' the decisions are made have seldom been explored. Extracting the rationale for these decisions can facilitate transparency (by making them known), and also promote accountability on the part of decision-makers. This work bridges this gap by means of a large-scale study that unearths the rationale behind decisions from Python development email archives comprising about 1.5 million emails. This paper makes two main contributions. First, it makes a knowledge contribution by unearthing and presenting the rationale behind decisions made. Second, it makes a methodological contribution by presenting a heuristics-based rationale extraction system called Rationale Miner that employs multiple heuristics, and follows a data-driven, bottom-up approach to infer the rationale behind specific decisions (e.g., whether a new module is implemented based on core developer consensus or benevolent dictator's pronouncement). Our approach can be applied to extract rationale in other OSSD communities that have similar governance structures.

Shrestha2020 Nischal Shrestha, Colton Botta, Titus Barik, and Chris Parnin: "Here we go again: why is it difficult for developers to learn another programming language?". Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 10.1145/3377811.3380352.

Once a programmer knows one language, they can leverage concepts and knowledge already learned, and easily pick up another programming language. But is that always the case? To understand if programmers have difficulty learning additional programming languages, we conducted an empirical study of Stack Overflow questions across 18 different programming languages. We hypothesized that previous knowledge could potentially interfere with learning a new programming language. From our inspection of 450 Stack Overflow questions, we found 276 instances of interference that occurred due to faulty assumptions originating from knowledge about a different language. To understand why these difficulties occurred, we conducted semi-structured interviews with 16 professional programmers. The interviews revealed that programmers make failed attempts to relate a new programming language with what they already know. Our findings inform design implications for technical authors, toolsmiths, and language designers, such as designing documentation and automated tools that reduce interference, anticipating uncommon language transitions during language design, and welcoming programmers not just into a language, but its entire ecosystem.

Sliwerski2005 Jacek Śliwerski, Thomas Zimmermann, and Andreas Zeller: "When do changes induce fixes?". Proceedings of the 2005 international workshop on Mining software repositories - MSR '05, 10.1145/1083142.1083147.

As a software system evolves, programmers make changes that sometimes cause problems. We analyze CVS archives for fix-inducing changes—changes that lead to problems, indicated by fixes. We show how to automatically locate fix-inducing changes by linking a version archive (such as CVS) to a bug database (such as BUGZILLA). In a first investigation of the MOZILLA and ECLIPSE history, it turns out that fix-inducing changes show distinct patterns with respect to their size and the day of week they were applied.

Sobrinho2021 Elder Vicente de Paulo Sobrinho, Andrea De Lucia, and Marcelo de Almeida Maia: "A Systematic Literature Review on Bad Smells–5 W's: Which, When, What, Who, Where". IEEE Transactions on Software Engineering, 47(1), 2021, 10.1109/tse.2018.2880977.

Bad smells are sub-optimal code structures that may represent problems needing attention. We conduct an extensive literature review on bad smells relying on a large body of knowledge from 1990 to 2017. We show that some smells are much more studied in the literature than others, and also that some of them are intrinsically inter-related (which). We give a perspective on how the research has been driven across time (when). In particular, while the interest in duplicated code emerged before the reference publications by Fowler and Beck and by Brown et al., other types of bad smells only started to be studied after these seminal publications, with an increasing trend in the last decade. We analyzed aims, findings, and respective experimental settings, and observed that the variability of these elements may be responsible for some apparently contradictory findings on bad smells (what). Moreover, we could observe that, in general, papers tend to study different types of smells at once. However, only a small percentage of those papers actually investigate possible relations between the respective smells (co-studies), i.e., each smell tends to be studied in isolation. Despite of a few relations between some types of bad smells have been investigated, there are other possible relations for further investigation. We also report that authors have different levels of interest in the subject, some of them publishing sporadically and others continuously (who). We observed that scientific connections are ruled by a large "small world" connected graph among researchers and several small disconnected graphs. We also found that the communities studying duplicated code and other types of bad smells are largely separated. Finally, we observed that some venues are more likely to disseminate knowledge on Duplicate Code (which often is listed as a conference topic on its own), while others have a more balanced distribution among other smells (where). Finally, we provide a discussion on future directions for bad smell research.

Soremekun2021 Ezekiel Soremekun, Lukas Kirschner, Marcel Böhme, and Andreas Zeller: "Locating faults with program slicing: an empirical analysis". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-020-09931-7.

Statistical fault localization is an easily deployed technique for quickly determining candidates for faulty code locations. If a human programmer has to search the fault beyond the top candidate locations, though, more traditional techniques of following dependencies along dynamic slices may be better suited. In a large study of 457 bugs (369 single faults and 88 multiple faults) in 46 open source C programs, we compare the effectiveness of statistical fault localization against dynamic slicing. For single faults, we find that dynamic slicing was eight percentage points more effective than the best performing statistical debugging formula; for 66% of the bugs, dynamic slicing finds the fault earlier than the best performing statistical debugging formula. In our evaluation, dynamic slicing is more effective for programs with single fault, but statistical debugging performs better on multiple faults. Best results, however, are obtained by a hybrid approach : If programmers first examine at most the top five most suspicious locations from statistical debugging, and then switch to dynamic slices, on average, they will need to examine 15% (30 lines) of the code. These findings hold for 18 most effective statistical debugging formulas and our results are independent of the number of faults (i.e. single or multiple faults) and error type (i.e. artificial or real errors).

SotoValero2021 César Soto-Valero, Nicolas Harrand, Martin Monperrus, and Benoit Baudry: "A comprehensive study of bloated dependencies in the Maven ecosystem". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-020-09914-8.

Build automation tools and package managers have a profound influence on software development. They facilitate the reuse of third-party libraries, support a clear separation between the application's code and its external dependencies, and automate several software development tasks. However, the wide adoption of these tools introduces new challenges related to dependency management. In this paper, we propose an original study of one such challenge: the emergence of bloated dependencies. Bloated dependencies are libraries that the build tool packages with the application's compiled code but that are actually not necessary to build and run the application. This phenomenon artificially grows the size of the built binary and increases maintenance effort. We propose a tool, called DepClean, to analyze the presence of bloated dependencies in Maven artifacts. We analyze 9,639 Java artifacts hosted on Maven Central, which include a total of 723,444 dependency relationships. Our key result is that 75.1% of the analyzed dependency relationships are bloated. In other words, it is feasible to reduce the number of dependencies of Maven artifacts up to 1/4 of its current count. We also perform a qualitative study with 30 notable open-source projects. Our results indicate that developers pay attention to their dependencies and are willing to remove bloated dependencies: 18/21 answered pull requests were accepted and merged by developers, removing 131 dependencies in total.

Spadini2019 Davide Spadini, Fabio Palomba, Tobias Baum, Stefan Hanenberg, Magiel Bruntink, and Alberto Bacchelli: "Test-Driven Code Review: An Empirical Study". 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 10.1109/icse.2019.00110.

Test-Driven Code Review (TDR) is a code review practice in which a reviewer inspects a patch by examining the changed test code before the changed production code. Although this practice has been mentioned positively by practitioners in informal literature and interviews, there is no systematic knowledge of its effects, prevalence, problems, and advantages. In this paper, we aim at empirically understanding whether this practice has an effect on code review effectiveness and how developers' perceive TDR. We conduct (i) a controlled experiment with 93 developers that perform more than 150 reviews, and (ii) 9 semi-structured interviews and a survey with 103 respondents to gather information on how TDR is perceived. Key results from the experiment show that developers adopting TDR find the same proportion of defects in production code, but more in test code, at the expenses of fewer maintainability issues in production code. Furthermore, we found that most developers prefer to review production code as they deem it more critical and tests should follow from it. Moreover, general poor test code quality and no tool support hinder the adoption of TDR.

Spadini2020 Davide Spadini, Gül Çalikli, and Alberto Bacchelli: "Primers or reminders?: the effects of existing review comments on code review". Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 10.1145/3377811.3380385.

In contemporary code review, the comments put by reviewers on a specific code change are immediately visible to the other reviewers involved. Could this visibility prime new reviewers' attention (due to the human's proneness to availability bias), thus biasing the code review outcome? In this study, we investigate this topic by conducting a controlled experiment with 85 developers who perform a code review and a psychological experiment. With the psychological experiment, we find that ≈70% of participants are prone to availability bias. However, when it comes to the code review, our experiment results show that participants are primed only when the existing code review comment is about a type of bug that is not normally considered; when this comment is visible, participants are more likely to find another occurrence of this type of bug. Moreover, this priming effect does not influence reviewers' likelihood of detecting other types of bugs. Our findings suggest that the current code review practice is effective because existing review comments about bugs in code changes are not negative primers, rather positive reminders for bugs that would otherwise be overlooked during code review. Data and materials:

Spiegler2021 Simone V. Spiegler, Christoph Heinecke, and Stefan Wagner: "An empirical study on changing leadership in agile teams". Empirical Software Engineering, 26(3), 2021, 10.1007/s10664-021-09949-5.

An increasing number of companies aim to enable their development teams to work in an agile manner. When introducing agile teams, companies face several challenges. This paper explores the kind of leadership needed to support teams to work in an agile way. One theoretical agile leadership concept describes a Scrum Master who is supposed to empower the team to lead itself. Empirical findings on such a leadership role are controversial. We still have not understood how leadership unfolds in a team that is by definition self-organizing. Further exploration is needed to better understand leadership in agile teams. Our goal is to explore how leadership changes while the team matures using the example of the Scrum Master. Through a grounded theory study containing 75 practitioners from 11 divisions at the Robert Bosch GmbH we identified a set of nine leadership roles that are transferred from the Scrum Master to the Development Team while it matures. We uncovered that a leadership gap and a supportive internal team climate are enablers of the role transfer process, whereas role conflicts may diminish the role transfer. To make the Scrum Master change in a mature team, team members need to receive trust and freedom to take on a leadership role which was previously filled by the Scrum Master. We conclude with practical implications for managers, Product Owners, Development Teams and Scrum Masters which they can apply in real settings.

Spinellis2021 Diomidis Spinellis and Paris Avgeriou: "Evolution of the Unix System Architecture: An Exploratory Case Study". IEEE Transactions on Software Engineering, 47(6), 2021, 10.1109/tse.2019.2892149.

Unix has evolved for almost five decades, shaping modern operating systems, key software technologies, and development practices. Studying the evolution of this remarkable system from an architectural perspective can provide insights on how to manage the growth of large, complex, and long-lived software systems. Along main Unix releases leading to the FreeBSD lineage we examine core architectural design decisions, the number of features, and code complexity, based on the analysis of source code, reference documentation, and related publications. We report that the growth in size has been uniform, with some notable outliers, while cyclomatic complexity has been religiously safeguarded. A large number of Unix-defining design decisions were implemented right from the very early beginning, with most of them still playing a major role. Unix continues to evolve from an architectural perspective, but the rate of architectural innovation has slowed down over the system's lifetime. Architectural technical debt has accrued in the forms of functionality duplication and unused facilities, but in terms of cyclomatic complexity it is systematically being paid back through what appears to be a self-correcting process. Some unsung architectural forces that shaped Unix are the emphasis on conventions over rigid enforcement, the drive for portability, a sophisticated ecosystem of other operating systems and development organizations, and the emergence of a federated architecture, often through the adoption of third-party subsystems. These findings have led us to form an initial theory on the architecture evolution of large, complex operating system software.

Stol2018 Klaas-Jan Stol and Brian Fitzgerald: "The ABC of Software Engineering Research". ACM Transactions on Software Engineering and Methodology, 27(3), 2018, 10.1145/3241743.

A variety of research methods and techniques are available to SE researchers, and while several overviews exist, there is consistency neither in the research methods covered nor in the terminology used. Furthermore, research is sometimes critically reviewed for characteristics inherent to the methods. We adopt a taxonomy from the social sciences, termed here the ABC framework for SE research, which offers a holistic view of eight archetypal research strategies. ABC refers to the research goal that strives for generalizability over Actors (A) and precise measurement of their Behavior (B), in a realistic Context (C). The ABC framework uses two dimensions widely considered to be key in research design: the level of obtrusiveness of the research and the generalizability of research findings. We discuss metaphors for each strategy and their inherent limitations and potential strengths. We illustrate these research strategies in two key SE domains, global software engineering and requirements engineering, and apply the framework on a sample of 75 articles. Finally, we discuss six ways in which the framework can advance SE research.


Taipalus2021 Toni Taipalus, Hilkka Grahn, and Hadi Ghanbari: "Error messages in relational database management systems: A comparison of effectiveness, usefulness, and user confidence". Journal of Systems and Software, 181, 2021, 10.1016/j.jss.2021.111034.

The database and the database management system (DBMS) are two of the main components of any information system. Structured Query Language (SQL) is the most popular query language for retrieving data from the database, as well as for many other data management tasks. During system development and maintenance, software developers use a considerable amount of time to interpret compiler error messages. The quality of these error messages has been demonstrated to affect software development effectiveness, and correctly formulating queries and fixing them when needed is an important task for many software developers. In this study, we set out to investigate how participants (N = 152) experienced the qualities of error messages of four popular DBMSs in terms of error message effectiveness, perceived usefulness for finding and fixing errors, and error recovery confidence. Our results show differences between the DBMSs by three of the four metrics, and indicate a discrepancy between objective effectiveness and subjective usefulness. The results suggest that although error messages have perceived differences in terms of usefulness for finding and fixing errors, these differences may not necessarily result in differences in query fixing success rates.

Tamburri2020 Damian Andrew Tamburri, Kelly Blincoe, Fabio Palomba, and Rick Kazman: "'The Canary in the Coal Mine…' A cautionary tale from the decline of SourceForge". Software: Practice and Experience, 50(10), 2020, 10.1002/spe.2874.

Forges are online collaborative platforms to support the development of distributed open source software. While once mighty keepers of open source vitality, software forges are rapidly becoming less and less relevant. For example, of the top 10 forges in 2011, only one survives today—SourceForge—the biggest of them all, but its numbers are dropping and its community is tenuous at best. Through mixed-methods research, this article chronicles and analyze the software practice and experiences of the project's history—in particular its architectural and community/organizational decisions. We discovered a number of suboptimal social and architectural decisions and circumstances that, may have led to SourceForge's demise. In addition, we found evidence suggesting that the impact of such decisions could have been monitored, reduced, and possibly avoided altogether. The use of sociotechnical insights needs to become a basic set of design and software/organization monitoring principles that tell a cautionary tale on what to measure and what not to do in the context of large-scale software forge and community design and management.

Tan2020a Xin Tan, Minghui Zhou, and Zeyu Sun: "A first look at good first issues on GitHub". Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 10.1145/3368089.3409746.

Keeping a good influx of newcomers is critical for open source software projects' survival, while newcomers face many barriers to contributing to a project for the first time. To support newcomers onboarding, GitHub encourages projects to apply labels such as good first issue (GFI) to tag issues suitable for newcomers. However, many newcomers still fail to contribute even after many attempts, which not only reduces the enthusiasm of newcomers to contribute but makes the efforts of project members in vain. To better support the onboarding of newcomers, this paper reports a preliminary study on this mechanism from its application status, effect, problems, and best practices. By analyzing 9,368 GFIs from 816 popular GitHub projects and conducting email surveys with newcomers and project members, we obtain the following results. We find that more and more projects are applying this mechanism in the past decade, especially the popular projects. Compared to common issues, GFIs usually need more days to be solved. While some newcomers really join the projects through GFIs, almost half of GFIs are not solved by newcomers. We also discover a series of problems covering mechanism (e.g., inappropriate GFIs), project (e.g., insufficient GFIs) and newcomer (e.g., uneven skills) that makes this mechanism ineffective. We discover the practices that may address the problems, including identifying GFIs that have informative description and available support, and require limited scope and skill, etc. Newcomer onboarding is an important but challenging question in open source projects and our work enables a better understanding of GFI mechanism and its problems, as well as highlights ways in improving them.

Tan2020b Jie Tan, Daniel Feitosa, Paris Avgeriou, and Mircea Lungu: "Evolution of technical debt remediation in Python: A case study on the Apache Software Ecosystem". Journal of Software: Evolution and Process, 33(4), 2020, 10.1002/smr.2319.

In recent years, the evolution of software ecosystems and the detection of technical debt received significant attention by researchers from both industry and academia. While a few studies that analyze various aspects of technical debt evolution already exist, to the best of our knowledge, there is no large-scale study that focuses on the remediation of technical debt over time in Python projects—that is, one of the most popular programming languages at the moment. In this paper, we analyze the evolution of technical debt in 44 Python open-source software projects belonging to the Apache Software Foundation. We focus on the type and amount of technical debt that is paid back. The study required the mining of over 60K commits, detailed code analysis on 3.7K system versions, and the analysis of almost 43K fixed issues. The findings show that most of the repayment effort goes into testing, documentation, complexity, and duplication removal. Moreover, more than half of the Python technical debt is short term being repaid in less than 2 months. In particular, the observations that a minority of rules account for the majority of issues fixed and spent effort suggest that addressing those kinds of debt in the future is important for research and practice.

Tomasdottir2020 Kristín Fjóla Tómasdóttir, Maur'icio Aniche, and Arie van Deursen: "The Adoption of JavaScript Linters in Practice: A Case Study on ESLint". IEEE Transactions on Software Engineering, 46(8), 2020, 10.1109/tse.2018.2871058.

A linter is a static analysis tool that warns software developers about possible code errors or violations to coding standards. By using such a tool, errors can be surfaced early in the development process when they are cheaper to fix. For a linter to be successful, it is important to understand the needs and challenges of developers when using a linter. In this paper, we examine developers' perceptions on JavaScript linters. We study why and how developers use linters along with the challenges they face while using such tools. For this purpose we perform a case study on ESLint, the most popular JavaScript linter. We collect data with three different methods where we interviewed 15 developers from well-known open source projects, analyzed over 9,500 ESLint configuration files, and surveyed 337 developers from the JavaScript community. Our results provide practitioners with reasons for using linters in their JavaScript projects as well as several configuration strategies and their advantages. We also provide a list of linter rules that are often enabled and disabled, which can be interpreted as the most important rules to reason about when configuring linters. Finally, we propose several feature suggestions for tool makers and future work for researchers.

Tomassi2019 David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio-Gonzalez: "BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes". 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 10.1109/icse.2019.00048.

Fault-detection, localization, and repair methods are vital to software quality; but it is difficult to evaluate their generality, applicability, and current effectiveness. Large, diverse, realistic datasets of durably-reproducible faults and fixes are vital to good experimental evaluation of approaches to software quality, but they are difficult and expensive to assemble and keep current. Modern continuous-integration (CI) approaches, like TRAVIS-CI, which are widely used, fully configurable, and executed within custom-built containers, promise a path toward much larger defect datasets. If we can identify and archive failing and subsequent passing runs, the containers will provide a substantial assurance of durable future reproducibility of build and test. Several obstacles, however, must be overcome to make this a practical reality. We describe BUGSWARM, a toolset that navigates these obstacles to enable the creation of a scalable, diverse, realistic, continuously growing set of durably reproducible failing and passing versions of real-world, open-source systems. The BUGSWARM toolkit has already gathered 3,091 fail-pass pairs, in Java and Python, all packaged within fully reproducible containers. Furthermore, the toolkit can be run periodically to detect fail-pass activities, thus growing the dataset continually.

Tregubov2017 Alexey Tregubov, Barry Boehm, Natalia Rodchenko, and Jo Ann Lane: "Impact of task switching and work interruptions on software development processes". Proceedings of the 2017 International Conference on Software and System Process, 10.1145/3084100.3084116.

Software developers often work on multiple projects and tasks throughout a work day, which may affect their productivity and quality of work. Knowing how working on several projects at a time affects productivity can improve cost and schedule estimations. It also can provide additional insights for better work scheduling and the development process. We want to achieve a better productivity without losing the benefits of work interruptions and multitasking for developers involved in the process. To understand how the development process can be improved, first, we identify work interruptions that mostly have a negative effect on productivity, second, we need to quantitatively evaluate impact of multitasking (task switching, work context switching) and work interruptions on productivity. In this research we study cross-project multitasking among the developers working on multiple projects in an educational setting. We propose a way to evaluate the number of cross-project interruptions among software developers using self-reported work logs. This paper describes the research that found: a) software developers involved in two or more projects on average spend 17% of their development effort on cross-project interruptions, b) the amount of effort spent on interruptions is overestimated by the G. Weinberg's heuristic, c) the correlation between the number of projects and effort spent by developers on cross-project interruptions is relatively weak, and d) there is strong correlation between the number of projects and the number of interruptions developers reported.



Venigalla2021 Akhila Sri Manasa Venigalla and Sridhar Chimalakonda: "On the comprehension of application programming interface usability in game engines". Software: Practice and Experience, 51(8), 2021, 10.1002/spe.2985.

Extensive development of games for various purposes including education and entertainment has resulted in increased development of game engines. Game engines are being used on a large scale as they support and simplify game development to a greater extent. Game developers using game engines are often compelled to use various application programming interfaces (APIs) of game engines in the process of game development. Thus, both quality and ease of development of games are greatly influenced by APIs defined in game engines. Hence, understanding API usability in game engines could greatly help in choosing better game engines among the ones that are available for game development and also could help developers in designing better game engines. In this article, we thus aim to evaluate API usability of 95 publicly available game engine repositories on GitHub, written primarily in C++ programming language. We test API usability of these game engines against the eight structural API usability metrics—AMNOI, AMNCI, AMGI, APXI, APLCI, AESI, ATSI, and ADI. We see this research as a first step toward the direction of improving usability of APIs in game engines. We present the results of the study, which indicate that about 25% of the game engines considered have minimal API usability, with respect to the considered metrics. It was observed that none of the considered repositories have ideal (all metric scores equal to 1) API usability, indicating the need for developers to consider API usability metrics while designing game engines.


Weintrop2017 David Weintrop and Uri Wilensky: "Comparing Block-Based and Text-Based Programming in High School Computer Science Classrooms". ACM Transactions on Computing Education, 18(1), 2017, 10.1145/3089799.

The number of students taking high school computer science classes is growing. Increasingly, these students are learning with graphical, block-based programming environments either in place of or prior to traditional text-based programming languages. Despite their growing use in formal settings, relatively little empirical work has been done to understand the impacts of using block-based programming environments in high school classrooms. In this article, we present the results of a 5-week, quasi-experimental study comparing isomorphic block-based and text-based programming environments in an introductory high school programming class. The findings from this study show students in both conditions improved their scores between pre- and postassessments; however, students in the blocks condition showed greater learning gains and a higher level of interest in future computing courses. Students in the text condition viewed their programming experience as more similar to what professional programmers do and as more effective at improving their programming ability. No difference was found between students in the two conditions with respect to confidence or enjoyment. The implications of these findings with respect to pedagogy and design are discussed, along with directions for future work.

Weir2021 Charles Weir, Ingolf Becker, and Lynne Blair: "A Passion for Security: Intervening to Help Software Developers". 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 10.1109/icse-seip52600.2021.00011.

While the techniques to achieve secure, privacy-preserving software are now well understood, evidence shows that many software development teams do not use them: they lack the 'security maturity' to assess security needs and decide on appropriate tools and processes; and they lack the ability to negotiate with product management for the required resources. This paper describes a measuring approach to assess twelve aspects of this security maturity; its use to assess the impact of a lightweight package of workshops designed to increase security maturity; and a novel approach within that package to support developers in resource negotiation. Based on trials in eight organizations, involving over 80 developers, this paper demonstrates that (1) development teams can notably improve their security maturity even in the absence of security specialists; and (2) suitably guided, developers can find effective ways to promote security to product management. Empowering developers to make their own decisions and promote security in this way offers a powerful grassroots approach to improving the security of software worldwide.

Wessel2020 Mairieli Wessel, Alexander Serebrenik, Igor Wiese, Igor Steinmacher, and Marco A. Gerosa: "Effects of Adopting Code Review Bots on Pull Requests to OSS Projects". 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), 10.1109/icsme46990.2020.00011.

Software bots, which are widely adopted by Open Source Software (OSS) projects, support developers on several activities, including code review. However, as with any new technology adoption, bots may impact group dynamics. Since understanding and anticipating such effects is important for planning and management, we investigate how several activity indicators change after the adoption of a code review bot. We employed a regression discontinuity design on 1,194 software projects from GitHub. Our results indicate that the adoption of code review bots increases the number of monthly merged pull requests, decreases monthly non-merged pull requests, and decreases communication among developers. Practitioners and maintainers may leverage our results to understand, or even predict, bot effects on their projects' social interactions.



Yasmin2020 Jerin Yasmin, Yuan Tian, and Jinqiu Yang: "A First Look at the Deprecation of RESTful APIs: An Empirical Study". 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), 10.1109/icsme46990.2020.00024.

REpresentational State Transfer (REST) is considered as one standard software architectural style to build web APIs that can integrate software systems over the internet. However, while connecting systems, RESTful APIs might also break the dependent applications that rely on their services when they introduce breaking changes, e.g., an older version of the API is no longer supported. To warn developers promptly and thus prevent critical impact on downstream applications, a deprecated-removed model should be followed, and deprecation-related information such as alternative approaches should also be listed. While API deprecation analysis as a theme is not new, most existing work focuses on non-web APIs, such as the ones provided by Java and Android.To investigate RESTful API deprecation, we propose a framework called RADA (RESTful API Deprecation Analyzer). RADA is capable of automatically identifying deprecated API elements and analyzing impacted operations from an OpenAPI specification, a machine-readable profile for describing RESTful web service. We apply RADA on 2,224 OpenAPI specifications of 1,368 RESTful APIs collected from, the largest directory of OpenAPI specifications. Based on the data mined by RADA, we perform an empirical study to investigate how the deprecated-removed protocol is followed in RESTful APIs and characterize practices in RESTful API deprecation. The results of our study reveal several severe deprecation-related problems in existing RESTful APIs. Our implementation of RADA and detailed empirical results are publicly available for future intelligent tools that could automatically identify and migrate usage of deprecated RESTful API operations in client code.

Yu2021 Zhongxing Yu, Chenggang Bai, Lionel Seinturier, and Martin Monperrus: "Characterizing the Usage, Evolution and Impact of Java Annotations in Practice". IEEE Transactions on Software Engineering, 47(5), 2021, 10.1109/tse.2019.2910516.

Annotations have been formally introduced into Java since Java 5. Since then, annotations have been widely used by the Java community for different purposes, such as compiler guidance and runtime processing. Despite the ever-growing use, there is still limited empirical knowledge about the actual usage of annotations in practice, the changes made to annotations during software evolution, and the potential impact of annotations on code quality. To fill this gap, we perform the first large-scale empirical study about Java annotations on 1,094 notable open-source projects hosted on GitHub. Our study systematically investigates annotation usage, annotation evolution, and annotation impact, and generates 10 novel and important findings. We also present the implications of our findings, which shed light for developers, researchers, tool builders, and language or library designers in order to improve all facets of Java annotation engineering.


Zampetti2020 Fiorella Zampetti, Carmine Vassallo, Sebastiano Panichella, Gerardo Canfora, Harald Gall, and Massimiliano Di Penta: "An empirical characterization of bad practices in continuous integration". Empirical Software Engineering, 25(2), 2020, 10.1007/s10664-019-09785-8.

Continuous Integration (CI) has been claimed to introduce several benefits in software development, including high software quality and reliability. However, recent work pointed out challenges, barriers and bad practices characterizing its adoption. This paper empirically investigates what are the bad practices experienced by developers applying CI. The investigation has been conducted by leveraging semi-structured interviews of 13 experts and mining more than 2,300 Stack Overflow posts. As a result, we compiled a catalog of 79 CI bad smells belonging to 7 categories related to different dimensions of a CI pipeline management and process. We have also investigated the perceived importance of the identified bad smells through a survey involving 26 professional developers, and discussed how the results of our study relate to existing knowledge about CI bad practices. Whilst some results, such as the poor usage of branches, confirm existing literature, the study also highlights uncovered bad practices, e.g., related to static analysis tools or the abuse of shell scripts, and contradict knowledge from existing literature, e.g., about avoiding nightly builds. We discuss the implications of our catalog of CI bad smells for (i) practitioners, e.g., favor specific, portable tools over hacking, and do not ignore nor hide build failures, (ii) educators, e.g., teach CI culture, not just technology, and teach CI by providing examples of what not to do, and (iii) researchers, e.g., developing support for failure analysis, as well as automated CI bad smell detectors.

Zhang2020 Haoxiang Zhang, Shaowei Wang, Tse-Hsun Chen, and Ahmed E. Hassan: "Reading Answers on Stack Overflow: Not Enough!". IEEE Transactions on Software Engineering, 2020, 10.1109/tse.2019.2954319.

Stack Overflow is one of the most active communities for developers to share their programming knowledge. Answers posted on Stack Overflow help developers solve issues during software development. In addition to posting answers, users can also post comments to further discuss their associated answers. As of Aug 2017, there are 32.3 million comments that are associated with answers, forming a large collection of crowdsourced repository of knowledge on top of the commonly-studied Stack Overflow answers. In this study, we wish to understand how the commenting activities contribute to the crowdsourced knowledge. We investigate what users discuss in comments, and analyze the characteristics of the commenting dynamics, (i.e., the timing of commenting activities and the roles of commenters). We find that: 1) the majority of comments are informative and thus can enhance their associated answers from a diverse range of perspectives. However, some comments contain content that is discouraged by Stack Overflow. 2) The majority of commenting activities occur after the acceptance of an answer. More than half of the comments are fast responses occurring within one day of the creation of an answer, while later comments tend to be more informative. Most comments are rarely integrated back into their associated answers, even though such comments are informative. 3) Insiders (i.e., users who posted questions/answers before posting a comment in a question thread) post the majority of comments within one month, and outsiders (i.e., users who never posted any question/answer before posting a comment) post the majority of comments after one month. Inexperienced users tend to raise limitations and concerns while experienced users tend to enhance the answer through commenting. Our study provides insights into the commenting activities in terms of their content, timing, and the individuals who perform the commenting. For the purpose of long-term knowledge maintenance and effective information retrieval for developers, we also provide actionable suggestions to encourage Stack Overflow users/engineers/moderators to leverage our insights for enhancing the current Stack Overflow commenting system for improving the maintenance and organization of the crowdsourced knowledge.

Zhang2021a Jingxuan Zhang, He Jiang, Zhilei Ren, Tao Zhang, and Zhiqiu Huang: "Enriching API Documentation with Code Samples and Usage Scenarios from Crowd Knowledge". IEEE Transactions on Software Engineering, 47(6), 2021, 10.1109/tse.2019.2919304.

As one key resource to learn Application Programming Interfaces (APIs), a lot of API reference documentation lacks code samples with usage scenarios, thus heavily hindering developers from programming with APIs. Although researchers have investigated how to enrich API documentation with code samples from general code search engines, two main challenges remain to be resolved, including the quality challenge of acquiring high-quality code samples and the mapping challenge of matching code samples to usage scenarios. In this study, we propose a novel approach named ADECK towards enriching API documentation with code samples and corresponding usage scenarios by leveraging crowd knowledge from Stack Overflow, a popular technical Question and Answer (Q&A) website attracting millions of developers. Given an API related Q&A pair, a code sample in the answer is extensively evaluated by developers and targeted towards resolving the question under the specified usage scenario. Hence, ADECK can obtain high-quality code samples and map them to corresponding usage scenarios to address the above challenges. Extensive experiments on the Java SE and Android API documentation show that the number of code-sample-illustrated API types in the ADECK-enriched API documentation is 3.35 and 5.76 times as many as that in the raw API documentation. Meanwhile, the quality of code samples obtained by ADECK is better than that of code samples by the baseline approach eXoaDocs in terms of correctness, conciseness, and usability, e.g., the average correctness values of representative code samples obtained by ADECK and eXoaDocs are 4.26 and 3.28 on a 5-point scale in the enriched Java SE API documentation. In addition, an empirical study investigating the impacts of different types of API documentation on the productivity of developers shows that, compared against the raw and the eXoaDocs-enriched API documentation, the ADECK-enriched API documentation can help developers complete 23.81 and 14.29 percent more programming tasks and reduce the average completion time by 9.43 and 11.03 percent.

Zhang2021b Haoxiang Zhang, Shaowei Wang, Tse-Hsun Chen, Ying Zou, and Ahmed E. Hassan: "An Empirical Study of Obsolete Answers on Stack Overflow". IEEE Transactions on Software Engineering, 47(4), 2021, 10.1109/tse.2019.2906315.

Stack Overflow accumulates an enormous amount of software engineering knowledge. However, as time passes, certain knowledge in answers may become obsolete. Such obsolete answers, if not identified or documented clearly, may mislead answer seekers and cause unexpected problems (e.g., using an out-dated security protocol). In this paper, we investigate how the knowledge in answers becomes obsolete and identify the characteristics of such obsolete answers. We find that: 1) More than half of the obsolete answers (58.4 percent) were probably already obsolete when they were first posted. 2) When an obsolete answer is observed, only a small proportion (20.5 percent) of such answers are ever updated. 3) Answers to questions in certain tags (e.g., node.js, ajax, android, and objective-c) are more likely to become obsolete. Our findings suggest that Stack Overflow should develop mechanisms to encourage the whole community to maintain answers (to avoid obsolete answers) and answer seekers are encouraged to carefully go through all information (e.g., comments) in answer threads.

Zieris2020 Franz Zieris and Lutz Prechelt: "Explaining pair programming session dynamics from knowledge gaps". Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 10.1145/3377811.3380925.

Background: Despite a lot of research on the effectiveness of Pair Programming (PP), the question when it is useful or less useful remains unsettled. Method: We analyze recordings of many industrial PP sessions with Grounded Theory Methodology and build on prior work that identified various phenomena related to within-session knowledge build-up and transfer. We validate our findings with practitioners. Result: We identify two fundamentally different types of required knowledge and explain how different constellations of knowledge gaps in these two respects lead to different session dynamics. Gaps in project-specific systems knowledge are more hampering than gaps in general programming knowledge and are dealt with first and foremost in a PP session. Conclusion: Partner constellations with complementary knowledge make PP a particularly effective practice. In PP sessions, differences in system understanding are more important than differences in general software development knowledge.