A Hopeful Case for Generative AI in Software Engineering

Reviewed by Mei Nagappan / 2023-05-13
Keywords: Generative AI

tl;dr: Any C-Suite executive who thinks they can replace software engineers (even novices) with generative AI will be at a disadvantage compared to competitors who use it to empower software engineers.

New versions of generative AI like OpenAI's GPT-3.5/4 models have made a huge splash because of their ability to write code, and because of their potential negative impact. Economists at Goldman Sachs projected that "generative AI could expose the equivalent of 300 million full-time jobs to automation" [1], and of course there was the infamous letter asking companies to stop training new generative AI models [2] (but see also [16]). In the software engineering context we therefore need to ask, "Can generative AI automate away programming jobs?"

Caveat: I am a software engineering researcher, not a specialist in AI/ML. Hence, I will restrict my discussion to just software engineering - something that I know a little bit about. This is not a commentary on any other jobs as I do not know enough about them. This is also not about the ethical/legal aspects of the technology. There are several such aspects that I will leave to experts on ethics and the law like my colleague Maura R. Grossman.

What evidence is there that anyone might think of replacing programmers entirely with AI—not anecdotes or TED Talks, but evidence?

Sandoval et al. [3] did a user study to investigate the cybersecurity impact of LLMs on code written by student programmers. They found that the use of LLMs did not introduce new security risks but helped participants generate more correct solutions.
Similarly, Asare et al. [4] investigated whether GitHub Copilot is just as likely to introduce the same software vulnerabilities as human developers. They found that it replicates the original vulnerable code ≈33% of the time while replicating the fixed code ≈25% of the time.
Ziegler et al. [5] compared the results of a Copilot user survey with data directly measured from Copilot usage. They report a strong correlation between the acceptance rates of Copilot suggestions (directly measured) and developer perceptions of productivity (user survey).
Vaithilingam et al. [6] found that Copilot frequently provides good starting points that direct programmers toward a desired solution.
Performance experiments by Erhabor [7] indicate that humans assisted by Copilot produced correct code more frequently than humans without that assistance.
Bubeck et al. [8] evaluated an early version of GPT-4 on several coding activities. They found it was comparable to humans in LeetCode questions, could develop small games, do data visualization, and even understand and reason about code execution. (Sebastian gives a great talk about the study here.)

So much for the pros—what about the cons? The answer is, "It's complicated."

Pearce et al. [9] evaluated Copilot with a focus on security. They found that Copilot, on its own, generates vulnerable suggestions about 40% of the time. Note that this is different from Sandoval et al. [3] and Asare et al. [4] where humans are empowered with generative AI tools
Perry et al. [10] conduct a large-scale study to determine if users write more insecure code with the Codex model (the one behind Copilot. They found that participants who had access to the Codex assistants wrote significantly less secure code than those without access, and were also more likely to believe they wrote more secure code. This study contradicts the findings of Sandoval et al. [3].
Vaithilingam et al. [6] evaluated Copilot's usability. One of their main takeaways is that Copilot does not necessarily reduce the time required to complete a task. This contradicts Ziegler et al.'s [5] findings.
Erhabor [7] evaluated the runtime performance of code produced when developers use Copilot versus when they do not. Their results suggest that using Copilot may produce code with significantly slower performance.
The study by Dakhel et al. [11] suggests that Copilot can provide solutions for most fundamental algorithmic problems with the exception of some cases that yield buggy results. They also find that human programmers generate a higher ratio of correct solutions than Copilot. This contradicts the auxiliary findings of Erhabor [7].
Even OpenAI's own report on GPT-4 [12] indicates that approximately 74% of all medium-difficulty and 93% of all hard-difficulty problems in Leetcode could not be solved.

The jury is therefore still out, even before we worry about the murky legalities of training models on people's data without their prior knowledge or consent. However:

These models are still Question and Answer type models, not independent agents: they need people to prompt them to generate code responses. The need for proper prompt engineering is starting to be widely recognized [13, 14]. This suggests that requirements engineering, long neglected, will become increasingly critical.
While generative AI can write some code reasonably well, it's hard to know whether the code is right or wrong. We can fuzz test to see if the application crashes, but we cannot know if the functional requirements are met. We therefore need humans to see if the code does what is needed, rather than just what was asked for.
Current generative AI models cannot plan ahead [8], and the coding tasks that have been tested with generative AI are quite small. We therefore still need humans to break problems down into smaller pieces so that models can generate code.
Finally, as even OpenAI states [12], models "do not learn from experience". Given this limitation, when a new version of a library is released to fix a bug or to add a feature, generative models won't be able to make use of them without retraining. (Of course, the same can be said of people…)

So, can generative AI automate away software engineering jobs? The answer right now is, "No, but they are still useful." And, as my colleague Asokan notes, while generative AI is not going to automate away SE jobs, it is certainly going to change the way software engineers work. If someone's expertise is only writing code, they will almost certainly be replaced by generative AI in the near future. However, the parts of software engineering before and after writing code will gain increased prominence. As a call to action to my fellow researchers, I say that we should be working to make generative AI more usable for software engineers and figure out how to incorporate them into our workflows.

This talk from Sal Khan on the educational possibilities of generative AI should serve as an inspiration. If you are interested in a broader discussion of why the fear of generative AI models is just a panic (even though there are some legitimate concerns of evil actors misusing it), you may also enjoy [15].

Thanks to Owura Asare, Partha Chakraborty, Daniel Erhabor N. Asokan, Semih Salihoglu, Tamer Özsu, Jimmy Lin, and Samer Al-Kiswany for their contributions and feedback.

Mei Nagappan is an associate professor at the University of Waterloo. He presented work on bias in evaluating code contributions for our April 2022 lightning talks.

References:

https://www.goldmansachs.com/insights/pages/generative-ai-could-raise-global-gdp-by-7-percent.html
https://futureoflife.org/open-letter/pause-giant-ai-experiments/
Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Siddharth Garg, and Brendan Dolan-Gavitt. Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants. 2023. URL: https://www.usenix.org/conference/usenixsecurity23/presentation/sandoval
Owura Asare, Meiyappan Nagappan, and N. Asokan, "Is github's copilot as bad as humans at introducing vulnerabilities in code?" arXiv preprint arXiv:2204.04741, 2022. https://arxiv.org/abs/2204.04741
Albert Ziegler, Eirini Kalliamvakou, X. Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 21–29, San Diego CA USA, June 2022. ACM. URL: https://dl.acm.org/doi/10.1145/3520312.3534864, doi:10.1145/ 3520312.3534864.
Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts, pages 1–7, New Orleans LA USA, April 2022. ACM. URL: https://dl.acm.org/doi/10.1145/3491101.3519665, doi:10.1145/3491101.3519665.
Daniel Erhabor (2022). Measuring the Performance of Code Produced with GitHub Copilot. UWSpace. http://hdl.handle.net/10012/19000
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang, "Sparks of artificial general intelligence: Early experiments with gpt-4" arXiv preprint arXiv:2303.12712, 2023. URL: https://arxiv.org/abs/2303.12712
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. Asleep at the Keyboard? Assessing the Security of GitHub Copi- lot's Code Contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pages 754–768, May 2022. ISSN: 2375-1207. doi:10.1109/SP46214.2022.9833571. URL: https://ieeexplore.ieee.org/document/9833571
Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do Users Write More Insecure Code with AI Assistants? arXiv preprint arXiv:2211.03622, 2022. URL: https://arxiv.org/abs/2211.03622.
Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, Zhen Ming, and Jiang. GitHub Copilot AI pair programmer: Asset or Liability?, June 2022. arXiv:2206.15331. URL: http://arxiv.org/abs/2206.15331.
OpenAI (2023) https://cdn.openai.com/papers/gpt-4.pdf
Best practices for prompt engineering with OpenAI API https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api
Prompt engineering https://help.openai.com/en/collections/3675942-prompt-engineering
Patrick Grady and Daniel Castro, "Tech Panics, Generative AI, and the Need for Regulatory Caution" https://datainnovation.org/2023/05/tech-panics-generative-ai-and-regulatory-caution/
Diomidis Spinellis: The hypocritical call to pause giant AI. Accessed May 2023.

« Academic NFTs

Empathy Models and Software Engineering »