Where should I comment my code? A dataset and model for predicting locations that need comments
Reviewed by Wenxin Jiang / 2022-02-22
Keywords: Machine Learning, Maintenance
Every programmer knows that they're supposed to write the code and comments at the same time, but most programs still contain fewer than they should. To help programmers figure out where comments are most effective, Louis2020 created a dataset of 41506 snippets of C/C++ source code along with labels to indicate whether there should be a comment within each snippet. Using this dataset, they evaluated and compared the performance of three machine learning models: LOC model, sequence model, and hierarchical sequence models.
Based on both accuracy and the model's generalization ability on unseen data, The authors found that: hierarchical models are the best choice. In particular, a hierarchical sequence model can reach a precision of 74% and recall of 13%, though with only shallow captures of the content of code.
Their current tool is available at http://groups.inf.ed.ac.uk/cup/comment-locator/. The authors' short-term goal is to incorporate background knowledge and 'surprise'-capturing features into the models. In future they plan to focus on automatically segmenting the code into snippets rather than using the blank-line heuristic.
Louis2020 Annie Louis, Santanu Kumar Dash, Earl T. Barr, Michael D. Ernst, Chales Sutton: "Where should I comment my code? A dataset and model for predicting locations that need comments." 2020 IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER) 10.1145/3377816.3381736
Programmers should write code comments, but not on every line of code. We have created a machine learning model that suggests locations where a programmer should write a code comment. We trained it on existing commented code to learn locations that are chosen by developers. Once trained, the model can predict locations in new code. Our models achieved precision of 74% and recall of 13% in identifying comment-worthy locations. This first success opens the door to future work, both in the new where-to-comment problem and in guiding comment generation. Our code and data is available at http://groups.inf.ed.ac.uk/cup/comment-locator/.