The authors have undertaken a project to make bytecode more readable by interspersing it with machine-generated
comments. There are two salient questions regarding this project: Did they (at least mostly) succeed? And to the extent that they did succeed, will these comments actually prove useful?
An important factor to consider here is that often a bad comment may be more harmful than nine good comments are helpful. In the absence of a good comment, a programmer can at least read the code and see what it does. But a bad comment may actively mislead the programmer, and, thinking they know what is going on, they misunderstand the code. Thus, even if
automatic comment generation is, say, 90 percent accurate, it still may do more harm than good.
The authors spend some time explaining how they tested their project. They searched Maven (a build tool for Java)
repositories; after rejecting certain repositories as inappropriate due to a lack of either bytecode or source code with
comments, the authors also worked to eliminate most templates since they are not truly independent instances.
But the paper is less than clear on what they did to detect templates:
Assuming that <> represents any character, then the sentences ”get the type of error” and “get the type of event” can be represented by the template “get the type of <>.”
I must admit that this sentence is entirely opaque to me.
In any case, 55130 bytecode-comment pairs were collected for machine learning. The authors extracted information from the tokens contained in the bytecode. They used a clever method of reducing the number of tokens they considered by
taking advantage of Java’s typical camel case variable naming.
The bytecode could be treated as plaintext, but this would lose much structural information. To deal with this fact,
the authors introduce a control-flow-graph representation of the bytecode.
The authors take some time describing their experimental setup, in which they compare their project against several other
natural language processing models. The evaluation criterion is how similar the comments each method generates are to the original source code comments. They find their model performs significantly better than the state-of-the-art baselines. And when the comments generated were rated by human programmers, the BCGen model also outperformed its rivals. Nevertheless, neither the scores for both similarity of the generated comments to the original comments nor the scores assigned by humans were particularly high, meaning many of the comments were not good.
So this brings us back to the second question asked at the beginning of the review: Is the activity of automated comment generation actually helpful to working engineers? Unfortunately, the authors do not address this point. Furthermore, for the activities in which bytecode analysis is used, they do not show that interspersing bytecode with comments is even
potentially helpful. For instance, in detecting malware, one is dealing with code that is intentionally deceptive. Comments
automatically generated from such code would seem likely to be derivatively deceptive, and thus of little help in
the task at hand.
In summary, this is a very interesting project that has improved the state of the art in automated comment generation.
But whether it will have practical benefits remains to be seen.