CCFinder/Gemini is a token-based code clone detection system for industrial strength software. The lexical analysis employed, however, does not always result in semantically cohesive clone pairs, and users of CCFinder/Gemini must manually determine if clone pairs can be merged. In this paper, an extension to Gemini that supports the extraction of semantically cohesive clone pairs is described. This extension is called the Code Clone Shaper. When using the extension, CCFinder reports a list of clone pairs found in its usual way. The Code Clone Shaper parses the source files separately, and calculates the positions of blocks (code portions enclosed by a pair of brackets). Blocks are then extracted from the original list reported by CCFinder.
In two case studies reported on in the paper, the use of the Code Clone Shaper reduced the number of clone pairs to 1/350 (from 338,574 to 972 clone pairs) and to 1/120 (from 12,033 to 103 clone pairs), greatly reducing the manual effort required by users of CCFinder/Gemini. Twenty-eight clones were found in the language tool ANTLR that could be replaced by a single parameterized method. Seven identical methods were found in the build tool ANT that could be replaced by a single method in a super class.
Additional insights might have emerged had a complete investigation been undertaken of all the clone pairs extracted by the Code Clone Shaper. The reader is also left wondering what the consequences would be of using different size thresholds for code clone detection. (It was set at 50 tokens for the two case studies reported.) Nevertheless, this paper is highly recommended to specialists in software maintenance.