A token-based code clone detection system called CCFinder is described in this paper. A clone pair is a pair of identical or similar code portions that could be merged into a single routine to reduce the maintenance burden. The paper describes a token-by-token matching algorithm that employs several optimization techniques, making analysis of industrial strength software practical. Language dependency is restricted: developing the Java sub-component took only two person days. Several metrics are developed, and results are presented for the source code for Java development kit 1.3.0, FreeBSD 4.0, NetBSD 1.5, and Linux 2.4.0.
Many very similar source files are reported to be found in javax/swing/*.java. The paper contains a convincing visualization of strong similarities between FreeBSD and NetBSD (over 25,000 clone pairs). Between FreeBSD and Linux, 252 of 1,091 clone pairs (23 percent) were detected across line breaks, indicating how many clones line-by-line matching algorithms can miss.
The transformation, optimization, and other implementation techniques employed by CCFinder implicitly define similarity, and what a clone pair is. The paper shows the dramatic effects of disabling various techniques on the numbers of clone pairs detected. However, these results are not related to the metric values for clones, so it remains unclear which set of techniques are optimal.
The paper does not report application of the tool to itself. Considerable insights might have emerged had such an investigation been undertaken, and had CCFinder code undergone refactoring to merge clone pairs. However, this paper represents a major contribution to code clone detection, and is highly recommended to specialists in software maintenance.