Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Estimating semantic relatedness in source code
Mahmoud A., Bradshaw G. ACM Transactions on Software Engineering and Methodology25 (1):1-35,2015.Type:Article
Date Reviewed: Mar 9 2016

Analyzing existing source code to identify relationships between classes might be more accurate using the authors’ new method, normalized software distance, which is introduced, derived, and analyzed in this paper. Correctly identifying relationships between sections of code could be an essential starting point in automating tools for improving testing, maintenance, and other improvements in large software systems. Natural language processing (NLP) has been used to discover relationships between sections of code; however, source code is very different from natural language. Their proposed metric, normalized software distance (NSD), is shown to be superior to general NLP methods.

The authors analyzed a number of currently proposed methods, including latent semantic analysis (LSA) [1], normalized Google distance (NGD) [2], pointwise mutual information (PMI) [3], path-based methods [4,5], information-content methods [6,7,8], and a definition-of-words method [9].

Three software systems were used for comparison. All were in Java. One was a student-developed open-source medical application. It was 47.6 KLOC (thousands of lines of code). The second was a subproject of the Apache Ant project. It was 40.9 KLOC. The third was a financial software package contributed by an industrial partner. Each software system had participants with two or more years of experience with the respective software. The above methods were applied to these three software projects and the results compared to the knowledge of the participants. The evaluation measures were recall analysis and mean average precision. The LSA method was the best of the current methods.

The analysis showed a number of issues. The methods based on external sources were not as good as those based on the source code. However, the sparsity and lack of uniqueness in source code were an issue for code-based approaches, including the perfect term dependence (where two terms only appear together). Their approach, NSD, uses a hybrid technique using the class as the level of granularity for the code analysis. The authors provide a theoretical derivation of their approach. The resulting NSD method involves normalizing the maximums of logs of inverses of Bayesian probabilities of one term given the occurrence of the other. The resulting measure, NSD, shows statistically significant improvement over the other methods on these three software projects. NSD also had the lowest time complexity in both pre-processing and relatedness calculations.

The paper includes an extensive set of references and a thorough explanation of the experiments, results, derivation of the NSD measure, NSD results, limitations, and related work. This paper would be a good technical introduction to the area of semantic relatedness in source code.

Reviewer:  David A. Gustafson Review #: CR144227 (1605-0319)
1) Deerwester, S.; Dumais, S.; Furnas, G.; Landauer, T.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41, 6(1990), 391–407.
2) Cilibrasi, R.; Vitanyi, P. The Google similarity distance. IEEE Trans. Knowl. Data Eng. 19, 3(2007), 370–383.
3) Church, K.; Hanks, P. Word association norms, mutual information, and lexicography. Comput. Linguistics 16, 1(1990), 22–29.
4) Leacock, C.; Chodorow, M. Combining local context and WordNet similarity for word sense identification. In: WordNet: an electronic lexical database. 24-26, MIT Press, 1998.
5) Wu, Z.; Palmer, M. Verbs semantics and lexical selection. In Proc. of the 32nd Annual Meeting on the Association for Computational Linguistics (ACL 1994). ACL, 1994, 133–138.
6) Resnick, P. Using information content to evaluate semantic similarity in a taxonomy. In Proc. of the 14th International Joint Conference on Artificial Intelligence (IJCAI 1995). Morgan Kaufmann, 1995, 448–453.
7) Lin, D. An information-theoretic definition of similarity. In Proc. of the 15th International Conference on Machine Learning (ICML 1998). Morgan Kaufmann, 1998, 296–304.
8) Jiang, J.; Conrath, D. Semantic similarity based on corpus statistics and lexical taxonomy. In Proc. of the 10th International Conference on Research in Computational Linguistics (ROCLING 1997). National Taiwan University, 1997, 19–33.
9) Lesk, M. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proc. of the 5th Annual International Conference on Systems Documentation (SIGDOC 1986). ACM, 1986, 24–26.
Bookmark and Share
  Featured Reviewer  
 
Design Tools and Techniques (D.2.2 )
 
Would you recommend this review?
yes
no
Other reviews under "Design Tools and Techniques": Date
UNIX tool building
Ingham K., Academic Press Prof., Inc., San Diego, CA, 1991. Type: Book (9780123708304)
Aug 1 1991
More C tools for scientists and engineers
Baker L. (ed), McGraw-Hill, Inc., New York, NY, 1991. Type: Book (9780070033580)
Apr 1 1992
Introduction to programming logic for business applications
Wintermeyer L., Reston Publishing Co., Reston, VA, 1987. Type: Book (9789780835932516)
Dec 1 1987
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy