Today, huge volumes of data about software development are available from a variety of sources, including organizational databases, open-source software project metadata, and other software engineering repositories, mailing lists, discussion forums, and newsletters. Data mining provides the capability to analyze this data and transform it into valuable information.
This excellent paper deals with the mining of data produced during the software development life cycle and stored in software repositories. The authors introduce the concepts, approaches, tasks, and techniques of data mining, and the challenges of mining software repositories. The data mining approaches described are clustering, classification, frequent pattern mining and association rules, data characterization and summarization, change and deviation detection, and text mining.
The paper classifies, describes, and explains the different types of software engineering data. These include documentation, software configuration management data, source code, compiled code, execution traces, problem tracking and bug reports, and mailing lists.
The authors map the various data mining approaches and techniques to the software engineering tasks for which they are helpful. The appropriate mining approaches, the input data and data analysis results for software development, testing, debugging, maintenance, and reuse are discussed and summarized. The authors conclude by discussing the challenges in mining software engineering repositories, which in their opinion require further research.
This paper will be useful to anyone who is connected with the development, testing, debugging, maintenance, and reuse of software products--from programmers to managers. The knowledge obtained from mining software repositories and other related sources will help such an audience better understand the development process, and thereby help them refine it. It will also help them make the software life cycle processes more efficient.