Commercial cloud providers such as Amazon and research grids such as Enabling Grids for E-Science (EGEE) are playing an increasingly important role in computing, especially in handling issues with big data. However, with hundreds of cloud and grid servers, and files on the order of millions, it is a challenge for users to find an appropriate piece of software in the cloud or grid in an efficient and effective manner. Researchers are trying to find answers to the challenge.
Minersoft, a search engine for software packages distributed across computing clouds and grids, is one such attempt in this direction. Minersoft consists of crawlers that collect software-related information from the clouds, indexers that build inverted indices for search, data storage that stores all information and data for search, a query manager that accepts and processes the queries and returns the results to the user, and a job manager that coordinates the work among different components. Users are able to examine, index, and retrieve upon request software and related documents in various forms, including binary, source code, software libraries, and software description documents.
One of the interesting concepts used by the authors is that of a software graph, which is similar to a map of a file system starting from a root. Each of the leaves is a file, and each node in the tree (interior or leaf) contains metadata that helps identify or categorize the node (file).
The authors conducted two types of experiments to evaluate the performance of Minersoft. One examines system performance, measuring the number of files a system can index and the time needed to index these files. On a grid, the crawling software is written in Python and it can read an average rate of 100,000 files in five-to-30 minutes. The average rates of the indexing software range from 15-to-65 minutes per 100,000 files. The measurements are on the same order but slower on clouds. The other type of measurement concerns query/answer correctness. The authors used multiple types of measurements, including Precision@10, mean reciprocal rank (MRR), and normalized discounted cumulative gain (NDCG). All three measures show that the system performs very well.
The system is very interesting and will be useful to cloud and grid communities. Tools like these can help users locate and identify pieces of software that match their needs.