Computing Reviews

A collaborative approach to build evaluated Web page datasets
Barros R., Rodrigues Nt J., Xexéo G., de Souza J. Future Generation Computer Systems27(1):119-126,2011.Type:Article
Date Reviewed: 10/20/11

In their paper, Barros et al. propose a method of collaboratively collecting Web pages of good quality to be used for studying information retrieval algorithms that require large datasets.

The traditional way of collecting Web pages by crawling the Web following a set of seeding uniform resource locators (URLs) is expensive; this method also makes it difficult to preserve the quality of the pages collected. The authors’ proposed method uses a filtering process to weed out low-quality pages in a given context (topic). The filtering process decides the quality of a Web page in three dimensions: completeness, reputation, and timeliness. The quality is determined by examining metadata, such as dates, number of back-links and forward-links, the page’s “authority and hub scores,” and other factors. The filtering process works as follows. First, Web pages are collected. These pages then are fed through an automatic evaluation process using a six-step process: metadata derivation, fuzzification, definition of “SQER (single quality evaluation results), definition of CQD, calculation of CQER (composed quality evaluation results), and defuzzification.” The results are then evaluated by human coordinators and evaluators. The coordinator collects the scores from various evaluators and formulates a binary decision about whether or not a Web page is relevant to the topic by taking the median value of the evaluations.

The authors presented the result of a small-scale, proof-of-concept study using their approach. The coordinator first decided the context (“economy,” in this case) and assigned three quality dimensions different degrees of importance. Seeding Web pages were selected, and then a predetermined number of Web pages (500) were crawled. The pages were then fed through the automatic evaluation process, and then through the manual evaluation. The results show that both recall and precision increased for the queries applied to the collected dataset.

Reviewer:  Xiannong Meng Review #: CR139512 (1203-0305)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy