It is a good idea for all Internet users to be aware of the most common approaches to ranking Web pages, databases, and digital documents.
The most popular ranking algorithm is Google’s PageRank, which uses the frequency of the incoming links in relation to the outgoing links for a particular Web page (this is also called a random walk, made by a user clicking the links that should lead directly to the information being searched for). Another such algorithm is Kleinberg’s hypertext-induced topic selection (HITS), which determines a correspondence between pages with many links (hubs) and those linked by many other pages (authorities). Assuming that “hub pages link to many good pages,” the correspondence changes with each user’s random walk. Such an approach relies on the judgment of humans who walk randomly from one page to another, making the pages visited most often (authority pages) “good pages,” as the authors call them.
Other researchers have worked in the past on this algorithm. For example, Li et al. [1] proposed an improvement to a HITS-based algorithm for Web documents by combining the weighting of the incoming links with relevance methods, such as the vector space model, Okapi similarity, cover density ranking, and three-level scoring. As long as weighting increased the ranking precision significantly, none of the combinations with the methods mentioned above made a dramatic improvement.
Similarly, the authors’ experiment aims to achieve a better HITS performance, specifically by tackling the topic drift problem. In this problem, the top-ranking pages--both authorities and hubs--are not necessarily those most relevant to the query, but are often the most commonly retrieved by search engines such as Google or Yahoo! Thus, they follow Chakrabarti’s method, which determines the information relevance (authority) based on a query matching any of 50 words around the anchor link. Their contribution is the extraction of text fragments that have some semantic properties of the anchor link. The evaluators were invited to assess the semantics of 50 official and 50 personal target pages from the Open Directory (http://dmoz.org), located in paragraphs, tables, lists, and DIV objects. Thirteen pooling methods were used to compare ten hubs and authority pages each. Despite the fact that most of the results achieved over 90 percent accuracy, ten queries is hardly a sufficiently large sample, particularly when the measures are based on the judgment of the authors’ students, acting as evaluators.
Despite some sentence repetitions and an imprecise description of the raw data usage, the paper is quite well written and interesting, especially for students and small text-based search engine developers. Most of the references provided come from the mid- to late-1990s, so the work is not really novel; however, the topic itself is interesting. The experiment is presented in detail, which can serve as a good example to other researchers, particularly those working in this specific field.