Web agents need to sift through an unwieldy amount of human-oriented data on the Web, which is not an easy task. Ye and Chua report on a method to infer object models and extract data from semistructured, tabulated Web pages. Their approach is unsupervised and consists of clustering the data according to a kernel-based metric that measures how frequently it changes in a set of related Web pages; intuitively, the clusters that do not change frequently are prone to be menus or banners, whereas the clusters that change frequently are prone to be data. This idea is not new--it lies at the heart of OntoMiner [1]--but using a kernel-based method seems to be a new application of this technique.
The authors conducted some experiments on several commercial Web sites, and the results prove that their technique is effective. Unfortunately, the related work section is poor. First, the authors simply enumerate several proposals, and there is not enough of a theoretical or empirical comparison; for example, it is not clear whether OntoMiner’s hierarchical partitioning algorithm outperforms kernel-based methods. Second, there is no reference to Crescenzi and Mecca’s recent proposal on wrapper induction [2]. Finally, key concepts such as “token” or “shared attribute” are not explained or illustrated in the paper. This makes it difficult to repeat the results and to assess them from a comparative point of view.