Computing Reviews, the leading online review service for computing literature.

Computing Reviews

Today's Issue

Hot Topics

Browse

Recommended

My Account

Log In

Review

Help

Search

Harvesting relational tables from lists on the Web
Elmeleegy H., Madhavan J., Halevy A. The VLDB Journal: The International Journal on Very Large Data Bases20 (2):209-226,2011.Type:Article

Date Reviewed: Oct 27 2011

Extracting structured information from the Web is a challenge. Information extraction from Hypertext Markup Language (HTML) pages, and wrapper generation in particular, are research fields related to this goal. This paper presents ListExtract, a technique for obtaining tables from lists in HTML pages. As lists are frequent in HTML pages, this is a current and interesting problem. Besides motivating this research, the authors present some potential applications, including question answering, integration for the relational Web, deep Web crawling, and table extraction Web services. The main goal of this technique is to create a set of relational tables with the information extracted from Web lists. Its main features are domain independence and being an unsupervised technique. The paper is well structured. First, a motivating example helps readers understand the problem, alongside justification for a new technique for solving it. Next, an overview of the technique shows its main steps--each step is explained in a detailed manner--and characteristics. Then, the implementation is presented. Experiments close the technique presentation, and related work and conclusions close the paper. ListExtract is focused on Web lists, which is clearly emphasized by the authors who explain the differences between this problem and problems like wrapper generation. ListExtract consists of three main steps: splitting, alignment, and refinement. Each step is subdivided into several tasks, which are carefully detailed and formalized. ListExtract is a useful, well-structured technique. The quality parameters and algorithms used are specified. It uses a table corpus and a language model to identify segments in lines. ListExtract reuses knowledge from other areas, such as dynamic programming, multiple sequence alignment (MSA), and, of course, information extraction. A valuable aspect of this paper is the careful specification of the relation between each approach used and other fields and techniques. The target audience of this scientific paper includes scientists and technicians working in information extraction from the Web.

Reviewer: Mercedes Martínez González	Review #: CR139528 (1203-0303)

Relational Databases (H.2.4 ... )

World Wide Web (WWW) (H.3.4 ... )

Information Search And Retrieval (H.3.3 )

Would you recommend this review?

yes

no

Other reviews under "Relational Databases":	Date

A sound and sometimes complete query evaluation algorithm for relational databases with null values Reiter R. Journal of the ACM 33(2): 349-370, 1986. Type: Article	Nov 1 1986

Sort sets in the relational model Ginsburg S., Hull R. Journal of the ACM 33(3): 465-488, 1986. Type: Article	Nov 1 1986

Foundation for object/relational databases Date C., Darwen H., Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, 1998. Type: Book (9780201309782)	Nov 1 1998

more...

Tips

Help

Contact Us

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy