Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Harvesting relational tables from lists on the Web
Elmeleegy H., Madhavan J., Halevy A. The VLDB Journal: The International Journal on Very Large Data Bases20 (2):209-226,2011.Type:Article
Date Reviewed: Oct 27 2011

Extracting structured information from the Web is a challenge. Information extraction from Hypertext Markup Language (HTML) pages, and wrapper generation in particular, are research fields related to this goal.

This paper presents ListExtract, a technique for obtaining tables from lists in HTML pages. As lists are frequent in HTML pages, this is a current and interesting problem. Besides motivating this research, the authors present some potential applications, including question answering, integration for the relational Web, deep Web crawling, and table extraction Web services. The main goal of this technique is to create a set of relational tables with the information extracted from Web lists. Its main features are domain independence and being an unsupervised technique.

The paper is well structured. First, a motivating example helps readers understand the problem, alongside justification for a new technique for solving it. Next, an overview of the technique shows its main steps--each step is explained in a detailed manner--and characteristics. Then, the implementation is presented. Experiments close the technique presentation, and related work and conclusions close the paper.

ListExtract is focused on Web lists, which is clearly emphasized by the authors who explain the differences between this problem and problems like wrapper generation. ListExtract consists of three main steps: splitting, alignment, and refinement. Each step is subdivided into several tasks, which are carefully detailed and formalized. ListExtract is a useful, well-structured technique. The quality parameters and algorithms used are specified. It uses a table corpus and a language model to identify segments in lines. ListExtract reuses knowledge from other areas, such as dynamic programming, multiple sequence alignment (MSA), and, of course, information extraction. A valuable aspect of this paper is the careful specification of the relation between each approach used and other fields and techniques. The target audience of this scientific paper includes scientists and technicians working in information extraction from the Web.

Reviewer:  Mercedes Martínez González Review #: CR139528 (1203-0303)
Bookmark and Share
  Reviewer Selected
 
 
Relational Databases (H.2.4 ... )
 
 
World Wide Web (WWW) (H.3.4 ... )
 
 
Information Search And Retrieval (H.3.3 )
 
Would you recommend this review?
yes
no
Other reviews under "Relational Databases": Date
A sound and sometimes complete query evaluation algorithm for relational databases with null values
Reiter R. Journal of the ACM 33(2): 349-370, 1986. Type: Article
Nov 1 1986
Sort sets in the relational model
Ginsburg S., Hull R. Journal of the ACM 33(3): 465-488, 1986. Type: Article
Nov 1 1986
Foundation for object/relational databases
Date C., Darwen H., Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, 1998. Type: Book (9780201309782)
Nov 1 1998
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy