Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Unsupervised information extraction by text segmentation
Cortez E., da Silva A., Springer Publishing Company, Incorporated, New York, NY, 2013. 124 pp. Type: Book (978-3-319025-96-4)
Date Reviewed: Apr 22 2014

The typical result from a web search is still a ranked list of documents that may contain the information you’re looking for, but it is up to the user to actually extract the sought-after information. In order for search engines to actually answer a query, instead of just pointing to documents that may contain an answer, they need to perform information extraction from a wide variety of documents, determine the correct (or most plausible) answer, and then present it to the user. IBM’s Watson system demonstrates that this is indeed feasible, and Google’s knowledge graph feature is a move in a similar direction.

Cortez and da Silva discuss the problem of information extraction in a specific context: by using unsupervised methods (not relying on examples of correct and incorrect attempts), and by performing text segmentation in order to identify structural and content-based features that allow text snippets to be converted into pieces of information, which can be used to construct larger information entities with an internal structure that can be processed further by the computer system. On a web page of an online store, for example, one will find product information, contact information for the vendor, reviews by customers, and advertisements. Humans use a mix of structural and content-based hints to interpret this jumble of text and images: even if a vendor’s address is not explicitly labeled as such, we know from experience that text snippets containing a number, a street name, a city name, a state abbreviation, and a five-digit number most likely constitute an address. Picking out numbers, especially with a specific digit count, and two-letter state abbreviations is of course no problem for a computer. However, while determining whether a word or phrase constitutes a street or city name is usually straightforward for humans, the same is not true for computers. There are strong hints, however: street and city names are usually capitalized, and street names often contain labels like “Street,” “Lane,” “Way,” or abbreviations thereof. In the case of cities, it may also be practical to compare candidates against a reference list, especially if state information is also available.

In the first three chapters, the authors give a brief introduction to information extraction by text segmentation; discuss related work such as web extraction methods and tools and probabilistic graph-based methods; and examine ways of exploiting preexisting datasets to support information extraction by text segmentation. These three chapters provide the foundations for the next three, which contain three previously published papers on specific systems that implement variations of such information extraction approaches.

The first system, on-demand unsupervised information extraction (ONDUX), uses some information from preexisting data, but also relies on the use of content-based features to bootstrap the learning of structure-based features. The presence of the string “Str.” in a text segment that may be an address, for example, is a good indicator that together with one or a small number of previous strings, it constitutes a street name. According to the authors, this reinforcement between content-based and structural aspects leads to consistent and sometimes significant performance improvement--compared to a baseline system, unsupervised conditional random fields (U-CRF)--in terms of F-measure (a combination of precision and recall) and execution times.

The second system, joint unsupervised structure discovery and information extraction (JUDIE), is intended for situations where no useful prior domain information is available, but the source documents contain a fairly large number of records with a similar structure. For simple structures, like a telephone book, this task is not too challenging, but for more complex ones, like bibliographic citations or cooking recipes, the variation across the records can make this very challenging. JUDIE relies on a structure discovery method that groups labels into specific records by identifying frequent patterns in the sequences of labels. The performance of JUDIE is compared against ONDUX and U-CRF. While JUDIE outperforms U-CRF for less-regular datasets, its performance overall is very close to that of ONDUX, but with advantages in some cases where its structure discovery capability is beneficial.

Overall, this book is most beneficial to researchers familiar with information extraction. The first three chapters can serve as an introduction, but the overall scope of the book is too narrow to serve as an overview of the field. The book and the individual chapters are well organized and reasonably easy to follow, with just enough technical details to understand the functioning of the methods and systems described. While the systems described appear to have advantages over similar ones, I don’t have a clear sense of how they compare against a broader class of related systems, such as the open information extraction from the web approach championed by Etzioni et al. [1]. This is understandable, however, since a well-founded comparison of a larger set of such approaches is a significant undertaking, not to mention the difficulty of getting access to systems that may be proprietary or difficult to install in other locations.

Reviewer:  Franz Kurfess Review #: CR142207 (1407-0522)
1) Etzioni, O.; Banko, M.; Soderland, S.; Weld, D. S. Open information extraction from the web. Communications of the ACM 51, (2008), 68–74.
Bookmark and Share
  Editor Recommended
Featured Reviewer
 
 
Text Analysis (I.2.7 ... )
 
 
Learning (I.2.6 )
 
 
Segmentation (I.4.6 )
 
Would you recommend this review?
yes
no
Other reviews under "Text Analysis": Date
Some issues in the semantics and pragmatics of definite reference in the context of natural language database access
Berry-Rogghe G. Circuits, Systems, and Signal Processing 3(1): 47-54, 1984. Type: Article
Jun 1 1985
Word division in Spanish
Mañas J. Communications of the ACM 30(7): 612-616, 1987. Type: Article
Jul 1 1989
Schemata for understanding of argumentation in newspaper texts
Roesner D.  Progress in artificial intelligence (, Orsay, France,3111985. Type: Proceedings
Apr 1 1986
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy