Computing Reviews, the leading online review service for computing literature.

Search

Semantic clustering of XML documents
Tagarelli A., Greco S. ACM Transactions on Information Systems28 (1):1-56,2010.Type:Article

Date Reviewed: May 28 2010

With the advent of Extensible Markup Language (XML) and its wide adoption in applications, data extraction from semi-structured documents to facilitate data analysis has become an attractive research direction. The existence of structure in documents provides the means for designing sophisticated approaches for data management and knowledge discovery. These approaches take into consideration both content and structure semantics. In this paper, Tagarelli and Greco propose a framework, along with algorithms to cluster semantically related semi-structured documents, based on commonalities in their structure and content. First, they apply structure analysis to the XML documents to remove the ambiguity in the different tag names and allow the selection of the most appropriate sense for each tag name. Following this, they analyze the documents based on their content similarity, using techniques that consider both syntactic and semantic term relevance. An important characteristic of the proposed approach is the use of a novel representation scheme for mapping XML document trees into transactions consisting of items that carry both structure and content characteristics. The authors employ a transactional clustering algorithm that quantifies similarity by taking into consideration the semantics of the data. Subsequently, the identified clusters of transactions derive a classification of the XML documents for the end user. The authors demonstrate the effectiveness of the proposed approach through experiments on real-world data that test it against state-of-the-art algorithms for clustering XML documents. Overall, this is interesting work. The paper is well structured, motivated, and presented, and the experimental results look promising. For these reasons, researchers in the field will benefit from reading it.

Reviewer: Aris Gkoulalas-Divanis	Review #: CR138046 (1010-1045)

Textual Databases (H.2.4 ... )

Clustering (H.3.3 ... )

Markup Languages (I.7.2 ... )

Document Preparation (I.7.2 )

Information Search And Retrieval (H.3.3 )

Would you recommend this review?

yes

Other reviews under "Textual Databases":	Date

Text databases & document management: theory & practice Chin A. Idea Group Publishing, Hershey, PA,2001. Type: Divisible Book	May 1 2001

Modeling and managing changes in text databases Ipeirotis P., Ntoulas A., Cho J., Gravano L. ACM Transactions on Database Systems 32(3): 14-es, 2007. Type: Article	Dec 20 2007

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy