Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Path-based methods on categorical structures for conceptual representation of Wikipedia articles
Kucharczyk Ł., Szymański J. Journal of Intelligent Information Systems48 (2):309-327,2017.Type:Article
Date Reviewed: Nov 3 2017

The task of determining whether two documents cover similar topics pops up in many different contexts. This requires solving two different subtasks, namely, extraction of significant and useful characteristics/features from the document and application of an appropriate (dis)similarity measure using those characteristics. In spite of a number of drawbacks, the bag of words (BoW) method is quite popular for the former subtask, which uses normalized vectors of weighted term frequencies [1] of each document. To overcome the drawbacks of BoW methods, various extensions have been considered in the literature. To this end, the Wikipedia category graph (WCG) has been frequently exploited as a source of additional text features. Because the WCG is based on a basic form of conceptual and hierarchical classification, it naturally is expected to have some advantages over the standard BoW features. This motivated the authors to exploit the lexical knowledge of Wikipedia through the use of WCG. The main focus of the authors is to extract significant relations between categories through appropriate similarity measures. The main contribution thus, as claimed by the authors as well, is the introduction of new WCG-based similarity measures.

The idea of using WCG for computing semantic relatedness is not new; it has in fact been successfully used previously [2]. The authors in this paper have applied their idea and proposed a method for clustering utilizing the Wikipedia categories as an alternative document representation. Furthermore, they have proposed three path-based relatedness measures in the conceptual space formed from WCG. They have implemented and used a clustering search engine to evaluate the proposed methods empirically.

For performing the experiments, the authors have used documents that are a priori tagged with Wikipedia categories and have employed the OPTICS and K-means clustering algorithms for the purpose of evaluation. However, the ultimate goal must be to use a large-scale multilabel text classifier at the first stage to automatically tag raw text with Wikipedia categories; this has been left for future works by the authors.

Reviewer:  M. Sohel Rahman Review #: CR145636 (1801-0027)
1) Manning, C. D.; Raghavan, P.; Schütze, H. An introduction to information retrieval. Cambridge University Press, New York, NY, 2009.
2) Zesch, T.; Gurevych, I. Analysis of the Wikipedia category graph for NLP applications. In Proceedings of the TextGraphs-2 Workshop (NAACL-HLT). Association for Computational Linguistics, New Brunswick, NJ, 2007, 1–8.
Bookmark and Share
 
Document And Text Processing (I.7 )
 
 
Information Search And Retrieval (H.3.3 )
 
 
Information Storage And Retrieval (H.3 )
 
Would you recommend this review?
yes
no
Other reviews under "Document And Text Processing": Date
Text retrieval from early printed books
Marinai S. International Journal on Document Analysis and Recognition 14(2): 117-129, 2011. Type: Article
Sep 29 2011
Document clustering method using dimension reduction and support vector clustering to overcome sparseness
Jun S., Park S., Jang D. Expert Systems with Applications: An International Journal 41(7): 3204-3212, 2014. Type: Article
Sep 19 2014
Handbook of document image processing and recognition
Doermann D., Tombre K., Springer Publishing Company, Incorporated, New York, NY, 2014.  1055, Type: Book (978-0-857298-58-4)
Oct 15 2014

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy