ComputingReviews.com

Path-based methods on categorical structures for conceptual representation of Wikipedia articles
Kucharczyk Ł., Szymański J. Journal of Intelligent Information Systems48(2):309-327,2017.Type:Article

Date Reviewed: 11/03/17

The task of determining whether two documents cover similar topics pops up in many different contexts. This requires solving two different subtasks, namely, extraction of significant and useful characteristics/features from the document and application of an appropriate (dis)similarity measure using those characteristics. In spite of a number of drawbacks, the bag of words (BoW) method is quite popular for the former subtask, which uses normalized vectors of weighted term frequencies [1] of each document. To overcome the drawbacks of BoW methods, various extensions have been considered in the literature. To this end, the Wikipedia category graph (WCG) has been frequently exploited as a source of additional text features. Because the WCG is based on a basic form of conceptual and hierarchical classification, it naturally is expected to have some advantages over the standard BoW features. This motivated the authors to exploit the lexical knowledge of Wikipedia through the use of WCG. The main focus of the authors is to extract significant relations between categories through appropriate similarity measures. The main contribution thus, as claimed by the authors as well, is the introduction of new WCG-based similarity measures.

The idea of using WCG for computing semantic relatedness is not new; it has in fact been successfully used previously [2]. The authors in this paper have applied their idea and proposed a method for clustering utilizing the Wikipedia categories as an alternative document representation. Furthermore, they have proposed three path-based relatedness measures in the conceptual space formed from WCG. They have implemented and used a clustering search engine to evaluate the proposed methods empirically.

For performing the experiments, the authors have used documents that are a priori tagged with Wikipedia categories and have employed the OPTICS and K-means clustering algorithms for the purpose of evaluation. However, the ultimate goal must be to use a large-scale multilabel text classifier at the first stage to automatically tag raw text with Wikipedia categories; this has been left for future works by the authors.

Manning, C. D.; Raghavan, P.; Schütze, H. An introduction to information retrieval. Cambridge University Press, New York, NY, 2009.

Zesch, T.; Gurevych, I. Analysis of the Wikipedia category graph for NLP applications. In Proceedings of the TextGraphs-2 Workshop (NAACL-HLT). Association for Computational Linguistics, New Brunswick, NJ, 2007, 1–8.

Reviewer: M. Sohel Rahman

Review #: CR145636 (1801-0027)

Reproduction in whole or in part without permission is prohibited. Copyright 2024 ComputingReviews.com™
Terms of Use | Privacy Policy