Computing Reviews

On Using Partial Supervision for Text Categorization
Aggarwal C. (ed), Gates S., Yu P. (ed) IEEE Transactions on Knowledge and Data Engineering16(2):245-255,2004.Type:Article
Date Reviewed: 04/20/05

The automatic classification and categorization of textual documents has become a very active research field since the 1990s. One of the main reasons for this is the considerable amount of digital information available. Search engine users now expect to obtain results through effective content-based technologies; many experiments have demonstrated that automatic classification and categorization of text data can contribute to the achievement of this objective.

Most classification techniques are effective for classifying documents into very general categories (such as the higher levels of the Yahoo! categories), but they have proven to be mostly inaccurate in distinguishing fine-grained, related categories. This paper presents an innovative method that automatically categorizes documents using a supervised clustering process. First, the method uses a preexisting taxonomy to supervise, through an automatic classification process, the creation of closely related clusters. The objective is then to identify specific words that best describe the matter of each cluster. Documents are then automatically categorized into the taxonomy, including the closely related clusters.

The method was compared to a manual categorization process using the Yahoo! taxonomy. The results indicate that categorization using this novel method is generally just as good as the manual Yahoo! categorization. The main advantage of this method is that it is completely automatic (while the Yahoo! categorization is entirely manual).

The main contributions of this paper are its validation of the effectiveness of using a clustering process to build a fine-grained, related set of categories, and its demonstration that supervised clustering can be an interesting alternative to traditional categorization, using a predefined set of categories.

Reviewer:  Dominic Forest Review #: CR131157 (0510-1174)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy