An automatic rule-based method for assigning texts to predetermined categories is described in this straightforward account. A training sample of texts with known classification tags is used to construct rules capable of classifying new unknown texts, based on the occurrence in the texts of particular words and phrases. For example, a text containing the phrase “running back” would be immediately assigned to the category “football”; the same is true of items containing both the words “award” and “player.” Heuristic methods are used to find simple rules that properly separate all classes.
The authors have applied their method to samples of articles disseminated by the Reuters newswire service; they studied both English- and German-language articles. They compute evaluation parameters such as the percentage of total documents for a given category that are correctly classified, and find that their system performs with a high degree of accuracy. Moreover, no differences were detected between the English and the German collections, showing that the method may work across many languages.
Many competing automatic classification methods have been described in the literature, including systems not based on a construction of formal classification rules. For example, the vocabulary of available test documents can be used to build a class profile (sometimes called a centroid) for each category of interest. Each centroid then consists of a large number of words and phrases plus, possibly, the associated term frequency information. An unknown document can then be classified by comparing its vocabulary globally with all existing centroids, and then assigning the text to the class with the highest matching centroid. Such a method may be more robust than the rule-based approach when the vocabulary changes dynamically over time, because explicit rules are never recorded. The authors prefer the rule-based approach because the rules are “explicitly interpretable” and are “compatible with human-expressed knowledge.”
No comparison of the effectiveness of the rule-based classification method with other competing methods is included.