Generally, two kinds of information--textual and contextual--are used for document processing tasks such as classification and retrieval. Textual information is based on characteristics associated with words, for example the word’s frequency, location, and co-occurrences, for which we have widely-used lexical, syntactic, and even semantic-level text analysis. Contextual information is made up of features of the whole discourse that are not directly visible from the text itself, for example its genre, register, domain terminology, and document structure. (The authors of this paper call these features discourse variables.)
No work has been done to specifically investigate the correlation between these two kinds of text features. This paper makes a good attempt at testing the impact of discourse variables on one indexing algorithm (n-gram) and two clustering algorithms (k-means and Chen’s), although their explanation of the result has some limitations.
The authors start with a very good overview of discourse analysis. Their references are quite complete, but it would have been more helpful if they had included Biber’s work on register analysis [1], and the various research done on stylometrics. Because these discourse variables are closely interrelated and defined in many ways, clarifying the sense in which the terms are used is very important. The authors clearly explain the meaning of the concepts genre, register, and domain that were adopted for this study. Genre is used to denote document typology, register is used for style, and domain is used to denote the field of discourse.
It was reasonable for the authors to choose n-gram as the indexing method for their test; it is a widely used, general-purpose method. But the authors did not explain why they choose k-means and Chen’s as examples of classification methods. From a machine learning perspective, both supervised learning and unsupervised learning (clustering) methods are used popularly for classification. K-means and Chen’s are both clustering approaches; k-means is a general-purpose method, and Chen’s is limited to text data. No supervised learning methods, such as decision tree (DT) or naive Bayesian (NB), are included in this study. No other general purpose clustering methods, such as hierarchical clustering methods (HAC), are included either.
The authors conclude that the n-gram method is not affected by discourse variables, while k-means is affected by the document terminology and document structure. Chen’s algorithm was affected by all the discourse variables. The authors’ explanation did not take into account the different steps of the whole classification process, and thus did not explain which step was actually affected by the discourse variables. Basically, there are two steps in a classification process: feature selection and classification. Independent from the feature selection task, a general-purpose classification algorithm works for the second-step task, in which the similarity metrics play a very important role.
In this paper, n-gram keywords were selected as features, and fed into both k-means and Chen’s as input. It is apparent that this feature selection method is not discourse-independent. For example, the keyword distribution is apparently correlated to the document structure. This could explain the strange outcome of the general-purpose k-means being affected by the discourse variables; it is the feature selection process that was affected, not the algorithm itself. So it is predictable that the conclusion for k-means will repeat for other general-purpose classification algorithms, such as DT, NB, or HAC, as long as we use the same feature set.
Chen’s algorithm, because the cowording information was also used in the similarity metrics, was even more strongly affected by the discourse variables. One thing to note is that this feature selection method, correlating with discourse variables, does not conflict with the authors’ conclusion that n-gram is not affected in indexing. The latter finding is based on the horizontal comparison between n-gram and medical subject headings (MeSH), while the former finding is based on the vertical comparison between different values of the same discourse variable.
Despite the limitations of the experimental methodology, this paper does important work in showing that different classification methods may still be affected by contextual aspects, although it looks like these methods use only textual information. This finding means that there is no hard boundary between textual and contextual information. Finding out how this impact happens is critical for testing these methods, and for generalizing them to other discourses. Also, as the authors claim, their work provides a better perspective on how to use the contextual information in text analysis.