Traditionally, n-grams are derived by extracting groups of words as they appear in a text. This paper describes a new way to formulate n-grams, referred to as syntactic n-grams (sn-grams). Sn-grams are made by extracting groups of words not based on how they appear in a text, but rather on how they are presented in grammar parse trees. Derivations from both constituent grammar parses and typed dependency parses are possible. This difference in derivation is interesting and can potentially have an impact on how n-grams are used.
N-grams have found plenty of uses. In fact, they are the mainstay of many language models. However, they suffer from two main problems. The first is that of data sparseness, especially in cases where insufficient data is available. The other problem involves stop words, or words that do not hold much meaning. Because of the way n-grams are derived, stop words can make their way into n-grams. The use of sn-grams can potentially overcome, or at least alleviate, both problems.
The authors apply sn-grams to the problem of author attribution, the process of identifying the author of a piece of text. Comparing an approach based on sn-grams to one using traditional n-grams, the paper shows that sn-grams demonstrate better performance.
It would have been more interesting if the authors had compared sn-grams to other related technologies, such as path features and string kernels, for example. Though not exactly the same, these are common ways to use syntactic information.
Furthermore, while sn-grams outperform the use of n-grams for author attribution, the case for the superiority of sn-grams would have been more convincing if either a more state-of-the-art approach to the problem had been used as the comparative baseline [1], or the authors had chosen another problem that better highlights the value of the sn-gram approach. Many advanced approaches to author attribution have been applied with good success (such as the use of topic models).
I would suggest that the authors compare the performance of sn-grams to that of conditional random fields (CRF) [2] in a future work. Since the typical CRF classifier uses n-grams, it would be quite exciting if they could show that sn-grams can boost the performance of these CRF classifiers.
I definitely found this paper interesting to read. The idea of syntactic n-grams has the potential to be very useful. The paper is also clearly written, and the approach is adequately explained.