One might think that automatic language identification (LI) is straightforward--surely, distinguishing English from Polish is easy. This review shows that the problem is much harder than one might expect. For example, distinguishing Modern Standard Arabic (MSA) from the Egyptian dialect, especially when the text is a tweet, is hard. This review sets out to encompass the current state of the art of LI. To get some idea of the scope of the paper and the material reviewed, be aware that the paper is more than 100 pages, and more than one-third of that is the bibliography.
The authors divide the LI process into four steps. First, select a document representation. Second, a language model for a predefined set of languages is derived from a training corpus. Third, a function is defined that determines how well a given document fits the language model for each training language. Finally, the language of the document is predicted.
Language models can be constructed via n-grams of characters, bearing in mind that “character” is not well defined for all languages, or n-grams of words, among many other possible features. The authors give an exhaustive survey of features that can be used. The probability of such a feature in a text can then be part of the quantitative function used for language identification. The authors survey papers where a mixture of types of features is used. Support vector machines, decision trees, and neural networks, among others, have been used.
There is a survey of empirical evaluations of LI systems. An issue here is the length required of a document to enable the identification of the language; think, here, of tweets or search phrases. The authors note that there is a need for standardized datasets to enable comparisons between systems. A section of the review is devoted to application areas. Another section reviews off-the-shelf language identifiers. A lengthy section covers research directions and open issues. Multilingual documents are an issue here.
The paper contains a significant number of tables that list papers covering a specified topic. This will be very useful to readers wishing to pursue a particular topic further. The paper is an impressive contribution that highlights the complexities of an area that one might overlook.