This detailed and well-written paper presents a study on the normalization of informal text. The idea of normalization is to convert or correct informal language use into its formal equivalent. An example would be the expansion of the abbreviation “tmr” to “tomorrow.” Informal language use is prevalent in many modern-day applications and is an obstacle to language processing technologies, most of which were researched and developed based on proper, formal language use.
The authors examined the use of two models, one involving the use of sequential labeling (conditional random field (CRF)), and the other based on statistical machine translation (MT). Both models were shown to do better than competitive baselines. The authors went on to show that these models can be combined easily to produce a hybrid system that further improves performance.
The paper is worth reading for several reasons. First, it gives a good introduction to the problem and related work. This is informative and will be useful for new researchers in the field. Next, it explains the experiments that were conducted in detail. Many of the decisions made by the authors are soundly justified and explained. It’s a convincing piece of work and is a good reference for sound scientific writing.
The paper piqued my interest in this area of research, and made me want to test some ideas that I came up with while going through it. For example, I thought that a pure language model approach baseline would have performed better than was reported. The authors did not elaborate on how their language model was derived, but considering that the work was first done in 2011, it will be interesting to see if new language models built on larger text corpora will lead to a stronger baseline.
I recommend this paper to researchers interested in this area. It is well written and informative, and I believe any time spent reading it would be worthwhile.