In machine translation, system efficiency refers to translation quality, which largely depends on factors such as source, to target language similarity; language resources; and the translation model, which is an algorithm that processes source language samples into the target language. Many machine translation systems employ language models that assign a probability to a sequence of a well-defined number of words, using a probability distribution. In this paper, the authors propose an approach that uses comparable corpora for extracting knowledge about a Persian language processed for translation purposes from English. For comparison of the corpora in two languages, the documents are aligned and then similarity scores are computed. The authors follow the Kullback-Leibler divergence model; in ambiguous cases, they apply a naïve Bayes rule. The estimation of prior probabilities relies on the Jelinek-Mercer method and Dirichlet prior smoothing. The alignment similarity is then normalized.
As a result, this work implies that even a relatively poor resource, such as comparable corpora, can be explored efficiently for knowledge extraction. Language knowledge is clearly a valuable foundation for translation purposes.
This topic is interesting, but the presentation is somewhat vague: the paper lacks some precise information, such as how translation quality is measured; the weighting in the models is intuitive rather than formal; and the word translations are statistically independent, which impedes vocabulary coverage, especially when it comes to the translation of technical terms. However, my overall opinion is positive, and I recommend the paper to students working in the field, as well as those who plan to do so.