Most research into improving statistical machine translation has focused on improving the translation model component of the Fundamental Equation (2.2). We now examine this translation model probability, P(F |E ), in more detail.

The translation model probability cannot be reliably calculated based on the sentences as a unit, due to sparseness. Instead, the sentences are decomposed into a sequence of words. In the IBM models, words in the target sentence are aligned with the word in the source sentence that generated them. The translation in Figure 2.2 contains an example of a reordered alignment (black cat to chat noir) and of a many-to-one alignment (fish to le poisson).

The black cat likes fish

Le chat noir aime le poisson

Figure 2.2: Example of a word-aligned translation from English to French.

2.3.1 Definitions

Suppose we have E, an English sentence with I words, and F, a French sentence with J words.

According to the noisy channel

model, a word alignment aligns a French word f j with the English word ei that generated it (Brown et al. 1993). We callA the set of word alignments that account for all words in F.

Then the probability of the alignment is the product of individual word alignment probabilities.

P(F, A |E ) is called the statistical alignment model. However, there are multiple possible sets A of word alignments for a sentence pair. Therefore, to obtain the probability of a particular translation, we sum over the set AA of all such A.

2.3.2 Calculating the Translation Model Probability

P(F |E ) would be simple to compute if word-aligned bilingual corpora were available. Because they are generally unavailable, word-level alignments are estimated from the sentence-aligned corpora that are accessible for many language pairs.

Perform this alignment extraction by using an Expectation Maximization (EM) algorithm (Dempster et al. 1977). EM methods produce estimates for hidden parameter values by maximizing the likelihood probability of a training set. In this case, it estimates word translation probabilities (and other hidden variables of the translation probability) by selecting those that maximize the sentence alignment probability of the sentence-aligned parallel corpus. The more closely the training set represents the set the machine translation system will be used on, the more accurate the EM parameter estimates will be. However, EM is not guaranteed to produce the globally optimal translation probabilities.

For a training corpus of SS aligned sentence pairs, the EM algorithm estimates the translation model parameters.

Given the word alignment probability estimates?, the IBM translation models then collectively compute the translation model P(F |E ) for a word-based statistical machine translation system.

Word-based statistical machine translation models necessitated some elaborate model parameters (e. g. fertility, the likelihood of a word translating to more than one word) to account for how some real-world translations could be produced. A limitation of word-based models was their handling of reordering, null words, and non-compositional phrases (which cannot be translated as a function of their constituent words). The word-based system of translation models has been improved upon by recent phrase – based approaches to statistical machine translation, which use larger chunks of language as their basis.