This chapter is an introduction to statistical language models. A statistical language model is a probabilistic way to capture regularities of a particular language, in the form of word-order constraints. In other words, a statistical language model expresses the likelihood that a sequence of words is a fluent sequence in a particular language. Plausible sequences of words are given high probabilities whereas nonsensical ones are given low probabilities.

We briefly mentioned statistical language models in Chapter 2; here we explain them in more detail, with the goal of motivating the use of factored language models for statistical machine translation. Section 3.1 explains the n-gram model and associated smoothing techniques that are the basis for statistical language modeling. A linguistically-motivated approach to language modeling, probabilistic context-free grammars, is briefly described in section 3.2. Section 3.3 details the language model and smoothing scheme used in this thesis: Factored Language Models with Generalized Parallel Backoff. Section 3.4 mentions a useful method for generating factored language models. Finally, Section 3.5 describes the metrics for evaluating statistical language models.

3.1 The N-gram Language Model

The most widespread statistical language model, the n-gram model, was proposed by Jelinek and Mercer (Bahl et al. 1983) and has proved to be simple and robust. Much like phrase-based statistical machine translation, the n-gram language model has dominated the field since its introduction despite disregarding any inherent linguistic properties of the language being modeled. Language is reduced to a sequence of arbitrary symbols with no deep structure or meaning – yet this simplification works.

The n-gram model makes the simplifying assumption that the nth word w depends only on the history h, which consists of the n – 1 preceeding words. By neglecting the leading terms, it models langage as a Markov

chain of order n – 1.

The value of n trades off the stability of the estimate (i. e. its variance) against its appropriateness (i. e. bias) (Rosenfeld 2000). A high n provides a more accurate model, but a low n provides more reliable estimates. Despite the simplicity of the n – gram language model, obtaining accurate n-gram probabilities can be difficult because of data sparseness. Given infinite amounts of relevant data, the next word following a given history can be reasonably predicted with just the maximum likelihood estimate (MLE).

The Count function simply measures the number of times something was observed in the training corpus. However, for any sized corpus there will be a value for n beyond which n-grams occur very infrequently and thus cannot be estimated reliably. Because of this, trigrams are a common choice for n-gram language models based on multi-million-word corpora. Rosenfeld (1996) mentions that a language model trained on 38 million words marked one third of test trigrams, from the same domain, as previously unseen. By comparison, this thesis uses the Europarl corpus, which is a standard parallel text, yet contains only 14 million words in each language. Rosenfeld (1996) furthermore showed that the majority of observed trigrams only appeared once in the training corpus.

Because the value of n tends to be chosen to improve model accuracy, direct MLE computation of n-gram probabilities from counts is likely to be highly inaccurate.