I admit, that I have some kind of obsessive disorder for understanding the theory behind of working things. A couple of years ago, I was working on search related projects. I have really loved working with search engines for years, because you can still cope with real-life applications and at the same time, you are aware of the science behind the information and text retrieval systems while evaluating the results. This passion of understanding, how search engines work and what their limits are, took me from the practice world to the beautiful theory, that focuses on, how human languages are processed i.e understood by machines.
From those days, as I was educating myself in this field, the term I remember, perplexity, which is a measure, that is used to evaluate language models and an intrinsic method to compare them. It roughly answers the question, how successful your language model is. A (statistical) language model is basically a probabilistic distribution on words. They are used to find out, how likely a word appears after an another. Let’s consider the example, that we have a vocabulary, :
and let’s say, the is the set of strings, that we can write in combination of the words from our vocabulary:
The STOP word has a special meaning. It is a terminal word, which has the purpose to point, where the sentences terminate. It also helps us to define the probabilistic distributions at the end of the sentences.
In language modelling problems, we deal with inputs, “training sets” and outputs, probabilistic distributions p, which our models need to learn from training sets. “p” is a probabilistic function, which must satisfy the following rule:
For all where
We are still investigating on, how likely the sentences appears in the training set. A naive approach to estimate these probabilities would be:
where N is the number of the sentences in our training set and is the number of times that the word sequence has been seen in the training set. In fact, there is a small problem with our naive estimation solution. Think about the sentences, that they don’t exist in the training set, yet. Then, the would be zero. However, there will be sentences outside of the training set, and new sentences are likely seen overtime. But, for the sake of ease, I am going to put this problem and its solutions aside.
If we had some experiments with the real world examples, some realistic estimations might be seen like following:
In fact, none of these values are empiric. I’ve just tried to give some consistent example estimations to give you an intuition about a reasonable distribution. For instance, according to these values, it is more likely to find sentences in our training set like “I love Engineering” or “Engineering and Science” than the one, “love love”. It is actually, what we might expect. If you want to see more realistic distributions, you can check out Google’s N-Gram Viewer. You can even embed it into your blog article
Now, I hope, that you got at least a clear picture of language modelling problem. Basically, we do work with the probability of word occurrences in a word sequence, e.g sentences, for instance, as the Trigram language model is intended.
The Trigram language model is in fact really a simple and widely used. As I was studying this field, I encountered it in many language processing text books, that is used as a reference to explain the challenges in language modelling problems. Even though trigram language models fail in long-range dependencies, it is still very useful in computational linguistics (as well as information retrieval). The model consists of two things:
First, finite set of words, a vocabulary,
and second; u,v,w such that and
For every sentence , and the probability of the sentence within Trigram model is:
where we define
if we apply the science to the following sentence “The quick brown fox jumps over the lazy dog”, the probability distribution would look like:
It was all about the language models. Now, we have a language model to evaluate. Before we begin evaluating our model, as I mentioned before, in order to be able to calculate the estimations used in probability distributions, like in our naive example, you must provide some training data to the language model.
– A couple of years ago, if you were following the tech news, you have probably heard of Google’s huge effort. This action even concluded in conflicts with the copyright holders. At first glance, not thinking evil for a moment, this effort might be taken within the Google Books Library Project. However, wouldn’t these millions of books be a great training set for the algorithms used in search engines as well as in language processing research ?
While you are evaluating your language model, you will need to provide a test set (unseen data), which has not been touched yet, during training your language model. In language model tests, we are essentially interested in probability distribution of the test set i.e it is the way to disclose, how good are the predictions made by the language model. Like flipping coins, we can calculate the probability of multiple independent events by multiplying their chances together. However, in computer science, in order to normalise the probability distribution and moreover, because of the fact, that the additive operations are executed more effectively than the multiplication by hardware, log-probability is a more preferred way as follows:
For some test sentences
where the perplexity is for
As you can see, we first normalise the probability distribution by dividing it with the total number of test sentences. Since it is the negative value of the in , the lower the perplexity is the better our language model. Here is a comparison of language models after training them with 38M words from WSJ:
Perplexity is a standard measure for language model and an intrinsic method to evaluate your language models. It tells us how successful our language model is. The perplexity is in fact not the only tool to evaluate language models. You can put your language model into applications like machine translation, speech recognition, so on (extrinsic approach) and evaluate the results. However, this approach would take a long time till you get your answers.