Perplexity: A non-scientific overview

I admit, that I have some kind of obsessive disorder for understanding the theory behind of working things. A couple of years ago, I was working on search related projects. I have really loved working with search engines for years, because you can still cope with real-life applications and at the same time, you are aware of the science behind the information and text retrieval systems while evaluating the results. This passion of understanding, how search engines work and what their limits are, took me from the practice world to the beautiful theory, that focuses on, how human languages are processed i.e understood by machines.

From those days, as I was educating myself in this field, the term I remember, perplexity, which is a measure, that is used to evaluate language models and an intrinsic method to compare them. It roughly answers the question, how successful your language model is. A (statistical) language model is basically a probabilistic distribution on words. They are used to find out, how likely a word appears after an another. Let’s consider the example, that we have a vocabulary, \Upsilon:

\Upsilon=\left\{ {I, love, Science, Engineering, Fiction, don't, and, ...}\right\}

and let’s say, the \Upsilon' is the set of strings, that we can write in combination of the words from our vocabulary:

\Upsilon'=\left\{ I\hspace{1}\,love\,Science\,STOP, love\,STOP, Science\,Engineering\,STOP, don't\,STOP, ...  ...}\right\}

The STOP word has a special meaning. It is a terminal word, which has the purpose to point, where the sentences terminate. It also helps us to define the probabilistic distributions at the end of the sentences.

In language modelling problems, we deal with inputs, “training sets” and outputs, probabilistic distributions p, which our models need to learn from training sets. “p” is a probabilistic function, which must satisfy the following rule:

For all \sum\limits_{x\in\Upsilon'}, p(x) = 1 where p(x) \geq 0

We are still investigating on, how likely the sentences x \in \Upsilon' appears in the training set. A naive approach to estimate these probabilities would be:

p(x_1,x_2,...,x_n) = \frac{c(x_1,x_2,...,x_n)}{N}

where N is the number of the sentences in our training set and c(x_1,x_2,...,x_n) is the number of times that the word sequence has been seen in the training set. In fact, there is a small problem with our naive estimation solution. Think about the sentences, that they don’t exist in the training set, yet. Then, the p(x) would be zero. However, there will be sentences outside of the training set, and new sentences are likely seen overtime. But, for the sake of ease, I am going to put this problem and its solutions aside.

If we had some experiments with the real world examples, some realistic estimations might be seen like following:

p(I\hspace{1}love\, Engineering\, STOP) = 5 * 10^{-9}
p(love\hspace{1}love\, STOP) = 2 * 10^{-15}
p(Enginering\hspace{1}\,and\,Science\, STOP) = 3 * 10^{-7}

In fact, none of these values are empiric. I’ve just tried to give some consistent example estimations to give you an intuition about a reasonable distribution. For instance, according to these values, it is more likely to find sentences in our training set like “I love Engineering” or “Engineering and Science” than the one, “love love”. It is actually, what we might expect. If you want to see more realistic distributions, you can check out Google’s N-Gram Viewer. You can even embed it into your blog article :)

Now, I hope, that you got at least a clear picture of language modelling problem. Basically, we do work with the probability of word occurrences in a word sequence, e.g sentences, for instance, as the Trigram language model is intended.

The Trigram language model is in fact really a simple and widely used. As I was studying this field, I encountered it in many language processing text books, that is used as a reference to explain the challenges in language modelling problems. Even though trigram language models fail in long-range dependencies, it is still very useful in computational linguistics (as well as information retrieval). The model consists of two things:

First, \Upsilon finite set of words, a vocabulary,
and second; u,v,w such that w \in \Upsilon \cup \left\{ STOP \right\} and u,v \in \Upsilon \cup \left\{ \star \right\}

For every sentence x_1, x_2, ..., x_n, x_i \in \Upsilon and x_b = STOP the probability of the sentence within Trigram model is:

p(x_1, x_2, ..., x_n) = \prod\limits_{i = 1}^n q(x_i|x_{i-2}, x_{i-1}) where we define x_0 = x_{-1} = \star

if we apply the science to the following sentence s=“The quick brown fox jumps over the lazy dog”, the probability distribution would look like: p(x_1, x_2, ..., x_n) = q(the|\star , \star)\times q(quick|the,\star)\times q(brown|the,quick)\times ...

It was all about the language models. Now, we have a language model to evaluate. Before we begin evaluating our model, as I mentioned before, in order to be able to calculate the estimations used in probability distributions, like in our naive example, you must provide some training data to the language model.

– A couple of years ago, if you were following the tech news, you have probably heard of Google’s huge effort. This action even concluded in conflicts with the copyright holders. At first glance, not thinking evil for a moment, this effort might be taken within the Google Books Library Project. However, wouldn’t these millions of books be a great training set for the algorithms used in search engines as well as in language processing research ?

While you are evaluating your language model, you will need to provide a test set (unseen data), which has not been touched yet, during training your language model. In language model tests, we are essentially interested in probability distribution of the test set i.e it is the way to disclose, how good are the predictions made by the language model. Like flipping coins, we can calculate the probability of multiple independent events by multiplying their chances together. However, in computer science, in order to normalise the probability distribution and moreover, because of the fact, that the additive operations are executed more effectively than the multiplication by hardware, log-probability is a more preferred way as follows:

For some test sentences s_1, s_2, ..., s_m

\prod\limits_{i = 1}^m p(s_i) = \log \sum_{i = 1}^m p(s_i)

where the perplexity is p = 2^{-l} for l=\frac{1}{M}\sum_{i = 1}^m \log p(s_i)

As you can see, we first normalise the probability distribution by dividing it with the total number of test sentences. Since it is the negative value of the l in p = 2^{-l}, the lower the perplexity is the better our language model. Here is a comparison of language models after training them with 38M words from WSJ:

Rendered by

Perplexity is a standard measure for language model and an intrinsic method to evaluate your language models. It tells us how successful our language model is. The perplexity is in fact not the only tool to evaluate language models. You can put your language model into applications like machine translation, speech recognition, so on (extrinsic approach) and evaluate the results. However, this approach would take a long time till you get your answers.

Lift it!

Partial functions are widely used in Scala. Just like their mathematical counterpart, a partial function, e.g f: X => Y, is a generalisation of functions, that don’t necessarily map every element in domain X to Y. You won’t get a response from the function, if you pass a value, which the function do not know how to deal with, instead, something exceptional will happen.

You see partial functions frequently, almost everywhere. Take the following example actor:

class HttpServerActor extend Actor {
  def receive = {
    case StartOperation => start()
    case StopNow => stop()
    case StopGracefulIn(x) => stop(x)

HttpServerActor reacts on StartOperation, StopNow and StopGraceful i.e the function can only handle these events. But, what happens, if the function can not handle the input i.e the parameter is not within the function’s domain ?

In fact, if an actor receives an event, which can not be handled, surprisingly, nothing is going to happen i.e you will not see an exception thrown, at all. This behaviour is required to keep the actor’s state healthy. But, it is for now out of scope. The actor example is just intended to show you, that you can meet the partial functions almost everywhere.

Let’s consider the following example. We want to calculate absolute values of given numbers. So, we define first a (buggy) partial function, which returns the absolute value of the number provided. However, the function is only defined for negative and positive numbers. So, the function can not handle the zero:

def abs: PartialFunction[Int, Int] = {
  case x:Int if x < 0 => x * (-1)
  case x:Int if x > 0 => x

abs(-1) // gives 1.
abs(-1) // gives 1.
abs(0)  // scala.MatchError: 0 (of class java.lang.Integer)

While the first two abs calls return expected result, the second call will fail throwing a MatchError. The error thrown might be a legit application behaviour in your case – as long as you do not make the application’s behaviour dependent on exception handling, which would beat the referential transparency. On the other hand, if you want to process a list of integers, which may contain some zeros as well, sometimes, you don’t want to break the iteration because of an invalid input and just move on to the next item in the list while ignoring the invalid one, e.g:

val numbers = List(-4, -2, 0, 2, 4, 6)

The input, zero, would break the iteration because of the invalid value within. Sometimes, this behaviour is not really, what we want; in other words, you want to keep the iteration going, while invalid inputs are being ignored. In this instance, you can utilise lifting to transform the source function to a safer one, e.g using Option types:

val numbers = List(-4, -2, 0, 2, 4, 6)

The map call now will produce Option types:
Some(4), Some(2), None, Some(2), Some(4), Some(6) and for the value which doesn’t exist abs’ domain, we get a None instead of throwing an exception. What happened?
Basically, the lift function converts the function abs : Int => Int to a function with F[Int] => F[Int].

flatMap redux

If you use lifted function in combination with flatMap:

val numbers = List(-4, -2, 0, 2, 4, 6)

You will even get the numeric results instead of Option types.

In our examples, we didn’t specify any parameterised type for our lift function. Therefore, the function, that we wanted to lift, have been lifted to Option[T] => Option[T]. However, you could define an another type,

In fact, there is an another method in PartialFunction type, which you can use to find out, whether a variable is defined within the function’s domain, isDefinedAt():

val numbers = List(-4, -2, 0, 2, 4, 6)

In the example above, we filter the numbers using isDefinedAt first, then apply the map. The result will be the same as in the flatMap-example.