Language Models Overview
2019 is called a year of NLP by many Machine Learning practitioners for a good reason. The release of GPT-2 (Generative Pretrained Transformer) by OpenAI and BERT (Bidirectional Encoder Representations for Transformers) models by Google are two of the most noticeable breakthroughs, which brings Natural Language Processing (NLP) applications to the next level. Currently, state of the art NLP systems are capable of analyzing textual data and extracting knowledge out of it with high precision, creating a lot of value in a number of use cases, such as text classification, question answering, named entity recognition, and relation extraction. Every NLP task is different in terms of its implementation and mechanics of how algorithms process data and make inferences. However, most NLP tasks share a common element known as a language model, which makes the fulfillment of these tasks possible. This is why in this blog post we give a brief overview of language models.
To start with, building a language model is a very difficult process, simply because there are so many languages and each and every language has its own grammar, punctuation, morphology and other unique features, which attribute to that particular language. In a nutshell, language models compute the probability of a sequence of words (sentence), given a set of words that was used before in that sentence over a fixed size window. In mathematical terms this could be expressed as follows:
Simply put, the mathematical expression above shows how a language model calculates the probability of a sequence of next N words given a set of previously used words. So, essentially the language model provides a way to find what word could be a better choice given a number of J last used words (known as context), such that the probability of a more relevant sentence will be higher than that of a less relevant one.
See the following examples:
Welcome to Starbucks! Would you like a cup of coffee? P ≈ 0,95
Welcome to Starbucks! Would you like to rent an apartment? P ≈ 0,65
Welcome to Starbucks! Cup you would a like coffee of? P ≈ 0,15
From the example above it becomes obvious that the first version of the sentence (in red) given previous context (in green) is more relevant than that of other versions. This is why it gets the highest possible probability compared to other less relative versions.
It is quite straightforward to understand how such language models work. However, it turns out that practically this approach is very computationally expensive as it requires computing probabilities of all possible combinations of words in a sentence. Given the large corpus of words in any language (thousands of unique words) this task becomes even more challenging.
To a certain degree, such complexity made the Neural Networks approach (Recurrent Neural Networks in particular) of building language models so successful recently. Not only do Neural Networks help achieve a better performance but they also require fewer resources to build a decent language model. This is achieved by using special types of Neural Networks designed to learn sequential patterns. Such Neural Networks are capable of unrolling data sequence of a certain length optimizing only a fixed number of parameters, while at the same time taking into account all previous words that have appeared in the context.
This approach has become a backbone element of top-notch language models developed recently. Hopefully, researchers and the general AI community will find even better ways to improve it further in the upcoming years.
Ildar Abdrashitov Business Intelligence Analyst
Missing Link Technologies