Natural Language Processing has gotten quite a lot of attention recently due its capabilities to automate conversational and language related tasks in social media, e-commerce and customer service areas. Despite the fact that complex statistical language processing and modeling had been there for decades the performance of NLP systems have improved significantly only relatively recently. There is a number of factors that play their parts in this improvement, where probably one of the most significant one is the invention of new approaches to generate word representations. Among all these new methods Word2Vec is considered to be the most well-known.
Before Word2Vec became a default choice for generating word vectors (representations) words where commonly represented as one-hot encoded vectors. For the text sample shown below each word`s representation in that text would look like this:
Word representations (vectors)
As you can see from the example above each word is represented as a sparse vector (array) where all elements of each and every array are 0 except one element. The only non-zero element`s index points to a certain word in the text vocabulary (a set of unique word presented in text).
This method seems to be pretty straightforward, however it has a couple of important disadvantages:
Vectors size depend on the vocabulary size. This means that for a large text corpus with large vocabulary vectors will be also very large (thousands and millions of elements in each vector). This makes usage of such vectors in further processing very computationally expensive
Words are represented as discrete symbols and have no notion of similarity. Imagine we are having 2 words in the text:
“BEAUTIFUL” <-vector representation-> [0,0,0,1,0,0]
“MAGNIFICENT” <-vector representation-> [1,0,0,0,0,0]
Event though words ‘Beautiful’ and ‘Magnificent’ are synonyms in natural language and mean the same their vector representations are completely orthogonal and have nothing in common.
Word2Vec, originally developed by Google researches, addresses disadvantages of one-hot encoded vector representation by fixing the size of word vectors (ex. vocabulary of 1 million unique words text corpus could be represented by vectors of a much smaller and fixed size) as well as incorporating context of each word into the representation, so that vectors of similar words (words most frequently appearing in same context) will be located closer each other in the vector space. Without digging much into nitty-gritty details let’s have a high-level look on how Word2Vec algorithm works:
Text sample (Wikipedia)
The algorithm (Word2Vec Skipgram) iteratively scans an entire text corpus word by word from beginning of the text to its end. At every iteration each scanned word is considered as a central word (marked in red in the example above). For every central word, context words are defined (marked in green in the example above). As you can see from the example context words are words that surround central word. As the algorithms starts to scan text corpus it also randomly initializes vectors for all words in text corpus vocabulary. As the algorithm goes throughout the text corpus these vectors are updated and optimized in such a manner so that the probability of correctly predicted context words given a central word is maximized. At the end of the day, when the algorithm is done, resulting word vectors will incorporate context information so that word vectors of words that are contextually close to each other will be similar and also located close to each other in the vector space.
2 D projections of similar context Word2Vec words vectors
Ildar Abdrashitov, Business Intelligence Analyst Missing Link Technologies