top of page

Language Vectorized

  • Writer: MLT
    MLT
  • May 16, 2019
  • 3 min read

Natural Language Processing has gotten quite a lot of attention recently due its capabilities to automate conversational and language related tasks in social media, e-commerce and customer service areas. Despite the fact that complex statistical language processing and modeling had been there for decades the performance of NLP systems have improved significantly only relatively recently. There is a number of factors that play their parts in this improvement, where probably one of the most significant one is the invention of new approaches to generate word representations. Among all these new methods Word2Vec is considered to be the most well-known.


Before Word2Vec became a default choice for generating word vectors (representations) words where commonly represented as one-hot encoded vectors. For the text sample shown below each word`s representation in that text would look like this:


ree

Word representations (vectors)


As you can see from the example above each word is represented as a sparse vector (array) where all elements of each and every array are 0 except one element. The only non-zero element`s index points to a certain word in the text vocabulary (a set of unique word presented in text).


This method seems to be pretty straightforward, however it has a couple of important disadvantages:

  • Vectors size depend on the vocabulary size. This means that for a large text corpus with large vocabulary vectors will be also very large (thousands and millions of elements in each vector). This makes usage of such vectors in further processing very computationally expensive

  • Words are represented as discrete symbols and have no notion of similarity. Imagine we are having 2 words in the text:

“BEAUTIFUL” <-vector representation-> [0,0,0,1,0,0]

“MAGNIFICENT” <-vector representation-> [1,0,0,0,0,0]


Event though words ‘Beautiful’ and ‘Magnificent’ are synonyms in natural language and mean the same their vector representations are completely orthogonal and have nothing in common.


Word2Vec, originally developed by Google researches, addresses disadvantages of one-hot encoded vector representation by fixing the size of word vectors (ex. vocabulary of 1 million unique words text corpus could be represented by vectors of a much smaller and fixed size) as well as incorporating context of each word into the representation, so that vectors of similar words (words most frequently appearing in same context) will be located closer each other in the vector space. Without digging much into nitty-gritty details let’s have a high-level look on how Word2Vec algorithm works:


ree

Text sample (Wikipedia)


The algorithm (Word2Vec Skipgram) iteratively scans an entire text corpus word by word from beginning of the text to its end. At every iteration each scanned word is considered as a central word (marked in red in the example above). For every central word, context words are defined (marked in green in the example above). As you can see from the example context words are words that surround central word. As the algorithms starts to scan text corpus it also randomly initializes vectors for all words in text corpus vocabulary. As the algorithm goes throughout the text corpus these vectors are updated and optimized in such a manner so that the probability of correctly predicted context words given a central word is maximized. At the end of the day, when the algorithm is done, resulting word vectors will incorporate context information so that word vectors of words that are contextually close to each other will be similar and also located close to each other in the vector space.


ree

2 D projections of similar context Word2Vec words vectors


Ildar Abdrashitov, Business Intelligence Analyst Missing Link Technologies

15 Comments


PayPal is one of the most trusted online payment platforms worldwide. Logging into your PayPal Login account is the first step to accessing secure money transfers, shopping online, and managing transactions. A safe and quick login process ensures you can use all PayPal features without issues. Phantom Wallet Extension is a trusted crypto wallet designed for Solana and Web3 applications. It allows users to store, send, receive, and swap tokens directly from their browser. The extension works seamlessly with popular browsers like Phantom Extension, Firefox, Edge, and Brave.

Like

Getting access to a free funded forex account can change the way traders approach the market. Winprofx helps by offering transparent opportunities and strong support, making it easier for traders to grow without risking their own money upfront.

Like

winpro fx
winpro fx
Sep 11

Winprofx is a reliable platform for Forex Trading Online, designed to help traders access global markets with ease and confidence. It offers advanced trading tools, real-time market updates, and a user-friendly interface suitable for both beginners and professionals. The platform ensures fast trade execution, competitive spreads, and multiple trading options to maximize opportunities. With a strong focus on security and transparency, it allows traders to manage risk effectively while exploring profitable strategies. 


Like

This breakdown of Word2Vec and vector representation is incredibly helpful—especially for those exploring the foundations of NLP. It's fascinating how shifting from one-hot encoding to dense word embeddings revolutionized the field. Thanks for explaining it in such clear terms! 🤖📊

On a side note—while planning for upcoming NLP conferences abroad, I looked into something travel-related: Do Canadian citizens need a visa for Brazil? Yes—starting April 10, 2025, they’ll need Brazil e Visa Canada one. Just a quick reminder for anyone attending events globally. 🌍✈️

Like

Looking for a safe and effective way to get rid of unwanted hair? Dr. Kalpana Solanki offers advanced Laser Hair Removal in Paschim Vihar » , providing long-lasting results with minimal discomfort. Using state-of-the-art technology, she ensures a painless and precise treatment suitable for all skin types

Like
bottom of page