A.I, Data and Software Engineering

Word2vec with gensim – a simple word embedding example

W

In this short article, we show a simple example of how to use GenSim and word2vec for word embedding.

Word2vec

Word2vec is a famous algorithm for natural language processing (NLP) created by Tomas Mikolov teams. It is a group of related models that are used to produce word embeddings, i.e. CBOW and skip-grams. The models are considered shallow. They consist of two-layer neural networks that are trained to reconstruct linguistic contexts of words. It was a small step in machine learning but made a huge impact in NLP compared to earlier algorithms, such as latent semantic analysis.

skip-grams word2vec
Word2vec skip-gram

While skip-grams predicts context words based on input, CBOW is the reverse model of Skip-Grams. It predicts target words based on context words.

The GENSIM library

Gensim is an open-source python library for natural language processing and it was developed and is maintained by the Czech natural language processing researcher Radim Řehůřek. Gensim library will enable us to develop word embeddings by training our own word2vec models on a custom corpus either with CBOW of skip-grams algorithms.

To install and import gensim:

The word embedding example

In this example, I use a text file downloaded from Norvig.com and let the word2vec library train its models. After that, we will see some result. In practice, to get better results, you will need much bigger data and also tweaking some hyperparameters.

The code is for jupyter notebook in GG Colab using python 2x as there is a bug with the example for python 3x.

Download a big text file

This big text file is 6MB from norvig.com, it is still small in terms of big data. But it should be enough for demo purposes. We download the file using requests and save it to local drive with the name “big.txt”. Alternatively, you can also use ” from gensim.test.utils import common_texts” for the sample text file.

Read text file

Output the first 3 lines:

Pre-Process words

As can be seen, the words in “lines” are raw. We can do simple processing for the list by using gensim.utils.simple_preprocess“. This will convert the fetched document into a list of tokens.

Output:

So we have 100k+ words in the word_list.

Create a Word2Vec model

The hyperparameters of this model:

  • size: The number of dimensions of the embeddings and the default is 100.
  • window: The maximum distance between a target word and words around the target word. The default window is 5.
  • min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
  • workers: The number of partitions during training and the default workers is 3.
  • sg: The training algorithm, either CBOW(0) or skip-gram (1). The default training algorithm is CBOW.

The output of the training process:

Check embedded words

If there is no error, you should be able to query the vector representing a word. Each vector is 32 in length.

Check similarity

We now can find the words that are similar to a given word.

Also, we can check the similarity between a word pair. The below examples do make a lot of sense when “she” is close to “mrs” while not really relevant to “gun”. This is what the model learned from the corpus.

predict output word

Word2vec will predict the list of words for given context words. The current

Conclusion

We have shown the simple example of how to use a word2vec library of gensim. It is not only a powerful tool for NLP but also for other application, such as search or recommender system. We will cover the topic in the future post or with new implementation with TensorFlow 2.0.

Add comment

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Pin It on Pinterest

Newsletters

You have successfully subscribed to the newsletter

There was an error while trying to send your request. Please try again.

Petaminds will use the information you provide on this form to be in touch with you and to provide updates.