A.I, Data and Software Engineering

Word2vec with gensim – a simple word embedding example

W

In this short article, we show a simple example of how to use GenSim and word2vec for word embedding.

Word2vec

Word2vec is a famous algorithm for natural language processing (NLP) created by Tomas Mikolov teams. It is a group of related models that are used to produce word embeddings, i.e. CBOW and skip-grams. The models are considered shallow. They consist of two-layer neural networks that are trained to reconstruct linguistic contexts of words. It was a small step in machine learning but made a huge impact in NLP compared to earlier algorithms, such as latent semantic analysis.

skip-grams word2vec
Word2vec skip-gram

While skip-grams predicts context words based on input, CBOW is the reverse model of Skip-Grams. It predicts target words based on context words.

The GENSIM library

Gensim is an open-source python library for natural language processing and it was developed and is maintained by the Czech natural language processing researcher Radim Řehůřek. Gensim library will enable us to develop word embeddings by training our own word2vec models on a custom corpus either with CBOW of skip-grams algorithms.

To install and import gensim:

!pip install gensim
import gensim 

The word embedding example

In this example, I use a text file downloaded from Norvig.com and let the word2vec library train its models. After that, we will see some result. In practice, to get better results, you will need much bigger data and also tweaking some hyperparameters.

The code is for jupyter notebook in GG Colab using python 2x as there is a bug with the example for python 3x.

Download a big text file

This big text file is 6MB from norvig.com, it is still small in terms of big data. But it should be enough for demo purposes. We download the file using requests and save it to local drive with the name “big.txt”. Alternatively, you can also use ” from gensim.test.utils import common_texts” for the sample text file.

textURL = 'https://norvig.com/big.txt'
import urllib
import requests
urllib.urlretrieve(textURL, "big.txt")
r = requests.get(textURL)
with open("big.txt", "wb") as f:
    f.write(r.content)

Read text file

text_file = "big.txt"
# read all lines at once
with open(text_file, 'rb') as f:
    lines = f.readlines()
print(lines[:3])

Output the first 3 lines:

['The Project Gutenberg EBook of The Adventures of Sherlock Holmes\n', 'by Sir Arthur Conan Doyle\n', '(#15 in our series by Sir Arthur Conan Doyle)\n']

Pre-Process words

As can be seen, the words in “lines” are raw. We can do simple processing for the list by using gensim.utils.simple_preprocess“. This will convert the fetched document into a list of tokens.

# process sentences to tokens
processedLines = [gensim.utils.simple_preprocess(sentence) for sentence in lines]
#create word list from token using utf8 encoding
word_list = [word.encode('utf-8') for words in processedLines for word in words]
#check the length of the list
print('Length: ', len(word_list)
#check first five words
word_list[:5]

Output:

Length: 1064778
['the', 'project', 'gutenberg', 'ebook', 'of']

So we have 100k+ words in the word_list.

Create a Word2Vec model

The hyperparameters of this model:

  • size: The number of dimensions of the embeddings and the default is 100.
  • window: The maximum distance between a target word and words around the target word. The default window is 5.
  • min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
  • workers: The number of partitions during training and the default workers is 3.
  • sg: The training algorithm, either CBOW(0) or skip-gram (1). The default training algorithm is CBOW.
model = gensim.models.Word2Vec(
    [word_list],
    negative=10,
    iter=50,
    min_count=1,
    size=32 #dimension of the word vector
    )

The output of the training process:

...
2019-3-11 10:30:34,757 : INFO : EPOCH - 50 : training on 1064778 raw words (10000 effective words) took 0.0s, 260315 effective words/s
2019-3-11 10:30:34,758 : INFO : training on a 53238900 raw words (500000 effective words) took 2.2s, 228090 effective words/s

Check embedded words

If there is no error, you should be able to query the vector representing a word. Each vector is 32 in length.

model['uncourteous']
#output vector for 'uncourteous'
array([ 7.1157895e-02,  3.0961020e-02, -8.3555140e-02,  1.4138509e-02,
       -6.7576997e-02,  1.2850691e-01, -1.1335255e-01, -1.0118033e-01,
       -9.3394592e-03,  1.1703445e-01, -1.2325902e-01, -4.7652371e-02,
       -3.4051174e-01,  3.2857012e-02,  4.3575674e-02,  6.2586945e-03,
       -1.0634376e-02, -2.9684594e-04, -1.7043892e-02, -1.7614692e-02,
       -1.3355084e-01, -1.0865550e-01,  1.4331852e-01,  2.4548984e-01,
        7.0743695e-02, -7.7710636e-03, -3.7634671e-02, -8.4391050e-02,
       -4.7245864e-02, -6.7963079e-02, -8.1062794e-02,  1.2597212e-01],
      dtype=float32)
model['she']
#output vector for 'she'
array([ 2.2985563e+00,  1.5476345e+00, -1.6383197e+00,  1.0864263e-02,
       -9.1463244e-01,  1.8228959e+00, -3.5029898e+00, -2.7226317e+00,
        5.6958675e-01,  3.1518939e+00, -2.0932381e+00, -1.8218745e+00,
       -5.9538774e+00,  1.3396950e+00,  1.0417372e+00,  5.9076434e-01,
        8.6964197e-02,  3.9082132e-03, -8.0454630e-01,  1.0120409e+00,
       -2.6888485e+00, -3.3639653e+00,  1.9557896e+00,  4.9678087e+00,
        8.7341714e-01, -3.4626475e-01, -9.6429044e-01, -1.2232714e+00,
       -4.3153864e-01, -5.5255091e-01, -1.4339647e+00,  1.6336327e+00],
      dtype=float32)

Check similarity

We now can find the words that are similar to a given word.

model.wv.most_similar(positive="she")[:3]
#Output the first 3 words found similar to "she"
[('not', 0.9891390800476074),
 ('mrs', 0.9884595274925232),
 ('chances', 0.9869956970214844)]
model.wv.most_similar(positive="think")[:5]
#output first 5 words found similar to "think"
[('remarked', 0.9967506527900696),
 ('sorry', 0.9963732957839966),
 ('that', 0.9953688979148865),
 ('alone', 0.9952799081802368),
 ('accent', 0.9949144721031189)]

Also, we can check the similarity between a word pair. The below examples do make a lot of sense when “she” is close to “mrs” while not really relevant to “gun”. This is what the model learned from the corpus.

model.wv.similarity("mrs", "she")
0.9884596
model.wv.similarity("she", "gun")
0.18504377

predict output word

Word2vec will predict the list of words for given context words. The current

model.predict_output_word([ "with", "Hercules", "limbs"])
#output
[('loathed', 0.003917875),
 ('moustached', 0.0032442596),
 ('rocket', 0.00320722),
 ('settee', 0.0031377329),
 ('recommence', 0.0025945716),
 ('irene', 0.0025766355),
 ('headed', 0.0025396906),
 ('darlington', 0.002532876),
 ('wallenstein', 0.0022341162),
 ('refreshingly', 0.0021978107)]

Conclusion

We have shown the simple example of how to use a word2vec library of gensim. It is not only a powerful tool for NLP but also for other application, such as search or recommender system. We will cover the topic in the future post or with new implementation with TensorFlow 2.0.

2 comments

💬

  • urllib.urlretrieve(textURL, “big.txt”) it should be urllib.request.urlretrieve(textURL, “big.txt”) !!!
    also
    error in
    model = gensim.models.Word2Vec(
    [word_list],
    negative=10,
    iter=50,
    min_count=1,
    size=32 #dimension of the word vector
    )
    can’t concat str to bytes

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Categories