In this short article, we show a simple example of how to use GenSim and word2vec for word embedding.
Word2vec is a famous algorithm for natural language processing (NLP) created by Tomas Mikolov teams. It is a group of related models that are used to produce word embeddings, i.e. CBOW and skip-grams. The models are considered shallow. They consist of two-layer neural networks that are trained to reconstruct linguistic contexts of words. It was a small step in machine learning but made a huge impact in NLP compared to earlier algorithms, such as latent semantic analysis.
While skip-grams predicts context words based on input, CBOW is the reverse model of Skip-Grams. It predicts target words based on context words.
The GENSIM library
Gensim is an open-source python library for natural language processing and it was developed and is maintained by the Czech natural language processing researcher Radim Řehůřek. Gensim library will enable us to develop word embeddings by training our own word2vec models on a custom corpus either with CBOW of skip-grams algorithms.
To install and import gensim:
!pip install gensim import gensim
The word embedding example
In this example, I use a text file downloaded from Norvig.com and let the word2vec library train its models. After that, we will see some result. In practice, to get better results, you will need much bigger data and also tweaking some hyperparameters.
The code is for jupyter notebook in GG Colab using python 2x as there is a bug with the example for python 3x.
Download a big text file
This big text file is 6MB from norvig.com, it is still small in terms of big data. But it should be enough for demo purposes. We download the file using requests and save it to local drive with the name “big.txt”. Alternatively, you can also use ” from gensim.test.utils import common_texts” for the sample text file.
textURL = 'https://norvig.com/big.txt' import urllib import requests urllib.urlretrieve(textURL, "big.txt") r = requests.get(textURL) with open("big.txt", "wb") as f: f.write(r.content)
Read text file
text_file = "big.txt" # read all lines at once with open(text_file, 'rb') as f: lines = f.readlines() print(lines[:3])
Output the first 3 lines:
['The Project Gutenberg EBook of The Adventures of Sherlock Holmes\n', 'by Sir Arthur Conan Doyle\n', '(#15 in our series by Sir Arthur Conan Doyle)\n']
# process sentences to tokens processedLines = [gensim.utils.simple_preprocess(sentence) for sentence in lines] #create word list from token using utf8 encoding word_list = [word.encode('utf-8') for words in processedLines for word in words] #check the length of the list print('Length: ', len(word_list) #check first five words word_list[:5]
Length: 1064778 ['the', 'project', 'gutenberg', 'ebook', 'of']
So we have 100k+ words in the word_list.
Create a Word2Vec model
The hyperparameters of this model:
- size: The number of dimensions of the embeddings and the default is 100.
- window: The maximum distance between a target word and words around the target word. The default window is 5.
- min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
- workers: The number of partitions during training and the default workers is 3.
- sg: The training algorithm, either CBOW(0) or skip-gram (1). The default training algorithm is CBOW.
model = gensim.models.Word2Vec( [word_list], negative=10, iter=50, min_count=1, size=32 #dimension of the word vector )
The output of the training process:
... 2019-3-11 10:30:34,757 : INFO : EPOCH - 50 : training on 1064778 raw words (10000 effective words) took 0.0s, 260315 effective words/s 2019-3-11 10:30:34,758 : INFO : training on a 53238900 raw words (500000 effective words) took 2.2s, 228090 effective words/s
Check embedded words
If there is no error, you should be able to query the vector representing a word. Each vector is 32 in length.
model['uncourteous'] #output vector for 'uncourteous' array([ 7.1157895e-02, 3.0961020e-02, -8.3555140e-02, 1.4138509e-02, -6.7576997e-02, 1.2850691e-01, -1.1335255e-01, -1.0118033e-01, -9.3394592e-03, 1.1703445e-01, -1.2325902e-01, -4.7652371e-02, -3.4051174e-01, 3.2857012e-02, 4.3575674e-02, 6.2586945e-03, -1.0634376e-02, -2.9684594e-04, -1.7043892e-02, -1.7614692e-02, -1.3355084e-01, -1.0865550e-01, 1.4331852e-01, 2.4548984e-01, 7.0743695e-02, -7.7710636e-03, -3.7634671e-02, -8.4391050e-02, -4.7245864e-02, -6.7963079e-02, -8.1062794e-02, 1.2597212e-01], dtype=float32)
model['she'] #output vector for 'she' array([ 2.2985563e+00, 1.5476345e+00, -1.6383197e+00, 1.0864263e-02, -9.1463244e-01, 1.8228959e+00, -3.5029898e+00, -2.7226317e+00, 5.6958675e-01, 3.1518939e+00, -2.0932381e+00, -1.8218745e+00, -5.9538774e+00, 1.3396950e+00, 1.0417372e+00, 5.9076434e-01, 8.6964197e-02, 3.9082132e-03, -8.0454630e-01, 1.0120409e+00, -2.6888485e+00, -3.3639653e+00, 1.9557896e+00, 4.9678087e+00, 8.7341714e-01, -3.4626475e-01, -9.6429044e-01, -1.2232714e+00, -4.3153864e-01, -5.5255091e-01, -1.4339647e+00, 1.6336327e+00], dtype=float32)
We now can find the words that are similar to a given word.
model.wv.most_similar(positive="she")[:3] #Output the first 3 words found similar to "she" [('not', 0.9891390800476074), ('mrs', 0.9884595274925232), ('chances', 0.9869956970214844)]
model.wv.most_similar(positive="think")[:5] #output first 5 words found similar to "think" [('remarked', 0.9967506527900696), ('sorry', 0.9963732957839966), ('that', 0.9953688979148865), ('alone', 0.9952799081802368), ('accent', 0.9949144721031189)]
Also, we can check the similarity between a word pair. The below examples do make a lot of sense when “she” is close to “mrs” while not really relevant to “gun”. This is what the model learned from the corpus.
model.wv.similarity("mrs", "she") 0.9884596 model.wv.similarity("she", "gun") 0.18504377
predict output word
Word2vec will predict the list of words for given context words. The current
model.predict_output_word([ "with", "Hercules", "limbs"]) #output [('loathed', 0.003917875), ('moustached', 0.0032442596), ('rocket', 0.00320722), ('settee', 0.0031377329), ('recommence', 0.0025945716), ('irene', 0.0025766355), ('headed', 0.0025396906), ('darlington', 0.002532876), ('wallenstein', 0.0022341162), ('refreshingly', 0.0021978107)]
We have shown the simple example of how to use a word2vec library of gensim. It is not only a powerful tool for NLP but also for other application, such as search or recommender system. We will cover the topic in the future post or with new implementation with TensorFlow 2.0.