A.I, Data and Software Engineering

Word2vec with TensorFlow 2.0 – a simple CBOW implementation

W

In TensorFlow website, there is a good example of word embedding implementation with Keras. Nevertheless, we are curious to see how it looks like when implementing word2vec with PURE TensorFlow 2.0.

What is CBOW

In the previous article, we introduced Word2vec (w2v) with Gensim library. Word2vec consists of two-layer neural networks that are trained to reconstruct linguistic contexts of words. The model was simple but more significant in natural language processing (NLP) compared to earlier algorithms, such as latent semantic analysis.

CBOW is short for the continuous bag of words – a part of w2v. The CBOW model architecture tries to predict the current target word (the centre word) based on the source context words (surrounding words). It firstly presents words as vectors. We can decide the dimension of the vector (embedding dimension). For example, if we choose the embedding dimension is 2, then the word “hello” can be presented as [0.213, 0.4543].

For training the presentation, CBOW use targets and windows. Considering a simple sentence, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on. Thus the model tries to predict the target_word based on the context_window words.

CBOW model (word2vec implementation with tensorflow 2.0)
CBOW model

Implement W2v CBOW with TF2

CBOW has an input layer, a hidden layer, and the output layer. The input layer is one-hot encoded. The weight and bias of the hidden layer form word embedding. Next, we take what we have in the embedded dimension and make a prediction about the neighbour. To make the prediction we use softmax.

Enable TF 2.0.

from __future__ import absolute_import, division, print_function, unicode_literals
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
import numpy as np
print(tf.__version__)
##Output
#TensorFlow 2.x selected.
#2.0.0-rc2

Build the w2v-cbow model

The model will take the vocabulary size, embedding dimension, and an optimizer to train. By default, we use stochastic gradient descent (SGD) for the optimizer but you can try with Adam optimizer.

class Word2Vec:
  def __init__(self, vocab_size=0, embedding_dim=16, optimizer='sgd', epochs=10000):
    self.vocab_size=vocab_size
    self.embedding_dim=5
    self.epochs=epochs
    if optimizer=='adam':
      self.optimizer = tf.optimizers.Adam()
    else:
      self.optimizer = tf.optimizers.SGD(learning_rate=0.1)

Training method takes 2 inputs x_train and y_train that are one-hot vectors. We continuously optimize the weight and bias using tf.GradientTape().

   def train(self, x_train=None, y_train=None):
    self.W1 = tf.Variable(tf.random.normal([self.vocab_size, self.embedding_dim]))
    self.b1 = tf.Variable(tf.random.normal([self.embedding_dim])) #bias
    self.W2 = tf.Variable(tf.random.normal([self.embedding_dim, self.vocab_size]))
    self.b2 = tf.Variable(tf.random.normal([self.vocab_size]))
    for _ in range(self.epochs):
      with tf.GradientTape() as t:
        hidden_layer = tf.add(tf.matmul(x_train,self.W1),self.b1)
        output_layer = tf.nn.softmax(tf.add( tf.matmul(hidden_layer, self.W2), self.b2))
        cross_entropy_loss = tf.reduce_mean(-tf.math.reduce_sum(y_train * tf.math.log(output_layer), axis=[1]))
      grads = t.gradient(cross_entropy_loss, [self.W1, self.b1, self.W2, self.b2])
      self.optimizer.apply_gradients(zip(grads,[self.W1, self.b1, self.W2, self.b2]))
      if(_ % 1000 == 0):
        print(cross_entropy_loss)

Optionally, we can add a method to check how a word looks like after embedding.

  def vectorized(self, word_idx):
    return (self.W1+self.b1)[word_idx]

One-hot input

We will process the corpus input to the one-hot presentation before testing the model.

corpus_raw = 'He is the king . The king is royal . She is the royal  queen '

Let do some simple pre-processing. We change to lower case and remove the periods (.). Gensim also provides a simple processing function as shown in the previous example.

# convert to lower case
corpus_raw = corpus_raw.lower()
# raw sentences is a list of sentences.
raw_sentences = corpus_raw.split('.')
sentences = []
for sentence in raw_sentences:
    sentences.append(sentence.split())
#sentences:
#[['he', 'is', 'the', 'king'], ['the', 'king', 'is', 'royal'], ['she', 'is', 'the', 'royal', 'queen']]

We choose window size to 2.

data = []
WINDOW_SIZE = 2
for sentence in sentences:
    for word_index, word in enumerate(sentence):
        for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] :
            if nb_word != word:
                data.append([word, nb_word])
#data:
#[['he', 'is'], ['he', 'the'], ['is', 'he'], ['is', 'the'], ['is', 'king'], ['the', 'he'], ['the', 'is'], ['the', 'king'], ['king', 'is'], ['king', 'the'], ['the', 'king'], ['the', 'is'], ['king', 'the'], ['king', 'is'], ['king', 'royal'], ['is', 'the'], ['is', 'king'], ['is', 'royal'], ['royal', 'king'], ['royal', 'is'], ['she', 'is'], ['she', 'the'], ['is', 'she'], ['is', 'the'], ['is', 'royal'], ['the', 'she'], ['the', 'is'], ['the', 'royal'], ['the', 'queen'], ['royal', 'is'], ['royal', 'the'], ['royal', 'queen'], ['queen', 'the'], ['queen', 'royal']]

For convenience, create 2 helper dictionaries, i.e. word2int and int2word. They do the simple mapping between words and corresponding integer values.

words = []
for word in corpus_raw.split():
    if word != '.': # because we don't want to treat . as a word
        words.append(word)
words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words
for i,word in enumerate(words):
    word2int[word] = i
    int2word[i] = word

Encode to x_train and y_train

# function to convert numbers to one hot vectors
def to_one_hot(data_point_index, vocab_size):
    temp = np.zeros(vocab_size)
    temp[data_point_index] = 1
    return temp
x_train = [] # input word
y_train = [] # output word
for data_word in data:
    x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size))
    y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size))
# convert them to numpy arrays
x_train = np.asarray(x_train, dtype='float32')
y_train = np.asarray(y_train, dtype='float32')

Train the model

w2v = Word2Vec(vocab_size=vocab_size, optimizer='adam', epochs=10000)
w2v.train(x_train, y_train)
#training process
tf.Tensor(2.8971386, shape=(), dtype=float32)
tf.Tensor(1.4061855, shape=(), dtype=float32)
tf.Tensor(1.3393705, shape=(), dtype=float32)
tf.Tensor(1.324885, shape=(), dtype=float32)
tf.Tensor(1.3221014, shape=(), dtype=float32)
tf.Tensor(1.3211844, shape=(), dtype=float32)
tf.Tensor(1.320798, shape=(), dtype=float32)
tf.Tensor(1.3206141, shape=(), dtype=float32)
tf.Tensor(1.3205199, shape=(), dtype=float32)
tf.Tensor(1.3204701, shape=(), dtype=float32)

Let check how the word “queen” looks like in vector space:

w2v.vectozied(word2int['queen'])
<tf.Tensor: id=1920489, shape=(5,), dtype=float32, numpy=
array([-0.34213448,  0.83041203,  1.1423318 , -0.87035054,  2.8295236 ],
      dtype=float32)>

Visualise words in 2d space

Transform to 2d space

from sklearn.manifold import TSNE
from sklearn import preprocessing
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
vectors = model.fit_transform(vectors)
normalizer = preprocessing.Normalizer()
vectors =  normalizer.fit_transform(vectors, 'l2')

Plot all words

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.set_xlim(left=-1, right=1)
ax.set_ylim(bottom=-1, top=1)
for word in words:
    print(word, vectors[word2int[word]][1])
    ax.annotate(word, (vectors[word2int[word]][0],vectors[word2int[word]][1] ))
plt.show()
words in 2D space
she 0.03894128
is -0.53304255
queen -0.97676146
he -0.99200153
the 0.51811576
royal -0.7622982
king 0.93413407

Conclusion

Yeah!

In this article, we showed a simple version of Word2vec using pure TensorFlow 2.0 implementation. There is no more tf.Session involved. To wrap all training params, we use tf.GradientTape() . The result can be better with a larger corpus, nevertheless, it already showed an interesting result as queen ~ royal (close) and queen >< king (far away).

2 comments

💬

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Categories