Table of contents
In TensorFlow website, there is a good example of word embedding implementation with Keras. Nevertheless, we are curious to see how it looks like when implementing word2vec with PURE TensorFlow 2.0.
What is CBOW
In the previous article, we introduced Word2vec (w2v) with Gensim library. Word2vec consists of two-layer neural networks that are trained to reconstruct linguistic contexts of words. The model was simple but more significant in natural language processing (NLP) compared to earlier algorithms, such as latent semantic analysis.
CBOW is short for the continuous bag of words – a part of w2v. The CBOW model architecture tries to predict the current target word (the centre word) based on the source context words (surrounding words). It firstly presents words as vectors. We can decide the dimension of the vector (embedding dimension). For example, if we choose the embedding dimension is 2, then the word “hello” can be presented as [0.213, 0.4543].
For training the presentation, CBOW use targets and windows. Considering a simple sentence, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on. Thus the model tries to predict the target_word
based on the context_window
words.

Implement W2v CBOW with TF2
CBOW has an input layer, a hidden layer, and the output layer. The input layer is one-hot encoded. The weight and bias of the hidden layer form word embedding. Next, we take what we have in the embedded dimension and make a prediction about the neighbour. To make the prediction we use softmax.
Enable TF 2.0.
1 2 3 4 5 6 7 8 9 10 11 12 | from __future__ import absolute_import, division, print_function, unicode_literals try: # %tensorflow_version only exists in Colab. %tensorflow_version 2.x except Exception: pass import tensorflow as tf import numpy as np print(tf.__version__) ##Output #TensorFlow 2.x selected. #2.0.0-rc2 |
Build the w2v-cbow model
The model will take the vocabulary size, embedding dimension, and an optimizer to train. By default, we use stochastic gradient descent (SGD) for the optimizer but you can try with Adam optimizer.
1 2 3 4 5 6 7 8 9 | class Word2Vec: def __init__(self, vocab_size=0, embedding_dim=16, optimizer='sgd', epochs=10000): self.vocab_size=vocab_size self.embedding_dim=5 self.epochs=epochs if optimizer=='adam': self.optimizer = tf.optimizers.Adam() else: self.optimizer = tf.optimizers.SGD(learning_rate=0.1) |
Training method takes 2 inputs x_train and y_train that are one-hot vectors. We continuously optimize the weight and bias using tf.GradientTape().
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | def train(self, x_train=None, y_train=None): self.W1 = tf.Variable(tf.random.normal([self.vocab_size, self.embedding_dim])) self.b1 = tf.Variable(tf.random.normal([self.embedding_dim])) #bias self.W2 = tf.Variable(tf.random.normal([self.embedding_dim, self.vocab_size])) self.b2 = tf.Variable(tf.random.normal([self.vocab_size])) for _ in range(self.epochs): with tf.GradientTape() as t: hidden_layer = tf.add(tf.matmul(x_train,self.W1),self.b1) output_layer = tf.nn.softmax(tf.add( tf.matmul(hidden_layer, self.W2), self.b2)) cross_entropy_loss = tf.reduce_mean(-tf.math.reduce_sum(y_train * tf.math.log(output_layer), axis=[1])) grads = t.gradient(cross_entropy_loss, [self.W1, self.b1, self.W2, self.b2]) self.optimizer.apply_gradients(zip(grads,[self.W1, self.b1, self.W2, self.b2])) if(_ % 1000 == 0): print(cross_entropy_loss) |
Optionally, we can add a method to check how a word looks like after embedding.
1 2 | def vectorized(self, word_idx): return (self.W1+self.b1)[word_idx] |
One-hot input
We will process the corpus input to the one-hot presentation before testing the model.
1 | corpus_raw = 'He is the king . The king is royal . She is the royal queen ' |
Let do some simple pre-processing. We change to lower case and remove the periods (.). Gensim also provides a simple processing function as shown in the previous example.
1 2 3 4 5 6 7 8 9 10 | # convert to lower case corpus_raw = corpus_raw.lower() # raw sentences is a list of sentences. raw_sentences = corpus_raw.split('.') sentences = [] for sentence in raw_sentences: sentences.append(sentence.split()) #sentences: #[['he', 'is', 'the', 'king'], ['the', 'king', 'is', 'royal'], ['she', 'is', 'the', 'royal', 'queen']] |
We choose window size to 2.
1 2 3 4 5 6 7 8 9 10 | data = [] WINDOW_SIZE = 2 for sentence in sentences: for word_index, word in enumerate(sentence): for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] : if nb_word != word: data.append([word, nb_word]) #data: #[['he', 'is'], ['he', 'the'], ['is', 'he'], ['is', 'the'], ['is', 'king'], ['the', 'he'], ['the', 'is'], ['the', 'king'], ['king', 'is'], ['king', 'the'], ['the', 'king'], ['the', 'is'], ['king', 'the'], ['king', 'is'], ['king', 'royal'], ['is', 'the'], ['is', 'king'], ['is', 'royal'], ['royal', 'king'], ['royal', 'is'], ['she', 'is'], ['she', 'the'], ['is', 'she'], ['is', 'the'], ['is', 'royal'], ['the', 'she'], ['the', 'is'], ['the', 'royal'], ['the', 'queen'], ['royal', 'is'], ['royal', 'the'], ['royal', 'queen'], ['queen', 'the'], ['queen', 'royal']] |
For convenience, create 2 helper dictionaries, i.e. word2int and int2word. They do the simple mapping between words and corresponding integer values.
1 2 3 4 5 6 7 8 9 10 11 | words = [] for word in corpus_raw.split(): if word != '.': # because we don't want to treat . as a word words.append(word) words = set(words) # so that all duplicate words are removed word2int = {} int2word = {} vocab_size = len(words) # gives the total number of unique words for i,word in enumerate(words): word2int[word] = i int2word[i] = word |
Encode to x_train and y_train
1 2 3 4 5 6 7 8 9 10 11 12 13 | # function to convert numbers to one hot vectors def to_one_hot(data_point_index, vocab_size): temp = np.zeros(vocab_size) temp[data_point_index] = 1 return temp x_train = [] # input word y_train = [] # output word for data_word in data: x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size)) y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size)) # convert them to numpy arrays x_train = np.asarray(x_train, dtype='float32') y_train = np.asarray(y_train, dtype='float32') |
Train the model
1 2 | w2v = Word2Vec(vocab_size=vocab_size, optimizer='adam', epochs=10000) w2v.train(x_train, y_train) |
1 2 3 4 5 6 7 8 9 10 11 | #training process tf.Tensor(2.8971386, shape=(), dtype=float32) tf.Tensor(1.4061855, shape=(), dtype=float32) tf.Tensor(1.3393705, shape=(), dtype=float32) tf.Tensor(1.324885, shape=(), dtype=float32) tf.Tensor(1.3221014, shape=(), dtype=float32) tf.Tensor(1.3211844, shape=(), dtype=float32) tf.Tensor(1.320798, shape=(), dtype=float32) tf.Tensor(1.3206141, shape=(), dtype=float32) tf.Tensor(1.3205199, shape=(), dtype=float32) tf.Tensor(1.3204701, shape=(), dtype=float32) |
Let check how the word “queen” looks like in vector space:
1 | w2v.vectozied(word2int['queen']) |
1 2 3 | <tf.Tensor: id=1920489, shape=(5,), dtype=float32, numpy= array([-0.34213448, 0.83041203, 1.1423318 , -0.87035054, 2.8295236 ], dtype=float32)> |
Visualise words in 2d space
Transform to 2d space
1 2 3 4 5 6 7 8 9 | from sklearn.manifold import TSNE from sklearn import preprocessing model = TSNE(n_components=2, random_state=0) np.set_printoptions(suppress=True) vectors = model.fit_transform(vectors) normalizer = preprocessing.Normalizer() vectors = normalizer.fit_transform(vectors, 'l2') |
Plot all words
1 2 3 4 5 6 7 8 | import matplotlib.pyplot as plt fig, ax = plt.subplots() ax.set_xlim(left=-1, right=1) ax.set_ylim(bottom=-1, top=1) for word in words: print(word, vectors[word2int[word]][1]) ax.annotate(word, (vectors[word2int[word]][0],vectors[word2int[word]][1] )) plt.show() |

1 2 3 4 5 6 7 | she 0.03894128 is -0.53304255 queen -0.97676146 he -0.99200153 the 0.51811576 royal -0.7622982 king 0.93413407 |
Conclusion

In this article, we showed a simple version of Word2vec using pure TensorFlow 2.0 implementation. There is no more tf.Session involved. To wrap all training params, we use tf.GradientTape() . The result can be better with a larger corpus, nevertheless, it already showed an interesting result as queen ~ royal (close) and queen >< king (far away).