A.I, Data and Software Engineering

Word2vec with TensorFlow 2.0 – a simple CBOW implementation


In TensorFlow website, there is a good example of word embedding implementation with Keras. Nevertheless, we are curious to see how it looks like when implementing word2vec with PURE TensorFlow 2.0.

What is CBOW

In the previous article, we introduced Word2vec (w2v) with Gensim library. Word2vec consists of two-layer neural networks that are trained to reconstruct linguistic contexts of words. The model was simple but more significant in natural language processing (NLP) compared to earlier algorithms, such as latent semantic analysis.

CBOW is short for the continuous bag of words – a part of w2v. The CBOW model architecture tries to predict the current target word (the centre word) based on the source context words (surrounding words). It firstly presents words as vectors. We can decide the dimension of the vector (embedding dimension). For example, if we choose the embedding dimension is 2, then the word “hello” can be presented as [0.213, 0.4543].

For training the presentation, CBOW use targets and windows. Considering a simple sentence, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on. Thus the model tries to predict the target_word based on the context_window words.

CBOW model (word2vec implementation with tensorflow 2.0)
CBOW model

Implement W2v CBOW with TF2

CBOW has an input layer, a hidden layer, and the output layer. The input layer is one-hot encoded. The weight and bias of the hidden layer form word embedding. Next, we take what we have in the embedded dimension and make a prediction about the neighbour. To make the prediction we use softmax.

Enable TF 2.0.

Build the w2v-cbow model

The model will take the vocabulary size, embedding dimension, and an optimizer to train. By default, we use stochastic gradient descent (SGD) for the optimizer but you can try with Adam optimizer.

Training method takes 2 inputs x_train and y_train that are one-hot vectors. We continuously optimize the weight and bias using tf.GradientTape().

Optionally, we can add a method to check how a word looks like after embedding.

One-hot input

We will process the corpus input to the one-hot presentation before testing the model.

Let do some simple pre-processing. We change to lower case and remove the periods (.). Gensim also provides a simple processing function as shown in the previous example.

We choose window size to 2.

For convenience, create 2 helper dictionaries, i.e. word2int and int2word. They do the simple mapping between words and corresponding integer values.

Encode to x_train and y_train

Train the model

Let check how the word “queen” looks like in vector space:

Visualise words in 2d space

Transform to 2d space

Plot all words

words in 2D space



In this article, we showed a simple version of Word2vec using pure TensorFlow 2.0 implementation. There is no more tf.Session involved. To wrap all training params, we use tf.GradientTape() . The result can be better with a larger corpus, nevertheless, it already showed an interesting result as queen ~ royal (close) and queen >< king (far away).

Add comment

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Pin It on Pinterest


You have successfully subscribed to the newsletter

There was an error while trying to send your request. Please try again.

Petaminds will use the information you provide on this form to be in touch with you and to provide updates.