A.I, Data and Software Engineering

Latent Dirichlet Allocation (LDA) and Topic ModelLing in Python

L

Topic modelling is a type of statistical modelling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of a topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modelled as Dirichlet distributions.

Here, we are going to apply LDA to a set of documents and split them into topics. Let’s get started!

The Data

The data set we’ll use is a list of over one million news headlines published over a period of 15 years and can be downloaded from Kaggle.

import pandas as pd
data = pd.read_csv('../input/million-headlines/abcnews-date-text.csv');
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

Take a peek at the data.

print(len(documents))
print(documents[:5])

1226258
                                       headline_text  index
0  aba decides against community broadcasting lic...      0
1     act fire witnesses must be aware of defamation      1
2     a g calls for infrastructure protection summit      2
3           air nz staff in aust strike for pay rise      3
4      air nz strike to affect australian travellers      4

Data Pre-processing

Firstly, we will perform the following steps:

  • Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
  • Words that have fewer than 3 characters are removed.
  • All stopwords are removed.
  • Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
  • Words are stemmed — words are reduced to their root form.

Loading gensim and nltk libraries

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

Write a function to perform lemmatize and stem preprocessing steps on the data set.

def lemmatize_stemming(text):
    stemmer = SnowballStemmer('english')
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

Select a document to preview after preprocessing.

doc_sample = documents[documents['index'] == 4310].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

The output:

original document: 
['ratepayers', 'group', 'wants', 'compulsory', 'local', 'govt', 'voting']


 tokenized and lemmatized document: 
['ratepay', 'group', 'want', 'compulsori', 'local', 'govt', 'vote']

It worked!

Preprocess the headline text, saving the results as ‘processed_docs’

processed_docs = documents['headline_text'].map(preprocess)
processed_docs[:10]

The output:

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

Bag of Words on the Data set

Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.

dictionary = gensim.corpora.Dictionary(processed_docs)
for k, v in dictionary.iteritems():
    print(k, v)
    if k > 9:
        break

The output:

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit

Filter out tokens that appear in

  • less than 15 documents (absolute number) or
  • more than 0.5 documents (fraction of total corpus size, not absolute number).
  • after the above two steps, keep only the first 100000 most frequent tokens.
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

Gensim doc2bow

For each document, we create a dictionary reporting how many words and how many times those words appear. After that, we save this to ‘bow_corpus’, then check our selected document earlier.

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]
[(162, 1), (240, 1), (292, 1), (589, 1), (838, 1), (3570, 1), (3571, 1)]

Preview Bag Of Words for our sample preprocessed document.

bow_doc_4310 = bow_corpus[4310]
for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                               dictionary[bow_doc_4310[i][0]], 
bow_doc_4310[i][1]))
Word 162 ("govt") appears 1 time.
Word 240 ("group") appears 1 time.
Word 292 ("vote") appears 1 time.
Word 589 ("local") appears 1 time.
Word 838 ("want") appears 1 time.
Word 3570 ("compulsori") appears 1 time.
Word 3571 ("ratepay") appears 1 time.

TF-IDF

Next, we now create tf-idf model object using models.TfidfModel on ‘bow_corpus’ and save it to tfidf, then apply the transformation to the entire corpus and call it ‘corpus_tfidf’. Finally, we preview TF-IDF scores for our first document.

from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

The output:

[(0, 0.5842699484464488),
 (1, 0.38798859072167835),
 (2, 0.5008422243250992),
 (3, 0.5071987254965034)]

Running LDA using Bag of Words

Next, we train our LDA model using gensim.models.LdaMulticore and save it to `lda_model`

lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

For each topic, we will explore the words occurring in that topic and its relative weight.

for idx, topic in lda_model.print_topics(-1):
      print('Topic: {} \nWords: {}'.format(idx, topic))
Topic: 0 
Words: 0.031*"polic" + 0.028*"case" + 0.025*"death" + 0.021*"charg" + 0.018*"court" + 0.017*"murder" + 0.016*"coronavirus" + 0.016*"attack" + 0.015*"alleg" + 0.012*"trial"
Topic: 1 
Words: 0.021*"market" + 0.018*"world" + 0.018*"women" + 0.015*"island" + 0.015*"final" + 0.014*"australian" + 0.013*"return" + 0.011*"fall" + 0.011*"street" + 0.011*"beach"
Topic: 2 
Words: 0.041*"coronavirus" + 0.031*"covid" + 0.025*"live" + 0.022*"nation" + 0.021*"coast" + 0.019*"restrict" + 0.014*"water" + 0.013*"plan" + 0.013*"gold" + 0.011*"park"
Topic: 3 
Words: 0.039*"sydney" + 0.021*"crash" + 0.020*"die" + 0.018*"adelaid" + 0.014*"miss" + 0.013*"polic" + 0.011*"break" + 0.011*"driver" + 0.011*"pandem" + 0.010*"search"
Topic: 4 
Words: 0.040*"year" + 0.030*"melbourn" + 0.020*"open" + 0.019*"canberra" + 0.015*"accus" + 0.015*"jail" + 0.014*"work" + 0.013*"face" + 0.013*"life" + 0.013*"record"
Topic: 5 
Words: 0.028*"govern" + 0.019*"health" + 0.019*"state" + 0.018*"school" + 0.017*"chang" + 0.017*"help" + 0.013*"coronavirus" + 0.012*"indigen" + 0.012*"feder" + 0.012*"communiti"
Topic: 6 
Words: 0.073*"australia" + 0.045*"trump" + 0.024*"donald" + 0.018*"border" + 0.018*"tasmania" + 0.015*"elect" + 0.015*"peopl" + 0.014*"covid" + 0.013*"say" + 0.012*"scott"
Topic: 7 
Words: 0.042*"queensland" + 0.034*"victoria" + 0.023*"news" + 0.021*"hous" + 0.021*"bushfir" + 0.014*"time" + 0.012*"west" + 0.012*"price" + 0.011*"farmer" + 0.010*"guilti"
Topic: 8 
Words: 0.032*"china" + 0.029*"test" + 0.022*"south" + 0.015*"coronavirus" + 0.013*"north" + 0.012*"rural" + 0.012*"presid" + 0.012*"train" + 0.012*"minist" + 0.011*"talk"
Topic: 9 
Words: 0.025*"call" + 0.021*"rise" + 0.020*"victorian" + 0.018*"morrison" + 0.017*"royal" + 0.017*"tasmanian" + 0.017*"claim" + 0.014*"commiss" + 0.011*"town" + 0.010*"million"

Can you distinguish different topics using the words in each topic and their corresponding weights?

Running LDA using TF-IDF

lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

The output:

Topic: 0 Word: 0.008*"thursday" + 0.008*"liber" + 0.008*"govern" + 0.007*"sexual" + 0.007*"victorian" + 0.006*"alan" + 0.006*"abus" + 0.006*"social" + 0.006*"video" + 0.006*"disabl"
Topic: 1 Word: 0.010*"interview" + 0.009*"michael" + 0.009*"andrew" + 0.009*"coronavirus" + 0.008*"pandem" + 0.008*"lockdown" + 0.007*"domest" + 0.007*"grandstand" + 0.007*"violenc" + 0.007*"univers"
Topic: 2 Word: 0.023*"coronavirus" + 0.022*"covid" + 0.018*"news" + 0.015*"market" + 0.008*"price" + 0.007*"australian" + 0.007*"stori" + 0.007*"rural" + 0.007*"vaccin" + 0.007*"share"
Topic: 3 Word: 0.020*"donald" + 0.015*"drum" + 0.010*"tuesday" + 0.008*"david" + 0.007*"histori" + 0.005*"america" + 0.005*"abbott" + 0.004*"elder" + 0.004*"harvest" + 0.004*"shorten"
Topic: 4 Word: 0.012*"elect" + 0.008*"presid" + 0.007*"say" + 0.006*"govern" + 0.005*"border" + 0.005*"leader" + 0.005*"biden" + 0.005*"labor" + 0.005*"protest" + 0.005*"hong"
Topic: 5 Word: 0.012*"australia" + 0.011*"south" + 0.010*"north" + 0.009*"queensland" + 0.008*"weather" + 0.008*"world" + 0.007*"west" + 0.007*"monday" + 0.007*"storm" + 0.007*"leagu"
Topic: 6 Word: 0.017*"bushfir" + 0.012*"royal" + 0.011*"morrison" + 0.011*"commiss" + 0.010*"climat" + 0.009*"restrict" + 0.009*"wednesday" + 0.008*"coronavirus" + 0.008*"peter" + 0.007*"chang"
Topic: 7 Word: 0.009*"christma" + 0.007*"station" + 0.006*"farm" + 0.006*"plead" + 0.005*"coronavirus" + 0.005*"light" + 0.005*"fund" + 0.005*"plan" + 0.005*"august" + 0.005*"explain"
Topic: 8 Word: 0.034*"trump" + 0.018*"countri" + 0.013*"hour" + 0.011*"health" + 0.010*"care" + 0.010*"friday" + 0.009*"coronavirus" + 0.009*"sport" + 0.008*"mental" + 0.008*"age"
Topic: 9 Word: 0.019*"polic" + 0.016*"charg" + 0.014*"murder" + 0.013*"crash" + 0.013*"woman" + 0.009*"shoot" + 0.009*"court" + 0.009*"death" + 0.009*"alleg" + 0.009*"kill"

Again, can you distinguish different topics using the words in each topic and their corresponding weights?

Performance evaluation by classifying sample document using LDA Bag of Words model

We will check where our test document would be classified.

processed_docs[4310]
['ratepay', 'group', 'want', 'compulsori', 'local', 'govt', 'vote']
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))

The output

Score: 0.4076267182826996	 
Topic: 0.028*"govern" + 0.019*"health" + 0.019*"state" + 0.018*"school" + 0.017*"chang" + 0.017*"help" + 0.013*"coronavirus" + 0.012*"indigen" + 0.012*"feder" + 0.012*"communiti"

Score: 0.354687362909317	 
Topic: 0.025*"call" + 0.021*"rise" + 0.020*"victorian" + 0.018*"morrison" + 0.017*"royal" + 0.017*"tasmanian" + 0.017*"claim" + 0.014*"commiss" + 0.011*"town" + 0.010*"million"

Score: 0.15015438199043274	 
Topic: 0.073*"australia" + 0.045*"trump" + 0.024*"donald" + 0.018*"border" + 0.018*"tasmania" + 0.015*"elect" + 0.015*"peopl" + 0.014*"covid" + 0.013*"say" + 0.012*"scott"

Score: 0.012506603263318539	 
Topic: 0.041*"coronavirus" + 0.031*"covid" + 0.025*"live" + 0.022*"nation" + 0.021*"coast" + 0.019*"restrict" + 0.014*"water" + 0.013*"plan" + 0.013*"gold" + 0.011*"park"

Score: 0.012504367157816887	 
Topic: 0.040*"year" + 0.030*"melbourn" + 0.020*"open" + 0.019*"canberra" + 0.015*"accus" + 0.015*"jail" + 0.014*"work" + 0.013*"face" + 0.013*"life" + 0.013*"record"

Score: 0.012504314072430134	 
Topic: 0.032*"china" + 0.029*"test" + 0.022*"south" + 0.015*"coronavirus" + 0.013*"north" + 0.012*"rural" + 0.012*"presid" + 0.012*"train" + 0.012*"minist" + 0.011*"talk"

Score: 0.01250426471233368	 
Topic: 0.042*"queensland" + 0.034*"victoria" + 0.023*"news" + 0.021*"hous" + 0.021*"bushfir" + 0.014*"time" + 0.012*"west" + 0.012*"price" + 0.011*"farmer" + 0.010*"guilti"

Score: 0.012503987178206444	 
Topic: 0.031*"polic" + 0.028*"case" + 0.025*"death" + 0.021*"charg" + 0.018*"court" + 0.017*"murder" + 0.016*"coronavirus" + 0.016*"attack" + 0.015*"alleg" + 0.012*"trial"

Score: 0.012503987178206444	 
Topic: 0.021*"market" + 0.018*"world" + 0.018*"women" + 0.015*"island" + 0.015*"final" + 0.014*"australian" + 0.013*"return" + 0.011*"fall" + 0.011*"street" + 0.011*"beach"

Score: 0.012503987178206444	 
Topic: 0.039*"sydney" + 0.021*"crash" + 0.020*"die" + 0.018*"adelaid" + 0.014*"miss" + 0.013*"polic" + 0.011*"break" + 0.011*"driver" + 0.011*"pandem" + 0.010*"search"

Our test document has the highest probability to be part of the topic that our model assigned, which is the accurate classification.

Performance evaluation by classifying sample document using LDA TF-IDF model.

for index, score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))

The output:

Score: 0.7066717147827148	 
Topic: 0.009*"christma" + 0.007*"station" + 0.006*"farm" + 0.006*"plead" + 0.005*"coronavirus" + 0.005*"light" + 0.005*"fund" + 0.005*"plan" + 0.005*"august" + 0.005*"explain"

Score: 0.19328047335147858	 
Topic: 0.020*"donald" + 0.015*"drum" + 0.010*"tuesday" + 0.008*"david" + 0.007*"histori" + 0.005*"america" + 0.005*"abbott" + 0.004*"elder" + 0.004*"harvest" + 0.004*"shorten"

Score: 0.012508566491305828	 
Topic: 0.012*"elect" + 0.008*"presid" + 0.007*"say" + 0.006*"govern" + 0.005*"border" + 0.005*"leader" + 0.005*"biden" + 0.005*"labor" + 0.005*"protest" + 0.005*"hong"

Score: 0.012507441453635693	 
Topic: 0.008*"thursday" + 0.008*"liber" + 0.008*"govern" + 0.007*"sexual" + 0.007*"victorian" + 0.006*"alan" + 0.006*"abus" + 0.006*"social" + 0.006*"video" + 0.006*"disabl"

Score: 0.012506011873483658	 
Topic: 0.023*"coronavirus" + 0.022*"covid" + 0.018*"news" + 0.015*"market" + 0.008*"price" + 0.007*"australian" + 0.007*"stori" + 0.007*"rural" + 0.007*"vaccin" + 0.007*"share"

Score: 0.012505752965807915	 
Topic: 0.017*"bushfir" + 0.012*"royal" + 0.011*"morrison" + 0.011*"commiss" + 0.010*"climat" + 0.009*"restrict" + 0.009*"wednesday" + 0.008*"coronavirus" + 0.008*"peter" + 0.007*"chang"

Score: 0.012505417689681053	 
Topic: 0.034*"trump" + 0.018*"countri" + 0.013*"hour" + 0.011*"health" + 0.010*"care" + 0.010*"friday" + 0.009*"coronavirus" + 0.009*"sport" + 0.008*"mental" + 0.008*"age"

Score: 0.012504937127232552	 
Topic: 0.012*"australia" + 0.011*"south" + 0.010*"north" + 0.009*"queensland" + 0.008*"weather" + 0.008*"world" + 0.007*"west" + 0.007*"monday" + 0.007*"storm" + 0.007*"leagu"

Score: 0.012504936195909977	 
Topic: 0.010*"interview" + 0.009*"michael" + 0.009*"andrew" + 0.009*"coronavirus" + 0.008*"pandem" + 0.008*"lockdown" + 0.007*"domest" + 0.007*"grandstand" + 0.007*"violenc" + 0.007*"univers"

Score: 0.012504755519330502	 
Topic: 0.019*"polic" + 0.016*"charg" + 0.014*"murder" + 0.013*"crash" + 0.013*"woman" + 0.009*"shoot" + 0.009*"court" + 0.009*"death" + 0.009*"alleg" + 0.009*"kill"

Our test document has the highest probability to be part of the topic that our model assigned, which is the accurate classification.

Testing model on unseen document

unseen_document = 'How a Pentagon deal became an identity crisis for Google'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))
Score: 0.34983524680137634	 Topic: 0.031*"polic" + 0.028*"case" + 0.025*"death" + 0.021*"charg" + 0.018*"court"
Score: 0.2172440141439438	 Topic: 0.032*"china" + 0.029*"test" + 0.022*"south" + 0.015*"coronavirus" + 0.013*"north"
Score: 0.18334510922431946	 Topic: 0.040*"year" + 0.030*"melbourn" + 0.020*"open" + 0.019*"canberra" + 0.015*"accus"
Score: 0.14952439069747925	 Topic: 0.042*"queensland" + 0.034*"victoria" + 0.023*"news" + 0.021*"hous" + 0.021*"bushfir"
Score: 0.01667620614171028	 Topic: 0.025*"call" + 0.021*"rise" + 0.020*"victorian" + 0.018*"morrison" + 0.017*"royal"
Score: 0.016675885766744614	 Topic: 0.021*"market" + 0.018*"world" + 0.018*"women" + 0.015*"island" + 0.015*"final"
Score: 0.016675855964422226	 Topic: 0.041*"coronavirus" + 0.031*"covid" + 0.025*"live" + 0.022*"nation" + 0.021*"coast"
Score: 0.01667516678571701	 Topic: 0.028*"govern" + 0.019*"health" + 0.019*"state" + 0.018*"school" + 0.017*"chang"
Score: 0.016674060374498367	 Topic: 0.039*"sydney" + 0.021*"crash" + 0.020*"die" + 0.018*"adelaid" + 0.014*"miss"
Score: 0.016674060374498367	 Topic: 0.073*"australia" + 0.045*"trump" + 0.024*"donald" + 0.018*"border" + 0.018*"tasmania"

Source code can be found publicly on Kaggle.

Reference:

Udacity — NLP

Add comment

💬

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Categories