Topic modelling is a type of statistical modelling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of a topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modelled as Dirichlet distributions.
Here, we are going to apply LDA to a set of documents and split them into topics. Let’s get started!
The Data
The data set we’ll use is a list of over one million news headlines published over a period of 15 years and can be downloaded from Kaggle.
import pandas as pd
data = pd.read_csv('../input/million-headlines/abcnews-date-text.csv');
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text
Take a peek at the data.
print(len(documents))
print(documents[:5])
1226258
headline_text index
0 aba decides against community broadcasting lic... 0
1 act fire witnesses must be aware of defamation 1
2 a g calls for infrastructure protection summit 2
3 air nz staff in aust strike for pay rise 3
4 air nz strike to affect australian travellers 4
Data Pre-processing
Firstly, we will perform the following steps:
- Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
- Words that have fewer than 3 characters are removed.
- All stopwords are removed.
- Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
- Words are stemmed — words are reduced to their root form.
Loading gensim and nltk libraries
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')
Write a function to perform lemmatize and stem preprocessing steps on the data set.
def lemmatize_stemming(text):
stemmer = SnowballStemmer('english')
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))
return result
Select a document to preview after preprocessing.
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))
The output:
original document:
['ratepayers', 'group', 'wants', 'compulsory', 'local', 'govt', 'voting']
tokenized and lemmatized document:
['ratepay', 'group', 'want', 'compulsori', 'local', 'govt', 'vote']
It worked!
Preprocess the headline text, saving the results as ‘processed_docs’
processed_docs = documents['headline_text'].map(preprocess)
processed_docs[:10]
The output:
0 [decid, communiti, broadcast, licenc]
1 [wit, awar, defam]
2 [call, infrastructur, protect, summit]
3 [staff, aust, strike, rise]
4 [strike, affect, australian, travel]
5 [ambiti, olsson, win, tripl, jump]
6 [antic, delight, record, break, barca]
7 [aussi, qualifi, stosur, wast, memphi, match]
8 [aust, address, secur, council, iraq]
9 [australia, lock, timet]
Name: headline_text, dtype: object
Bag of Words on the Data set
Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.
dictionary = gensim.corpora.Dictionary(processed_docs)
for k, v in dictionary.iteritems():
print(k, v)
if k > 9:
break
The output:
0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit
Filter out tokens that appear in
- less than 15 documents (absolute number) or
- more than 0.5 documents (fraction of total corpus size, not absolute number).
- after the above two steps, keep only the first 100000 most frequent tokens.
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
Gensim doc2bow
For each document, we create a dictionary reporting how many words and how many times those words appear. After that, we save this to ‘bow_corpus’, then check our selected document earlier.
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]
[(162, 1), (240, 1), (292, 1), (589, 1), (838, 1), (3570, 1), (3571, 1)]
Preview Bag Of Words for our sample preprocessed document.
bow_doc_4310 = bow_corpus[4310]
for i in range(len(bow_doc_4310)):
print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0],
dictionary[bow_doc_4310[i][0]],
bow_doc_4310[i][1]))
Word 162 ("govt") appears 1 time.
Word 240 ("group") appears 1 time.
Word 292 ("vote") appears 1 time.
Word 589 ("local") appears 1 time.
Word 838 ("want") appears 1 time.
Word 3570 ("compulsori") appears 1 time.
Word 3571 ("ratepay") appears 1 time.
TF-IDF
Next, we now create tf-idf
model object using models.TfidfModel on ‘bow_corpus’ and save it to tfidf
, then apply the transformation to the entire corpus and call it ‘corpus_tfidf’. Finally, we preview TF-IDF scores for our first document.
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
pprint(doc)
break
The output:
[(0, 0.5842699484464488),
(1, 0.38798859072167835),
(2, 0.5008422243250992),
(3, 0.5071987254965034)]
Running LDA using Bag of Words
Next, we train our LDA model using gensim.models.LdaMulticore
and save it to `lda_model
`
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)
For each topic, we will explore the words occurring in that topic and its relative weight.
for idx, topic in lda_model.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
Topic: 0
Words: 0.031*"polic" + 0.028*"case" + 0.025*"death" + 0.021*"charg" + 0.018*"court" + 0.017*"murder" + 0.016*"coronavirus" + 0.016*"attack" + 0.015*"alleg" + 0.012*"trial"
Topic: 1
Words: 0.021*"market" + 0.018*"world" + 0.018*"women" + 0.015*"island" + 0.015*"final" + 0.014*"australian" + 0.013*"return" + 0.011*"fall" + 0.011*"street" + 0.011*"beach"
Topic: 2
Words: 0.041*"coronavirus" + 0.031*"covid" + 0.025*"live" + 0.022*"nation" + 0.021*"coast" + 0.019*"restrict" + 0.014*"water" + 0.013*"plan" + 0.013*"gold" + 0.011*"park"
Topic: 3
Words: 0.039*"sydney" + 0.021*"crash" + 0.020*"die" + 0.018*"adelaid" + 0.014*"miss" + 0.013*"polic" + 0.011*"break" + 0.011*"driver" + 0.011*"pandem" + 0.010*"search"
Topic: 4
Words: 0.040*"year" + 0.030*"melbourn" + 0.020*"open" + 0.019*"canberra" + 0.015*"accus" + 0.015*"jail" + 0.014*"work" + 0.013*"face" + 0.013*"life" + 0.013*"record"
Topic: 5
Words: 0.028*"govern" + 0.019*"health" + 0.019*"state" + 0.018*"school" + 0.017*"chang" + 0.017*"help" + 0.013*"coronavirus" + 0.012*"indigen" + 0.012*"feder" + 0.012*"communiti"
Topic: 6
Words: 0.073*"australia" + 0.045*"trump" + 0.024*"donald" + 0.018*"border" + 0.018*"tasmania" + 0.015*"elect" + 0.015*"peopl" + 0.014*"covid" + 0.013*"say" + 0.012*"scott"
Topic: 7
Words: 0.042*"queensland" + 0.034*"victoria" + 0.023*"news" + 0.021*"hous" + 0.021*"bushfir" + 0.014*"time" + 0.012*"west" + 0.012*"price" + 0.011*"farmer" + 0.010*"guilti"
Topic: 8
Words: 0.032*"china" + 0.029*"test" + 0.022*"south" + 0.015*"coronavirus" + 0.013*"north" + 0.012*"rural" + 0.012*"presid" + 0.012*"train" + 0.012*"minist" + 0.011*"talk"
Topic: 9
Words: 0.025*"call" + 0.021*"rise" + 0.020*"victorian" + 0.018*"morrison" + 0.017*"royal" + 0.017*"tasmanian" + 0.017*"claim" + 0.014*"commiss" + 0.011*"town" + 0.010*"million"
Can you distinguish different topics using the words in each topic and their corresponding weights?
Running LDA using TF-IDF
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
print('Topic: {} Word: {}'.format(idx, topic))
The output:
Topic: 0 Word: 0.008*"thursday" + 0.008*"liber" + 0.008*"govern" + 0.007*"sexual" + 0.007*"victorian" + 0.006*"alan" + 0.006*"abus" + 0.006*"social" + 0.006*"video" + 0.006*"disabl"
Topic: 1 Word: 0.010*"interview" + 0.009*"michael" + 0.009*"andrew" + 0.009*"coronavirus" + 0.008*"pandem" + 0.008*"lockdown" + 0.007*"domest" + 0.007*"grandstand" + 0.007*"violenc" + 0.007*"univers"
Topic: 2 Word: 0.023*"coronavirus" + 0.022*"covid" + 0.018*"news" + 0.015*"market" + 0.008*"price" + 0.007*"australian" + 0.007*"stori" + 0.007*"rural" + 0.007*"vaccin" + 0.007*"share"
Topic: 3 Word: 0.020*"donald" + 0.015*"drum" + 0.010*"tuesday" + 0.008*"david" + 0.007*"histori" + 0.005*"america" + 0.005*"abbott" + 0.004*"elder" + 0.004*"harvest" + 0.004*"shorten"
Topic: 4 Word: 0.012*"elect" + 0.008*"presid" + 0.007*"say" + 0.006*"govern" + 0.005*"border" + 0.005*"leader" + 0.005*"biden" + 0.005*"labor" + 0.005*"protest" + 0.005*"hong"
Topic: 5 Word: 0.012*"australia" + 0.011*"south" + 0.010*"north" + 0.009*"queensland" + 0.008*"weather" + 0.008*"world" + 0.007*"west" + 0.007*"monday" + 0.007*"storm" + 0.007*"leagu"
Topic: 6 Word: 0.017*"bushfir" + 0.012*"royal" + 0.011*"morrison" + 0.011*"commiss" + 0.010*"climat" + 0.009*"restrict" + 0.009*"wednesday" + 0.008*"coronavirus" + 0.008*"peter" + 0.007*"chang"
Topic: 7 Word: 0.009*"christma" + 0.007*"station" + 0.006*"farm" + 0.006*"plead" + 0.005*"coronavirus" + 0.005*"light" + 0.005*"fund" + 0.005*"plan" + 0.005*"august" + 0.005*"explain"
Topic: 8 Word: 0.034*"trump" + 0.018*"countri" + 0.013*"hour" + 0.011*"health" + 0.010*"care" + 0.010*"friday" + 0.009*"coronavirus" + 0.009*"sport" + 0.008*"mental" + 0.008*"age"
Topic: 9 Word: 0.019*"polic" + 0.016*"charg" + 0.014*"murder" + 0.013*"crash" + 0.013*"woman" + 0.009*"shoot" + 0.009*"court" + 0.009*"death" + 0.009*"alleg" + 0.009*"kill"
Again, can you distinguish different topics using the words in each topic and their corresponding weights?
Performance evaluation by classifying sample document using LDA Bag of Words model
We will check where our test document would be classified.
processed_docs[4310]
['ratepay', 'group', 'want', 'compulsori', 'local', 'govt', 'vote']
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))
The output
Score: 0.4076267182826996
Topic: 0.028*"govern" + 0.019*"health" + 0.019*"state" + 0.018*"school" + 0.017*"chang" + 0.017*"help" + 0.013*"coronavirus" + 0.012*"indigen" + 0.012*"feder" + 0.012*"communiti"
Score: 0.354687362909317
Topic: 0.025*"call" + 0.021*"rise" + 0.020*"victorian" + 0.018*"morrison" + 0.017*"royal" + 0.017*"tasmanian" + 0.017*"claim" + 0.014*"commiss" + 0.011*"town" + 0.010*"million"
Score: 0.15015438199043274
Topic: 0.073*"australia" + 0.045*"trump" + 0.024*"donald" + 0.018*"border" + 0.018*"tasmania" + 0.015*"elect" + 0.015*"peopl" + 0.014*"covid" + 0.013*"say" + 0.012*"scott"
Score: 0.012506603263318539
Topic: 0.041*"coronavirus" + 0.031*"covid" + 0.025*"live" + 0.022*"nation" + 0.021*"coast" + 0.019*"restrict" + 0.014*"water" + 0.013*"plan" + 0.013*"gold" + 0.011*"park"
Score: 0.012504367157816887
Topic: 0.040*"year" + 0.030*"melbourn" + 0.020*"open" + 0.019*"canberra" + 0.015*"accus" + 0.015*"jail" + 0.014*"work" + 0.013*"face" + 0.013*"life" + 0.013*"record"
Score: 0.012504314072430134
Topic: 0.032*"china" + 0.029*"test" + 0.022*"south" + 0.015*"coronavirus" + 0.013*"north" + 0.012*"rural" + 0.012*"presid" + 0.012*"train" + 0.012*"minist" + 0.011*"talk"
Score: 0.01250426471233368
Topic: 0.042*"queensland" + 0.034*"victoria" + 0.023*"news" + 0.021*"hous" + 0.021*"bushfir" + 0.014*"time" + 0.012*"west" + 0.012*"price" + 0.011*"farmer" + 0.010*"guilti"
Score: 0.012503987178206444
Topic: 0.031*"polic" + 0.028*"case" + 0.025*"death" + 0.021*"charg" + 0.018*"court" + 0.017*"murder" + 0.016*"coronavirus" + 0.016*"attack" + 0.015*"alleg" + 0.012*"trial"
Score: 0.012503987178206444
Topic: 0.021*"market" + 0.018*"world" + 0.018*"women" + 0.015*"island" + 0.015*"final" + 0.014*"australian" + 0.013*"return" + 0.011*"fall" + 0.011*"street" + 0.011*"beach"
Score: 0.012503987178206444
Topic: 0.039*"sydney" + 0.021*"crash" + 0.020*"die" + 0.018*"adelaid" + 0.014*"miss" + 0.013*"polic" + 0.011*"break" + 0.011*"driver" + 0.011*"pandem" + 0.010*"search"
Our test document has the highest probability to be part of the topic that our model assigned, which is the accurate classification.
Performance evaluation by classifying sample document using LDA TF-IDF model.
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))
The output:
Score: 0.7066717147827148 Topic: 0.009*"christma" + 0.007*"station" + 0.006*"farm" + 0.006*"plead" + 0.005*"coronavirus" + 0.005*"light" + 0.005*"fund" + 0.005*"plan" + 0.005*"august" + 0.005*"explain" Score: 0.19328047335147858 Topic: 0.020*"donald" + 0.015*"drum" + 0.010*"tuesday" + 0.008*"david" + 0.007*"histori" + 0.005*"america" + 0.005*"abbott" + 0.004*"elder" + 0.004*"harvest" + 0.004*"shorten" Score: 0.012508566491305828 Topic: 0.012*"elect" + 0.008*"presid" + 0.007*"say" + 0.006*"govern" + 0.005*"border" + 0.005*"leader" + 0.005*"biden" + 0.005*"labor" + 0.005*"protest" + 0.005*"hong" Score: 0.012507441453635693 Topic: 0.008*"thursday" + 0.008*"liber" + 0.008*"govern" + 0.007*"sexual" + 0.007*"victorian" + 0.006*"alan" + 0.006*"abus" + 0.006*"social" + 0.006*"video" + 0.006*"disabl" Score: 0.012506011873483658 Topic: 0.023*"coronavirus" + 0.022*"covid" + 0.018*"news" + 0.015*"market" + 0.008*"price" + 0.007*"australian" + 0.007*"stori" + 0.007*"rural" + 0.007*"vaccin" + 0.007*"share" Score: 0.012505752965807915 Topic: 0.017*"bushfir" + 0.012*"royal" + 0.011*"morrison" + 0.011*"commiss" + 0.010*"climat" + 0.009*"restrict" + 0.009*"wednesday" + 0.008*"coronavirus" + 0.008*"peter" + 0.007*"chang" Score: 0.012505417689681053 Topic: 0.034*"trump" + 0.018*"countri" + 0.013*"hour" + 0.011*"health" + 0.010*"care" + 0.010*"friday" + 0.009*"coronavirus" + 0.009*"sport" + 0.008*"mental" + 0.008*"age" Score: 0.012504937127232552 Topic: 0.012*"australia" + 0.011*"south" + 0.010*"north" + 0.009*"queensland" + 0.008*"weather" + 0.008*"world" + 0.007*"west" + 0.007*"monday" + 0.007*"storm" + 0.007*"leagu" Score: 0.012504936195909977 Topic: 0.010*"interview" + 0.009*"michael" + 0.009*"andrew" + 0.009*"coronavirus" + 0.008*"pandem" + 0.008*"lockdown" + 0.007*"domest" + 0.007*"grandstand" + 0.007*"violenc" + 0.007*"univers" Score: 0.012504755519330502 Topic: 0.019*"polic" + 0.016*"charg" + 0.014*"murder" + 0.013*"crash" + 0.013*"woman" + 0.009*"shoot" + 0.009*"court" + 0.009*"death" + 0.009*"alleg" + 0.009*"kill"
Our test document has the highest probability to be part of the topic that our model assigned, which is the accurate classification.
Testing model on unseen document
unseen_document = 'How a Pentagon deal became an identity crisis for Google'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))
Score: 0.34983524680137634 Topic: 0.031*"polic" + 0.028*"case" + 0.025*"death" + 0.021*"charg" + 0.018*"court"
Score: 0.2172440141439438 Topic: 0.032*"china" + 0.029*"test" + 0.022*"south" + 0.015*"coronavirus" + 0.013*"north"
Score: 0.18334510922431946 Topic: 0.040*"year" + 0.030*"melbourn" + 0.020*"open" + 0.019*"canberra" + 0.015*"accus"
Score: 0.14952439069747925 Topic: 0.042*"queensland" + 0.034*"victoria" + 0.023*"news" + 0.021*"hous" + 0.021*"bushfir"
Score: 0.01667620614171028 Topic: 0.025*"call" + 0.021*"rise" + 0.020*"victorian" + 0.018*"morrison" + 0.017*"royal"
Score: 0.016675885766744614 Topic: 0.021*"market" + 0.018*"world" + 0.018*"women" + 0.015*"island" + 0.015*"final"
Score: 0.016675855964422226 Topic: 0.041*"coronavirus" + 0.031*"covid" + 0.025*"live" + 0.022*"nation" + 0.021*"coast"
Score: 0.01667516678571701 Topic: 0.028*"govern" + 0.019*"health" + 0.019*"state" + 0.018*"school" + 0.017*"chang"
Score: 0.016674060374498367 Topic: 0.039*"sydney" + 0.021*"crash" + 0.020*"die" + 0.018*"adelaid" + 0.014*"miss"
Score: 0.016674060374498367 Topic: 0.073*"australia" + 0.045*"trump" + 0.024*"donald" + 0.018*"border" + 0.018*"tasmania"
Source code can be found publicly on Kaggle.
Reference: