STTM approach (Gibbs Sampling Dirichlet Mixture Model or GSDMM)

admin 1/18/2021 07:04:00 pm python

One of the most popular topic modeling approaches is Latent Dirichlet Allocation (LDA) which is a generative probabilistic model algorithm that uncovers latent variables that govern the semantics of a document, these variables representing abstract topics. A typical use of LDA (and topic modeling in general) is applying it to a collection of news articles to identify common themes or topics such as science, politics, finance, etc. However, one shortcoming of LDA is that it doesn’t work well with shorter texts such as tweets. This is where more recent short text topic modeling (STTM) approaches, some that build upon LDA, come in handy and perform better!

This series of posts are designed to show and explain how to use Python to perform and apply a specific STTM approach (Gibbs Sampling Dirichlet Mixture Model or GSDMM) to health tweets from Twitter. It will be a combination of data scraping/cleaning, programming, data visualization, and machine learning.

Part 1: Scraping Tweets From Twitter

Part 2: Cleaning and Preprocessing Tweets

Part 3: Applying Short Text Topic Modeling

Part 4: Visualize Topic Modeling Results

https://towardsai.net/p/programming/tweet-topic-modeling-part-2-cleaning-and-preprocessing-tweets