Why LDA model generate a topic include stop words although these words don't exist in the data ?

12 views (last 30 days)
Jack on 27 Sep 2021
Answered: the cyclist on 27 Sep 2021
Hello and good day to you..
I am doing topic modling by Latent Dirichlet Allocation (LDA), and this require preprocessing (cleaning) the data before. Thus, I did preprocessing steps in order as follows:
However, when topics generated by the LDA model, whereby a topic in LDA means (a collection of propably related words), there is a topic contain stop words although it were removed from the data. I also check the data and there is no single stop word in it. Why these stop words still there and showed as one of resulted topics, althgouh these words do not even exist in the Vocabulary of the model ?
Please Help !

Answers (1)

the cyclist
the cyclist on 27 Sep 2021
I don't think it is possible to answer this question well without seeing the data.
I think it is extraordinarily unlikely that the stop word does not appear in the data, if it shows up in a topic. Perhaps you are somehow accidentally incorporating another corpus, besides your data? Another possibility is that a stop word (e.g. "run") does not appear in your data, but a related word (e.g. "running") does appear, and there is an algorithm that is doing trimming of words to their root words.
One thing you could try, to debug this weirdness, is to run your code on half your data, to see if these stop words still show up. If they do, run it on on the other half. Keep slicing up the data, and maybe you can narrow down to see exactly which part of your corpus is causing the "error".

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by