How to improve K-means clustering with TF-IDF?
8 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
Geovane Gomes
le 7 Oct 2024
Commenté : Christopher Creutzig
le 22 Oct 2024
Hi all,
I’m currently working on a project where I need to classify company segments based on their activity descriptions.
I’ve implemented K-means clustering using TF-IDF for feature extraction from text data. However, the current clustering results aren’t entirely accurate, especially when it comes to grouping semantically similar segments (e.g., "cars" and "vehicles" are placed into separate clusters). Is this possible to optmise it, or use another approche rather than TF-IDF.
See cluster 13. More than 50% of the items were assigned to this cluster. I also tried using other distance parameters, but the results didn't improve.
Here is my code:
clear
close
% load and preprocess
d = readtable("segmentos95Translated.xlsx");
t = d.TRANSLATED;
for i = 1:height(t)
str = t{i};
splitStr = strsplit(str, 'EXCEPT');
t{i} = strtrim(splitStr{1});
end
for i = 1:height(t)
str = t{i};
splitStr = strsplit(str, 'WITHOUT PREDOMINANCE');
t{i} = strtrim(splitStr{1});
end
% tokenization
t = lower(t);
t = tokenizedDocument(t);
t = removeStopWords(t);
t = normalizeWords(t);
customStopWords = ["manufactur","activ",",","rental","(",")","*","exempt"...
"commerci","repres","agent","trade","product","retail","sale","waiv","special","wholesal"];
t = removeWords(t,customStopWords);
% bag of words and TF-IDF
bag = bagOfWords(t);
tfidfMatrix = tfidf(bag);
X = full(tfidfMatrix);
% kmeans
rng(1)
numClusters = 25; % about 10%
[idx, C, sumd, D] = kmeans(X, numClusters);
d.clusters = idx;
% display results
for i = 1:numClusters
fprintf('Cluster %d:\n', i);
disp(d.TRANSLATED(idx == i));
end
sortrows(groupcounts(d,"clusters"),"Percent","descend")
0 commentaires
Réponse acceptée
Sandeep Mishra
le 8 Oct 2024
Hi Geovane,
I can observe that you are trying to enhance the accuracy of your K-means clustering implementation.
The current implementation using 'TF-IDF' fails to capture the semantic meanings between words, which can lead to unrelated synonyms or related terms being treated as distinct.
To resolve this, you can use word embeddings such as 'fastText' which represent words in a continuous vector space, capturing semantic meanings.
You can leverage the 'Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding' add-on in MATLAB to implement 'fastText' word embedding.
Consider the following implementation:
% Converting tokenized documents to cell array
textData = arrayfun(@(doc) joinWords(doc), t, 'UniformOutput', false);
% Loading fastText word embedding
emb = fastTextWordEmbedding;
% Converting text to embedding
X = zeros(numel(textData), emb.Dimension);
for i = 1:numel(textData)
words = split(textData{i});
validWords = words(isVocabularyWord(emb, words));
if ~isempty(validWords)
vecs = word2vec(emb, validWords);
X(i, :) = mean(vecs, 1);
end
end
[idx, C] = kmeans(X, numClusters);
Refer to the following MathWorks Documentation to learn more about ‘Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding’ function in MATLAB: https://www.mathworks.com/matlabcentral/fileexchange/66229-text-analytics-toolbox-model-for-fasttext-english-16-billion-token-word-embedding
I hope this helps.
4 commentaires
Christopher Creutzig
le 22 Oct 2024
Also worth checking out are documentEmbedding and, for a different workflow with “soft clustering,” fitlda.
Plus de réponses (0)
Voir également
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!