MATLAB Answers

John
0

measuring term frequency of words

Asked by John
on 3 Dec 2017
Latest activity Edited by Christopher Creutzig on 15 Oct 2018
I have been able to obtain a bag of words from a document. Please, how can I interact with the bag of words array, so I may make calculations on the frequency of terms within each document?
str = extractFileText('file.txt');
paras = split(str,"</P>");
paras(end) = []; % the split left an empty last entry
paras = extractAfter(paras,">") % Drop the "<P ID=n>" from the beginning
tdoc = tokenizedDocument(lower(paras));
bag = bagOfWords(tdoc)
I have this result:
For clarification, I believe the columns are the terms, while the rows are the documents. Am I right?
I loaded 2 txt files (1 document set, 1 query set) I want to evaluate similarity between each document and each query by Cosine similarity, tf-idf or whatsoever means.

  0 Comments

Sign in to comment.

1 Answer

Answer by Christopher Creutzig on 4 Dec 2017
 Accepted Answer

See the bagOfWords documentation. E.g., you can use the tfidf function, you can extract bag.Counts and use pdist(bag.Counts,'cosine'), you can use fitlsa for what is essentially a principal component analysis for dimensionality reduction, or fitlda to train/fit a topic model.

  2 Comments

John
on 7 Dec 2017
I need to compute the similarity between each query loaded in QueTF and each document in DocTF.
How may I do that? QueTF and DocTF are both bag of words.
What is the significance of pdist2?
I am having problems applying this to the bag of words.
Cosss = pdist2(QueTF,DocTF,'cosine');
John, you need to encode both sets of documents with the same bag-of-words model. (That model not only contains counts, it also has a specific mapping which word to put into which position, and if you use tfidf, you need to use the same idf factors for consistency within your analysis.) Something like this:
corpus = tokenizedDocument(corpusData);
bow = bagOfWords(corpus);
query = tokenizedDocument(queryData);
queryVectors = encode(bow,query);
dists = pdist2(queryVectors,bow.Counts,'cosine');

Sign in to comment.