Most frequent word in text
1 vue (au cours des 30 derniers jours)
Afficher commentaires plus anciens
How to print all word in text that are together and how many time they appear one word per line order from most to least?
4 commentaires
Walter Roberson
le 27 Nov 2019
Modifié(e) : Walter Roberson
le 27 Nov 2019
"No" ? So "can't" is not a "word", and "John's" is not a word, and "self-expression" is not a word? If the file happened to contain
John's self-expression can't runh 7 tlick.
then what would the desired output be?
Réponses (1)
Image Analyst
le 27 Nov 2019
Try this:
str = '123 zxy abc def abc def abc last word';
% str = fileread(fileName); % Read in text from disk file.
words = strsplit(str);
uniqueWords = unique(words)
numUniqueWords = length(uniqueWords)
wordCounts = zeros(numUniqueWords, 1);
for k = 1 : numUniqueWords
thisWord = uniqueWords(k);
indexes = ismember(words, thisWord);
wordCounts(k) = sum(indexes);
end
% Show results in command window
wordCounts
Do you have the Text Analytics Toolbox? There are probably functions in that toolbox to get a histogram of words easier than this.
3 commentaires
Walter Roberson
le 27 Nov 2019
If you have a cell array of character vectors that is the words, then you can use
randperm(number_of_words, number_to_choose_randomly)
to get out a cell array of character vectors that are that many randomly choosen words. After that your task is reduced to one of displaying them, such as
fprintf('%s\n', TheCellArray{:});
Image Analyst
le 27 Nov 2019
Roger, you might find Talk To Transformer fun. It will generate sentences using a neural network. So even though it generates gibberish, it's not just random words. The grammar is right with nouns, adjectives, etc. right, and sentence structure right. For example when I type in "I like to use MATLAB Answers." below is how it completed the paragraph.
"I like to use MATLAB Answers. There's one new way to run a simulation if I have time, and that's to run the Model of a Power Grapher experiment with a mesh that's made of a grid that covers the corresponding coordinates. The reason for this is that the Lattice Proximal layer doesn't cover each coordinate perfectly, meaning that each layer overlaps some areas, which introduces a kind of noise to the output image. My current theory is that the noise causes the software not to converge as well. Unfortunately, I don't have the equipment."
I've seen one professor feed the whole works of Shakespeare into a network and after the first epoch it was just random letters, than after a few hundred more, it was breaking them into words, then sentences. And after even more it was getting grammar right. After more and more epochs the text got more and more reasonable and less gibberish sounding. He thinks if he trained it for weeks, it might produce something that sounded very reasonable.
Voir également
Catégories
En savoir plus sur Text Files dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!