Count the number of times a word begins with "co" in a text using Text Analytics Toolbox

3 vues (au cours des 30 derniers jours)
Dear community,
I have a pdf with news headlines, and I need to count the number of words each title has and the number of times the words starting with "co" and the word "price" appear in each title. I have not much experience using the Text Analytics Toolbox in Matlab. As far as I can see, "tokenizedDocument" already gives you the total number of words (or tokens) per headline, and "context" counts a specific word. However, I do not know how to ask Matlab to look for words starting with "co". Also, how do I get this information displayed in a table?
I leave my pdf and my code.
I really appreciate any help you can provide!
filename = "Factiva_sample_headlines_1.pdf";
str = extractFileText(filename);
textData = split(str,[newline newline]); %split the text into separate news using split
textData = textData(cellfun(@(s)isempty(regexp(s,'Page')),textData)); %Erase data related to number of page
cleanedDocuments = tokenizedDocument(textData); %Create an array of tokenized documents.
  12 commentaires
Angelavtc
Angelavtc le 22 Avr 2022
Oh la la, it seems more complex than expected :( perhaps I should move to another software 😭. In any case, thank you very much @Stephen!
Angelavtc
Angelavtc le 23 Avr 2022
@Stephen Sorry for the inconvenience again, but I have managed to transform the file to html format (https://drive.google.com/file/d/1Z5bW98_gWohr2appS8zKgxCC1_mlLpzc/view?usp=sharing) Now the problem is that when I use :
filename = "Factiva_1.html";
str = extractFileText(filename);
I only get one article loaded. Any idea how to make matlab read all of them and classify them by title, date and body?
Thank you very much!

Connectez-vous pour commenter.

Réponse acceptée

Jonas
Jonas le 21 Avr 2022
Modifié(e) : Jonas le 21 Avr 2022
are your searching for something like in this example, applied to your textData?
a={'cotrol', 'alcotro','conect','trial','co'};
cellfun(@(in) strcmp(in(1:2),'co'),a)
ans =
1×5 logical array
1 0 1 0 1
you can sum that array to get the total number of words starting with "co"
  8 commentaires
Walter Roberson
Walter Roberson le 25 Avr 2022
See also contains() and patternarray
(but I am old fashioned and find regexp easier to work with... but it can get tricky!)

Connectez-vous pour commenter.

Plus de réponses (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by