how to extract a list of unique words from a set of one row strings

55 vues (au cours des 30 derniers jours)
Harrison
Harrison le 14 Nov 2024 à 0:58
Commenté : Harrison le 15 Nov 2024 à 16:56
Basically I have a set of 11 strings of words, and each string has no repeating words, but I need a list of every unique word in all 11 strings.
I've found that this works for one string at a time, but I can't get a list for all 11 strings this way.
A{1} = updatedDocuments(1,1)
B{1} = strjoin(unique(strtrim(strsplit(A{1}, ',')))', '')
Is it possible to index A{1} as updatedDocuments(1:11,1) or do something similar?

Réponse acceptée

Madheswaran
Madheswaran le 14 Nov 2024 à 9:32
Modifié(e) : Madheswaran le 15 Nov 2024 à 5:17
I am assuming the following:
  • 'updatedDocuments' is an array of 'tokenizedDocument'
  • Each document contains text that is comma seperated and doesn't end with a comma
To get the unique words from the entire set of strings, you can follow the below approach:
% remove comma from the documents if you don't want comma to be
% included in 'uniqeWords'
updatedDocuments = removeWords(updatedDocuments, ",");
uniqueWords = updatedDocuments.Vocabulary;
If the 'updatedDocuments' is an cell array of char vector, you can follow the below approach:
updatedDocuments = strcat(updatedDocuments, ','); % Add comma at end of each cell
allWords = strjoin(updatedDocuments(1:11,1), ' '); % Join all words into a single string
allWords = strtrim(strsplit(allWords, ',')); % Split with comma as delimiter and trim
uniqueWords = unique(allWords); % unique words (1 x n cell where n is the number of unique words)
For more information, refer to the following documentations:
  1. https://mathworks.com/help/textanalytics/ref/tokenizeddocument.html
  2. https://mathworks.com/help/matlab/ref/double.unique.html
Hope this helps!
  3 commentaires
Madheswaran
Madheswaran le 15 Nov 2024 à 5:18
That is because I assumed 'updatedDocument' to be a cell array of character vectors. If 'updatedDocument' were an array of 'tokenizedDocument', resolving this issue would be straightforward. I have updated the answer by including a solution for when 'updatedDocument' is a 'tokenizedDocument', in addition to the existing explanation.
Let me know if that helps!
Harrison
Harrison le 15 Nov 2024 à 16:56
Thats exactly right! Thank you!!

Connectez-vous pour commenter.

Plus de réponses (1)

Paul
Paul le 14 Nov 2024 à 1:09
If UpdatedDocuments is a 1D cell array of chars ...
UpdatedDocuments{1} = 'one,two,three,one';
UpdatedDocuments{2} = 'one,two,three,two';
UpdatedDocuments{3} = 'one,two,three,three';
result = cellfun(@(S) strjoin(unique(strtrim(strsplit(S, ','))),','),UpdatedDocuments,'Uni',false)
result = 1x3 cell array
{'one,three,two'} {'one,three,two'} {'one,three,two'}
  1 commentaire
Paul
Paul le 15 Nov 2024 à 1:06
The Vocabulary property of tokenizedDocument returns the uniqew words in the array
documents = tokenizedDocument([
"an example of a short sentence an example of a short sentence "
"a second short sentence a second short sentence"]);
documents
documents =
2x1 tokenizedDocument: 12 tokens: an example of a short sentence an example of a short sentence 8 tokens: a second short sentence a second short sentence
documents.Vocabulary
ans = 1x7 string array
"an" "example" "of" "a" "short" "sentence" "second"

Connectez-vous pour commenter.

Catégories

En savoir plus sur Characters and Strings dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by