Find words common across multiple string cells

16 vues (au cours des 30 derniers jours)
Tejas
Tejas le 26 Oct 2020
Commenté : Tejas le 27 Oct 2020
I have a cell array where each cell has a string of different length, and each string is essentially a column of single words. Something like this
words{1,1} = ["sphere";"geometry";"number";"algebra";"function"];
words{1,2} = ["geometry";"equation";"nonlinear";"partial";"function"];
words{1,3} = ["number";"derivative";"function";"topology";"equation";"theory"];
words{1,4} = ["equation";"integral";"geometry";"function";"singular"];
I want to find words which are repeated at least once in a specified number of cells. That is, if I say words common in at least 4 cells, then I should get back
common_words = "function";
If I want words common in at least 3 cells, I should get back
common_words = ["geometry";"function";"equation"];
I can use intersect in a loop (however inefficient that might be) if the words are required to be common in all the cells. However, how do I go about finding intersections of a specific number of cells? As per my understanding, that would require combinations, and it would increase computation time exponentially with increasing cells. Is there an efficient way to do this or would I have to take combinations?
  4 commentaires
Stephen23
Stephen23 le 26 Oct 2020
Is the cell array or are the strings particularly large? Would there be any memory issues if they were concatenated or merged together?
Tejas
Tejas le 26 Oct 2020
There are 40 cells in the array, and the largest string vector is 3238x1. I can also reduce this by removing repeated words within a string vector, but I think the maximum length goes to about 3000. The mean string length across all cells is in fact around 2000, since initial cells have smaller string vectors. If it helps, I've attached the file containing these strings.

Connectez-vous pour commenter.

Réponse acceptée

Stephen23
Stephen23 le 27 Oct 2020
Modifié(e) : Stephen23 le 27 Oct 2020
My ancient version does not support strings, so I used cell arrays of character vectors, but I would expect that this should work for string as well. Approach: get unique words, concatenate, count using a histogram function:
words{1,1} = {'sphere';'geometry';'number';'algebra';'function'};
words{1,2} = {'geometry';'equation';'nonlinear';'partial';'function'};
words{1,3} = {'number';'derivative';'function';'topology';'equation';'theory'};
words{1,4} = {'equation';'integral';'geometry';'function';'singular'};
tmp = cellfun(@unique,words,'uni',0);
tmp = vertcat(tmp{:});
[uni,~,idx] = unique(tmp);
cnt = histc(idx,1:max(idx));
out = uni(cnt>=3)
Or as a function:
>> fun = @(n) uni(cnt>=n);
>> fun(4)
ans =
'function'
>> fun(3)
ans =
'equation'
'function'
'geometry'
  1 commentaire
Tejas
Tejas le 27 Oct 2020
This works for me! Thank you.

Connectez-vous pour commenter.

Plus de réponses (0)

Catégories

En savoir plus sur Characters and Strings dans Help Center et File Exchange

Produits


Version

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by