counting pattern frequency in a string

Question

Mark le 22 Déc 2017

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/374238-counting-pattern-frequency-in-a-string

Commenté : Mark le 22 Déc 2017

Looking for ideas on the fastest way to count the number of occurrences of letter patterns in a string. For example, for the test string 'ZXCVBNMZXCVBAS' with a pattern length of 4, it would generate the following table: ZXCV=2, XCVB=2, CVBN=1, etc... The brute force way to do it is to start with an empty list of 4-letter strings, and pull 4-letters blocks from the test string, moving over one letter at a time. For each block, check if already on the list, if so increment the count; if not, add to the end of the list.

The problem with this is that if you have very long test strings (say 10^8 and higher) and move to higher pattern 'word' lengths, say 8 or 10 letters, the number of unique patterns is huge and so the thing slows down as each 8 letter block is compared against a huge list.

Another idea would be to assign each possible pattern with an index: AAAA=1, AAAB=2, so that you can just calculate the index using base 26 conversion for any given string, and increment the value at that index of a master vector. (eg AABA=27, etc., so increment vector(27) by one). But again, with longer pattern lengths, there is not enough memory to store a vector with that many indices, covering all the words (26^8 or 26^10, etc.).

So, is there some way to do this efficiently when the strings and pattern lengths get big? Something with sparsity, or smart sorting, or indexing?

Thank you

2 commentaires
Afficher AucuneMasquer Aucune

Jos (10584) le 22 Déc 2017

Basic questions first: do you really need all patterns? If so, why?

Mark le 22 Déc 2017

Modifié(e) : Mark le 22 Déc 2017

I don't need all patterns, mostly because the number of possible 10-letter sequences in a test string of 10^6 letters, is just under 10^6, while the number of 10-letter patterns is 26^10. But there is no a priori list of possible patterns for a given test string, though one could quickly be generated from the string of course. I'm not sure if that answers your question? To be clear, I'm only interested in the patterns that appear in the test string, not the ones that don't. So in that sense, I don't need all patterns.

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Stephen23 le 22 Déc 2017

1
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/374238-counting-pattern-frequency-in-a-string#answer_297455

Modifié(e) : Stephen23 le 22 Déc 2017

Ouvrir dans MATLAB Online

Hummm, interesting problem. I would approach this slightly differently, taking advantage of two features to try and achieve some reasonably efficient solution:

The requested substring length num is small and fixed.
MATLAB'S inbuilt functions are highly optimized and quite efficient.

Given those points, I would simply loop over the string num times: 1st iteration start from position 1, 2nd iteration starting from position 2, etc. Within each iteration you can reshape the largest possible subset of the main string into a matrix and easily find the unique rows. This intermediate step slightly reduces memory requirements and builds up a "semi-unique" set of all substrings of the requested length. Also collect their indices and use histc to count the substring occurrences for that iteration. Then after the loop use efficient unique again to get the final list of unique substrings, together with accumarray to add the histc output counts together.

% Fake Data:
%str = char(randi([65,90],1,1e7));
str = 'ZXCVBNMZXCVBAS';
num = 4;
tot = numel(str);
% Loop over <num> starting indices:
tmp = cell(1,num);
idh = cell(1,num);
for k = 1:num
  vec = k-1:num:tot;
  [tmp{k},~,idt] = unique(reshape(str(k:vec(end)),num,[]).','rows');
  idh{k} = histc(idt,unique(idt)); % save the count
end
% Count unique rows:
[out,~,ido] = unique(vertcat(tmp{:}),'rows');
cnt = accumarray(ido,vertcat(idh{:}));

For that small sample string it returns:

>> out
out =
BNMZ
CVBA
CVBN
MZXC
NMZX
VBAS
VBNM
XCVB
ZXCV
>> cnt
cnt =
   1
   1
   1
   1
   1
   1
   1
   2
   2

which seems to be correct. For the longer string (1e7 elements) it took my laptop

Elapsed time is 9.031 seconds.

to run, and I would imagine it scales roughly linearly with num. How does that compare to the methods you are using now?

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Mark le 22 Déc 2017

This is excellent, supremely fast. Running brute force for num=3 on something around a 10^7 length test string was taking upwards of 10 mins when I stopped the script, whereas num=5 or more takes 12 seconds on your version. Perfect! Thank you

Connectez-vous pour commenter.

counting pattern frequency in a string

2 commentaires
Afficher AucuneMasquer Aucune

Réponse acceptée

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Plus de réponses (0)

Voir également

Catégories

Tags

Produits

Community Treasure Hunt

counting pattern frequency in a string

2 commentaires Afficher AucuneMasquer Aucune

Réponse acceptée

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Plus de réponses (0)

Voir également

Catégories

Tags

Produits

Community Treasure Hunt

2 commentaires
Afficher AucuneMasquer Aucune

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens