counting pattern frequency in a string
Afficher commentaires plus anciens
Looking for ideas on the fastest way to count the number of occurrences of letter patterns in a string. For example, for the test string 'ZXCVBNMZXCVBAS' with a pattern length of 4, it would generate the following table: ZXCV=2, XCVB=2, CVBN=1, etc... The brute force way to do it is to start with an empty list of 4-letter strings, and pull 4-letters blocks from the test string, moving over one letter at a time. For each block, check if already on the list, if so increment the count; if not, add to the end of the list.
The problem with this is that if you have very long test strings (say 10^8 and higher) and move to higher pattern 'word' lengths, say 8 or 10 letters, the number of unique patterns is huge and so the thing slows down as each 8 letter block is compared against a huge list.
Another idea would be to assign each possible pattern with an index: AAAA=1, AAAB=2, so that you can just calculate the index using base 26 conversion for any given string, and increment the value at that index of a master vector. (eg AABA=27, etc., so increment vector(27) by one). But again, with longer pattern lengths, there is not enough memory to store a vector with that many indices, covering all the words (26^8 or 26^10, etc.).
So, is there some way to do this efficiently when the strings and pattern lengths get big? Something with sparsity, or smart sorting, or indexing?
Thank you
2 commentaires
Jos (10584)
le 22 Déc 2017
Basic questions first: do you really need all patterns? If so, why?
Réponse acceptée
Plus de réponses (0)
Catégories
En savoir plus sur Characters and Strings dans Centre d'aide et File Exchange
Produits
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!