Regular expressions on uint8 or single byte characters

2 vues (au cours des 30 derniers jours)
Martin Hoecker
Martin Hoecker le 25 Août 2013
I have a 200 MB text file encoded in UTF-8. My maximum array size is around 350 MB, so I can safely read it in using fread('filename','*uint8'). For using regular expressions, I need to turn this into a char array, which blows up the array size by at least a factor of two (depending on encoding, but for my application I can ignore all fancy characters), and thus leads to an "out of memory" error.
I wrote some code that breaks up the original array, so that the matching of the regular expressions works on smaller chunks, but I am still wondering: Can I somehow run regular expressions on the uint8 array? Or is there a char-like variable type that only uses 1 byte per character?
  5 commentaires
dpb
dpb le 26 Août 2013
Instead of 'unit8', try 'uchar' Not sure it'll help but it is at least a character class, not an integer.
Cedric
Cedric le 27 Août 2013
Modifié(e) : Cedric le 27 Août 2013
Actually, it is simpler to ask what you are trying to match instead of the pattern (copy/paste of chunk of file content or string, and an explanation of what you want to extract). With a little luck, we can perform this using STRFIND (which works on uint8 arrays) or some numeric test on uint8's.

Connectez-vous pour commenter.

Réponses (0)

Catégories

En savoir plus sur Logical dans Help Center et File Exchange

Produits

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by