Replacing characters with integers in a very long string

I have a string of a few millions characters, want to replace it with a vector of integers according to simple rules, such as 'C' = -1 and so forth. My implementation works but takes forever and uses gigabytes of memory, in particular due to the str2num function, to my understanding. Is there a way to go more efficiently?
sequence = fileread('sourcefile.txt');
sequence_num = strrep(sequence, 'A', '0 ');
sequence_num = strrep(sequence_num,'C','-1 ');
sequence_num = strrep(sequence_num,'G', '1 ');
sequence_num = strrep(sequence_num,'T', '0 ');
sequence_num = regexprep(sequence_num,'\r\n','');
sequence_num = str2num(sequence_num);
sequence_num = int32(sequence_num);

 Réponse acceptée

I don’t know what structure ‘sequence’ has. I created it as a cell array here:
bases = {'A','C','T','G'}; % Cell Array
sequence = bases(randi(4, 1, 20)); % Create Data
skew = zeros(1, length(sequence)+1,'int32'); % Preallocate
Cix = find(ismember(sequence, 'C')); % Logical Vector
Gix = find(ismember(sequence, 'G')); % Logical Vector
skew(Cix+1) = -1; % Replace With Integer
skew(Gix+1) = +1; % Replace With Integer

7 commentaires

Thank you. Can I adapt it to work as well is the input is a char array of size 1 x 5000000?
My pleasure.
Yes:
t0 = clock;
bases = ['A','C','T','G']; % Character Array
sequence = bases(randi(4, 1, 5000000));
skew = zeros(1, length(sequence)+1,'int32');
Cix = find(ismember(sequence, 'C'));
Gix = find(ismember(sequence, 'G'));
skew(Cix+1) = -1;
skew(Gix+1) = +1;
t1 = clock;
fprintf(1, '\tMy code needs only %.3f seconds\n', etime(t1,t0))
My code needs only 0.450 seconds
Thank you, your solution worked fine! This is a 40x redutction of computation time and completely solves the memory problem. Memory consumption now is negligible!
Now the slowest part of my code is another replacement operation that I use to get rid of newline characters in the string, inherited from the source text file, which has lines of about 70 characters. Here is how I import the file:
sequence = fileread('source.txt');
And here us the slow command I use to get rid of newline characters. Is it the best option?
sequence = regexprep(sequence,'\r\n','');
My pleasure!
Without having ‘source.txt’ (or a representative sample of it), it is not possible to suggest an improvement.
The only possibility that comes immediately to mind is to read your file line-by-line with the fgetl function. Then, save it to a ‘.mat’ file so you only have to read it from the text file once. (See the documentation for the save, load, and matfile functions for details on how to use them if you’re not familiar with ‘.mat’ files.)
From the documentation for fgetl:
  • tline = fgetl( fileID ) returns the next line of the specified file, removing the newline characters.
Reading your file line-by-line in a while loop may not be faster than what you’re currently doing, so you will have to experiment. See the documentation for fgetl to understand how to use it with the ischar function. (Also see the documentation for the eof end-of-file indicator in the event that it would be best to use it instead of ischar in your application.)
@Paolo: strrep is much faster than regexprep:
sequence = strrep(sequence, sprintf('\r\n'), '');
Another simplification:
bases = ['A','C','T','G'];
sequence = bases(randi(4, 1, 5000000));
skew = zeros(1, length(sequence), 'int32');
Cix = (sequence == 'C');
Gix = (sequence == 'G');
skew(Cix) = -1;
skew(Gix) = +1;
Thank you @Star and @Jan. All in your help sped up my code 700x times, now 0.17 s for a bacterium genome. About 250 times thanks to @Star suggestions, and 3 more times thanks to @Jan final simplification.
Our pleasure!
It is always more gratifying to help with real-world research. We wish you well!

Connectez-vous pour commenter.

Plus de réponses (0)

Catégories

En savoir plus sur Characters and Strings dans Centre d'aide et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by