(Answers Dev) Restored edit
Best solution to finding repeating characters on a line.
2 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
Matthew Worker
le 13 Juil 2021
Commenté : Rena Berman
le 26 Sep 2023
I am looking for any instances of two characters (e/d) being repeated in a row greater then or equal to 10. I just want to either print every line that this occurs to the command line or stop and print the location of the stop everytime it is detected. Basically I am trying to find when e and d show up over ten times grouped together in a large data file. For example:
asdfsdfsdfsasdfsdfsdfsasdfsdfsdfs
asseefadfefeeedddeeedddasdfsdf
asdfsdfsdfsasdfsdfsdfsasdfsdfsdfs
asseefadfefeeedddeeedddasdfsdf
The script would then print out line 2 and line 4 in the command line.
Thank you for your help
Réponse acceptée
Stephen23
le 13 Juil 2021
Modifié(e) : Stephen23
le 13 Juil 2021
inp = {'asdfsdfsdfsasdfsdfsdfsasdfsdfsdfs';'asseefadfefaaadddaaadddasdfsdf';'asdfsdfsdfsasdfsdfsdfsasdfsdfsdfs';'asseefadfefaaadddaaadddasdfsdf'};
rgx = '(.)(??$1*)(.?)(??[$1$2]*)';
spl = regexp(inp,rgx,'match');
idx = cellfun(@(c)any(cellfun(@numel,c)>9),spl);
find(idx)
5 commentaires
Walter Roberson
le 13 Juil 2021
The bold text does not represent repetitions this time, not unless you mean repetition between lines. In the previous example there was two halves, with the second being the same as the first.
If the task is to find places where there is a string of at least 10 d or e characters then
'[de]{10,}'
can find that, and the 'once' and isempty and indexing from my Answer gives you the rest. It just depends on your having used readlines() on the file.
Stephen23
le 13 Juil 2021
Modifié(e) : Rena Berman
le 22 Sep 2023
Matthew Worker: are the specific characters known in advance? Or do you want to detect them automatically? (i.e. detect any two characters that are repeated more than 10 times contiguously)
Are there any particular patterns that you need to include/exclude? (e.g. does 10 'e' characters in a row count, or does the sequence have to include at least one 'd' character?).
Plus de réponses (1)
Walter Roberson
le 13 Juil 2021
You say "10 or over", so is it correct that the program needs to all possible patterns? For example,
'adadadadaaaadadadadaaa'
(length 22) should be located if it exists?
S = {'asseefadfefaaadddaaadddasdfsdf', 'asseeadadadadaaaadadadadaaadfsdf'}
matches = regexp(S, '([ad]{5,})\1', 'match');
celldisp(matches)
5 commentaires
Walter Roberson
le 14 Juil 2021
Example of reading from file:
%create a file for demonstration purposes only
tname = [tempname() '.txt'];
fid = fopen(tname, 'w');
T = regexprep('asseefadfefaaadddaaadddasdfsdf\nasseeadadadadaaaadadadadaaadfsdf\nasdfsdfsdfsasdfsdfsdfsasdfsdfsdfs\nasseefadfefaaadddaaadddasdfsdf\nasdfsdfsdfsasdfsdfsdfsasdfsdfsdfs\nasseefadfefaaadddaaadddasdfsdf\n', 'a', 'e');
fprintf(fid, T);
fclose(fid);
%okay, main function
filename = tname;
%okay, main function
S = readlines(filename);
matches = S(~cellfun(@isempty, regexp(S, '[de]{10}', 'once')));
matches
%alternative without readlines
S = regexp(fileread(filename), '\r?\n', 'split');
matches = S(~cellfun(@isempty, regexp(S, '[de]{10}', 'once')));
matches
%alternative without splitting
S = fileread(filename);
matches = regexp(S, '^.*[de]{10}.*$', 'match', 'dotexceptnewline', 'lineanchors');
matches
Voir également
Catégories
En savoir plus sur Startup and Shutdown dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!