Finding the repeated substrings

4 vues (au cours des 30 derniers jours)
Reshma Ravi
Reshma Ravi le 1 Juin 2017
I have a DNA sequence that is AAGTCAAGTCAATCG and I split into substrings such as AAGT,AGTC,GTCA,TCAA,CAAG,AAGT and so on. Then I have to find the repeated substirngs and their frequency counts ,that is here AAGT is repeated twice so I want to get AAGT - 2.How is this possible .
  2 commentaires
Stephen23
Stephen23 le 1 Juin 2017
See Andrei Bobrov's answer for an efficient solution.
Andrei Bobrov
Andrei Bobrov le 2 Juin 2017
Thank you Stephen!

Connectez-vous pour commenter.

Réponse acceptée

KSSV
KSSV le 1 Juin 2017
str = {'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'} ;
idx = cellfun(@(x) find(strcmp(str, x)==1), unique(str), 'UniformOutput', false) ;
L = cellfun(@length,idx) ;
Ridx = find(L>1) ;
for i = 1:length(Ridx)
st = str(idx{Ridx}) ;
fprintf('%s string repeated %d times\n',st{1},length(idx{Ridx}))
end

Plus de réponses (2)

Andrei Bobrov
Andrei Bobrov le 1 Juin 2017
A = 'AAGTCAAGTCAATCG';
B = hankel(A(1:end-3),A(end-3:end));
[a,~,c] = unique(B,'rows','stable');
out = table(a,accumarray(c,1),'VariableNames',{'DNA','counts'});
  5 commentaires
Stephen23
Stephen23 le 26 Août 2018
tabulate requires the Statistics and Machine Learning Toolbox, which not everyone has.
Ivan Savelyev
Ivan Savelyev le 14 Août 2019
Hi.
I have a question. Some time i have a ladder-like results (nested sequences) like this :
AAAAAAAAA which will be calculated (with frame size 3 as) as 6 AAAA sequences, wich is not correct in some cases ( it is also about ATATATA type of sequences). Is there a solution or algorithms to filter nested repeats ?
Thanx a lot.

Connectez-vous pour commenter.


Steven Lord
Steven Lord le 14 Août 2019
For the original question you could convert the char data into a categorical array and call histcounts.
>> C = categorical({'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'})
C =
1×6 categorical array
AAGT AGTC GTCA TCAA CAAG AAGT
>> [counts, uniquevalues] = histcounts(C)
counts =
2 1 1 1 1
uniquevalues =
1×5 cell array
{'AAGT'} {'AGTC'} {'CAAG'} {'GTCA'} {'TCAA'}

Catégories

En savoir plus sur Call Python from MATLAB dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by