Finding the repeated substrings
Afficher commentaires plus anciens
I have a DNA sequence that is AAGTCAAGTCAATCG and I split into substrings such as AAGT,AGTC,GTCA,TCAA,CAAG,AAGT and so on. Then I have to find the repeated substirngs and their frequency counts ,that is here AAGT is repeated twice so I want to get AAGT - 2.How is this possible .
2 commentaires
Stephen23
le 1 Juin 2017
See Andrei Bobrov's answer for an efficient solution.
Andrei Bobrov
le 2 Juin 2017
Thank you Stephen!
Réponse acceptée
Plus de réponses (2)
Andrei Bobrov
le 1 Juin 2017
A = 'AAGTCAAGTCAATCG';
B = hankel(A(1:end-3),A(end-3:end));
[a,~,c] = unique(B,'rows','stable');
out = table(a,accumarray(c,1),'VariableNames',{'DNA','counts'});
5 commentaires
Anthony Tracy
le 24 Août 2018
If it's alright, I had a question about the use of unique. Why not use tabulate? Just curious.
Thanks!
Image Analyst
le 24 Août 2018
Maybe he didn't know about it - I didn't.
outT = tabulate(B)
out =
8×2 table
DNA counts
____ ______
AAGT 2
AGTC 2
GTCA 2
TCAA 2
CAAG 1
CAAT 1
AATC 1
ATCG 1
outT =
8×3 cell array
{'AAGT'} {[2]} {[16.6666666666667]}
{'AGTC'} {[2]} {[16.6666666666667]}
{'GTCA'} {[2]} {[16.6666666666667]}
{'TCAA'} {[2]} {[16.6666666666667]}
{'CAAG'} {[1]} {[8.33333333333333]}
{'CAAT'} {[1]} {[8.33333333333333]}
{'AATC'} {[1]} {[8.33333333333333]}
{'ATCG'} {[1]} {[8.33333333333333]}
Anthony Tracy
le 24 Août 2018
yeah that's fair. I was just curious since I was just looking at both and wondering why I may want to use one over the other. Seems mainly like if I want a table or cell.
Thanks!
Ivan Savelyev
le 14 Août 2019
Hi.
I have a question. Some time i have a ladder-like results (nested sequences) like this :
AAAAAAAAA which will be calculated (with frame size 3 as) as 6 AAAA sequences, wich is not correct in some cases ( it is also about ATATATA type of sequences). Is there a solution or algorithms to filter nested repeats ?
Thanx a lot.
Steven Lord
le 14 Août 2019
For the original question you could convert the char data into a categorical array and call histcounts.
>> C = categorical({'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'})
C =
1×6 categorical array
AAGT AGTC GTCA TCAA CAAG AAGT
>> [counts, uniquevalues] = histcounts(C)
counts =
2 1 1 1 1
uniquevalues =
1×5 cell array
{'AAGT'} {'AGTC'} {'CAAG'} {'GTCA'} {'TCAA'}
Catégories
En savoir plus sur Genomics and Next Generation Sequencing dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!