How to read specific lines from a text file and store them in an array?
1 vue (au cours des 30 derniers jours)
Afficher commentaires plus anciens
I have a text file containing an Multiple Sequence Alignment (MSA) which has protein sequences stored in it. The contents of the file is like this:
>gi|73961569|ref|XP_547536.2| osteocalcin [C. lupus familiaris]
MRSLMVLALLAVAALCLCLAGPADAKPSSAESRKGGATFVSKREGSEVVRRLRRYLDSGL
GAPVPYPDPLEPKREVCELNPNCDELADHIGFQEAYQRFYGPV-
>gi|27806301|ref|NP_776674.1| osteocalcin preproprotein
MRTPMLLALLALAT--LCLAGRADAKPGDAESGK-GAAFVSKQEGSEVVKRLRRYLDHWL
GAPAPYPDPLEPKREVCELNPDCDELADHIGFQEAYRRFYGPV-
From this file I just want to extract the lines containing the actual sequences (ones NOT starting with '>' symbol) and store them in an array for future use. One thing to mention is that line 2 and line 3 is one single sequence, so I also need to make them a single string and store it in one single position of an array. How can I do that?
I wanted to use 'fileread' but it reads all the file at a time, so it's not helpful.
3 commentaires
TastyPastry
le 20 Oct 2015
Modifié(e) : TastyPastry
le 20 Oct 2015
To clarify, are your sequences supposed to be formatted like this, where there are two separate sequences starting with >gi?
>gi|73961569|ref|XP_547536.2| osteocalcin [C. lupus familiaris]
MRSLMVLALLAVAALCLCLAGPADAKPSSAESRKGGATFVSKREGSEVVRRLRRYLDSGL
GAPVPYPDPLEPKREVCELNPNCDELADHIGFQEAYQRFYGPV-
>gi|27806301|ref|NP_776674.1| osteocalcin preproprotein
MRTPMLLALLALAT--LCLAGRADAKPGDAESGK-GAAFVSKQEGSEVVKRLRRYLDHWL
GAPAPYPDPLEPKREVCELNPDCDELADHIGFQEAYRRFYGPV-
Réponses (1)
per isakson
le 20 Oct 2015
Try
>> out = cssm
out =
[1x104 char] [1x104 char] [1x104 char] [1x104 char] [1x104 char] [1x104 char]
>> out{3}
ans =
MRSLMVLALLAVAALCLCLAGPADAKPSSAESRKGGATFVSKREGSEVVRRLRRYLDSGLGAPVPYPDPLEPKREVCELNPNCDELADHIGFQEAYQRFYGPV-
>>
where
function out = cssm
str = fileread( 'cssm.txt' );
cac = regexp( str, '(?<=>gi[^\n]+\n).+?(?=\n>gi|$)', 'match' );
out = cell(1,length(cac));
for jj = 1 : length( cac )
out{jj} = regexprep( cac{jj}, '\n', '' );
end
end
and cssm.txt contains three copies of the string of your question.
0 commentaires
Voir également
Catégories
En savoir plus sur Characters and Strings dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!