How to read specific lines from a text file and store them in an array?

1 vue (au cours des 30 derniers jours)
Rasif Ajwad
Rasif Ajwad le 20 Oct 2015
Commenté : Rasif Ajwad le 20 Oct 2015
I have a text file containing an Multiple Sequence Alignment (MSA) which has protein sequences stored in it. The contents of the file is like this:
>gi|73961569|ref|XP_547536.2| osteocalcin [C. lupus familiaris]
MRSLMVLALLAVAALCLCLAGPADAKPSSAESRKGGATFVSKREGSEVVRRLRRYLDSGL
GAPVPYPDPLEPKREVCELNPNCDELADHIGFQEAYQRFYGPV-
>gi|27806301|ref|NP_776674.1| osteocalcin preproprotein
MRTPMLLALLALAT--LCLAGRADAKPGDAESGK-GAAFVSKQEGSEVVKRLRRYLDHWL
GAPAPYPDPLEPKREVCELNPDCDELADHIGFQEAYRRFYGPV-
From this file I just want to extract the lines containing the actual sequences (ones NOT starting with '>' symbol) and store them in an array for future use. One thing to mention is that line 2 and line 3 is one single sequence, so I also need to make them a single string and store it in one single position of an array. How can I do that?
I wanted to use 'fileread' but it reads all the file at a time, so it's not helpful.
  3 commentaires
TastyPastry
TastyPastry le 20 Oct 2015
Modifié(e) : TastyPastry le 20 Oct 2015
To clarify, are your sequences supposed to be formatted like this, where there are two separate sequences starting with >gi?
>gi|73961569|ref|XP_547536.2| osteocalcin [C. lupus familiaris]
MRSLMVLALLAVAALCLCLAGPADAKPSSAESRKGGATFVSKREGSEVVRRLRRYLDSGL
GAPVPYPDPLEPKREVCELNPNCDELADHIGFQEAYQRFYGPV-
>gi|27806301|ref|NP_776674.1| osteocalcin preproprotein
MRTPMLLALLALAT--LCLAGRADAKPGDAESGK-GAAFVSKQEGSEVVKRLRRYLDHWL
GAPAPYPDPLEPKREVCELNPDCDELADHIGFQEAYRRFYGPV-
Rasif Ajwad
Rasif Ajwad le 20 Oct 2015
Yes. sequences will start with '>gi', but the actual sequence is starting from the next line: 'MRSLM...'

Connectez-vous pour commenter.

Réponses (1)

per isakson
per isakson le 20 Oct 2015
Try
>> out = cssm
out =
[1x104 char] [1x104 char] [1x104 char] [1x104 char] [1x104 char] [1x104 char]
>> out{3}
ans =
MRSLMVLALLAVAALCLCLAGPADAKPSSAESRKGGATFVSKREGSEVVRRLRRYLDSGLGAPVPYPDPLEPKREVCELNPNCDELADHIGFQEAYQRFYGPV-
>>
where
function out = cssm
str = fileread( 'cssm.txt' );
cac = regexp( str, '(?<=>gi[^\n]+\n).+?(?=\n>gi|$)', 'match' );
out = cell(1,length(cac));
for jj = 1 : length( cac )
out{jj} = regexprep( cac{jj}, '\n', '' );
end
end
and cssm.txt contains three copies of the string of your question.

Catégories

En savoir plus sur Characters and Strings dans Help Center et File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by