searching through RNA sequence for a string of characters

10 vues (au cours des 30 derniers jours)
John Jamison
John Jamison le 24 Avr 2017
Modifié(e) : dpb le 25 Avr 2017
Hi all,
I am creating a program that gets a user input (txt file) of DNA nucleotides (random lists of A, G, C, T), and gives an output file of the RNA complement (got this done), the percentages of RNA nucleotides (got this done), and asks the user if they would like to search through the RNA sequence for a particular string of nucleotides—if the string exists in the RNA file, it will tell the location in the sequence and repeat that given string.
example: the user inputs "UACG," and my program says that string of basepairs occurred at location 59.
How can I do this? My current program operates by searching through the file on a per character basis, and converts each DNA basepair to an RNA basepair.
Thank you

Réponses (3)

Joseph Cheng
Joseph Cheng le 24 Avr 2017
so, do you store the full compliment data in an array such that you can
%gen dummy data
comp = 'AGCT';
list = comp(randi(4,1,100))
position = strfind(list,'CAT')
where position will be the index in the string that has CAT in it?
  2 commentaires
John Jamison
John Jamison le 24 Avr 2017
I dont believe so. Below is some of what I do:
[base, num] = fread(dnaFile, 1, 'char');
fwrite(rnaFile, out);
Joseph Cheng
Joseph Cheng le 24 Avr 2017
well then i would suggest that you do, unless the array is too large. if you read in the whole text file into an array you should be able to then use strfind to find the position, otherwise its a tricky swapping in and out of a buffer as you search for consecutive matches by looking one by one

Connectez-vous pour commenter.


dpb
dpb le 24 Avr 2017
Modifié(e) : dpb le 24 Avr 2017
Again, all the reason in the world to read the whole string at once as recommended earlier.
But, the answer to the question posed, given what you did previously is simply
idx=strfind(out,ASKEDFORSTRING);
AFTER the loop is complete. Fuggettabout trying to do this on a character-at-a-time basis.
idx will be empty if the sequence doesn't exist, or a vector of 1 to N position(s) where the sequence is found.
Done.
I can't emphasize enough how you should convert this from the looping you're doing to processing the input string as a vector...you'll find things will go much faster after you do, the code will become almost trivial by comparison and much simpler to develop/debug/maintain.
  6 commentaires
John Jamison
John Jamison le 24 Avr 2017
My code is the same as from the previous post. Here is my dna.txt file that I use for my file
dpb
dpb le 24 Avr 2017
Modifié(e) : dpb le 25 Avr 2017
"My code is the same as from the previous post"
Can't be--you had to have processed the input string instead of continuing to try to read a file character-by-character or you would have had an FEOF error on the first next read.
We need to see the changes you did make...for the NaN result, it looks like you never collected a single count so the result was 0/0 for all.
ADDENDUM
OK, the input file is OK; one hurdle cleared.
What you need to do is after the above read statement, if you're still hung up on looping solution, replace the while loop with a counted for loop
for i=1:length(base)
and remove the fread statement at the end of the loop entirely.
Oh, just dawned on me--if you left the code alone, since base now refers to the entire string not just a single character, the various if clauses will never be satisfied. You must address each character in the string in turn:
for i=1:length(base)
if base(i)=='C'
...
etc., etc., etc., ...
But, of course, the thing to do when you read the full string is to use the vectorized solution that I showed you...there's where the real simplification comes from.

Connectez-vous pour commenter.


dpb
dpb le 25 Avr 2017
Modifié(e) : dpb le 25 Avr 2017
OK, while this isn't really answering the question of the thread (I did that up above), I fixed your script to operate on the full string. This isn't fully vectorized yet but goes a fair step along the way eliminating the unnecessary big loop and switch block for the lookup table approach.
% set up lookup table inputs
DNA=['C','G','T','A']; % base characters in DNA sequence
RNA=['G','C','A','U']; % corresponding characters in RNA
% get input file, do bookkeeping for output file, etc., ..
filename=uigetfile('*.txt','Select DNA sequence file');
if filename==0, error('User Canceled'),end
[dnaFile, message] = fopen(filename, 'r');
% %if input file has extension .dna, switch to .rna
if length(filename) > 4 && strcmp(filename(end-3 : end), '.dna')
outputFile = [filename(1:end-4) '.rna'];
else
outputFile = [filename '.rna'];
end
[rnaFile, message] = fopen(outputFile, 'w');
% now read full string into char array
dna=fread(dnaFile,'*char').'; % orient as row vector
fclose(dnaFile); clear dnaFile % we're done with input file
% and translate DNA --> RNA
rna=RNA(arrayfun(@(c) find(c==DNA),dna));
% compute counts, frequencies
counts=zeros(size(RNA)).';
for i=1:length(RNA)
counts(i)=sum(rna==RNA(i));
end
freqs=100*counts/sum(counts);
% output results
fprintf('Base pair frequencies: \n')
for i=1:length(RNA)
fprintf('%c: %6.2f\n',RNA(i),freqs(i));
end
fwrite(rnaFile, rna);
rnaFile=fclose(rnaFile); clear rnaFile
Sample run from your dna.txt--
> dna2rna
Your output file will be: dna.txt.rna
Base pair frequencies:
G: 14.63
C: 31.71
A: 26.83
U: 26.83
>>
The searching would be on the variable rna as outlined above after getting the desired sequence for which to search from the user.
I attached the m-file as well...HTH.
ADDENDUM
OK, I added the lookup; using the search string you typed in the original question the nul result is the expected answer...there is no sequence like that in the output string. I ran one and picked one I knew would work and results were:
>> dna2rna
Your output file will be: dna.txt.rna
Base pair frequencies:
G: 14.63
C: 31.71
A: 26.83
U: 26.83
Enter RNA sequence of interest: aauu
Sequence AAUU locations (empty report-->not found)
Occurrence Location
1 3
2 16
3 28
>>
So, the lookup does work if there is anything to be found.
  2 commentaires
John Jamison
John Jamison le 25 Avr 2017
Thank you. How can I get it to search for a user input of RNA basepairs?
I have the following, and thought it would work, but it always returns a " [] ", which means it is finding nothing...
inputBase = input('Enter RNA sequence of interest: ', 's');
location = strfind(out, inputBase);
dpb
dpb le 25 Avr 2017
Modifié(e) : dpb le 25 Avr 2017
Well, syntax looks ok but again we can't debug what we can't see, sorry.
Same old song, umpteenth verse. SHOW us enough we can see what you did and what you used to do it with (both code and data)...we can't see your terminal from here.
This can be done simply from command line with the output string you operated on and the search string as a minimum...
ADDENDUM
You did notice if you used my updated code that the output variable is now a more expressive name rna, not out, I hope?
If you're still stuck on your old code, need what said above...show us the string you're searching and for what...remember strfind is case-sensitive.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Genomics and Next Generation Sequencing dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by