How to see if characters are present in a string array.

I am trying to write some code that will take a short amino acid sequence, ex. 'GSA' and then search through a string array of sequences to find the number and index of matches, but I would like it to ignore the order of the characters. As long as each character is present, I would like to consider it a hit.
Here is the code I have so far, which kind of works. InputSeq is the sequence I would like to search for, and AAseq is the string array of sequences that I would be searching through. This code only produces a match if all characters are present AND the order is correct.
InputSeq = "GSA";
AAseq = [ SGD; SGS; SGA; SGV; SGS; SGA; SGD; SGS; SGS; SGY; SGD; SGS; SGI.........];
result = ismember(InputSeq, AAseq)
This kind of works, but it will not register a match if the order of the characters does not match.

 Réponse acceptée

Stephen23
Stephen23 le 3 Déc 2021
Modifié(e) : Stephen23 le 3 Déc 2021
Assuming that all string elements contain exactly the same number of characters, then you can do this easily with basci logical operations on character arrays:
A = "GSA";
B = ["SGD";"SGS";"SGA";"SGV";"SGS";"SGA";"SGD";"SGS";"SGS";"SGY";"SGD";"SGS";"SGI"]
B = 13×1 string array
"SGD" "SGS" "SGA" "SGV" "SGS" "SGA" "SGD" "SGS" "SGS" "SGY" "SGD" "SGS" "SGI"
X = all(sort(char(A))==sort(char(B),2),2)
X = 13×1 logical array
0 0 1 0 0 1 0 0 0 0
Or without sorting:
X = all(any(char(A)==permute(char(B),[1,3,2]),3),2)
X = 13×1 logical array
0 0 1 0 0 1 0 0 0 0

3 commentaires

Thanks! This worked the best for me, but I had to make some changes to the way I sorted my character array. They way you coded it, it alphabetized by row and didn't alphabetize the columns. I solved it by doing this. I think it is because my variable was a character array already rather than a string.
A = 'GSA'
B = ['SGD';'SGS';'SGA';'SGV';'SGS';'SGA';'SGD';'SGS';'SGS';'SGY';'SGD';'SGS';'SGI']
for i = 1:length(B)
B(i,:) = sort(B(i,:));
end
Result = all(sort(A) == B, 2);
MatchIdx = find(Result == 1);
MatchIdx =
3
6
You don't need the loop, youc an simply specify the sort dimension argument:
A = 'GSA'
A = 'GSA'
B = ['SGD';'SGS';'SGA';'SGV';'SGS';'SGA';'SGD';'SGS';'SGS';'SGY';'SGD';'SGS';'SGI']
B = 13×3 char array
'SGD' 'SGS' 'SGA' 'SGV' 'SGS' 'SGA' 'SGD' 'SGS' 'SGS' 'SGY' 'SGD' 'SGS' 'SGI'
X = all(sort(A)==sort(B,2),2)
X = 13×1 logical array
0 0 1 0 0 1 0 0 0 0
Yep, you're right! That worked. Thank you!

Connectez-vous pour commenter.

Plus de réponses (1)

You could use multiple contains() tests.
But I suggest that instead you do something like
ismember(sort(char(InputSeq)), cellfun(@sort, cellstr(AAseq), 'uniform', 0))

2 commentaires

Elijah Roberts
Elijah Roberts le 2 Déc 2021
Modifié(e) : Elijah Roberts le 2 Déc 2021
That is only returning true or false i.e. "InputSeq is found somewhere in AAseq." I would like to know get a logic array of the same size as AAseq, so I can get all of the indeces of the matching sequences.
I had some luck with this, I also trimmed the input sequence down to 'GS,' and the AAseq are all two characters long as well
Matches = ismember(InputSeq, AAseq); (both variables are char arrays)
This gave me a 96x2 logic array. Column one seems to be "is G a member" and column 2 is "is S a member"
This kind of works for me. If I can get the row indeces where both columns are true I will be good.
I tried this
MatchIndex = find(Matches == [1 1])
but it just gave me every index where there is a 1, rather than giving me indeces where both columns are 1.
ismember( cellfun(@sort, cellstr(AAseq), 'uniform', 0), sort(char(InputSeq)) )
You could also strcmp()

Connectez-vous pour commenter.

Catégories

Produits

Version

R2019b

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by