help with regexpi expression match
5 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
I have a question regarding regexpi expression match which may be an easy one (not for me). I have a set of strings from a single cell. An example is as follows:
d2 = {'chromosome 1:NC_011985.1/CP000628.1; chromosome 2:NC_011983.1/CP000629.1; plasmid pAgK84:NC_011994.1/CP000632.1; plasmid pAtK84b:NC_011990.1/CP000630.1; plasmid pAtK84c:NC_011987.1/CP000631.1 chromosome Mycobacterium_bovis_AF2122/97:NC_002945.4/LT708304.1'};
I would like the two names that follow "chromosome" which are found after the ":" and "/" to be picked up with a regexpi expression. so for example, I want to match for NC_011985.1, CP000628.1, NC_011983.1, CP000629.1, NC_002945.4 and LT708304.1 but I want to ignore the other names that follow plasmid. I chose this large string as an example because I wanted the names to be proceeded by the word "chromosome" however as you can see, after the word "chromosome, there may have a number, a word or even nothing, followed by a semicolon ":" a name (that we want to keep) and then another name that follows "/" (that we also want to keep). Keeping all the names in one cell is fine I just want to pick up these names.
if I use the following code:
accession6 = regexpi(d2,'(?<=:)\w+','match');
using this as a base, I do not know how to proceed the match by the word "chromosome" followed by an optional number, or words or even nothing after the word "chromosome" without messing it up . It would have be before the necessary ":" and "/" parts of the expression that go before the name we want to keep.
Any help would be super appreciated.
1 commentaire
Stephen23
le 4 Déc 2017
Modifié(e) : Stephen23
le 4 Déc 2017
"Any help would be super appreciated"
You might like to download my FEX submission iregexp, an interactive regular expression tool:
It lets you quickly experiment with different regular expressions and shows all of regexp's outputs in real-time as you type.
Réponses (1)
per isakson
le 4 Déc 2017
Modifié(e) : per isakson
le 4 Déc 2017
One expression
- 'chromosome' followed by anything up till ':' and one ':'
- capturing group of one or more letter, digit, underscore, and '.' (greedy)
- zero or more of anything up till '/' and one '/'
- capturing group of one or more letter, digit, underscore, and '.' (greedy)
And repeat until no more matches are found
>> cac = regexpi( d2, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac{:}{:}
ans =
'NC_011985.1' 'CP000628.1'
ans =
'NC_011983.1' 'CP000629.1'
ans =
'NC_002945.4' 'LT708304.1'
>>
If d2 contains one string
>> cac = regexpi( d2{:}, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac
cac =
{1x2 cell} {1x2 cell} {1x2 cell}
>>
6 commentaires
per isakson
le 4 Déc 2017
Modifié(e) : per isakson
le 4 Déc 2017
And an alternative that uses @JM's approach. In a first step match "name slash name" between
- look-behind: (?<=chromosome[^:]+[:])
- look-ahead: (?=;|$)
and in a second step split the two names at slash
cac = regexpi( d2{:}, '(?<=chromosome[^:]+[:])[\w\.]+[/][\w\.]+(?=;|$)', 'match' );
cac = regexp( cac, '/', 'split' );
cac{:}
ans =
'NC_011985.1' 'CP000628.1'
ans =
'NC_011983.1' 'CP000629.1'
ans =
'NC_002945.4' 'LT708304.1'
>>
per isakson
le 4 Déc 2017
chromosome[^:;] with a semi-colon (proposed by @Guillaume) is better than chromosome[^:] without, because the latter will return a plasmid-name-pair if a colon is missing in the string after 'chromosome'. With semi-colon the pair is missed altogether.
Voir également
Catégories
En savoir plus sur Data Import dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!