help with regexpi expression match

Question

J M le 4 Déc 2017

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/370829-help-with-regexpi-expression-match

Modifié(e) : per isakson le 4 Déc 2017

I have a question regarding regexpi expression match which may be an easy one (not for me). I have a set of strings from a single cell. An example is as follows:

d2 = {'chromosome 1:NC_011985.1/CP000628.1; chromosome 2:NC_011983.1/CP000629.1; plasmid pAgK84:NC_011994.1/CP000632.1; plasmid pAtK84b:NC_011990.1/CP000630.1; plasmid pAtK84c:NC_011987.1/CP000631.1 chromosome Mycobacterium_bovis_AF2122/97:NC_002945.4/LT708304.1'};

I would like the two names that follow "chromosome" which are found after the ":" and "/" to be picked up with a regexpi expression. so for example, I want to match for NC_011985.1, CP000628.1, NC_011983.1, CP000629.1, NC_002945.4 and LT708304.1 but I want to ignore the other names that follow plasmid. I chose this large string as an example because I wanted the names to be proceeded by the word "chromosome" however as you can see, after the word "chromosome, there may have a number, a word or even nothing, followed by a semicolon ":" a name (that we want to keep) and then another name that follows "/" (that we also want to keep). Keeping all the names in one cell is fine I just want to pick up these names.

if I use the following code:

accession6 = regexpi(d2,'(?<=:)\w+','match');

using this as a base, I do not know how to proceed the match by the word "chromosome" followed by an optional number, or words or even nothing after the word "chromosome" without messing it up . It would have be before the necessary ":" and "/" parts of the expression that go before the name we want to keep.

Any help would be super appreciated.

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Stephen23 le 4 Déc 2017

Modifié(e) : Stephen23 le 4 Déc 2017

"Any help would be super appreciated"

You might like to download my FEX submission iregexp, an interactive regular expression tool:

https://www.mathworks.com/matlabcentral/fileexchange/48930-interactive-regular-expression-tool

It lets you quickly experiment with different regular expressions and shows all of regexp's outputs in real-time as you type.

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

per isakson le 4 Déc 2017

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/370829-help-with-regexpi-expression-match#answer_294533

Modifié(e) : per isakson le 4 Déc 2017

Ouvrir dans MATLAB Online

One expression

'chromosome' followed by anything up till ':' and one ':'
capturing group of one or more letter, digit, underscore, and '.' (greedy)
zero or more of anything up till '/' and one '/'
capturing group of one or more letter, digit, underscore, and '.' (greedy)

And repeat until no more matches are found

>> cac = regexpi( d2, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac{:}{:}
ans = 
    'NC_011985.1'    'CP000628.1'
ans = 
    'NC_011983.1'    'CP000629.1'
ans = 
    'NC_002945.4'    'LT708304.1'
>>

If d2 contains one string

>> cac = regexpi( d2{:}, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac
cac = 
    {1x2 cell}    {1x2 cell}    {1x2 cell}
>>

6 commentaires
Afficher 4 commentaires plus anciensMasquer 4 commentaires plus anciens

per isakson le 4 Déc 2017

Modifié(e) : per isakson le 4 Déc 2017

Ouvrir dans MATLAB Online

And an alternative that uses @JM's approach. In a first step match "name slash name" between

look-behind: (?<=chromosome[^:]+[:])
look-ahead: (?=;|$)

and in a second step split the two names at slash

cac = regexpi( d2{:}, '(?<=chromosome[^:]+[:])[\w\.]+[/][\w\.]+(?=;|$)', 'match' );
cac = regexp( cac, '/', 'split' );
cac{:}
ans = 
    'NC_011985.1'    'CP000628.1'
ans = 
    'NC_011983.1'    'CP000629.1'
ans = 
    'NC_002945.4'    'LT708304.1'
>>

per isakson le 4 Déc 2017

chromosome[^:;] with a semi-colon (proposed by @Guillaume) is better than chromosome[^:] without, because the latter will return a plasmid-name-pair if a colon is missing in the string after 'chromosome'. With semi-colon the pair is missed altogether.

Connectez-vous pour commenter.

help with regexpi expression match

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponses (1)

6 commentaires
Afficher 4 commentaires plus anciensMasquer 4 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

help with regexpi expression match

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponses (1)

6 commentaires Afficher 4 commentaires plus anciensMasquer 4 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

6 commentaires
Afficher 4 commentaires plus anciensMasquer 4 commentaires plus anciens