Effacer les filtres
Effacer les filtres

How to capture tokens using regular expressions?

23 vues (au cours des 30 derniers jours)
Patrick Mboma
Patrick Mboma le 16 Sep 2015
Commenté : Cedric le 19 Sep 2015
Dear all, I would like to capture two parts of a sequence of strings. I would like to call the first part "main" and the second part "digits". The expressions in the strings have a distinct pattern in that they either have ONE underscore or parentheses. What I am looking to capture is the part before the underscore or the opening parenthesis (main) and the part after the underscore or inside the parenthesis (digits). As an example, the typical exercise will be of the form
expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'}
out=regexp(expression,pattern,'name')
The result should be a cell array where each cell contains a structure with fields "main" and "digits". In the first case, for instance, the result should be
main='abcd' and digits='1'.
What I am missing is the right "pattern". Any suggestions?
  5 commentaires
Patrick Mboma
Patrick Mboma le 19 Sep 2015
Thanks a lot Cedric!!!
Cedric
Cedric le 19 Sep 2015
My pleasure!

Connectez-vous pour commenter.

Réponses (2)

Benjamin Kraus
Benjamin Kraus le 16 Sep 2015
expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
pattern = '(?<main>[a-zA-Z]+)(?:[_\(])(?<digits>[0-9]+))?';
out = regexp(expression,pattern,'once','names');
The pattern breaks down like this:
  • (?<main>[a-zA-Z]+) - A token named "main" with only letters.
  • (?:[_\(]) - An uncaptured token containing either an underscore or "(".
  • (?<digits>[0-9]+) - A token named "digits" with only numbers.
  • )? - An optional ")" character at the end.
The 'once' means to capture the pattern only once per input string. I think in this case you can leave it out.
  1 commentaire
Patrick Mboma
Patrick Mboma le 17 Sep 2015
Dear Benjamin,
Thanks for your input. Your solution would work but would probably need to be refined in the sense that the first part main, may also include some digits. For instance,
whatever345whatever_100
would also be something I would like to capture. It is the second part that would only include digits.
A potential algorithm would be to say everything before an opening parenthesis or an underscore is to be captured in "main", while everything after an underscore or inside parentheses is to be captured in "digits".

Connectez-vous pour commenter.


Kirby Fears
Kirby Fears le 16 Sep 2015
This isn't the most efficient or elegant solution, but it solves the problem. Let me know if your data is large enough that this code is slow. I can optimize it.
ex={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
temp=cellfun(@(s)strsplit(s,{'_','(',')'}),ex,'UniformOutput',false);
ex_main=cellfun(@(s)s{1},temp,'UniformOutput',false);
ex_digit=cellfun(@(s)s{2},temp,'UniformOutput',false);
clear temp;
  1 commentaire
Patrick Mboma
Patrick Mboma le 17 Sep 2015
Dear Kirby,
There are many ways to solve this problem and what you are suggesting is definitely one way to do it. However, I would like to use the elegance of regular expressions and get to practice something I am not very good at yet.
In my current solution for instance, I first use regular expressions to transform all the inputs into the same format
whatever_45
then I look for the underscore, etc. But this entails several lines of codes.
Thanks for your input!

Connectez-vous pour commenter.

Catégories

En savoir plus sur Logical dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by