How to capture tokens using regular expressions?

Question

Patrick Mboma le 16 Sep 2015

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/243437-how-to-capture-tokens-using-regular-expressions

Commenté : Cedric le 19 Sep 2015

Dear all, I would like to capture two parts of a sequence of strings. I would like to call the first part "main" and the second part "digits". The expressions in the strings have a distinct pattern in that they either have ONE underscore or parentheses. What I am looking to capture is the part before the underscore or the opening parenthesis (main) and the part after the underscore or inside the parenthesis (digits). As an example, the typical exercise will be of the form

 expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'}
 out=regexp(expression,pattern,'name')

The result should be a cell array where each cell contains a structure with fields "main" and "digits". In the first case, for instance, the result should be

main='abcd' and digits='1'.

What I am missing is the right "pattern". Any suggestions?

5 commentaires
Afficher 3 commentaires plus anciensMasquer 3 commentaires plus anciens

Cedric le 17 Sep 2015

Modifié(e) : Cedric le 17 Sep 2015

Ouvrir dans MATLAB Online

Dear Patrick,

In summary, for extracting and validating digits and decimal point, I would would write a pattern like

'(.*?)[\(_]([\d\.]*)'

which explicitly requires the second part to be zero or more * elements of the set [] of digits \d or decimal point \.. Yet, if I wanted to leave validation to STR2DOUBLE, I would extract whatever is in parenthesis or after the underscore:

'(.*?)[\(_]([^\)]*)'

which I translated into zero or more * elements that are not in the set [^] of the literal closing parenthesis. Another way is given by Benjamin where he adds a conditional closing parenthesis.

I also asked about how these strings are defined initially, because the context is important. If you are dealing with a reasonable number of cells, performing pattern matching on a cell array will be efficient enough. If, on the contrary, you have e.g. a 1GB file of entries to process, you may be much more efficient working on it "manually". To illustrate, say the file contains

 name1_45 
 name2(45)
 name2b_32
 name2c(84)
 ..

then you could load it as a char array, replace all '_', '(', ')', new lines, and carriage returns with white spaces, and extract names and contents in one shot with SSCANF or TEXSCAN:

 % - Dummy file content.
 content = sprintf( 'name1_45\nname2(45)\nname2b_32\nname2c(84)\n' ) ;
 % - Flag elements to replace.
 doReplace = content == '_' | content == '(' | content == ')' | content == 10 ;
 % - Replace with with space.
 content(doReplace) = ' ' ;
 % - Parse.
 parsed = textscan( content, '%s %f' ) ;

(10 = ASCII code of new line \n, should also manage 13 for carriage return; may be possible to make it even more efficient using BSXFUN). With that we get

 >> parsed
 parsed = 
    {4x1 cell}    [4x1 double]
 >> parsed{1}
 ans = 
    'name1'
    'name2'
    'name2b'
    'name2c'
 >> parsed{2}
 ans =
    45
    45
    32
    84

Patrick Mboma le 19 Sep 2015

Thanks a lot Cedric!!!

Cedric le 19 Sep 2015

My pleasure!

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Benjamin Kraus le 16 Sep 2015

3
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/243437-how-to-capture-tokens-using-regular-expressions#answer_192653

Ouvrir dans MATLAB Online

expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
pattern = '(?<main>[a-zA-Z]+)(?:[_\(])(?<digits>[0-9]+))?';
out = regexp(expression,pattern,'once','names');

The pattern breaks down like this:

(?<main>[a-zA-Z]+) - A token named "main" with only letters.
(?:[_\(]) - An uncaptured token containing either an underscore or "(".
(?<digits>[0-9]+) - A token named "digits" with only numbers.
)? - An optional ")" character at the end.

The 'once' means to capture the pattern only once per input string. I think in this case you can leave it out.

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Patrick Mboma le 17 Sep 2015

Ouvrir dans MATLAB Online

Dear Benjamin,

Thanks for your input. Your solution would work but would probably need to be refined in the sense that the first part main, may also include some digits. For instance,

whatever345whatever_100

would also be something I would like to capture. It is the second part that would only include digits.

A potential algorithm would be to say everything before an opening parenthesis or an underscore is to be captured in "main", while everything after an underscore or inside parentheses is to be captured in "digits".

Connectez-vous pour commenter.

Answer 2

Kirby Fears le 16 Sep 2015

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/243437-how-to-capture-tokens-using-regular-expressions#answer_192648

Ouvrir dans MATLAB Online

This isn't the most efficient or elegant solution, but it solves the problem. Let me know if your data is large enough that this code is slow. I can optimize it.

ex={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
temp=cellfun(@(s)strsplit(s,{'_','(',')'}),ex,'UniformOutput',false);
ex_main=cellfun(@(s)s{1},temp,'UniformOutput',false);
ex_digit=cellfun(@(s)s{2},temp,'UniformOutput',false);
clear temp;

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Patrick Mboma le 17 Sep 2015

Ouvrir dans MATLAB Online

Dear Kirby,

There are many ways to solve this problem and what you are suggesting is definitely one way to do it. However, I would like to use the elegance of regular expressions and get to practice something I am not very good at yet.

In my current solution for instance, I first use regular expressions to transform all the inputs into the same format

whatever_45

then I look for the underscore, etc. But this entails several lines of codes.

Thanks for your input!

Connectez-vous pour commenter.

How to capture tokens using regular expressions?

5 commentaires
Afficher 3 commentaires plus anciensMasquer 3 commentaires plus anciens

Réponses (2)

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

How to capture tokens using regular expressions?

5 commentaires Afficher 3 commentaires plus anciensMasquer 3 commentaires plus anciens

Réponses (2)

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

5 commentaires
Afficher 3 commentaires plus anciensMasquer 3 commentaires plus anciens

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens