Regex: How can I perform positive lookbehind for a specific sequence of characters?

2 vues (au cours des 30 derniers jours)
Adam Brann
Adam Brann le 14 Sep 2021
Commenté : Adam Brann le 14 Sep 2021
EDIT: Changed 'Negative lookbehind' to 'Positive lookbehind'
Hi,
I am attempting to seperate the first name from a list of names, using regex. The format of the names is as follows:
<last name>, <title>. <first name> <middle names> (<other name>)
Where <middle names> and (<other name>) are optional.
I'm new to regex, and currently finding it hard to intuit. It seems to me that I need a positive lookbehind to capture the word preceded by a '.' followed by a 'whitespace' in order to capture the first names, but its not working how I'd like! See code below:
load titanic.mat
% Attempt #1 (Matches words preceded by'.' characters OR whitespace characters -
% I need it to match '.' followed by a whitespace... how???
name_first = regexp(train.Name, '(?<=[\.\s])([A-Z][a-z]+)', 'match')
% Attempt #2 (Captures unwanted '. ' before first names)
name_first2 = regexp(train.Name, '\.\s([A-Z][a-z]+)', 'match')
% Attempt #2 (Attempt to capture 3rd word, doesn't work)
name_first3 = regexp(train.Name, '(\w.*\w){3}', 'match')
Alternative solutions are great, but ideally I'd like to understand WHY my current code doesn't work (specifically attempt #1), and how I might be able to make it work using the negative lookbehind to lookbehind for a specific sequence of characters (i.e. return a word preceded by 'abc').
Thanks in advance for your help.
  4 commentaires
Walter Roberson
Walter Roberson le 14 Sep 2021
Modifié(e) : Walter Roberson le 14 Sep 2021
% I need it to match '.' followed by a whitespace... how???
Using
name_first = regexp(train.Name, '(?<=\.\s)([A-Z][a-z]+)', 'match')
But consider making it \s+ instead of \s .
Also, are you sure you do not need to handle names with apostrophe like O'Rorke ? Are you sure you do not need to handle names with dashes, like Fitz-Williams ? Are you sure you do not need to handle surnames with spaces, such as van Horton ? Which, incidentally, is also an example of a name that starts with lower-case.
Adam Brann
Adam Brann le 14 Sep 2021
Thanks for your answer, exactly what I needed. I mistakenly thought the characters to be 'looked behind for' needed to be inside square brackets.
Excellent points regarding the 'unusual' names, I'll go away and have a think about how I might write a regexp to capture those cases. Many thanks for your help.

Connectez-vous pour commenter.

Réponses (0)

Tags

Produits


Version

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by