Names can be so difficult to parse into first, middle, last, suffix
This group of names (saved at regex.com) is giving me a migraine.
The desired parse is actually /^(.)(\b[Vv][ao]n\b\s\w+|\b[Dd][eu]\s\b\w+)(.)/ which groups 'De La', but I want to make sure that 'La name' is also included and grouped properly so I focused on the difference between 'De La name' and 'La name' to make sure the logic works.
Also not sure how to incorporate (De La \w+)|(La \w+) into the rest of the regex.
TIA
** Update (per @lemon's request) **
The name string Emile La Sére
should return (Emile) (La Sére)
without losing the diacritical on the "e"
Justin De Witt Bowersock
should return (Justin) (De Witt) (Bowersock)
Monica De La Cruz
should return (Monica) (De La Cruz)
Robert M. La Follette
should return (Robert M.) (La Follette)
or ideally (Robert) (M.) (La Follette)
Henry St. John
should return (Henry) (St. John)
Edward St. Loe Livermore
should return (Edward) (St. Loe) (Livermore)
Oscar L. Auf der Heide
should return (Oscar) (L.) (Auf der Heide)
I've been able to successfully parse these in various groupings. I don't know if it is possible to parse the whole range in a single pattern.
The main pattern that partially works is (^.*)\b([Vv][ao]n\s\w+|[Dd][ue]\s\w+|[Dd]e\s[Ll]a\s\w+|St\.\s\w+)\s*(.*)
however, the crossover between De Witt
, [Dd]e [Ll]a Cruz
and '[Ll]a Follette' is giving me a headache.
Also I am a novice regex wizard so there's that.
** Update 2 ** This pattern from @The fourth bird is almost perfect. I dressed it up with a couple of additions to catch the previously unmentioned outliers so it's almost airtight. (Assuming there are not other pattern outliers I've missed)
** Update **
Thanks to @The fourth bird this pattern is the one that works.
As you already pointed out, names can be really difficult to parse. See a nice read about Falsehoods Programmers Believe About Names.
For the provided example data, you might use:
^(.*?)\b((?:[Vv][ao]n|(?:[Dd][eu]\s+)?La|[Dd][eu]|St\.|Auf\s+der)\s+\p{L}+)(.*)
^
Start of string(.*?)
Capture group 1, match any character as few as possible\b
A word boundary(
Capture group 2
(?:
Non capture group for the alternatives
[Vv][ao]n
Match one of V
v
, a
o
and then n
|
Or(?:[Dd][eu]\s+)?La
Optionally match D
d
, e
u
and 1+ whitespace chars followed by La
|
Or[Dd][eu]
Match one of D
d
, e
u
|
OrSt\.
Match St.
|
OrAuf\s+der
Match Auf der
with 1+ whitespace chars in between)
Close the non capture group\s+
Match 1+ whitespace chars\p{L}+
Match 1+ times any letter)
Close group 2(.*)
Capture group 3, optionally capture any characterSee a regex demo.
When using JavaScript including the \u
flag for Unicode:
const regex = /^(.*?)\b((?:[Vv][ao]n|(?:[Dd][eu]\s+)?La|[Dd][eu]|St\.|Auf\s+der)\s+\p{L}+)(.*)/gmu;
Note that \s
can also match a newline.
When using pcre for example, you might replace \s
with \h
to match horizontal whitespace chars (no newlines), see this regex demo.