I need to a regular expression to extract names from a GEDCOM file. The format is:
Fred Joseph /Smith/
Where the text bounded by the / is the surname and the Fred Joseph are the forenames. The complication is that the surname could be at any place in the text or may not be there at all. I need something that will extract the surname and capture everything else as the forenames.
This is as far as I have got and I have tried making groups optional with the ? qualifier but to no avail:
As you can see it has several problems: If the surname is missing nothing gets captured, the forename(s) sometimes have leading and trailing spaces, and I have 3 capture groups when I'd really like 2. Even better would be if the capture group for the surname didn't include the '/' characters.
Any help would be much appreciated.
For your last line, I'm not sure there is a way to join the group 1 with group 3 into a single group.
Here is my proposed solution. It doesn't capture spaces around forenames.
^(?:\h*([a-z\h]+\b)\h*)?(?:\/([a-z\h]+)\/)?(?:\h*([a-z\h]+\b)\h*)?$
To correctly match the names, care to use the insensitive flag, and if you test all lines at once, use multiline flag.
^
start of the line(?:\h*([a-z\h]+\b)\h*)?
first non-capturing group that matches 0 or 1 time:
\h*
0 or more horizontal spaces([a-z\h]+\b)
captures in a group letters and spaces, but stops at the end of the last word\h*
matches the possible remaining spaces without capturing(?:\/([a-z\h]+)\/)?
second non-capturing group that matches 0 or 1 time a name in a capturing group surrounded by slashes(?:\h*([a-z\h]+\b)\h*)?
third non-capturing group doing the same as first one, capturing the names in a third group.$
end of the line