regexmultilinegedcom

Regular expression for end of line


I am trying to parse a GEDCOM file using regular expressions and am almost there, but the expression grabs the next line of the text for lines where there is optional text at the end of line. Each record should be a single line.

This is an extract from the file:

0 HEAD
1 CHAR UTF-8
1 SOUR Ancestry.com Family Trees
2 VERS (2010.3)
2 NAME Ancestry.com Family Trees
2 CORP Ancestry.com
1 GEDC
2 VERS 5.5
2 FORM LINEAGE-LINKED
0 @P6@ INDI
1 BIRT

And this is the regular expression I am using:

(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)

This works for all lines except those that do not contain any text at the end, such as the first one. For instance, the last capture group for the first record contains the '1 CHAR UTF-8'.

Here's a screenshot from regex101.com, showing how the purple capture group bleeds onto the next line:

Regex101 screen shot

I have tried using the $ qualifier to limit the .* to just line ends, but this fails as the second line is also a line end.


Solution

  • The \s pattern matches newline symbols. Replace it with a regular space, or [^\S\r\n], or \h if it is PCRE, or [\p{Zs}\t].

    (\d+) +(@\S+@)? *(\S+) +(.*)
    

    See the regex demo

    If you need to match lines, you may add a multiline option and add anchors (^ at the start and $ at the end of the patten) on both sides (see another demo).