I would like to parse a MARC record with a regular expression and return the field as the first captured group and the value as the second captured group. Here's what I've got thus far for the regex:
(\n[0-9]{3})[ 0-9]{4}([^\n]*)
The last capture group there ([^\n]*)
is capturing everything up until the next line break, which works great with lines like:
001 868229892
100 1 Montgomery, L. M.|q(Lucy Maud),|d1874-1942.,|eauthor.
245 10 Anne of Green Gables /|cL.M. Montgomery.
250 Aladdin hardcover edition.
264 1 New York :|bAladdin,|c2014.
300 440 pages ;|c22 cm
336 text|2rdacontent.
337 unmediated|2rdamedia.
338 volume|2rdacarrier.
However, when it comes to values which break over lines, the regex no longer works:
520 Anne, an eleven-year-old orphan, is sent by mistake to
live with a lonely, middle-aged brother and sister on a
Prince Edward Island farm and proceeds to make an
indelible impression on everyone around her.
650 0 Shirley, Anne (Fictitious character)|vJuvenile fiction.
The next stop area should be the 650
above. So the regex should capture everything up until a line break followed by 3 digits.
I did try ([^\n0-9]*)
but that is interpreted as match anything other than digits or a line break in any order. I need it to match a line break and 3 digits in that exact sequence.
This this regex, as demonstrated on regex101:
(\n[0-9]{3})[ 0-9]{4}([^\n]+(?:\n\s+[^\n]+)*)
The capture group ([^\n]+(?:\n\s+[^\n]+)*)
matches
[^\n]+
(?:\n\s+[^\n]+)*