regexpcremarc

How can I parse MARC records with a Regular Expression?


I would like to parse a MARC record with a regular expression and return the field as the first captured group and the value as the second captured group. Here's what I've got thus far for the regex:

(\n[0-9]{3})[ 0-9]{4}([^\n]*)

The last capture group there ([^\n]*) is capturing everything up until the next line break, which works great with lines like:

001    868229892 
100 1  Montgomery, L. M.|q(Lucy Maud),|d1874-1942.,|eauthor. 
245 10 Anne of Green Gables /|cL.M. Montgomery. 
250    Aladdin hardcover edition. 
264  1 New York :|bAladdin,|c2014. 
300    440 pages &#59;|c22 cm 
336    text|2rdacontent. 
337    unmediated|2rdamedia. 
338    volume|2rdacarrier. 

However, when it comes to values which break over lines, the regex no longer works:

520    Anne, an eleven-year-old orphan, is sent by mistake to 
       live with a lonely, middle-aged brother and sister on a 
       Prince Edward Island farm and proceeds to make an 
       indelible impression on everyone around her. 
650  0 Shirley, Anne (Fictitious character)|vJuvenile fiction. 

The next stop area should be the 650 above. So the regex should capture everything up until a line break followed by 3 digits.

I did try ([^\n0-9]*) but that is interpreted as match anything other than digits or a line break in any order. I need it to match a line break and 3 digits in that exact sequence.


Solution

  • This this regex, as demonstrated on regex101:

    (\n[0-9]{3})[ 0-9]{4}([^\n]+(?:\n\s+[^\n]+)*)

    The capture group ([^\n]+(?:\n\s+[^\n]+)*) matches