I am writing a Regex that will extract the various pieces of information from an EDIFACT UN Codes List. As there are tens of thousands of codes I do not wish to type them all in so I have decided to use Regex to parse the text file and extract out the bits that I need. The text file is structured in a way that I can easily identify the bits that I want.
I have created the following Regex using Regex Hero to test it, but I just cannot get it to match everything up to a double line break for the codeComment group. I have tried using the character class [^\n\n] but this still won't match double line breaks.
Note: I have selected the Multiline option on Regex Hero.
(?<element>\d+)\s\s(?<elementName>.*)\[[B|C|I]\]\s+Desc: (?<desc>[^\n]*\s*[^\n]*)
^\s*Repr: (?<type>a(?:n)?)..(?<length>\d+)
^\s*(?<code>\d+)\s*(?<codeName>[^\n]*)
^\s{14}(?<codeComment>[^\n]*)
This is the example text I am using to match.
----------------------------------------------------------------------
1073 Document line action code [B]
Desc: Code indicating an action associated with a line of a
document.
Repr: an..3
1 Included in document/transaction
The document line is included in the
document/transaction.
should capture this as well.
2 Excluded from document/transaction
The document line is excluded from the
document/transaction.
What I want is for codeComment to contain the following:
The document line is included in the
document/transaction.
should capture this as well.
but it is only extracting the first line:
The document line is included in the
In a character class, every character counts once, no matter how often you write it. So a character class can't be used to check for consecutive linebreaks. But you can use a lookahead assertion:
^\s{14}(?<codeComment>(?s)(?:(?!\n\n).)*)
(?s)
switches on singleline mode (to allow the dot to match newlines).
(?!\n\n)
asserts that there are no two consecutive linebreaks at the current position.