I have been struggling to get a regex that can capture data extracted from a not-so-clean pdf file:
Each line should contain 1. school id (5 digits), 2. school name, 3. applications#(number), 4. another number (=offer#)
A clean line looks like "10394 ABC School 50 34" and can be captured using ([0-9]{5})\s{2,}(\D+)\s+(\d*)\s+(\d*)
. A normal case looks like https://regex101.com/r/Mwv3bJ/1 , ignore the negative lookbehind.
The problem I struggle with is that a few schools have a partial postcode (1 or 2 alphabet followed by 1 or 2 digit such as W19 or SW2) in the name, so "10422 XYZ College W9 60 33" will be captured as (id: 10422)(school: XYZ College W)(applications: 9)(offers: 60). https://regex101.com/r/YeNmT7/1
I want the (3:application#) to not capture any digit immediately preceded by an alphabet and if such \D{1,2}\d{1,2} exists in the name be captured by (2:school name). I tried a non-capturing group (?:^\D{1,2}\d{1,2}$) to get rid of any potential postcode but not working.
Examples:
Please advise.
You may use
([0-9]{5})\s{2,}([^\d\s]\D*(?:\s[a-zA-Z]{1,2}\d{1,2})?)\s+(\d+)\s+(\d+)
See this demo. Or, a bit more optimized:
([0-9]{5})\s{2,}([^\d\s]+(?:\s+[^\d\s]+)*(?:\s+[a-zA-Z]{1,2}\d{1,2})?)\s+(\d+)\s+(\d+)
See the regex demo.
If the initial number must contain 5 digits only, add a word boundary, \b
.
Details:
\b
- a word boundary([0-9]{5})
- Group 1: five digits\s{2,}
- two or more whitespace chars([^\d\s]\D*(?:\s[a-zA-Z]{1,2}\d{1,2})?)
- Group 2:
[^\d\s]\D*
- a char that is not a digit and a whitespace and then zero or more non-digits(?:\s[a-zA-Z]{1,2}\d{1,2})?
- an optional sequence of a whitespace and then one or two ASCII letters and then one or two digits\s+
- one or more whitespaces(\d+)
- Group 3: one or more digits\s+(\d+)
- one or more whitespaces and then Group 4 capturing one or more digits.Note that [^\d\s]+(?:\s+[^\d\s]+)*
matches one or more chars other than digits and whitespaces, and then one or more repetitions of one or more whitespaces followed by one or more chars other than digits.