I have to construct a regex that matches client codes that look like:
With X a number between 0 and 9.
The regex needs to be strong enough so we don't extract codes that are within another string. The use of word boundaries was my first idea.
The regex looks like this: \b\d{3}[\.\/]\d{3,6}(?:\/\d{3})?\b
The problem with word boundaries is that it also matches dots. So a number like "123/456.12" would match "123/456" as the client number. So then I came up with the following regex: (?<!\S)\d{3}[\.\/]\d{3,6}(?:\/\d{3})?(?!\S)
. It uses lookbehind and lookahead and checks if that character is a white space. This matches most of the client codes correctly.
But there is still one last issue. We are using a Google OCR text to extract the codes from. This means that a valid code can be found in the text like 123/456\n
, \n123/456
, \n123/456\n
, etc. Checking if the previous and or next characters are white space doesn't work because the literal "\n" is not included in this. If I do something like (?<!\S|\\n)
as the word boundary it also includes a back and/or forward slash for some reason. Currently I came up with the following regex (?<![^\r\n\t\f\v n])\d{3}[\.\/]\d{3,6}(?:\/\d{3})?(?![^\r\n\t\f\v \\])
, but that only checks if the previous character is a "n" or white space and the next a backslash or white space. So strings like "lorem\123/456" would still find a match. I need some way to include the "\n" in the white space characters without breaking the lookahead/lookbehind.
Do you guys have any idea how to solve this issue? All input is appreciated. Thx!
It seems you want to subtract \n
from the whitespace boundaries. You can use
re.findall(r'(?<![^\s\n])\d{3}[./]\d{3,6}(?:/\d{3})?(?![^\s\n])', text)
See the Python demo and this regex demo.
If the \n
are combinations of \
and n
chars, you need to make sure the \S
in the lookarounds does not match those:
import re
text = r'Codes like 123/456\n \n123/3456 \n123/23456\n etc are correct \n333.3333/333\n'
print( re.findall(r'(?<!\S(?<!\\n))\d{3}[./]\d{3,6}(?:/\d{3})?(?!(?!\\n)\S)', text) )
# => ['123/456', '123/3456', '123/23456', '333.3333/333']
See this Python demo.
Details:
(?<![^\s\n])
- a negative lookbehind that matches a location that is not immediately preceded with a char other than whitespace and an LF char(?<!\S(?<!\\n))
- a left whitespace boundary that does not trigger if the non-whitespace is the n
from the \n
char combination\d{3}
- theree digits[./]
- a .
or /
\d{3,6}
- three to six digits(?:/\d{3})?
- an optional sequence of /
and three digits(?![^\s\n])
- a negative lookahead that requires no char other than whitespace and LF immediately to the right of the current location.(?!(?!\\n)\S)
- a right whitespace boundary that does not trigger if the non-whitespace is the \
char followed with n
.