I want to extract IBAN numbers from text with Python. The challenge here is, that the IBAN itself can be written in so many ways with spaces bewteen the numbers, that I find it difficult to translate this in a usefull regex pattern.
I have written a demo version which tries to match all German and Austrian IBAN numbers from text.
^DE([0-9a-zA-Z]\s?){20}$
I have seen similar questions on stackoverflow. However, the combination of different ways to write IBAN numbers and also extracting these numbers from text, makes it very difficult to solve my problem.
Hope you can help me with that!
In general, to match German and Austrian IBAN codes, you can use
codes = re.findall(r'\b(DE(?:\s*[0-9]){20}|AT(?:\s*[0-9]){18})\b(?!\s*[0-9])', text)
Details:
\b
- word boundary(DE(?:\s*[0-9]){20}|AT(?:\s*[0-9]){18})
- Group 1: DE
and 20 repetitions of a digit with any amount of whitespace in between, or AT
and then 18 repetitions of single digits eventaully separated with any amount of whitespaces\b(?!\s*[0-9])
- word boundary that is NOT immediately followed with zero or more whitespaces and an ASCII digit.See this regex demo.
For the data you showed in the question that includes non-proper IBAN codes, you can use
\b(?:DE|AT)(?:\s?[0-9a-zA-Z]){18}(?:(?:\s?[0-9a-zA-Z]){2})?\b
See the regex demo. Details:
\b
- word boundary(?:DE|AT)
- DE
or AT
(?:\s?[0-9a-zA-Z]){18}
- eighteen occurrences of an optional whitespace and then an alphanumeric char(?:(?:\s?[0-9a-zA-Z]){2})?
- an optional occurrence of two sequences of an optional whitespace and an alphanumeric char\b
- word boundary.