I want automatically analyse string for all present types of chars and also create a RegEx pattern based on a column with a sample template.
So that later any string which is related to this pattern could be cleaned by only allowed chars and then aligned with pattern.
For example samples could be:
"A111AA1" - means that all possible chars: only letters and didgits; pattern should be: first letter, then 3 digits, followed by 2 letters and 1 digit.
"11AA-111A" - means that possible chars: letters, digits, hyphen/dash; pattern: 2 digits, 2 letters, dash, 3 digits, 1 letter.
Is it possible without manual if-else hardcoding? Unique patterns could be > 1000.
Thanks.
Regarding extracting all possible chars in string I've created following function. It creates RegEx with existing (allowed) chars in pattern.
If you know better method, let me know.
def extractCharsFromPattern(pattern: str) -> str:
allowedChars = []
# Convert string to set of chars
pattern = ''.join(set(pattern))
# Letters
if re.findall(r"[a-zA-Z]", pattern):
allowedChars.append("a-zA-Z")
pattern = re.sub(r"[a-zA-Z]", "", pattern)
# Digits
if re.findall(r"[0-9]", pattern):
allowedChars.append("0-9")
pattern = re.sub(r"[0-9]", "", pattern)
# Special chars
allowedChars.append(pattern)
# Prepare in regex format
allowedChars = "[" + "".join(allowedChars) + "]"
return allowedChars
If your patterns are that simplistic then of course you can match on that to get a regex, for example:
patterns = ["A111AA1", "11AA-111A"]
for pattern in patterns:
re_pattern = ''.join([r'\d' if c.isdigit() else r'[a-zA-Z]' if c.isalpha() else r'-' if c=='-' else '???' for c in pattern])
print (pattern, '-->', re_pattern)
A111AA1 --> [a-zA-Z]\d\d\d[a-zA-Z][a-zA-Z]\d
11AA-111A --> \d\d[a-zA-Z][a-zA-Z]-\d\d\d[a-zA-Z]
From your comments, if you just want a character class, you'd chain it all together. Here is an example one-line but based on your requirements you'd put it in a function:
>>> s="AA-22"
>>> r = ('[' # start of character class
+ ('a-z' if re.search(r'[a-z]', s) else '') # have a lowercase?
+ ('A-Z' if re.search(r'[A-Z]', s) else '') # have an uppercase?
+ ('0-9' if re.search(r'[0-9]', s) else '') # have a number?
+ ('-' if re.search(r'-', s) else '') # have a dash
+ ']' # end of character class
+ '{' + str(len(s)) + '}' # enforce a length?
)
# '[A-Z0-9-]{5}'
>>> re.search(r, "BB-44").group(0)
# 'BB-44'