pythonregex

Python RegEx: Extract all possible types of chars from string and automatically create a RegEx pattern based on sample


I want automatically analyse string for all present types of chars and also create a RegEx pattern based on a column with a sample template.
So that later any string which is related to this pattern could be cleaned by only allowed chars and then aligned with pattern.

For example samples could be:

"A111AA1" - means that all possible chars: only letters and didgits; pattern should be: first letter, then 3 digits, followed by 2 letters and 1 digit.

"11AA-111A" - means that possible chars: letters, digits, hyphen/dash; pattern: 2 digits, 2 letters, dash, 3 digits, 1 letter.

Is it possible without manual if-else hardcoding? Unique patterns could be > 1000.

Thanks.

Update

Regarding extracting all possible chars in string I've created following function. It creates RegEx with existing (allowed) chars in pattern.
If you know better method, let me know.

def extractCharsFromPattern(pattern: str) -> str:
    allowedChars = []
    
    # Convert string to set of chars
    pattern = ''.join(set(pattern))
    
    # Letters
    if re.findall(r"[a-zA-Z]", pattern):
        allowedChars.append("a-zA-Z")
        pattern = re.sub(r"[a-zA-Z]", "", pattern)
    # Digits
    if re.findall(r"[0-9]", pattern):
        allowedChars.append("0-9")
        pattern = re.sub(r"[0-9]", "", pattern)    
    # Special chars
    allowedChars.append(pattern)
    
    # Prepare in regex format
    allowedChars = "[" + "".join(allowedChars) + "]"
    
    return allowedChars

Solution

  • If your patterns are that simplistic then of course you can match on that to get a regex, for example:

    patterns = ["A111AA1", "11AA-111A"]
    for pattern in patterns:
        re_pattern = ''.join([r'\d' if c.isdigit() else r'[a-zA-Z]' if c.isalpha() else r'-' if c=='-' else '???' for c in pattern])
        print (pattern, '-->', re_pattern)
    
    A111AA1   --> [a-zA-Z]\d\d\d[a-zA-Z][a-zA-Z]\d
    11AA-111A --> \d\d[a-zA-Z][a-zA-Z]-\d\d\d[a-zA-Z]
    

    From your comments, if you just want a character class, you'd chain it all together. Here is an example one-line but based on your requirements you'd put it in a function:

    >>> s="AA-22"
    >>> r = ('['                                   # start of character class
      +  ('a-z' if re.search(r'[a-z]', s) else '') # have a lowercase?
      + ('A-Z' if re.search(r'[A-Z]', s) else '')  # have an uppercase?
      + ('0-9' if re.search(r'[0-9]', s) else '')  # have a number?
      + ('-' if re.search(r'-', s) else '')        # have a dash
      + ']'                                        # end of character class
      +  '{' + str(len(s)) + '}'                   # enforce a length?
    )
    # '[A-Z0-9-]{5}'
    >>> re.search(r, "BB-44").group(0)
    # 'BB-44'