pythonregextext-normalization

How to normalize text with regex?


How to normilize text with regex with some if statements?

If we have string like this One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1

And I want to normilize like this one t 933 two three 35.4 four 9,3 8.5 five m2x13 m4.3x2.1

  1. Remove all dots and commas.
  2. Split number and string if not starts with letter 'M' T933 --> T 933
  3. All lowercase
  4. Do not split if there is dot or comma between numbers 35.4 --> 35.4 or 9,3 --> 9.3 if there is comma between, then replace to dot

What I am able to do is this

def process(str, **kwargs):
    str = str.replace(',', '.')
    str = re.split(r'(-?\d*\.?\d+)', str)
    str = ' '.join(str)
    str.lower()
    return str

but there is no if condition when numbers starts with letter 'M' and their also is splitted. And in some reason after string process i get some unnecessary spaces.

Is there some ideas how to do that with regex? Or with help methods like replace, lower, join and so on?


Solution

  • I can suggest a solution like

    re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)|(?<=\d)(?=[^\W\d_])', ' ', text)).lower()
    

    The outer re.sub is meant to remove dots or commas when not between digits:

    The inner re.sub replaces with a space the following pattern:

    See the Python demo:

    import re
    text = 'One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1 aa88aa'
    print( re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)|(?<=\d)(?=[^\W\d_])', ' ', text)).lower() )
    

    Output:

    one t 933 two three 35.4 four 9,3 8.5 five m2 x13 m4.3 x2.1 aa 88 aa