pythonregexpython-3.x

Adding leading zero with regular expression


say I have a string like

'1 - hello.mp3'
'22 - hellox.mp3'
'223 - hellox.mp3'
'hellox.mp3'

I hope to output to be

'001 - hello.mp3'
'022 - hellox.mp3'
'223 - hellox.mp3'
'hellox.mp3'

that is if the starting is number, appending 0 to make it three digits.

Is there a way to achieve using regex in python?


Solution

  • Yes, regexes can do that. Use re.sub() with a callback function:

    import re
    
    def pad_number(match):
        number = int(match.group(1))
        return format(number, "03d")
    
    fixed_text = re.sub(r"^(\d+)", pad_number, text)
    

    The pattern I used, ^(\d+) matches 1 or more digits (\d is a digit, + will match at least one time but will encompass all following digits), but only at the start of the string (^ is the 'start of text' anchor here).

    Then, for each matched pattern, the pad_number() function is called, and the string that that function returns is used to replace the matched pattern. Because the pattern uses a capturing group (everything between ( and ) is such a group) the function can access the matched digits by calling match.group(1).

    The function turns the digits into an integer, then uses the format() function to turn that integer back into text, but this time as a 0-padded number 3 characters wide; that's what the 03 formatting instruction tells format() to produce.

    Note that the pattern can match more digits, but limiting them doesn't make much sense unless there is a strict upper number you want to limit to (at which point you need to also add a restriction on the next character not being a digit). The format(number, "03d") instruction produces a number at least 3 digits wide but can handle longer values.

    Demo:

    >>> import re
    >>> samples = [
    ...     '1 - hello.mp3',
    ...     '22 - hellox.mp3',
    ...     '223 - hellox.mp3',
    ...     'hellox.mp3',
    ... ]
    >>> def pad_number(match):
    ...     number = int(match.group(1))
    ...     return format(number, "03d")
    ...
    >>> for sample in samples:
    ...     result = re.sub(r"^(\d+)", pad_number, sample)
    ...     print(f"{sample!r:20} -> {result!r:20}")
    ...
    '1 - hello.mp3'      -> '001 - hello.mp3'
    '22 - hellox.mp3'    -> '022 - hellox.mp3'
    '223 - hellox.mp3'   -> '223 - hellox.mp3'
    'hellox.mp3'         -> 'hellox.mp3'
    

    Again, take into account that this method doesn't special case strings with 4 or more digits at the start; you simply get a longer sequence of digits:

    >>> re.sub(r"^(\d+)", pad_number, "4281 - 4 digits")
    '4281 - 4 digits'
    >>> re.sub(r"^(\d+)", pad_number, "428117 - 6 digits")
    '428117 - 6 digits'
    

    This would happen even if we limited the \d pattern to only match up to 3 digits (e.g. with \d{1,3}).

    If you wanted to make the padding width configurable, you can put everything in a nested function and use string formatting. You don't really need

    import re
    
    def pad_leading_number(text, width):
        def pad_number(match):
            number = int(match.group(1))
            return format(number, f"0{width}d")
    
        return re.sub(fr"^(\d+)", pad_number, text)
    

    Demo:

    >>> pad_leading_number("22 - hellox.mp3", 3)
    '022 - hellox.mp3'
    >>> pad_leading_number("22 - hellox.mp3", 7)
    '0000022 - hellox.mp3'