pythonregexpython-3.11

Regular expression look ahead anchor with multiple match


I am using Regular Expression in Python 3.11 (because it allows the (?>...) pattern, https://docs.python.org/3/library/re.html) to transform the bellow string to a dictionary by an interactive match pattern:

string = '''Latitude (degrees): 4010.44 Longitude (degrees): 58.000 Radiation database: year month H(h)_m 2005 Jan 57.77 2005 Feb 77.76 2005 Mar 120.58 H(h)_m: Irradiation plane (kWh/m2/mo)'''


for match in re.finditer(r'(?P<key>(?>[A-Z][ a-z\_\(\)]*))\: *(?P<value>.+?)(?: |$)', string):
    # Key is the short pattern before ":" starting with a uppercase letter
    # Value must be the remaining, after the ": " and before the next key.
    print(match[1], ":", match[2])

I haven't been able to return:

Latitude (degrees): 4010.44
Longitude (degrees): 58.000
Radiation database: year month H(h)_m 2005 Jan 57.77 2005 Feb 77.76 2005 Mar 120.58
H(h)_m: Irradiation plane (kWh/m2/mo)

And know that is because the (?P<value>.+?) short match pattern, but removing ?, the <value> also captures some unintended <key>.

How to long match and stop the <value> group match before the next <key>?


Solution

  • First of all, you need no atomic group here, it gives you no additional advantage in the current context.

    You can use

    import re
    
    text = '''Latitude (degrees): 4010.44 Longitude (degrees): 58.000 Radiation database: year month H(h)_m 2005 Jan 57.77 2005 Feb 77.76 2005 Mar 120.58 H(h)_m: Irradiation plane (kWh/m2/mo)'''
    
    for match in re.finditer(r'(?P<key>\b[A-Z][ a-z_()]*): *(?P<value>.+?)(?=\b[A-Z][ a-z_()]*:|$)', text):
        # Key is the short pattern before ":" starting with a uppercase letter
        # Value must be the remaining, after the ": " and before the next key.
        print(match.group("key"), ":", match.group("value"))
    

    See the Python demo.

    Output:

    Latitude (degrees) : 4010.44 
    Longitude (degrees) : 58.000 
    Radiation database : year month H(h)_m 2005 Jan 57.77 2005 Feb 77.76 2005 Mar 120.58 
    H(h)_m : Irradiation plane (kWh/m2/mo)
    

    Here is the regex demo.

    Details: