I am using Regular Expression in Python 3.11 (because it allows the (?>...)
pattern, https://docs.python.org/3/library/re.html) to transform the bellow string to a dictionary by an interactive match pattern:
string = '''Latitude (degrees): 4010.44 Longitude (degrees): 58.000 Radiation database: year month H(h)_m 2005 Jan 57.77 2005 Feb 77.76 2005 Mar 120.58 H(h)_m: Irradiation plane (kWh/m2/mo)'''
for match in re.finditer(r'(?P<key>(?>[A-Z][ a-z\_\(\)]*))\: *(?P<value>.+?)(?: |$)', string):
# Key is the short pattern before ":" starting with a uppercase letter
# Value must be the remaining, after the ": " and before the next key.
print(match[1], ":", match[2])
I haven't been able to return:
Latitude (degrees): 4010.44
Longitude (degrees): 58.000
Radiation database: year month H(h)_m 2005 Jan 57.77 2005 Feb 77.76 2005 Mar 120.58
H(h)_m: Irradiation plane (kWh/m2/mo)
And know that is because the (?P<value>.+?)
short match pattern, but removing ?
, the <value>
also captures some unintended <key>
.
How to long match and stop the <value>
group match before the next <key>
?
First of all, you need no atomic group here, it gives you no additional advantage in the current context.
You can use
import re
text = '''Latitude (degrees): 4010.44 Longitude (degrees): 58.000 Radiation database: year month H(h)_m 2005 Jan 57.77 2005 Feb 77.76 2005 Mar 120.58 H(h)_m: Irradiation plane (kWh/m2/mo)'''
for match in re.finditer(r'(?P<key>\b[A-Z][ a-z_()]*): *(?P<value>.+?)(?=\b[A-Z][ a-z_()]*:|$)', text):
# Key is the short pattern before ":" starting with a uppercase letter
# Value must be the remaining, after the ": " and before the next key.
print(match.group("key"), ":", match.group("value"))
See the Python demo.
Output:
Latitude (degrees) : 4010.44
Longitude (degrees) : 58.000
Radiation database : year month H(h)_m 2005 Jan 57.77 2005 Feb 77.76 2005 Mar 120.58
H(h)_m : Irradiation plane (kWh/m2/mo)
Here is the regex demo.
Details:
(?P<key>\b[A-Z][ a-z_()]*)
- Group "key": a word boundary, uppercase ASCII letter + zero or more lowercase ASCII letters, underscore, parentheses and then: *
- colon and then zero or more spaces(?P<value>.+?)
- Group "value": any one or more chars other than line break chars as few as possible(?=\b[A-Z][ a-z_()]*:|$)
- a positive lookahead that matches a location that is immediately followed by end of string or a word boundary, uppercase ASCII letter + zero or more lowercase ASCII letters, underscore, parentheses and then a :
char