I have a text:
{{Verkleinerungsformen}}
:[1] [[Äpfelchen]], [[Äpfelein]], [[Äpflein]]
{{Oberbegriffe}}
:[1] [[Kernobst]], [[Obst]]; [[Frucht]]
:[4] [[Kot]]
:[7] [[Gut]]
{{Unterbegriffe}}
:[1] [[Augustapfel]], [[Bohnapfel]], [[Bratapfel]], [[Essapfel]], [[Fallapfel]],
I'm interested in extracting all items under {{Oberbegriffe}}
that have the pattern [[Text]]
including all lines until it reach another line that does not have :[NUMBER-HERE]
at the begin
so in the above example it should return an array of these strings:
Kernobst, Obst, Frucht, Kot, Gut
what I have tried is:
re.search(r'{{Oberbegriffe}}\n(?::?\n)?([^\n]+)', text)
But it matches only the full first line. It's ok if there is a way to extract all lines with the pattern and it retruns this string
:[1] [[Kernobst]], [[Obst]]; [[Frucht]]
:[4] [[Kot]]
:[7] [[Gut]]
You may extract the blocks using
(?m)^{{Oberbegriffe}}(?:\n:\[\d+].*)*
See the regex demo
Then, use \[\[([^][]+)]]
pattern to extract the values you need. See this regex demo.
Regex details
(?m)
- an inline modifier, same as re.M
/ re.MULTILINE
^
- start of a line{{Oberbegriffe}}
- literal text(?:\n:\[\d+].*)*
- 0 or more repetitions of a newline followed with :[
, then 1+ digits, ]
, and then any 0 or more characters other than line break chars, as many as possible.The second regex - \[\[([^][]+)]]
- matches [[
, then capturing group #1 matching any 1 or more chars other than [
and ]
, and then ]]
.
In Python:
with open(filepath, 'r') as fr:
blocks = re.findall(r'^{{Oberbegriffe}}(?:\n:\[\d+].*)*', fr.read(), flags=re.M)
print([re.findall(r'\[\[([^][]+)]]', block) for block in blocks])