pythonregex

Match all lines with a pattern after a text until pattern matching failure regex


I have a text:


{{Verkleinerungsformen}}
:[1] [[Äpfelchen]], [[Äpfelein]], [[Äpflein]]

{{Oberbegriffe}}
:[1] [[Kernobst]], [[Obst]]; [[Frucht]]
:[4] [[Kot]]
:[7] [[Gut]]

{{Unterbegriffe}}
:[1] [[Augustapfel]], [[Bohnapfel]], [[Bratapfel]], [[Essapfel]], [[Fallapfel]], 


I'm interested in extracting all items under {{Oberbegriffe}} that have the pattern [[Text]] including all lines until it reach another line that does not have :[NUMBER-HERE] at the begin

so in the above example it should return an array of these strings:

Kernobst, Obst, Frucht, Kot, Gut

what I have tried is:

re.search(r'{{Oberbegriffe}}\n(?::?\n)?([^\n]+)', text)

But it matches only the full first line. It's ok if there is a way to extract all lines with the pattern and it retruns this string

:[1] [[Kernobst]], [[Obst]]; [[Frucht]]
:[4] [[Kot]]
:[7] [[Gut]]

Solution

  • You may extract the blocks using

    (?m)^{{Oberbegriffe}}(?:\n:\[\d+].*)*
    

    See the regex demo

    Then, use \[\[([^][]+)]] pattern to extract the values you need. See this regex demo.

    Regex details

    The second regex - \[\[([^][]+)]] - matches [[, then capturing group #1 matching any 1 or more chars other than [ and ], and then ]].

    In Python:

    with open(filepath, 'r') as fr:
      blocks = re.findall(r'^{{Oberbegriffe}}(?:\n:\[\d+].*)*', fr.read(), flags=re.M)
      print([re.findall(r'\[\[([^][]+)]]', block) for block in blocks])