pythonregexpython-refindalltextmatching

Reading text between multiple newline characters and whitespaces using regex


I'm trying to read these underlined headings using regex.

These headings have more than two newline characters and more than two whitespace characters before the start of the heading. It has ONE whitespace and two newline characters after the heading. The heading is in all CAPITAL letters.

I tried with r"(\n{2,}\s{2,})(?:([A-Z]+)\s([A-Z]*))" but it did not work.

enter image description here

Any help is greatly appreciated! Thanks in advance.


Solution

  • This appears to work.

    print(re.findall(r'\n{2,}\s{2,}([A-Z\s]+)\s\n', data, re.X))
    

    based on the snippet above, returns:

    ['ROBOT ', 'TRAFFIC LIGHT ', 'TRAFFIC LIGHT ']