regextext-extractionpcre2

Regex to get text between chapters titles in upper-case


I am trying to extract chapters/sections for txt files whose were generated using pdftotext on portuguese Lawsuits documents. Initially I tried this regex to, at least, get each chapter title:

^[A-Z\s\d\W]+$

Apparently it had worked for this example: https://regex101.com/r/FQKsy4/1

But, for this one: https://regex101.com/r/BEO55p/3

I got some non titles like those matches:

enter image description here

enter image description here

So, how can I get not only each chapter/section title but each content of them too?

I tried a regex to get each chapter and its content but not worked very well in some documents


Solution

  • An approach using 2 capture groups:

    ^[^\S\n]*([A-Z][^a-z]*)((?:\n(?![^\S\n]*[A-Z][^a-z\n]*$).*)*)$
    

    Regex demo

    A bit more pcre like approach:

    ^\h*([A-Z][^a-z]*)((?>\R(?!\h*[A-Z][^a-z\r\n]*$).*)*)$
    

    Regex demo