regexpython-repython-regex

Capturing any character between two specified words including new lines


I need to capture the title between the words TITLE and JOURNAL and to exclude a scenario in which the captured string is Direct Submission.
for instance, in the the following text,

  TITLE     The Identification of Novel Diagnostic Marker Genes for the
            Detection of Beer Spoiling Pediococcus damnosus Strains Using the
            BlAst Diagnostic Gene findEr
  JOURNAL   PLoS One 11 (3), e0152747 (2016)
   PUBMED   27028007
  REMARK    Publication Status: Online-Only
REFERENCE   2  (bases 1 to 462)
  AUTHORS   Behr,J., Geissler,A.J. and Vogel,R.F.
  TITLE     Direct Submission
  JOURNAL   Submitted (04-AUG-2015) Technische Mikrobiologie, Technische

the captured string needs to be only
'The Identification of Novel Diagnostic Marker Genes for the Detection of Beer Spoiling Pediococcus damnosus Strains Using the BlAst Diagnostic Gene findEr', either with or without new line characters (preferably without new line characters).
I tried applying regular expressions such as those offered here and here, but couldn't apply them to my needs.
Thanks.


Solution

  • (?<=TITLE)[\S\s]*?(?=JOURNAL)

    Should work. (?<=TITLE) is to make sure that match is preceded by TITLE. (?=JOURNAL) is to make sure that it is followed by JOURNAL.

    To exclude Direct Submission, use (?<=TITLE)(?!\s*Direct Submission)[\S\s]*?(?=JOURNAL). However, this approach will also exclude string that starts with Direct Submission. Here is the result.