I need to capture the title between the words TITLE and JOURNAL and to exclude a scenario in which the captured string is Direct Submission
.
for instance, in the the following text,
TITLE The Identification of Novel Diagnostic Marker Genes for the
Detection of Beer Spoiling Pediococcus damnosus Strains Using the
BlAst Diagnostic Gene findEr
JOURNAL PLoS One 11 (3), e0152747 (2016)
PUBMED 27028007
REMARK Publication Status: Online-Only
REFERENCE 2 (bases 1 to 462)
AUTHORS Behr,J., Geissler,A.J. and Vogel,R.F.
TITLE Direct Submission
JOURNAL Submitted (04-AUG-2015) Technische Mikrobiologie, Technische
the captured string needs to be only
'The Identification of Novel Diagnostic Marker Genes for the Detection of Beer Spoiling Pediococcus damnosus Strains Using the BlAst Diagnostic Gene findEr'
, either with or without new line characters (preferably without new line characters).
I tried applying regular expressions such as those offered here and here, but couldn't apply them to my needs.
Thanks.
(?<=TITLE)[\S\s]*?(?=JOURNAL)
Should work. (?<=TITLE) is to make sure that match is preceded by TITLE. (?=JOURNAL) is to make sure that it is followed by JOURNAL.
To exclude Direct Submission
, use (?<=TITLE)(?!\s*Direct Submission)[\S\s]*?(?=JOURNAL)
. However, this approach will also exclude string that starts with Direct Submission
. Here is the result.