pythonregextextmatching

regular expression for matching everything until a word is found


I have a piece of text that is repeated several times. Here you have a sample of that text:

DEMO of the text

The idea is to have a regular expression with three groups and repeat this for any match along with the text. Here you have an example of a possible match:

group1 = HORIZON-CL5-2021-D1-01
group2 (Opening) = 15 Apr 2021
group3 (Deadlines(s)) = 07 Sep 2021


group1 = HORIZON-CL5-2022-D1-01-two-stage
group2 (Opening) = 04 Nov 2021
group3 (Deadlines(s)) = 15 Feb 2022 (First Stage), 07 Sep 2022 (Second Stage)

I am trying with this regular expression:

\n(HORIZON-\S+-[A-Z]{1}\d{1}-\d{2}).*?^Opening

It almost works. What I need is to say in the regular expression two more things:

  1. That there are cases that after the last number of HORIZON... might appear some text, like in the second case:

HORIZON-CL5-2022-D1-01 -two-stage

  1. I need to say catch everything until the word 'Opening:' appears at the beginning of a line. I thought was doing this with this part of the expression .*?^Opening but it seems is not correct.

How can I solve this?


Solution

  • To get the -two-stage in group 1, you can add matching 0+ non whitespace chars \S* to the existing group.

    You don't need the s modifier to make the dot match a newline. Instead, you can match all lines that do not start with Opening using a negative lookahead, and then match Opening and capture the date and the deadline part in a capture group.

    Note that you can omit {1}

    ^(HORIZON-\S+-[A-Z]\d-\d{2}\S*)(?:\r?\n(?!Opening\b).*)*\r?\nOpening: (.+)\r?\nDeadline\(s\): (.+)
    

    Regex demo

    You could make the group starting with a date like part as specific as you want, as .+ is a broad match.

    For example

    ^(HORIZON-\S+-[A-Z]\d-\d{2}\S*)(?:\r?\n(?!Opening\b).*)*\r?\nOpening: (\d{2} [A-Z][a-z]{2} \d{4})\r?\nDeadline\(s\): (\d{2} [A-Z][a-z]{2} \d{4}.*)
    

    Regex demo