pythonregexregex-lookaroundslookbehind

Extracting two strings from between two characters. Why doesn't my regex match and how can I improve it?


I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:

Meaning I want to capture the strings that come in between the letters C and P as well as the string that comes after the letter P and concatenate them into a single string, while discarding the letters C and P

Examples of valid strings would be:

c45AFP2
CAPF
c56Bp26
CA6C22pAAA

For the above examples what I want would be to extract the following, in the same order:

45AF2     # Original string: c45AFP2
AF        # Original string: CAPF
56B26     # Original string: c56Bp26
A6C22AAA  # Original string: CA6C22pAAA

Examples of invalid strings would be:

BCA6C22pAAA  # It doesn't begin with C
c56Bp  # There aren't any characters after P
c45AF0P2  # Contains a zero

I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P

So far I've come up with this:

(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*

A breakdown would be:

(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string

[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times

(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]

[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times

But with the above regex I can't match any of the strings!

When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.

Meaning the below regex works:

(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*

I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?

But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.

That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.

Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?

I still need it to:

Thank you


Solution

  • To match both groups before and after P or p

    (?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))