I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:
C
, in either lowercase or
uppercase, which is then followed by a number of hexadecimal
characters (meaning it can contain the letters A to F
and numbers
from 1 to 9
, with no zeros included).P
, also either in lowercase or uppercaseMeaning I want to capture the strings that come in between the letters C
and P
as well as the string that comes after the letter P
and concatenate them into a single string, while discarding the letters C
and P
Examples of valid strings would be:
c45AFP2
CAPF
c56Bp26
CA6C22pAAA
For the above examples what I want would be to extract the following, in the same order:
45AF2 # Original string: c45AFP2
AF # Original string: CAPF
56B26 # Original string: c56Bp26
A6C22AAA # Original string: CA6C22pAAA
Examples of invalid strings would be:
BCA6C22pAAA # It doesn't begin with C
c56Bp # There aren't any characters after P
c45AF0P2 # Contains a zero
I'm using python and I want a regex to extract the two strings that come both in between the characters C
and P
as well as after P
So far I've come up with this:
(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*
A breakdown would be:
(?<=\A[cC])
Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC]
and that [cC] must be at the beginning of the string
[a-fA-F1-9]*
Matches a single character in the list between zero and unlimited times
(?<=[pP])
Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]
[a-fA-F1-9]*
Matches a single character in the list between zero and unlimited times
But with the above regex I can't match any of the strings!
When I insert a |
in between (?<=[cC])[a-fA-F1-9]*
and (?<=[pP])[a-fA-F1-9]*
it works.
Meaning the below regex works:
(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*
I know that |
means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?
But using |
means the string BCA6C22pAAA
is a partial match to AAA
since it comes after P
, even though the first assertion isn't true, since it doesn't begin with a C
.
That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.
Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?
I still need it to:
Thank you
To match both groups before and after P
or p
(?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))
(?<=^[Cc])
- Positive Lookbehind. Must match a case insensitive C
or c
at the start of the line[1-9a-fA-F]+
- Matches hexadecimal characters one or more times(?=[Pp]
- Positive Lookahead for case insensitive p
or P
([1-9a-fA-F]+$)
- Cature group for one or more hexadecimal characters following the pP
View Demo