import re
text = """
This is a line.
Short
Long line
<!-- Comment line -->
"""
pattern = r'''(?:^.{1,3}$|^.{4}(?<!<!--))'''
matches = re.findall(pattern, text, flags=re.MULTILINE)
print(matches)
OUTPUT with pattern = r'''(?:^.{1,3}$|^.{4}(?<!<!--))'''
:
['This', 'Shor', 'Long']
OUTPUT with pattern = r'''(?:^.{1,3}$|^.{3}(?<!<!--))'''
:
['Thi', 'Sho', 'Lon', '<!-']
OUTPUT with pattern = r'''(?:^.{1,3}$|^.{5}(?<!<!--))'''
:
['This ', 'Short', 'Long ', '<!-- ']
Any number other than 4 in .{4}(?<!<!--))
causes to display and match <!-- . How?
Here is the regex pattern broken down:
(
?: # match either
^.{1,3}$ # ...a line of 1 to 3 characters, any characters (e.g. "aaa")
| # ...or
^.{4} # ...4 characters of any kind, from the start of a line
(?<! # # provided those 4 characters are not
<!-- # these ones
)
)
Now the basic pattern has been broken down, we can look at the variants:
r'''(?:^.{1,3}$|^.{3}(?<!<!--))'''
With this one, we can see that the second part of it doesn't work well- it's looking for three characters that don't match a four character string ("<!--"
, which doesn't make any sense. It's also why <!-
is part of the output- Python is looking for <!--
, not <!-
r'''(?:^.{1,3}$|^.{5}(?<!<!--))'''
The same applies for this as for the previous example, except in this case, it's looking for a 5 character string, not a 3 character one. Once again, <!--
is found because it is not <!--
.
Hope this helps!