import re
fruit_list = ['apple banana', 'apple', 'pineapple', 'banana', 'banana apple', 'kiwi']
fruit = re.compile('|'.join(fruit_list))
fruit_re = [ re.compile(r'\b('+re.escape(fruit)+r')\b') for fruit in fruit_list]
fruit_re.append(re.compile( r'([#@])(\w+)'))
string = "this is pooapple is banana apple #apple"
for ft in fruit_re:
match = re.finditer(ft, string)
print(type(match))
for mat in match:
print(mat.span())
print(mat.group())
print("****************")
Above is the code that I am working with. The issue is that this snippet is capturing the #apple and the apple in #apple. How do I ensure that only the #apple is captured and not the apple in #apple.
(27, 32)
apple
****************
(34, 39)
apple
****************
<class 'callable_iterator'>
<class 'callable_iterator'>
(20, 26)
banana
****************
<class 'callable_iterator'>
(20, 32)
banana apple
****************
<class 'callable_iterator'>
<class 'callable_iterator'>
(33, 39)
#apple
****************
In the above output I am only intrested in the #apple (33,39) and not apple(34,39)
Ty
Instead of placing word boundaries (\b
) around the fruit in the list, you can use whitespace boundaries, e.g. match on:
(?<!\S)apple(?!\S)
The issue here is that in #apple
, the leftmost a
is on a word boundary, since it is preceded by #
, which is a non word character.
Your updated script:
import re
fruit_list = ['apple banana', 'apple', 'pineapple', 'banana', 'banana apple', 'kiwi']
fruit = re.compile('|'.join(fruit_list))
fruit_re = [ re.compile(r'(?<!\S)('+re.escape(fruit)+r')(?!\S)') for fruit in fruit_list]
fruit_re.append(re.compile( r'([#@])(\w+)'))
string = "this is pooapple is banana apple #apple"
for ft in fruit_re:
match = re.finditer(ft, string)
print(type(match))
for mat in match:
print(mat.span())
print(mat.group())
print("****************")