pythonregex

Regex in Python - Only capture exact match


import re
fruit_list = ['apple banana', 'apple', 'pineapple', 'banana', 'banana apple',  'kiwi']
fruit = re.compile('|'.join(fruit_list))
fruit_re = [ re.compile(r'\b('+re.escape(fruit)+r')\b') for fruit in fruit_list]
fruit_re.append(re.compile( r'([#@])(\w+)'))

string = "this is pooapple is banana apple #apple"

for ft in fruit_re:
    
    match = re.finditer(ft, string)  
    print(type(match))
    for mat in match:
        
        print(mat.span())
        print(mat.group())
        print("****************")

Above is the code that I am working with. The issue is that this snippet is capturing the #apple and the apple in #apple. How do I ensure that only the #apple is captured and not the apple in #apple.

(27, 32)
apple
****************
(34, 39)
apple
****************
<class 'callable_iterator'>
<class 'callable_iterator'>
(20, 26)
banana
****************
<class 'callable_iterator'>
(20, 32)
banana apple
****************
<class 'callable_iterator'>
<class 'callable_iterator'>
(33, 39)
#apple
****************

In the above output I am only intrested in the #apple (33,39) and not apple(34,39)

Ty


Solution

  • Instead of placing word boundaries (\b) around the fruit in the list, you can use whitespace boundaries, e.g. match on:

    (?<!\S)apple(?!\S)
    

    The issue here is that in #apple, the leftmost a is on a word boundary, since it is preceded by #, which is a non word character.

    Your updated script:

    import re
    fruit_list = ['apple banana', 'apple', 'pineapple', 'banana', 'banana apple',  'kiwi']
    fruit = re.compile('|'.join(fruit_list))
    fruit_re = [ re.compile(r'(?<!\S)('+re.escape(fruit)+r')(?!\S)') for fruit in fruit_list]
    fruit_re.append(re.compile( r'([#@])(\w+)'))
    
    string = "this is pooapple is banana apple #apple"
    
    for ft in fruit_re:
    
        match = re.finditer(ft, string)  
        print(type(match))
        for mat in match:
    
            print(mat.span())
            print(mat.group())
            print("****************")