I am trying to search for exact words in a file. I read the file by lines and loop through the lines to find the exact words. As the in
keyword is not suitable for finding exact words, I am using a regex pattern.
def findWord(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
The problem with this function is that is doesn't recognizes square brackets [xyz]
.
For example
findWord('data_var_cod[0]')('Cod_Byte1 = DATA_VAR_COD[0]')
returns None
whereas
findWord('data_var_cod')('Cod_Byte1 = DATA_VAR_COD')
returns <_sre.SRE_Match object at 0x0000000015622288>
Can anybody please help me to tweak the regex pattern?
It's because of that regex engine assume the square brackets as character class which are regex characters for get ride of this problem you need to escape your regex characters. you can use re.escape
function :
def findWord(w):
return re.compile(r'\b({0})\b'.format(re.escape(w)), flags=re.IGNORECASE).search
Also as a more pythonic way to get all matches you can use re.fildall()
which returns a list of matches or re.finditer
which returns an iterator contains matchobjects.
But still this way is not complete and efficient because when you are using word boundary your inner word must contains one type characters.
>>> ss = 'hello string [processing] in python.'
>>>re.compile(r'\b({0})\b'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss)
>>>
>>>re.compile(r'({})'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss).group(0)
'[processing]'
So I suggest to remove the word boundaries if your words are contains none word characters.
But as a more general way you can use following regex which use positive look around that match words that surround by space or come at the end of string or leading:
r'(?: |^)({})(?=[. ]|$) '