When I am doing string comparison, I am getting that 2 strings are not equal even though they are equal.
I am extracting text from 2 PDFs. Extracted text is same. But I can see some font change in one of them. I am not understanding why?
str1 = 'Confirmations'
str2 = 'Confirmations'
str1 == str2
False
The problem is that "fi" inside the string in the first case is a ligature (https://en.wikipedia.org/wiki/Typographic_ligature), while in the second is the sum of "f" and "i".
You can use a function to check if the ligature is present and substitute it with plain text
def ligature(string):
if 'fi' in string:
string.replace('fi', 'fi')
return string
you can also add other if
statements for other ligatures if you found more in your text.