python-3.xligature

How to compare two strings with different unicode?


When I am doing string comparison, I am getting that 2 strings are not equal even though they are equal.

I am extracting text from 2 PDFs. Extracted text is same. But I can see some font change in one of them. I am not understanding why?

str1 = 'Confirmations'

str2 = 'Confirmations'

str1 == str2

False


Solution

  • The problem is that "fi" inside the string in the first case is a ligature (https://en.wikipedia.org/wiki/Typographic_ligature), while in the second is the sum of "f" and "i".

    You can use a function to check if the ligature is present and substitute it with plain text

    def ligature(string):
        if 'fi' in string:
            string.replace('fi', 'fi')
        return string
    

    you can also add other if statements for other ligatures if you found more in your text.