pythonpostgresqlcharacter-encodingdatabase-migrationiso-8859-2

Problem with converting polish characters (ISO-8859-2) to Postgres with encoding UTF-8 in Python


With respect to a project I am working on, I have many .unl files (informix) for different countries and these need to be imported into postgres. To do that, I need to translate an informix schema to postgres schema using python.

Assuming I have this line of code in my python script with which I want to open all .unl files:

open(file, 'r', encoding='latin1')

For countries that use encoding = latin1, the script works fine and things look good in postgres. Except for Poland

When I specify encoding = latin2 for Poland, the import script is still successful executed but the polish text ends up looking different in postgres. An example, the output looks like this unexpectedly:

enter image description here

But if the encoding is correct, the expected result should look like this:

enter image description here

I tried and still can't figure out yet how to fix it. I really appriciate any suggestions on how to solve this problem. Thank you in advance!


Solution

  • You face a flagrant mojibake case.

    Proof in the following (partially commented) code snippet: type .\SO\78540135.py

    file = r'.\SO\78540135.txt'
    str_text = 'Aleksańdra Świętochowskiego'
    
    # create a sample file: utf-8 encoded
    with open( file, 'w', encoding = 'utf-8') as f:
        f.write( str_text)
    
    # read the file using wrong encoding
    with open( file, 'r', encoding = 'latin2') as f:
        str_name = f.read()
    
    print( '\nmojibake', str_name)
    
    # read the file using correct encoding
    with open( file, 'r', encoding = 'utf-8') as f:
        str_name = f.read()
    
    print( '\nUTF8text', str_name)
    

    Output: python .\SO\78540135.py

    mojibake AleksaĹdra ĹwiÄtochowskiego
    
    UTF8text Aleksańdra Świętochowskiego