pythonwindowscharacter-encoding

Error running Python script: 'utf-8' codec can't decode byte 0xed in position 79: invalid continuation byte


C:\inetpub\wwwroot\proyecto_transporte>python conexion.py
Unexpected error: 'utf-8' codec can't decode byte 0xed in position 79: invalid continuation byte

I already tried saving the file as ANSI, verified that it is saved as UTF-8, and nothing seems to work. I don't know what else to try. If anyone has had a similar issue, I would really appreciate your help.

Here is the content of my script:

import psycopg2

def verificar_conexion():
    try:
        conexion = psycopg2.connect(
            host="10.240.0.96",
            database="transporte_dev",
            user="postgres",
            password="***********",
            port="5432"
        )

        print("Conexion exitosa a PostgreSQL")

        cursor = conexion.cursor()
        cursor.execute("SELECT version();")
        version = cursor.fetchone()
        print(f"Version de PostgreSQL: {version[0]}")

        cursor.close()
        conexion.close()
        print("Conexion cerrada correctamente")
        
        return True
        
    except psycopg2.OperationalError as e:
        print(f"Error de conexion: {e}")
        return False
    except Exception as e:
        print(f"Error inesperado: {e}")
        return False

if __name__ == "__main__":
    verificar_conexion()

Solution

  • The root issue is that the data is most likely not valid UTF-8. We can figure out whether the errant byte (0xED) is likely to be part of a valid UTF-8 sequence by first assuming that it is and then figuring out what that implies. Since the errant byte is between 0xE0 and 0xEF, it would have to be the 1st byte in a 3-byte sequence. The underlying codepoint would therefore be between U+D000 and U+DFFF.


    Option 1: Valid codepoint from U+D000 to U+07FF

    This option is unlikely; it would mean that the error message is mistaken and threw unexpectedly while reading a codepoint from the Hangul Syllables or Hangul Jamo Extended-B block. If you are not working with Hangul characters, we can disregard this possibility.

    Option 2: Invalid codepoint from U+D800 to U+DFFF

    Such codepoints are associated with UTF-16 surrogates and are not valid UTF-8, they are explicitly forbidden. If you are working with overlong codepoints (U+010000 to U+10FFFF), this heavily implies that your data is UTF-16 encoded. This includes any character outside of the BMP. Such characters can be represented within UTF-8, just not in this way. For example, U+1F602 (😂) has a UTF-8 encoding which uses 4 code units that are not surrogates (F0 9F 98 82) and a UTF-16 encoding which uses a high and low surrogate pair (D83D DE02).


    Therefore, unless you are working with Hangul characters, your data is objectively not UTF-8 encoded. A likely option is UTF-16 as previously stated, but also legacy single-byte encodings like ISO-8859-1 and Windows-1252 (specifically ED may be an attempt to encode í, which UTF-8 instead encodes as C3 AD). Mohammed's answer may help you convert your data from whatever encoding it may currently be using to UTF-8.