I've to transfer data from one postgreSQL DB (old) into another postgresSQL DB (new). Old is encoded in win1252. New is encoded in utf-8.
I've already tried different methods ex. pandas.to_sql, sqlalchemy, psycopg2 and so on but failing all the time due to encoding "issues". I've done some researches and the most valid thing looks like an issue on the driver side. As far as I know psycopg2 uses the unicode driver but with my source database version (PostgreSQL 9.4.20 on x86_64) I've to use ANSI to bypass these encoding issues.
I've tested with an ETL tool if it's possible to export the affected table without encoding issues. It was possible without issues. Due to this test I'm pretty sure it's no real encoding issue instead of an driver handling issue.
When I just used a sample to test if loading the data in general works, I already noticed pandas is to slow. I've to load 1.2 mio records. But this runs for ever. Therefore the postgreSQL copy method is may preferred method. From my perspective psycopg2 is using the standard connection string (https://halvar.at/python/odbc_dsn_connection_strings/). But I've to use the ANSI driver.
I tried to pass an SQLAlchemy to thy psycopg2 connector. But this does not work.
stage_engine_string = ("{PostgreSQL ANSI}+psycopg2://" + str(stage_user) + ":" + str(stage_password) + "@" + str(stage_host) + ":" + str(stage_port) + "/" + str(stage_database))
because
conn = psycopg2.connect(**params)
only allows to pass the arguments.
host =
database =
user =
password =
port =
Before I tried the above I tried for ex.
cur.copy_to(open("sql_tmp_export.csv", "w", encoding="utf-8", errors="ignore"), "table", sep=";", columns=("no","description"))
,
conn.decode("win1250").encode("utf8")
and
conn.set_client_encoding("win1250")
but I receive an encoidng issue all the time. Based on the doc of postgres switching between utf8 and win1250 should never be an problem.
On the ETL tool I'd a similar issue but was able to solve it via sending an
"set client_encoding=\"windows-1250\"
after esthablishing the connection to the database.
But if I try this in psycopg2 to
cur.execute("set client_encoding=\"windows-1250\;select * from table")
I stil get the encoding issue.
Any clue if I have an option to pass the driver on builing up a psycopg2 connection? I think this should solve my issue.
My general problem has been fixed now as well. But the solution was strange. If someone stucks on something similar, I simply run the same script twice but first of all with limit and offset.
def any_postrgres_method_to_load_data_from_db:
conn = some_lib.conect(var1, var2)
cur = conn.cursor()
sql_pre_statement = """\
set client_encoding = "Windows-1250"
"""
cur.execute(sql_pre_statement)
sql_statement = """\
select * from n
"""
cur.execute(sql_statement)
df = pandas.read_sql_query(sql, conn)
df.to_csv("sql_tmp_export.csv", index=False)
The script above returned several encoding issues. After running the script slightly adjusted as shown below ones, I was able to run the original one working.
def any_postrgres_method_to_load_data_from_db:
conn = some_lib.conect(var1, var2)
cur = conn.cursor()
sql_pre_statement = """\
set client_encoding = "Windows-1250"
"""
cur.execute(sql_pre_statement)
sql_statement = """\
select * from n offset 500 limit 1000
"""
cur.execute(sql_statement)
df = pandas.read_sql_query(sql, conn)
df.to_csv("sql_tmp_export.csv", index=False)
I can't really explain this. I've just the feeling that there was something strange in the cache on the remote db.