pandasamazon-redshiftpg8000

How to avoid byte quoted column headers when reading redshift data with pg8000


I'm loading some redshift data in pandas via pg8000:

>>>> df = pd.read_sql(sql_text, pg8000_cnx)

pg8000_cnx is a connection object to a redshift database. I always get columns that are byte quoted, ex:

>>> df.columns
Index([b'column_',
       b'column_2',
       ...
       b'column_n'],
       dtype='object', length=142)

I know I can use .decode('utf-8') and some list comprehension to convert the columns back to string, but I was trying to have them loaded in string format from redshift itself. Tried playing with different CLIENT_ENCODING settings with the pg8000 cursor, but still get byte quoted columns in the pandas df.

import pg8000
conn = pg8000.connect(**db_details)
cursor = conn.cursor()
cursor.execute("SET CLIENT_ENCODING TO 'UTF8'")
cursor.execute(sql)
cursor.fetchone()
# load df here

Is there some redshift or pg8000 option I am missing that would load the column headers in utf-8 decoded strings from the db connection object itself?


Solution

  • This is fixed in pg8000 1.16.6, so upgrading to the latest version should solve the problem.

    So from verion 1.16.6 pg8000 returns column names as str rather than bytes.