pythonoraclecharacter-encodingexport-to-csvmultibyte-characters

Extracting data from an Oracle DB on python. Facing issues with the multibyte characters


I'm trying to take data from an Oracle db, just to have it on a csv file. When i try to do it with the default encoding i get the error of: 'utf-8' codec can't decode byte 0xd1 in position 14: invalid continuation byte.. The character set from the db is AL32UTF8 and for the national is AL16UTF8. Some clients have put records on spanish, so some charactes ñ are replaced by and ?. I've tried to change the encode to ISO-8859-1 and that works for the first 2 querys, but to the others i have the partial multibyte character error. I also check the data with the query and there's others ? replacements.

Here is my code (please ignore the credentials and the querys i just put some random values):

os.environ['NLS_LANG'] = 'AMERICAN_AMERICA.AL32UTF8'

connection = None
cursor = None

dir = './csv_files/'
queries = {
  'file_name.csv': "SELECT * table",
    'file_name.csv': "SELECT * FROM table",
}

if not os.path.exists(dir):
  os.makedirs(dir)

try:
  dsn_tns = cx_Oracle.makedsn('171.11.1.111', '1521', sid='ñññ')
  connection = cx_Oracle.connect(user, password, dsn=dsn_tns, encoding='ISO-8859-1')
  cursor = connection.cursor()
  

  for file_name, query in queries.items():
    try:
      cursor.execute(query)
      rows = cursor.fetchall()
      columns = [col[0] for col in cursor.description]

      dataFrame = pd.DataFrame(rows, columns=columns)

      file_path = os.path.join(dir, file_name)

      dataFrame.to_csv(file_path, index=False, encoding='utf-16', errors= 'replace')

except Exception as e:
      print(f"Error in query for {file_name}: {e}")
      continue

except Exception as e:
  print(f"Error in connection: {e}")

finally:
  if cursor is not None and connection is not None:
    cursor.close()

How can i handle this issue without deleting the conflict data from the db?

I've tried changing the encoding, but none of them works.


Solution

  • Putting the comment as an answer:

    (1) Check for corrupt data or query raw data. Note that both of these identify data issues in your database that you should fix.

    (2) Use python-oracledb which replaced cx_Oracle

    (3) Skip the overhead of pandas and write data using cvs, see https://github.com/oracle/python-oracledb/blob/main/samples/write_csv.py

    (4) Tune the arraysize when fetching (see previous example)