[SOLVED] Extracting data from an Oracle DB on python. Facing issues with the multibyte characters

Extracting data from an Oracle DB on python. Facing issues with the multibyte characters

I'm trying to take data from an Oracle db, just to have it on a csv file. When i try to do it with the default encoding i get the error of: 'utf-8' codec can't decode byte 0xd1 in position 14: invalid continuation byte.. The character set from the db is AL32UTF8 and for the national is AL16UTF8. Some clients have put records on spanish, so some charactes ñ are replaced by and ?. I've tried to change the encode to ISO-8859-1 and that works for the first 2 querys, but to the others i have the partial multibyte character error. I also check the data with the query and there's others ? replacements.

Here is my code (please ignore the credentials and the querys i just put some random values):

os.environ['NLS_LANG'] = 'AMERICAN_AMERICA.AL32UTF8'

connection = None
cursor = None

dir = './csv_files/'
queries = {
  'file_name.csv': "SELECT * table",
    'file_name.csv': "SELECT * FROM table",
}

if not os.path.exists(dir):
  os.makedirs(dir)

try:
  dsn_tns = cx_Oracle.makedsn('171.11.1.111', '1521', sid='ñññ')
  connection = cx_Oracle.connect(user, password, dsn=dsn_tns, encoding='ISO-8859-1')
  cursor = connection.cursor()
  

  for file_name, query in queries.items():
    try:
      cursor.execute(query)
      rows = cursor.fetchall()
      columns = [col[0] for col in cursor.description]

      dataFrame = pd.DataFrame(rows, columns=columns)

      file_path = os.path.join(dir, file_name)

      dataFrame.to_csv(file_path, index=False, encoding='utf-16', errors= 'replace')

except Exception as e:
      print(f"Error in query for {file_name}: {e}")
      continue

except Exception as e:
  print(f"Error in connection: {e}")

finally:
  if cursor is not None and connection is not None:
    cursor.close()

How can i handle this issue without deleting the conflict data from the db?

I've tried changing the encoding, but none of them works.

Solution

Putting the comment as an answer:

(1) Check for corrupt data or query raw data. Note that both of these identify data issues in your database that you should fix.

(2) Use python-oracledb which replaced cx_Oracle

(3) Skip the overhead of pandas and write data using cvs, see https://github.com/oracle/python-oracledb/blob/main/samples/write_csv.py

(4) Tune the arraysize when fetching (see previous example)