python-3.xpandascratedb

How to raise an exception when bulk inserting faulty rows into crateDB using Python Pandas to_sql()


I am bulk inserting data into crateDB using

pandas.to_sql(tableId, 'crate://xxxxxxx:4200', if_exists='append', index=False, chunksize=20000)

The data contains a few faulty rows where one column of type 'numeric' contains a value of type 'object' / 'string'.

This results in a typecasting error while inserting.

When I set the chunksize parameter of to_sql() to 1, an exception is raised and I can catch the issue.

When chunksize is > 0, to_sql() just continues and I cannot make sure that data was inserted properly.

What am I missing, and what would be a good approach to fix this problem?


Solution

  • The problem most likely is that pandas doesn't return anything on to_sql() also see https://github.com/pandas-dev/pandas/issues/23998 and https://github.com/pandas-dev/pandas/issues/42645 . Since the bulk_insert in CrateDB doesn't completely fail, but only skips the faulty rows, no exception is raised, but only an array with a value per row is returned. That return however never is processed in pandas :/

    One could change the SQL alchemy implementation of CrateDB, but this would in return break any implementation that actually uses the results array.