pythonpandassqlalchemypyodbc4d-database

Pandas read_sql with pyodbc to handle corrupt data


I am working with a client that has a 4D database. Tableau won't connect to it. (That's a whole other problem and if you know the answer to that let me know.) What we've decided to do is essentially keep two copies of the data. I am building a tool in Python that will take any arbitrary table from their database and store a copy of it in a MySQL database. It will then run periodically and update the data as new data is added.

I would prefer to use SqlAlchemy but it does not support 4D. So, I'm using pyodbc with pandas. I'm using

data_chunks = pandas.read_sql("SELECT * FROM table_name", con=pyodbc_connection, chunksize=100000)

Then I turn around and use

chunk_df.to_sql("table_name", con=sqlalchemy_mysql_connection, index=False, if_exists="append")

to write it to the MySQL database.

Unfortunately on some of the tables I'm reading in, there is corrupt data and I get a ValueError saying that The year xxxxx is out of range.

The last function called in the trace was data = cursor.fetchmany(chunksize) which I believe is from pyodbc.

How can I read data from any arbitrary table and be able to handle the corrupt data gracefully and continue on?


Solution

  • You could conceivably use a pyodbc Output Converter function to intercept the corrupted date values and "fix" them using code similar to this:

    def unpack_sql_type_timestamp(raw_bytes):
        y, m, d, h, n, s, f = struct.unpack("<h5HI", raw_bytes)
        if y > 9999:
            y = 9999
        elif y < 1:
            y = 1
        return datetime.datetime(y, m, d, h, n, s, f)
    
    pyodbc_connection = pyodbc.connect(connection_string)
    
    pyodbc_connection.add_output_converter(
        pyodbc.SQL_TYPE_TIMESTAMP, 
        unpack_sql_type_timestamp
    )
    
    data_chunks = pandas.read_sql_query(
        "SELECT * FROM table_name", 
        con=pyodbc_connection, 
        chunksize=100000
    )