pythondataframesqlitepython-polarspolars

When reading a database table with polars, how do I avoid a SchemaError?


I have a large table_to_load in a database file my_database.db that I am trying to read into a Python program as a polars DataFrame. Here is the code that does the reading:

import polars as pl

conn = sqlite3.connect('my_database.db')
df = pl.read_database(connection=conn, query='SELECT * FROM table_to_load', infer_schema_length=None)
conn.close()

When I run this code, pl.read_database throws the error, "polars.exceptions.SchemaError: failed to determine supertype of i64 and binary." The traceback is at the end of the post. I have several closely related questions about this:

  1. Each column of table_to_load verifiably already has one of the three sqlite3 datatypes integer, real, and text. Can I instruct polars to just use a reasonable analog of each datatype? For example, i64 seems like a reasonable analog of integer.
  2. Why does polars struggle to determine the datatype if it is reading the whole column as instructed by the parameter setting infer_schema_length=None, and is there any way to resolve this? (It looks like there is also a schema_overrides parameter, but this would be challenging to use in my application because my table columns are not static.)
  3. In the traceback, why does polars not reveal the column or columns that are generating the SchemaError, and is there a way to request that it do so? In my case, this would be helpful because table_to_load has hundreds of columns.

The traceback of the SchemaError:

Traceback (most recent call last):
  File "c:\Users\user0\AppData\Local\Programs\Python\Python312\test_module.py", line 115, in <module>
    main_function()
  File "c:\Users\user0\AppData\Local\Programs\Python\Python312\test_module.py", line 54, in main_function
    df = pl.read_database(connection=conn, query='SELECT * FROM table_to_load', infer_schema_length=None)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\io\database\functions.py", line 251, in read_database  
    ).to_polars(
      ^^^^^^^^^^
  File "C:\Users\user0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\io\database\_executor.py", line 543, in to_polars      
    frame = frame_init(
            ^^^^^^^^^^^
  File "C:\Users\user0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\io\database\_executor.py", line 300, in _from_rows     
    return frames if iter_batches else next(frames)  # type: ignore[arg-type]
                                       ^^^^^^^^^^^^
  File "C:\Users\user0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\io\database\_executor.py", line 283, in <genexpr>
    DataFrame(
  File "C:\Users\user0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\dataframe\frame.py", line 377, in __init__
    self._df = sequence_to_pydf(
               ^^^^^^^^^^^^^^^^^
  File "C:\Users\user0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\_utils\construction\dataframe.py", line 461, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user0\AppData\Local\Programs\Python\Python312\Lib\functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\_utils\construction\dataframe.py", line 674, in _sequence_of_tuple_to_pydf
    return _sequence_of_sequence_to_pydf(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user0\AppData\Local\Programs\Python\Python312\Lib\site-packages\polars\_utils\construction\dataframe.py", line 590, in _sequence_of_sequence_to_pydf
    pydf = PyDataFrame.from_rows(
           ^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.SchemaError: failed to determine supertype of i64 and binary


  [1]: https://www.sqlite.org/datatype3.html

Solution

  • As was pointed out in comments, the issue is that sqlite is not strict on types so a column can be of mixed types. In contrast, polars is strict so if you have a mixed type for which there isn't a defined supertype then it simply won't work without a cast on the sqlite end. In my test if I use read_database_uri instead of read_database then the error message does report the failing column. With that, you can put the query in a loop of trys such that if it fails it'll change the query to explicitly cast the failed column.

    Here's the function:

    def retry_qry(qry: str, conn_str: str, fallback_sqlite_type="text") -> pl.DataFrame:
        casted_cols: set[str] | None = None
        col_order: list[str] | None = None
        after_from: str | None = None
        table: str | None = None
        REerr = None
        while True:
            try:
                df = pl.read_database_uri(qry, conn_str)
                if casted_cols is not None:
                    import warnings
    
                    msg = f"Your query had type errors and was changed to\n{qry}"
                    warnings.warn(msg)
                return df
            except Exception as err:
                import re
                import sqlite3
    
                if REerr is None:
                    REerr = re.compile(r"(?<=name:\s)\w+")
                recheck = REerr.search(str(err))
                if recheck is None:
                    raise
                cast_col = recheck.group()
                if (
                    casted_cols is None
                    or col_order is None
                    or table is None
                    or after_from is None
                ):
                    after_select = qry.lower().split("select", maxsplit=1)[1]
                    before_from, after_from = after_select.split("from", maxsplit=1)
                    cols = before_from.strip()
                    table = after_from.strip().replace("\n"," ").split(" ")[0]
                    if cols == "*":
                        import sqlite3
    
                        with sqlite3.connect(conn_str.replace("sqlite:///", "")) as cn:
                            cur = cn.cursor()
                            cur.execute(f"PRAGMA table_info({table})")
                            p = cur.fetchall()
                            col_order = [x[1] for x in p]
                            casted_cols = set()
                    else:
                        # if you've got functions like sum(your_col) or aliases then 
                        # this won't work. You could check against the table definition
                        # for substrings but that's a different issue
                        col_order = [x.strip() for x in cols.split(",")]
                        casted_cols = set()
                if cast_col in casted_cols or cast_col not in col_order:
                    raise
                casted_cols.add(cast_col)
                new_cols_sql = ", ".join(
                    [
                        f"cast({x} as {fallback_sqlite_type}) as {x}"
                        if x in casted_cols
                        else x
                        for x in col_order
                    ]
                )
                qry = f"select {new_cols_sql} from {after_from}"```
    

    Note again that this function uses read_database_uri which takes a connection string instead of a connection object. Also, note that this is not meant to be bullet proof as it doesn't do real sql parsing. It'll fail for anything but the most simple queries. Here's an example of it in action.

    import sqlite3
    import os
    import polars as pl
    
    db_file="mydb.db"
    conn = sqlite3.connect(db_file)
    cur = conn.cursor()
    
    cur.execute("""
        CREATE TABLE IF NOT EXISTS users (
            name TEXT,
            age INTEGER
        )
    """)
    
    cur.executemany(
        "INSERT INTO users (name, age) VALUES (?, ?)",
        [("Alice", 30), ("Bob", 25), ("Charlie", b"jfjf")],
    )
    
    conn.commit()
    conn.close()
    
    
    conn_str = f"sqlite:///{os.path.abspath(db_file)}"
    
    df = pl.read_database_uri(qry, conn_str) # this raises, as we expect
    # RuntimeError: Invalid column type Blob at index: 2, name: age
    
    
    df = retry_qry(qry, conn_str) # this gives us a warning but we have a df returned.
    # UserWarning: Your query had type errors and was changed to
    # select id,name,cast(age as text) as age from  users