pythonconnector

Connectorx - cx.read_sql() returns dataframe of all zeros rows intermittently or crashes when table is being conncurrently updated


When using df = connectorx.read_sql(conn=cx_con, query=f"SELECT * FROM table;"), I occasionally get a Dataframe returned with the correct columns, but all the rows are zeros or it crashes with the message Process finished with exit code -1073741819 (0xC0000005). This only happens when the table is being updated at the same time with df.to_sql("table", con=con_in, if_exists="append")

My program reads a table from a local database that I am continuously updating in a concurrently running program. This issue does not occur when I try to read from the table using pandas.read_sql_query() (which is far slower). This indicates that there is some issue with the handling of the read/write traffic accessing the table when using connectorx that does not exist with the pandas read. Is this a bug with connectorx or is there something I can do to prevent this from happening?

I'm using PyCharm 2022.2.1, Windows11, and Python 3.10


Solution

  • Reason for the error

    ConnectorX is more designed for OLAP scenarios where the data is static with readonly queries. The reason that causes zero rows / crushes is because of the inconsistency between multiple queries. In order to achieve maximum speed up, ConnectorX issues queries to fetch metadata before fetching the real query result, including:

    1. limit 1 query to get result schema (column types and names)
    2. count query to get the number of the rows in the result
    3. min/max query to get the min and max value of the partition column (if partition enabled)

    The first two are used to pre-allocate the destination pandas.DataFrame in advance (step 1 and 2 in the below example workflow). Getting the dataframe in the beginning makes it possible for ConnectorX to stream the result values directly to the final destination, avoiding extra data copy and result concatenations (step 4-6 below, done in streaming fashion).

    Workflow of ConnectorX

    If the data is updating, the result of the count query X may be different from the real number of rows that the query returns Y. In such case, the program may crash (if X < Y) or return some rows with all zeros (if X > Y).

    Possible workarounds

    Avoid COUNT query through arrow

    One possible way to avoid the count query is to set the return_type to arrow2 to get the arrow format first. Since arrow table is consisted with multiple record batches, ConnectorX can allocate the memory on demand without issuing the count query. After getting the arrow result, you can then convert arrow to pandas using the efficient to_pandas API provided by pyarrow. Here is an example:

    import connectorx as cx
    
    table = cx.read_sql(conn, query, return_type="arrow2")
    df = table.to_pandas(split_blocks=False, date_as_object=False)
    

    However, one thing need to be noticed is that if you are using partition, the result might still be incorrect due to the inconsistency among min/max and multiple partitioned queries.

    Add predicates for consistency

    If your data is append-only in a certain way, for example a monotonic ID column. You can add a predicate like ID <= current max ID so the concurrently appending data will be filtered out in both count query and fetch result query. If you are using partition, you can also partition on this ID column so that the result can be consistent.