When using df = connectorx.read_sql(conn=cx_con, query=f"SELECT * FROM table;")
, I occasionally get a Dataframe returned with the correct columns, but all the rows are zeros or it crashes with the message Process finished with exit code -1073741819 (0xC0000005)
. This only happens when the table is being updated at the same time with df.to_sql("table", con=con_in, if_exists="append")
My program reads a table from a local database that I am continuously updating in a concurrently running program. This issue does not occur when I try to read from the table using pandas.read_sql_query()
(which is far slower). This indicates that there is some issue with the handling of the read/write traffic accessing the table when using connectorx that does not exist with the pandas read. Is this a bug with connectorx or is there something I can do to prevent this from happening?
I'm using PyCharm 2022.2.1, Windows11, and Python 3.10
ConnectorX is more designed for OLAP scenarios where the data is static with readonly queries. The reason that causes zero rows / crushes is because of the inconsistency between multiple queries. In order to achieve maximum speed up, ConnectorX issues queries to fetch metadata before fetching the real query result, including:
limit 1
query to get result schema (column types and names)count
query to get the number of the rows in the resultmin/max
query to get the min and max value of the partition column (if partition enabled)The first two are used to pre-allocate the destination pandas.DataFrame in advance (step 1 and 2 in the below example workflow). Getting the dataframe in the beginning makes it possible for ConnectorX to stream the result values directly to the final destination, avoiding extra data copy and result concatenations (step 4-6 below, done in streaming fashion).
If the data is updating, the result of the count
query X may be different from the real number of rows that the query returns Y. In such case, the program may crash (if X < Y) or return some rows with all zeros (if X > Y).
One possible way to avoid the count
query is to set the return_type
to arrow2
to get the arrow format first. Since arrow table is consisted with multiple record batches, ConnectorX can allocate the memory on demand without issuing the count
query. After getting the arrow result, you can then convert arrow to pandas using the efficient to_pandas API provided by pyarrow. Here is an example:
import connectorx as cx
table = cx.read_sql(conn, query, return_type="arrow2")
df = table.to_pandas(split_blocks=False, date_as_object=False)
However, one thing need to be noticed is that if you are using partition, the result might still be incorrect due to the inconsistency among min/max
and multiple partitioned queries.
If your data is append-only in a certain way, for example a monotonic ID column. You can add a predicate like ID <= current max ID
so the concurrently appending data will be filtered out in both count
query and fetch result query. If you are using partition, you can also partition on this ID column so that the result can be consistent.