pythonsqlalchemybatch-processingexecutechunking

SQLAlchemy isn't batching rows / using server side cursor via `yield_per`


Following documentation, and the code snippet provided from https://docs.sqlalchemy.org/en/14/core/connections.html#streaming-with-a-fixed-buffer-via-yield-per (posted directly below), my query is not being batched into 50_000 rows.

with engine.connect() as conn:
    result = conn.execution_options(yield_per=100).execute(text("select * from table"))

    for partition in result.partitions():
        # partition is an iterable that will be at most 100 items
        for row in partition:
            print(f"{row}")

Here I am writing the results of a SQL query to a CSV file:

with engine.connect() as conn: # connect to database
    with open(csv_path, mode='w', newline='') as f:
        
        c = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
        c.writerow(columns) # write column headers

        result = conn.execution_options(yield_per=50_000).execute(sql_string) # execute query that results in 110k rows, expecting batches of 50k rows
        for partition in result.partitions():
            for row in tqdm.tqdm(partition): # each partition only has 1 row, when I expect 50k rows
                c.writerow(row)

This sql_string results in 110k rows, so I expect 3 iterations of result.partitions(), however, I am seeing 110k iterations. I fear this is equivalent to DDOS-ing my own SQL Server database.

I've also tried doing this without partitions(), and the same thing happens - no batching.

    with engine.connect() as conn:
        with open(csv_path, mode='w', newline='') as f:
            c = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
            c.writerow(columns)
            for row in tqdm.tqdm(conn.execution_options(yield_per=50_000).execute(sql_string)): # I expect 3 iterations, but I get 110k.
                c.writerow(row)

The server seems to handle this ok, I get 22k iterations / second, but I still wonder, am I DDOS-ing myself?

I am using SQL Alchemy 1.4.42 (pip install --upgrade 'sqlalchemy==1.4.42') and Microsoft SQL Server 2019 - 15.0.X (X64)


Solution

  • The yield_per argument was having no effect in execute_options. In the example query snippets I posted in my question, fetchone() gets called N times, where N is the query result's row count. I discovered these fetchone() calls by getting the traceback from doing ctrl-C during an execution. That's a lot of server calls.

    Instead of concerning myself with yield_per, I was able to achieve pagination by using fetchmany(batch_size).

        with engine.connect() as conn:
            print('Connected!...')
            with open(csv_path, mode='w', newline='', encoding='utf-8') as f:
                c = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
                c.writerow(columns)
                result = conn.execute(sql_string)
                counter = 1
                while True:
                    print(f'Querying batch {counter}...')
                    rows = result.fetchmany(50_000) # batch size
                    print(f'Contains {len(rows)} rows...')
                    if not rows:
                        break
                    print(f'Writing batch {counter} to CSV...')
                    for row in rows:
                        c.writerow(row)
                    counter += 1
                result.close()
    

    I don't understand the reason for yield_per in SQLAlchemy 1.4.42, or why the the SQLAlchemy docs don't reference using fetchmany() instead. Oh well.