Following documentation, and the code snippet provided from https://docs.sqlalchemy.org/en/14/core/connections.html#streaming-with-a-fixed-buffer-via-yield-per (posted directly below), my query is not being batched into 50_000 rows.
with engine.connect() as conn:
result = conn.execution_options(yield_per=100).execute(text("select * from table"))
for partition in result.partitions():
# partition is an iterable that will be at most 100 items
for row in partition:
print(f"{row}")
Here I am writing the results of a SQL query to a CSV file:
with engine.connect() as conn: # connect to database
with open(csv_path, mode='w', newline='') as f:
c = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
c.writerow(columns) # write column headers
result = conn.execution_options(yield_per=50_000).execute(sql_string) # execute query that results in 110k rows, expecting batches of 50k rows
for partition in result.partitions():
for row in tqdm.tqdm(partition): # each partition only has 1 row, when I expect 50k rows
c.writerow(row)
This sql_string
results in 110k rows, so I expect 3 iterations of result.partitions()
, however, I am seeing 110k iterations. I fear this is equivalent to DDOS-ing my own SQL Server database.
I've also tried doing this without partitions()
, and the same thing happens - no batching.
with engine.connect() as conn:
with open(csv_path, mode='w', newline='') as f:
c = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
c.writerow(columns)
for row in tqdm.tqdm(conn.execution_options(yield_per=50_000).execute(sql_string)): # I expect 3 iterations, but I get 110k.
c.writerow(row)
The server seems to handle this ok, I get 22k iterations / second, but I still wonder, am I DDOS-ing myself?
I am using SQL Alchemy 1.4.42 (pip install --upgrade 'sqlalchemy==1.4.42'
) and Microsoft SQL Server 2019 - 15.0.X (X64)
The yield_per
argument was having no effect in execute_options. In the example query snippets I posted in my question, fetchone()
gets called N times, where N is the query result's row count. I discovered these fetchone()
calls by getting the traceback from doing ctrl-C during an execution. That's a lot of server calls.
Instead of concerning myself with yield_per
, I was able to achieve pagination by using fetchmany(batch_size)
.
with engine.connect() as conn:
print('Connected!...')
with open(csv_path, mode='w', newline='', encoding='utf-8') as f:
c = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
c.writerow(columns)
result = conn.execute(sql_string)
counter = 1
while True:
print(f'Querying batch {counter}...')
rows = result.fetchmany(50_000) # batch size
print(f'Contains {len(rows)} rows...')
if not rows:
break
print(f'Writing batch {counter} to CSV...')
for row in rows:
c.writerow(row)
counter += 1
result.close()
I don't understand the reason for yield_per
in SQLAlchemy 1.4.42, or why the the SQLAlchemy docs don't reference using fetchmany()
instead. Oh well.