Currently i am inserting to postgres database using psycopg2. Data is large and also the write frequency is high, so my database has WAL disabled and few other optimizations for faster writes.
When i use psycopg2
with execute_values, i am able to write batch of 1000 rows in 0.1-0.15 seconds.
from psycopg2.extras import execute_values
self.engine = create_engine(f'postgresql+psycopg2://postgres:password@localhost/postgres', pool_size=DB_POOL_SIZE,max_overflow=20)
def insert_data_todb(self, table_name, batch_data):
try:
t1 = time.perf_counter()
insert_sql = f"""INSERT INTO {table_name} ({self._market_snapshot_columns_str}) VALUES %s;"""
with self.engine.connect() as conn, conn.connection.cursor() as cur:
execute_values(cur, insert_sql, batch_data)
t2 = time.perf_counter()
logger.info(f"Inserted {len(batch_data)} records in {t2 - t1} seconds")
except Exception as ex:
logger.error(f"Error inserting batch data into {table_name}:")
logger.exception(ex)
I uninstalled psycopg2 and installed psycopg 3.2. and used psycopg3's executemany function like this:
import psycopg
self.engine = create_engine(f'postgresql+psycopg://postgres:password@localhost/postgres', pool_size=DB_POOL_SIZE,max_overflow=20)
def insert_data_todb(self, table_name, batch_data):
try:
t1 = time.perf_counter()
placeholders = ', '.join(['%s'] * len(batch_data[0]))
insert_sql = f"""INSERT INTO {table_name} ({self._market_snapshot_columns_str}) VALUES ({placeholders});""" # stored variable
with self.engine.connect() as conn:
with conn.cursor() as cur:
cur.executemany(insert_sql, batch_data) # Pass the batch data directly
t2 = time.perf_counter()
logger.info(f"Inserted {len(batch_data)} records in {t2 - t1} seconds")
except Exception as ex:
logger.error(f"Error inserting batch data into {table_name}:")
logger.exception(ex)
My psycopg3
code is way slower! it takes 8-20 seconds to insert the same batches.
These examples aren't equivalent and it's not about the psycopg
version:
cur.executemany()
in the second example runs one insert
per row.execute_values()
in the first example can construct insert
s with longer values
lists, which is typically more effective.These both lose to a very simple batch insert that uses copy
instead of insert
. Quoting the result summary from an example benchmark:
Function | Time (seconds) | Memory (MB) |
---|---|---|
insert_one_by_one() | 128.8 | 0.08203125 |
insert_executemany() | 124.7 | 2.765625 |
insert_executemany_iterator() | 129.3 | 0.0 |
insert_execute_batch() | 3.917 | 2.50390625 |
insert_execute_batch_iterator(page_size=1) | 130.2 | 0.0 |
insert_execute_batch_iterator(page_size=100) | 4.333 | 0.0 |
insert_execute_batch_iterator(page_size=1000) | 2.537 | 0.2265625 |
insert_execute_batch_iterator(page_size=10000) | 2.585 | 25.4453125 |
insert_execute_values() | 3.666 | 4.50390625 |
insert_execute_values_iterator(page_size=1) | 127.4 | 0.0 |
insert_execute_values_iterator(page_size=100) | 3.677 | 0.0 |
insert_execute_values_iterator(page_size=1000) | 1.468 | 0.0 |
insert_execute_values_iterator(page_size=10000) | 1.503 | 2.25 |
copy_stringio() | 0.6274 | 99.109375 |
copy_string_iterator(size=1024) | 0.4536 | 0.0 |
copy_string_iterator(size=8192) | 0.4596 | 0.0 |
copy_string_iterator(size=16384) | 0.4649 | 0.0 |
copy_string_iterator(size=65536) | 0.6171 | 0.0 |