I am using a PostgreSQL database and a big pandas DataFrame (~3500 records) that has to be uploaded into one of the tables in the database.
First, I have to test for existing records in the database and, afterwards, upload non existing ones.
The problem is that the CPU gets heavily loaded and sometimes my old "server" crashes.
So I divided the queries into smaller ones.
In the case of the uploading, I use the df.to_sql()
method with the option chunksize=150
, so that the upload takes place in chunks of 150 records.
In the case of the testing for existing records, I use the column in the dataframe that is defined as primary key in the database table (it is of type datetime, or Timestamp). So I have a long where
clause like this:
SELECT datetime FROM my_table
WHERE datetime = datetime_row1_df
OR datetime = datetime_row2_df
...
OR datetime = datetime_row3500_df
In order to reduce the CPU load, I divided this query into smaller ones, so that the first one tests for the existence in the database of the dataframe rows from 1 to 150, the second query tests for the existence of the dataframe rows from 151 to 300, and so on.
So I have two questions:
df.to_sql()
method, is it a good method to reduce the CPU load?where
clause in the SQL query, I'm wondering if what I'm doing is putting even more strain on the CPU. I think that with the first chunk, the server seeks the entire datatable, with the second one, again an entire search, and so on. Would it be better to keep it as a single and long query? If the answer is affirmative, what alternatives do I have here in order to reduce CPU load for this query?Bulk Insert
for a large dataset is not recommended. You should use COPY FROM
to upload your dataset to postgres instantaneously.
COPY my_table FROM '/app/records.csv' csv header;
Data validation and data sanitization should be done on the dataframe and the column names must match with that in postgres.
Using a copy operation you can load data into the database from any Python iterable (a list of tuples, or any iterable of sequences): the Python values are adapted as they would be in normal querying. To perform such operation use a
COPY ... FROM STDIN
withCursor.copy()
and usewrite_row()
on the resulting object in awith
block. On exiting the block the operation will be concluded:
records = [(10, 20, "hello"), (40, None, "world")]
with cursor.copy("COPY sample (col1, col2, col3) FROM STDIN") as copy:
for record in records:
copy.write_row(record)
If an exception is raised inside the block, the operation is interrupted and the records inserted so far are discarded.
This is another way of using COPY FROM
when you have a python iterable instead of a pandas dataframe. Note that it is not recommended to use for loop on a large dataframe as that will also take a lot of processing time.
If your dataset has rows that conflict with some of the existing rows in the table, then you have to copy the dataset to a temporary table first. Then you do a bulk insert operation from the temporary table to the main table with ON CONFLICT DO UPDATE clause
to overwrite the conflicting rows.
Frank Liao 's answer posted the exact query that you need.
sql = """ CREATE TABLE temp_h ( time , name, description ); COPY temp_h FROM STDIN With CSV; INSERT INTO table_a(time, name, description) SELECT * FROM temp_h ON conflict (time) DO update set name=EXCLUDED.name, description=EXCLUDED.description; DROP TABLE temp_h; """
If the temporary table has no conflicting rows with the main table, then you can also temporarily disable all indexes of the main table for the insert operation. This is useful for the future operations when your dataset is already clean and no rows in the temporary table conflict with the rows in the main table.
When you run a large query (insert/update) on a huge table with several indexes, these indexes can seriously slow the query execution.
With Postgresql it can be very faster to disable the indexes before runing the query and reindex all the table afterwards.
As answered by Milovan Zogovic in this answer:
The problem was with indexes. The
history
table had 160M indexed rows. By running eitherCOPY FROM
orINSERT INTO .. SELECT
it was taking a lot of time not to insert rows, but to update indexes. When i disabled indexes, it imported 3M rows in 10 seconds. Now i need to find faster way of reindexing the big table.
But you cannot use INSERT ON CONFLICT
here as it requires indexing.