postgresqlpsycopg2

Postgresql + psycopg: Bulk Insert large data with POSTGRESQL function call


I am working with large, very large amount of very simple data (point clouds). I want to insert this data into a simple table in a Postgresql database using Python.

An example of the insert statement I need to execute is as follows:

INSERT INTO points_postgis (id_scan, scandist, pt) VALUES (1, 32.656, **ST_MakePoint**(1.1, 2.2, 3.3));

Note the call to the Postgresql function ST_MakePoint in the INSERT statement.

I must call this billions (yes, billions) of times, so obviously I must insert the data into the Postgresql in a more optimized way. There are many strategies to bulk insert the data as this article presents in a very good and informative way (insertmany, copy, etc). https://hakibenita.com/fast-load-data-python-postgresql

But no example shows how to do these inserts when you need to call a function on the server-side. My question is: how can I bulk INSERT data when I need to call a function on the server-side of a Postgresql database using psycopg?

Any help is greatly appreciated! Thank you!


Please note that using a CSV doesn't make much sense because my data is huge. Alternatively, I tried already to fill a temp table with simple columns for the 3 inputs of the ST_MakePoint function and then, after all data is into this temp function, call a INSERT/SELECT. The problem is that this takes a lot of time and the amount of disk space I need for this is nonsensical.


Solution

  • The most important, in order to do this within reasonable time, and with minimum effort, is to break this task down into component parts, so that you could take advantage of different Postgres features seperately.

    Firstly, you will want to first create the table minus the geometry transformation. Such as:

    create table temp_table (
        id_scan bigint, 
        scandist numeric, 
        pt_1 numeric, 
        pt_2 numeric, 
        pt_3 numeric
    );
    
    

    Since we do not add any indexes and constraints, this will be most likely the fastest way to get the "raw" data into the RDBMS.

    The best way to do this would be with COPY method, which you can use either from Postgres directly (if you have sufficient access), or via the Python interface by using https://www.psycopg.org/docs/cursor.html#cursor.copy_expert

    Here is example code to achieve this:

    iconn_string = "host={0} user={1} dbname={2} password={3} sslmode={4}".format(target_host, target_usr, target_db, target_pw, "require")
    iconn = psycopg2.connect(iconn_string)
    import_cursor = iconn.cursor()
    csv_filename = '/path/to/my_file.csv'
    copy_sql = "COPY temp_table (id_scan, scandist, pt_1, pt_2, pt_3) FROM STDIN WITH CSV HEADER DELIMITER ',' QUOTE '\"' ESCAPE '\\' NULL AS 'null'"
    with open(csv_filename, mode='r', encoding='utf-8', errors='ignore') as csv_file:
        import_cursor.copy_expert(copy_sql, csv_file)
    iconn.commit()
    
    

    The next step will be to efficiently create the table you want, from the existing raw data. You will then be able to create your actual target table with single SQL statement, and let RDBMS to do its magic.

    Once data is in the RDBMS, makes sense to optimize it a little and add an index or two if applicable (primary or unique index preferably to speed up transformation)

    This will be dependent on your data / use case, but something like this should help:

    alter table temp_table add primary key (id_scan); --if unique
    -- or
    create index idx_temp_table_1 on temp_table(id_scan); --if not unique
    

    To move data from raw into your target table:

    with temp_t as (
        select id_scan, scandist, ST_MakePoint(pt_1, pt_2, pt_3) as pt from temp_table
    )
    INSERT INTO points_postgis (id_scan, scandist, pt)
    SELECT temp_t.id_scan, temp_t.scandist, temp_t.pt
    FROM temp_t;
    

    This will in one go select all data from the previous table and transform it.

    Second option that you could use is similar. You can load all raw data to points_postgis directly, while keeping it separated into 3 temp columns. Then use alter table points_postgis add column pt geometry; and follow up with an update, and removal of the temp columns: update points_postgis set pt = ST_MakePoint(pt_1, pt_2, pt_3); & alter table points_postgis drop column pt_1, drop column pt_2, drop column pt_3;

    The main takeaway is that the most performant option would be to not concentrate on the final final table state, but to break it down in easily achievable chunks. Postgres will easily handle both import of billion of rows, and transformation of it afterwards.