[SOLVED] How to insert 1 million rows into Oracle database with Python?

How to insert 1 million rows into Oracle database with Python?

I have ~100,000 to 1,000,000 rows to insert into an Oracle18c database. I'm quite new with Oracle and this order of magnitude of data. I reckon there must be some optimal way to do it, but for now I've only managed to implement a line by line insertion:

def insertLines(connection, table_name, column_names, rows):
    cursor = connection.cursor()
    if table_exists(connection, table_name):
        for row in rows:
            sql = 'INSERT INTO {} ({}) VALUES ({})'.format(table_name, column_names, row)
            cursor.execute(sql)
    cursor.close()

Is there some clear way in Oracle to bulk the rows to reach higher effectivity using cx_Oracle (the python Oracle library)?

EDIT: I read the data from a CSV file.

Solution

If your data is already in Python, then use executemany(). In your case with so many rows, you probably would still execute multiple calls to insert batches of records.

The latest release of cx_Oracle (which got renamed to python-oracledb) runs in a 'Thin' mode by default which bypasses the Oracle Client libraries. This means that in many cases it is faster for data loads. The usage and functionality of executemany() is still the same in the new release. Install with something like python -m pip install oracledb. Here's the current documentation for Executing Batch Statement and Bulk Loading. Also see the upgrading documentation.

Here's an example using the python-oracledb namespace. If you still use cx_Oracle then change the import to be import cx_Oracle as oracledb:

import oracledb
import csv

...
Connect and open a cursor here...
...

# Predefine the memory areas to match the table definition.
# This can improve performance by avoiding memory reallocations.
# Here, one parameter is passed for each of the columns.
# "None" is used for the ID column, since the size of NUMBER isn't
# variable.  The "25" matches the maximum expected data size for the
# NAME column
cursor.setinputsizes(None, 25)

# Adjust the number of rows to be inserted in each iteration
# to meet your memory and performance requirements
batch_size = 10000

with open('testsp.csv', 'r') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    sql = "insert into test (id,name) values (:1, :2)"
    data = []
    for line in csv_reader:
        data.append((line[0], line[1]))
        if len(data) % batch_size == 0:
            cursor.executemany(sql, data)
            data = []
    if data:
        cursor.executemany(sql, data)
    con.commit()

There is a full sample at samples/load_csv.py.

As pointed out by others:

Avoid using string interpolation in statements because it is a security risk. It is also generally a scalability problem. Use bind variables. Where you need to use string interpolation for things like column names, make sure you santize any values.
If the data is already on disk, then using something like SQL*Loader or Data Pump will be better than reading it into cx_Oracle and then sending it to the DB.