pythondatabaseoracle-databaseinsertoracle18c

How to insert 1 million rows into Oracle database with Python?


I have ~100,000 to 1,000,000 rows to insert into an Oracle18c database. I'm quite new with Oracle and this order of magnitude of data. I reckon there must be some optimal way to do it, but for now I've only managed to implement a line by line insertion:

def insertLines(connection, table_name, column_names, rows):
    cursor = connection.cursor()
    if table_exists(connection, table_name):
        for row in rows:
            sql = 'INSERT INTO {} ({}) VALUES ({})'.format(table_name, column_names, row)
            cursor.execute(sql)
    cursor.close()

Is there some clear way in Oracle to bulk the rows to reach higher effectivity using cx_Oracle (the python Oracle library)?

EDIT: I read the data from a CSV file.


Solution

  • If your data is already in Python, then use executemany(). In your case with so many rows, you probably would still execute multiple calls to insert batches of records.

    The latest release of cx_Oracle (which got renamed to python-oracledb) runs in a 'Thin' mode by default which bypasses the Oracle Client libraries. This means that in many cases it is faster for data loads. The usage and functionality of executemany() is still the same in the new release. Install with something like python -m pip install oracledb. Here's the current documentation for Executing Batch Statement and Bulk Loading. Also see the upgrading documentation.

    Here's an example using the python-oracledb namespace. If you still use cx_Oracle then change the import to be import cx_Oracle as oracledb:

    import oracledb
    import csv
    
    ...
    Connect and open a cursor here...
    ...
    
    # Predefine the memory areas to match the table definition.
    # This can improve performance by avoiding memory reallocations.
    # Here, one parameter is passed for each of the columns.
    # "None" is used for the ID column, since the size of NUMBER isn't
    # variable.  The "25" matches the maximum expected data size for the
    # NAME column
    cursor.setinputsizes(None, 25)
    
    # Adjust the number of rows to be inserted in each iteration
    # to meet your memory and performance requirements
    batch_size = 10000
    
    with open('testsp.csv', 'r') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        sql = "insert into test (id,name) values (:1, :2)"
        data = []
        for line in csv_reader:
            data.append((line[0], line[1]))
            if len(data) % batch_size == 0:
                cursor.executemany(sql, data)
                data = []
        if data:
            cursor.executemany(sql, data)
        con.commit()
    

    There is a full sample at samples/load_csv.py.

    As pointed out by others: