pythonsql-servercsvpyodbcbulkinsert

How to speed up bulk insert to MS SQL Server using pyodbc


Below is my code that I'd like some help with. I am having to run it over 1,300,000 rows meaning it takes up to 40 minutes to insert ~300,000 rows.

I figure bulk insert is the route to go to speed it up? Or is it because I'm iterating over the rows via for data in reader: portion?

#Opens the prepped csv file
with open (os.path.join(newpath,outfile), 'r') as f:
    #hooks csv reader to file
    reader = csv.reader(f)
    #pulls out the columns (which match the SQL table)
    columns = next(reader)
    #trims any extra spaces
    columns = [x.strip(' ') for x in columns]
    #starts SQL statement
    query = 'bulk insert into SpikeData123({0}) values ({1})'
    #puts column names in SQL query 'query'
    query = query.format(','.join(columns), ','.join('?' * len(columns)))

    print 'Query is: %s' % query
    #starts curser from cnxn (which works)
    cursor = cnxn.cursor()
    #uploads everything by row
    for data in reader:
        cursor.execute(query, data)
        cursor.commit()

I am dynamically picking my column headers on purpose (as I would like to create the most pythonic code possible).

SpikeData123 is the table name.


Solution

  • Update - May 2022: bcpandas and bcpyaz are wrappers for Microsoft's bcp utility.


    Update - April 2019: As noted in the comment from @SimonLang, BULK INSERT under SQL Server 2017 and later apparently does support text qualifiers in CSV files (ref: here).


    BULK INSERT will almost certainly be much faster than reading the source file row-by-row and doing a regular INSERT for each row. However, both BULK INSERT and BCP have a significant limitation regarding CSV files in that they cannot handle text qualifiers (ref: here). That is, if your CSV file does not have qualified text strings in it ...

    1,Gord Thompson,2015-04-15
    2,Bob Loblaw,2015-04-07
    

    ... then you can BULK INSERT it, but if it contains text qualifiers (because some text values contains commas) ...

    1,"Thompson, Gord",2015-04-15
    2,"Loblaw, Bob",2015-04-07
    

    ... then BULK INSERT cannot handle it. Still, it might be faster overall to pre-process such a CSV file into a pipe-delimited file ...

    1|Thompson, Gord|2015-04-15
    2|Loblaw, Bob|2015-04-07
    

    ... or a tab-delimited file (where represents the tab character) ...

    1→Thompson, Gord→2015-04-15
    2→Loblaw, Bob→2015-04-07
    

    ... and then BULK INSERT that file. For the latter (tab-delimited) file the BULK INSERT code would look something like this:

    import pypyodbc
    conn_str = "DSN=myDb_SQLEXPRESS;"
    cnxn = pypyodbc.connect(conn_str)
    crsr = cnxn.cursor()
    sql = """
    BULK INSERT myDb.dbo.SpikeData123
    FROM 'C:\\__tmp\\biTest.txt' WITH (
        FIELDTERMINATOR='\\t',
        ROWTERMINATOR='\\n'
        );
    """
    crsr.execute(sql)
    cnxn.commit()
    crsr.close()
    cnxn.close()
    

    Note: As mentioned in a comment, executing a BULK INSERT statement is only applicable if the SQL Server instance can directly read the source file. For cases where the source file is on a remote client, see this answer.