pythonsql-serverpandasbulkinsertpandas-to-sql

CSV to SQL Server: bulk import nightmare (T-SQL and/or Pandas)


I am trying to bulk insert a .CSV file into SQL Server without much success.

A bit of background:

1. I needed to insert 16 million records into a SQL Server (2017) DB. Each record has 130 columns. I have a field in the .CSV resulting from an API call from one of our vendors which I am not allowed to mention. I had integer, floats and strings data types.

2. I tried the usual: BULK INSERT but I could not get passed the data type errors. I posted a question here but could not make it work.

3. I tried experimenting with python and tried every method I could find but pandas.to_sql for everybody warned it was very slow. I got stuck with data type and string truncate errors. Different to the ones from BULK INSERT.

4. Without much options I tried pd.to_sql and while it did not raise any data type or truncation errors it was failing due to running out of space in my tmp SQL database. I could not pass this error either although I had plenty of space and all my data files (and log files) were set to autogrowth without limit.

I got stuck at that point. My code (for the pd.to_sql piece) was simple:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("mssql+pyodbc://@myDSN")

df.to_sql('myTable', engine, schema='dbo', if_exists='append',index=False,chunksize=100)

I am not really sure what else to try, any word of advice is welcome. All codes and examples I have seen deal with small datasets (not many columns). I am willing to try any other method. I would appreciate any pointers.

Thanks!


Solution

  • I just wanted to share this dirty piece of code just in case it helps anybody else. Note that I am very aware this is not optimal at all, it is slow but I was able to insert about 16 million records in ten minutes without overloading my machine.

    I tried doing it in small batches with:

    import pandas as pd
    from sqlalchemy import create_engine
    
    engine = create_engine("mssql+pyodbc://@myDSN")
    
    a = 1
    b = 1001
    
    while b <= len(df):
        try:
            df[a:b].to_sql('myTable', engine, schema='dbo', if_exists='append',index=False,chunksize=100)
            a = b + 1
            b = b + 1000
        except:
            print(f'Error between {a} and {b}')
            continue
    

    Ugly as hell but worked for me.

    I am open to all critics and advises. As I mentioned, I am posting this in case it helps anybody else but also looking forward to receive some constructive feedback.