I am trying to bulk insert a .CSV
file into SQL Server without much success.
A bit of background:
1. I needed to insert 16 million records into a SQL Server (2017) DB. Each record has 130 columns. I have a field in the .CSV
resulting from an API call from one of our vendors which I am not allowed to mention. I had integer, floats and strings data types.
2. I tried the usual: BULK INSERT
but I could not get passed the data type errors. I posted a question here but could not make it work.
3. I tried experimenting with python and tried every method I could find but pandas.to_sql
for everybody warned it was very slow. I got stuck with data type and string truncate errors. Different to the ones from BULK INSERT
.
4. Without much options I tried pd.to_sql
and while it did not raise any data type or truncation errors it was failing due to running out of space in my tmp SQL database. I could not pass this error either although I had plenty of space and all my data files (and log files) were set to autogrowth without limit.
I got stuck at that point. My code (for the pd.to_sql
piece) was simple:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://@myDSN")
df.to_sql('myTable', engine, schema='dbo', if_exists='append',index=False,chunksize=100)
I am not really sure what else to try, any word of advice is welcome. All codes and examples I have seen deal with small datasets (not many columns). I am willing to try any other method. I would appreciate any pointers.
Thanks!
I just wanted to share this dirty piece of code just in case it helps anybody else. Note that I am very aware this is not optimal at all, it is slow but I was able to insert about 16 million records in ten minutes without overloading my machine.
I tried doing it in small batches with:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://@myDSN")
a = 1
b = 1001
while b <= len(df):
try:
df[a:b].to_sql('myTable', engine, schema='dbo', if_exists='append',index=False,chunksize=100)
a = b + 1
b = b + 1000
except:
print(f'Error between {a} and {b}')
continue
Ugly as hell but worked for me.
I am open to all critics and advises. As I mentioned, I am posting this in case it helps anybody else but also looking forward to receive some constructive feedback.