pythonpandasazure-sql-databasepandas-to-sql

How to speed up pandas to_sql


I am trying to upload data to a MS Azure Sql database using pandas to_sql and it takes very long. I often have to run it before I go to bed and wake up in the morning and it is done but has taken several hours and if there is an error that comes up I am not able to address it. Here is the code I have:

params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)

conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True
        cursor.commit()
        
connection = engine.connect()
connection

Then I run this command for the sql ingestion:

master_data.to_sql('table_name', engine, chunksize=500, if_exists='append', method='multi',index=False)

I have played around with the chunksize and the sweet spot seems to be 100, which isn't fast enough considering I am usually trying to upload 800,000-2,000,000 records at a time. If I increase it beyond that I will get an error which seems to only be related to the chunk size.

OperationalError: (pyodbc.OperationalError) ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure (0) (SQLExecDirectW)')

Solution

  • Not sure if you have your issue resolved but did want to provide an answer here for the benefit of providing Azure SQL Database libraries for Python specific information and some useful resources to investigate and resolve this issue, as applicable.

    An example of using pyodbc to directly query an Azure SQL Database: Quickstart: Use Python to query Azure SQL Database Single Instance & Managed Instance

    An example of using Pandas dataframe: How to read and write to an Azure SQL database from a Pandas dataframe

    main.py

    """Read write to Azure SQL database from pandas"""
    import pyodbc
    import pandas as pd
    import numpy as np
    from sqlalchemy import create_engine
    
    # 1. Constants
    AZUREUID = 'myuserid'                                    # Azure SQL database userid
    AZUREPWD = '************'                                # Azure SQL database password
    AZURESRV = 'shareddatabaseserver.database.windows.net'   # Azure SQL database server name (fully qualified)
    AZUREDB = 'Pandas'                                      # Azure SQL database name (if it does not exit, pandas will create it)
    TABLE = 'DataTable'                                      # Azure SQL database table name
    DRIVER = 'ODBC Driver 13 for SQL Server'                 # ODBC Driver
    
    def main():
    """Main function"""
    
    # 2. Build a connectionstring
    connectionstring = 'mssql+pyodbc://{uid}:{password}@{server}:1433/{database}?driver={driver}'.format(
        uid=AZUREUID,
        password=AZUREPWD,
        server=AZURESRV,
        database=AZUREDB,
        driver=DRIVER.replace(' ', '+'))
    
    # 3. Read dummydata into dataframe 
    df = pd.read_csv('./data/data.csv')
    
    # 4. Create SQL Alchemy engine and write data to SQL
    engn = create_engine(connectionstring)
    df.to_sql(TABLE, engn, if_exists='append')
    
    # 5. Read data from SQL into dataframe
    query = 'SELECT * FROM {table}'.format(table=TABLE)
    dfsql = pd.read_sql(query, engn)
    
    print(dfsql.head())
    
    
    if __name__ == "__main__":
        main()
    

    And finally, the following resources should assist in comparing specific implementations, with performance issues, with the below information where the Stack Overflow thread is likely the best resource but the Monitoring and Performance tuning document is useful to investigate and mitigate ay server-side performance issues, etc.

    Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC Monitoring and performance tuning in Azure SQL Database and Azure SQL Managed Instance

    Regards, Mike