pythonpandasimport-from-excel

Faster way to read Excel files to pandas dataframe


I have a 14MB Excel file with five worksheets that I'm reading into a Pandas dataframe, and although the code below works, it takes 9 minutes!

Does anyone have suggestions for speeding it up?

import pandas as pd

def OTT_read(xl,site_name):
    df = pd.read_excel(xl.io,site_name,skiprows=2,parse_dates=0,index_col=0,
                       usecols=[0,1,2],header=None,
                       names=['date_time','%s_depth'%site_name,'%s_temp'%site_name])
    return df

def make_OTT_df(FILEDIR,OTT_FILE):
    xl = pd.ExcelFile(FILEDIR + OTT_FILE)
    site_names = xl.sheet_names
    df_list = [OTT_read(xl,site_name) for site_name in site_names]
    return site_names,df_list

FILEDIR='c:/downloads/'
OTT_FILE='OTT_Data_All_stations.xlsx'
site_names_OTT,df_list_OTT = make_OTT_df(FILEDIR,OTT_FILE)

Solution

  • As others have suggested, csv reading is faster. So if you are on windows and have Excel, you could call a vbscript to convert the Excel to csv and then read the csv. I tried the script below and it took about 30 seconds.

    # create a list with sheet numbers you want to process
    sheets = map(str,range(1,6))
    
    # convert each sheet to csv and then read it using read_csv
    df={}
    from subprocess import call
    excel='C:\\Users\\rsignell\\OTT_Data_All_stations.xlsx'
    for sheet in sheets:
        csv = 'C:\\Users\\rsignell\\test' + sheet + '.csv' 
        call(['cscript.exe', 'C:\\Users\\rsignell\\ExcelToCsv.vbs', excel, csv, sheet])
        df[sheet]=pd.read_csv(csv)
    

    Here's a little snippet of python to create the ExcelToCsv.vbs script:

    #write vbscript to file
    vbscript="""if WScript.Arguments.Count < 3 Then
        WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
        Wscript.Quit
    End If
    
    csv_format = 6
    
    Set objFSO = CreateObject("Scripting.FileSystemObject")
    
    src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
    dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
    worksheet_number = CInt(WScript.Arguments.Item(2))
    
    Dim oExcel
    Set oExcel = CreateObject("Excel.Application")
    
    Dim oBook
    Set oBook = oExcel.Workbooks.Open(src_file)
    oBook.Worksheets(worksheet_number).Activate
    
    oBook.SaveAs dest_file, csv_format
    
    oBook.Close False
    oExcel.Quit
    """;
    
    f = open('ExcelToCsv.vbs','w')
    f.write(vbscript.encode('utf-8'))
    f.close()
    

    This answer benefited from Convert XLS to CSV on command line and csv & xlsx files import to pandas data frame: speed issue