pythonpython-2.7pandasopenpyxl

How to read Excel Workbook (pandas) that may have multiple worksheets


First I want to say that I am not an expert by any means. I am versed but carry a burden of schedule and learning Python like I should have at a younger age!

Question:
I have a workbook that will on occasion have more than one worksheet. When reading in the workbook I will not know the number of sheets or their sheet name. The data arrangement will be the same on every sheet with some columns going by the name of 'Unnamed'. The problem is that everything I try or find online uses the pandas.ExcelFile to gather all sheets which is fine but i need to be able to skips 4 rows and only read 42 rows after that and parse specific columns. Although the sheets might have the exact same structure the column names might be the same or different but would like them to be merged.

So here is what I have:

import pandas as pd
from openpyxl import load_workbook

# Load in the file location and name
cause_effect_file = r'C:\Users\Owner\Desktop\C&E Template.xlsx'

# Set up the ability to write dataframe to the same workbook
book = load_workbook(cause_effect_file)
writer = pd.ExcelWriter(cause_effect_file) 
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)

# Get the file skip rows and parse columns needed
xl_file = pd.read_excel(cause_effect_file, skiprows=4, parse_cols = 'B:AJ', na_values=['NA'], convert_float=False)

# Loop through the sheets loading data in the dataframe
dfi = {sheet_name: xl_file.parse(sheet_name)
          for sheet_name in xl_file.sheet_names}

# Remove columns labeled as un-named
for col in dfi:
    if r'Unnamed' in col:
        del dfi[col]

# Write dataframe to sheet so we can see what the data looks like
dfi.to_excel(writer, "PyDF", index=False)

# Save it back to the book
writer.save()

The link to the file i am working with is below Excel File


Solution

  • Try to modify the following based on your specific need:

    import os
    import pandas as pd
    
    df = pd.DataFrame()
    xls = pd.ExcelFile(path)
    

    Then iterate over all the available data sheets:

    for x in range(0, len(xls.sheet_names)): 
        a = xls.parse(x,header = 4, parse_cols = 'B:AJ')
        a["Sheet Name"] = [xls.sheet_names[x]] * len(a)
        df = df.append(a)
    

    You can adjust the header row and the columns to read for each sheet. I added a column that will indicate the name of the data sheet the row came from.