pythonfor-loopif-statement

Read multiple excel files to dataframes using for loop by reading the month in each filename


I have 12 Excel files. Each is based on a month of the year and therefore each filename ends in 'Month YYYY'. For example, the file for March of 2021 ends in 'March 2021.xlsx'. I want to read each Excel file, select certain columns, drop empty rows, then merge each file into one excel file as a named worksheet. However, I want to search the file's name, identify the month and then rename the second column to say that month.

How do I add a code to have the month of each file be used as the 'new name' for the second column of each df?

Here's an example using two months:

File one: January 2021.xlsx

A B
1 x
3 x

File three: February 2021.xlsx

A B
3 x
5 x

I want to rename B to represent the month of the respective excel file and then merge to get:

A January February
1 x 0
3 x x
5 0 x

This is what I have done so far.

#Store Excel files in a python list
excel_files = list(Path(DATA_DIR).glob('*.xlsx'))
#Read each file into a dataframe

dfs = []

for excel_file in excel_files:
    df = pd.read_excel(excel_file,sheet_name='Sheet1',header=5,usecols='A,F',skipfooter=8)
    df.dropna(how='any', axis=0, inplace = True)
    df.rename(columns={'old-name': 'new-name'}, inplace=True)
    dfs.append(df)
#Compile the list of dataframes to merge
data_frames = [dfs[0], dfs[1],dfs[2] ... dfs[11]]
#Merge all dataframes
df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['A'],
                                            how='outer'), data_frames).fillna(0)

I need help adding the code to have the month of each file be used as the 'new name' for the second column of each df?


Solution

  • I think your question is similar to this: Extract month, day and year from date regex

    an advanced way to do this would be using regex, which is laid out a little in that prior post.

    a simpler way to do this would be to split (or rsplit) the on (' '), assuming that there is a space in front of the month as well as after: excel_file = "first bit of names MONTH 2021.xlsx":

    for excel_file in excel_files:
        new-name = str(excel_file).rsplit(' ', 2)[-2] # creates a list [[first bit of names...], [MONTH], [2021.xlsx]] and takes the 2nd to last element
        df = pd.read_excel(excel_file,sheet_name='Sheet1',header=5,usecols='A,F',skipfooter=8)
        df.dropna(how='any', axis=0, inplace = True)
        df.rename(columns={'old-name': new-name}, inplace=True)
        dfs.append(df)
    

    I think this answers the question, but you may have another problem getting the "old name" in the way you propose. Hope this helps

    *edited to match comment