pythonexcelpandasdataframedata-extraction

Skipping one specific excel tab from multiple tabs in multiple excel sheet (Pandas Python)


I have a routine in place to convert my multiple excel files, with multiple tabs and multiple columns (some tabs are present in the excel sheets, some are not, but the column structuring inside all the tabs is the same for all the sheets) to a dictionary of dictionaries. I'm facing an issue while skipping one specific tab from some of the excel sheets. I know we define the name of the sheets which we want to include in the data structure in the sheet_name parameter in the read_excel function of pandas. But, the problem here is that I want to skip one specific tab (Sheet1) from all the excel sheets, and also, the tab names I'm defining other than that in the sheet_name parameter are not present in each of the excel sheets. Please let me know if there are any workarounds here. Thank you!!

#Assigning the path to the folder variable
folder = r'specified_path'

#Changing the directory to the database directory
os.chdir(folder) 

#Getting the list of files from the assigned path
files = os.listdir(folder) 

#Joining the list of files to the assigned path
for archivedlist in files:
    local_path = os.path.join(folder, archivedlist)
    print("Joined Path: ", local_path)

#Reading the data from the files in the dictionary data structure
main_dict = {}
def readdataframe(files):
    df_dict = {}
    for element in files:
        df_dict[element] = pd.read_excel(element, sheet_name = ["Sheet2", "Sheet3", "Sheet4", 
                                                            "Sheet5", "Sheet6", "Sheet7",
                                                           "Sheet8"])
        print(df_dict[element].keys)
    return df_dict

print(readdataframe(files))

I want to skip sheet1 from all the excel files wherever it is present and want to extract the sheets[2-8] from all the excel files if they are present there. Also, a side note is that I could extract all the data from all the excel files when I was using sheet_name = None, but that is not the expected result.

Lastly, all the tabs which are extracted from all the excel sheets should be a pandas data frame.


Solution

  • I was able to resolve this query by creating two functions. The first function I created takes the input as the sheet name I want to skip/delete and the master dictionary (df_dict). Below is the code for the function:

    def delete_key(rm_key, df_dict):
        '''This routine is used to delete any tab from a nested dictionary '''
        
        #Checking for the tab name if it is present in the master dictionary. If yes, delete it directly from there
        if rm_key in df_dict:
            del df_dict[rm_key]
            
        #Looping in the master dictionary to check for the tab name to be deleted
        for val in df_dict.values():
            if isinstance(val, dict):
                df_dict = delete_key(rm_key, val) #Deleting the whole tab with its value from the master dictionary using a recursive routine
        
        return df_dict
    

    We need to call this function once we get our data structure from the routine mentioned in the question. The changes in that routine are as follows:

    folder = r'specified_path'
    
    files = os.listdir(folder)
    
    def readdataframe(files):
        '''This routine is used to read multiple excel files into a nested 
        dictionary of data frames'''
        
        for element in files:
            df_dict[element] = pd.read_excel(element, sheet_name = None)
            
            for num in df_dict[element]:
                df_dict[element][num] = pd.DataFrame.from_dict(df_dict[element][num])
                print("Filename: ", element, "Tab Name: ", num, "Type: ", type(df_dict1[element][num]))
        return df_dict
    

    When we execute both of these functions, we get the output as a dictionary of data frames which is not having the sheet that we want to skip.

    Please follow these routines, and they will work. Let me know if you face any issues.

    For simplicity, I have created three excel files with the same number of tabs inside them (Sheet1, Sheet2, Sheet3). The columns inside the tabs are also the same. Please check below the output. We get this output by running the readdataframe(files) function.

    Output:
    
    Joined Path:  specified_path\1.xlsx
    Joined Path:  specified_path\2.xlsx
    Joined Path:  specified_path\3.xlsx
    Filename:  1.xlsx Tab Name:  Sheet1 Type:  <class 'pandas.core.frame.DataFrame'>
    Filename:  1.xlsx Tab Name:  Sheet2 Type:  <class 'pandas.core.frame.DataFrame'>
    Filename:  1.xlsx Tab Name:  Sheet3 Type:  <class 'pandas.core.frame.DataFrame'>
    Filename:  2.xlsx Tab Name:  Sheet1 Type:  <class 'pandas.core.frame.DataFrame'>
    Filename:  2.xlsx Tab Name:  Sheet2 Type:  <class 'pandas.core.frame.DataFrame'>
    Filename:  2.xlsx Tab Name:  Sheet3 Type:  <class 'pandas.core.frame.DataFrame'>
    Filename:  3.xlsx Tab Name:  Sheet1 Type:  <class 'pandas.core.frame.DataFrame'>
    Filename:  3.xlsx Tab Name:  Sheet2 Type:  <class 'pandas.core.frame.DataFrame'>
    Filename:  3.xlsx Tab Name:  Sheet3 Type:  <class 'pandas.core.frame.DataFrame'>
    {'1.xlsx': {'Sheet1':    A  B  C  D
    0  1  1  1  2
    1  2  2  4  2
    2  3  3  2  4
    3  4  1  3  3, 'Sheet2':    A
    0  1
    1  2
    2  3
    3  4, 'Sheet3':    B
    0  3
    1  4
    2  5
    3  6}, '2.xlsx': {'Sheet1':    A  B  C  D
    0  1  1  1  2
    1  2  2  4  2
    2  3  3  2  4
    3  4  1  3  3, 'Sheet2':    A
    0  1
    1  2
    2  3
    3  4, 'Sheet3':    B
    0  3
    1  4
    2  5
    3  6}, '3.xlsx': {'Sheet1':    A  B  C  D
    0  1  1  1  2
    1  2  2  4  2
    2  3  3  2  4
    3  4  1  3  3, 'Sheet2':    A
    0  1
    1  2
    2  3
    3  4, 'Sheet3':    B
    0  3
    1  4
    2  5
    3  6}}
    

    Once we get this output, we can delete Sheet1 using delete_key('Sheet1', df_dict) function. The output after running this function is as follows:

    Output:
    
    {'1.xlsx': {'Sheet2':    A
    0  1
    1  2
    2  3
    3  4,
    'Sheet3':    B
    0  3
    1  4
    2  5
    3  6},
    '2.xlsx': {'Sheet2':    A
    0  1
    1  2
    2  3
    3  4,
    'Sheet3':    B
    0  3
    1  4
    2  5
    3  6},
    '3.xlsx': {'Sheet2':    A
    0  1
    1  2
    2  3
    3  4,
    'Sheet3':    B
    0  3
    1  4
    2  5
    3  6}}
    

    This is how we can see that Sheet one was removed from all the excel files.