I have a data structure in my code which is a dictionary of dictionaries. The nested dictionary has all the keys as pandas data frames. Basically, I had multiple excel files with multiple tabs and columns, so I created this data structure as I wanted to further do some modeling on this data. Now, I want to extract two columns from one specific tab of each excel file(if they are present in that file) and print them in a new master data frame. I tried some routines but was not able to get the expected result. Please find below the code that I tried to resolve this issue.
def text_extraction_to_dataframe(dict1, process_key):
'''This routine is used to extract any required column from the data into a new dataframe with the file name as new
column attached to it'''
#Initializing new data frame
df = pd.DataFrame()
df['ExcelFile'] = ''
#Running nested for-loops to get into our data structure(dictionary of dictionaries)
for key, value in dict1.items():
for key1, value1 in value.items():
#Checking if the required tab matches to the key
if key1 == process_key:
df = pd.DataFrame(value1) #Extracting all the data from the tab to the new dataframe
df['ExcelFile'] = key.split('.')[0] #Appending the data frame with new column as the filename
#Removing unnecessary columns from the data frame and only keeping column3 and column4
df = df.drop(columns = ['colum_1', 'column2'])
return df
text_extraction_to_dataframe(dictionary, 'tab_name')
This routine is not extracting all the data from all the columns of each excel file.
Also, I want to get the last column of the master data frame as the excel file name.
Basically, the structure of master df will be [column3, column4, excelfilename]
Let me know if you need anything else other than this. Any help would be appreciated.
I solved this query by adding all the data frames into a list and then concatenating those. Please find below the code.
def text_extraction_to_dataframe(dictionary, process_key):
'''This routine is used to extract any required column from the data into a
new data frame with the file name as a new column attached to it'''
#List to append all the read data frames
master_df1 = []
length = len(dictionary)
#Running nested for-loops to get into our data structure(dictionary of dictionaries)
for key, value in dictionary.items():
for key1, value1 in value.items():
#Checking if the required tab matches to the key
if key1 == process_key:
df = pd.DataFrame(value1)
#Adding the excel file name as the last column in each data frame
df['ExcelFile'] = key.split('.')[0]
#Appending all data frames in the list
master_df1.append(df)
#Concatenating all the data frames in the master data frame
master_df1 = pd.DataFrame(pd.concat(master_df1, ignore_index=True))
#Dropping unnecessary column
master_df1 = master_df1.drop(columns=['column1', 'column2'])
return master_df1