I'm trying to create a data ingestion routine to load data from multiple excel files with multiple tabs and columns in a data structure using python. The structuring of the tabs in each of the excel files is the same. Can someone please help me with my code? Please let me know what can be changed here.
folder = r'specified_path'
#Changing the directory to the database directory
os.chdir(folder)
#Getting the list of files from the assigned path
files = os.listdir(folder)
#Joining the list of files to the assigned path
for archivedlist in files:
local_path = os.path.join(folder, archivedlist)
print("Joined Path: ", local_path)
#Reading the data from the files in the dictionary data structure
main_dict = {}
def readdataframe(files):
df_dict = {}
for element in files:
df_dict[element] = pd.read_excel(element, sheet_name = None)
print(df_dict[element].keys)
return df_dict
print(readdataframe(files))
Let's assume you have two files in a directory called excel_test
:
1.xlsx
Sheet1 Sheet2
col1 col2 col1 col2
1 2 3 4
2.xlsx
Sheet1 Sheet2
col1 col2 col1 col2
5 6 7 8
You can store your extracted data in multiple ways, let's see some methods:
1) A single dictionary
A dictionary where all the keys are strings composed by the "name of the file, underscore, sheet name" and the values are pd.DataFrame
s
import pandas as pd
import os
files_dir = "excel_test/"
files = os.listdir(files_dir)
# here will be stored all the content from your files and sheets
sheets_content = {}
# reading all files inside the folder
for file in files:
# reading the content of a xlsx file
data = pd.ExcelFile(files_dir+file)
# iterating through all sheets
for sheet in data.sheet_names:
# saving the content of the sheet for that file (-5 deletes the .xlsx part from the name of the file and makes everything more readable)
sheets_content[file[:-5]+"_"+sheet] = data.parse(sheet)
print(sheets_content)
Output:
{'1_Sheet1':
col1 col2
0 1 2,
'1_Sheet2':
col1 col2
0 3 4,
'2_Sheet1':
col1 col2
0 5 6,
'2_Sheet2':
col1 col2
0 7 8
}
2) A dictionary of dictionaries
Store all xlsx files in a dictionary with as keys the file names and as value another dictionary. The inner dictionary has keys as sheets names and values are pd.DataFrame
s:
import pandas as pd
import os
files_dir = "excel_test/"
files = os.listdir(files_dir)
sheets_content = {}
for file in files:
data = pd.ExcelFile(files_dir+file)
file_data = {}
for sheet in data.sheet_names:
file_data[sheet] = data.parse(sheet)
sheets_content[file[:-5]] = file_data
Output:
{'1':
{'Sheet1':
col1 col2
0 1 2,
'Sheet2':
col1 col2
0 3 4},
'2':
{'Sheet1':
col1 col2
0 5 6,
'Sheet2':
col1 col2
0 7 8}
}
3) A list of dictionaries
Store all xlsx files in an array where each element is a dictionary with as keys the sheets names and values pd.DataFrame
s:
import pandas as pd
import os
files_dir = "excel_test/"
files = os.listdir(files_dir)
sheets_content = []
for file in files:
data = pd.ExcelFile(files_dir+file)
file_data = {}
for sheet in data.sheet_names:
file_data[sheet] = data.parse(sheet)
sheets_content.append(file_data)
Output:
[
{'Sheet1':
col1 col2
0 1 2,
'Sheet2':
col1 col2
0 3 4},
{'Sheet1':
col1 col2
0 5 6,
'Sheet2':
col1 col2
0 7 8}
]