Need to extract data from multiple tables which are located in the same worksheet in an excel(.xlsx) file. Every table has a distinct header above the column names and every table ends with the row "total" (total row is not needed). There are currently 3 tables I need from this file but there will be a 4th table added in the future which needs to be extracted as well. I have the header and the column names for that table as well. The cell ranges are not fixed and the rows can increase or decrease in the next file so the tables need to be picked up dynamically.
The output needs to be one big table which contains all the distinct columns from all the 3 tables and the columns from the 4th table as and when it is avaiable in the source file.
I am trying on Data flows but not able to extract the tables. I am open to working on ADF or Synapse also if there is a solution.
The sample input is shown below(All tables are in the same sheet):
Trial Balance | ||||
---|---|---|---|---|
Acc Number | Original Balance | Current Balance | Interest Rate | Original Date |
546172 | 2354673 | 231473 | 10% | 07-03-2023 |
758682 | 2375764 | 137487 | 9% | 09-09-2020 |
536357 | 6473635 | 474661 | 13% | 11-05-2022 |
Total | 658496 | 784651 |
Delinquency Report | |||
---|---|---|---|
Acc Number | Original Balance | Current Balance | Next Due Date |
546172 | 2354673 | 231473 | 07-03-2024 |
536357 | 6473635 | 474661 | 09-09-2024 |
758682 | 2375764 | 137487 | 11-05-2024 |
Total | 658496 | 784651 |
You can use synapse notebook with pandas functionality to combine all of you tables.
Below i am giving an example for 2 tables in you excel file similarly you alter below code for more tables.
df = pd.read_excel("https://vjgsblob.blob.core.windows.net/data/csvs/extracted_data.xlsx?sp=r&st=2024-09-13T07:07:11Z&se=2024-09-13T15:07:11Z&spr=https&sv=2022-11-02&sr=b&sig=2AmroznyDPtH5boSZym8yxpLibsD0X4te18Phj%2BpAxc%3D").dropna(how='all').dropna(axis=1, how='all')
df = df.reset_index(drop=True)
df.columns = [f'col{i}' for i,j in enumerate(df.columns)]
# Getting the column names where table name indicator presents
trialBalancCol = [col for col in df.columns if df[col].str.contains('Trial Balance', na=False).any()][0]
deliquencyReportCol = [col for col in df.columns if df[col].str.contains('Deliquency Report', na=False).any()][0]
# Getting the position, which is used to split them.
trialBalance = df[df[trialBalancCol]=='Trial Balance'].index[0]
deliquencyReport = df[df[deliquencyReportCol]=='Deliquency Report'].index[0]
print(trialBalance,deliquencyReport)
#splitting the tables
dfTrialBalance = df[(df.index>trialBalance) & (df.index<deliquencyReport)].dropna(how='all').dropna(axis=1, how='all')
dfDeliquencyReport = df[df.index>deliquencyReport].dropna(how='all').dropna(axis=1, how='all')
#Removing total row
totalCol = [col for col in dfTrialBalance.columns if dfTrialBalance[col].str.contains('total', na=False).any()][0]
dfTrialBalance = dfTrialBalance[df[totalCol]!='total']
totalCol = [col for col in dfDeliquencyReport.columns if dfDeliquencyReport[col].str.contains('total', na=False).any()][0]
dfDeliquencyReport = dfDeliquencyReport[df[totalCol]!='total']
# Creating the column names
dfTrialBalance.columns = dfTrialBalance.head(1).values.flatten().tolist()
dfDeliquencyReport.columns = dfDeliquencyReport.head(1).values.flatten().tolist()
#Removing the column name row and resetting the index
dfTrialBalance = dfTrialBalance.iloc[1:].reset_index(drop=True)
dfDeliquencyReport = dfDeliquencyReport.iloc[1:].reset_index(drop=True)
#Joining the tables
ResTable = pd.merge(dfTrialBalance,dfDeliquencyReport,on='Acc Number',how='left')
ResTable
After executing all above code you save it back to storage account in csv format(ResTable.to_csv(<path>)
).
Again, alter this code according to your data.
Input :
Output:
Acc Number | Original Bal | Current Bal | Interest rate | Original Date | Original Balance | Current Balance | Next Due Date | |
---|---|---|---|---|---|---|---|---|
546172 | 2354673 | 231473 | 0.1 | 2023-07-03 00:00:00 | 2354673 | 231473 | 2024-07-03 00:00:00 | |
758682 | 2375764 | 137487 | 0.09 | 2020-09-09 00:00:00 | NaN | NaN | NaN | |
536357 | 6473635 | 474661 | 0.13 | 2022-11-05 00:00:00 | 6473635 | 474661 | 2024-09-09 00:00:00 |