azureazure-data-factoryextractapache-synapse

Split multiple tables from a single sheet in excel using data flows


Need to extract data from multiple tables which are located in the same worksheet in an excel(.xlsx) file. Every table has a distinct header above the column names and every table ends with the row "total" (total row is not needed). There are currently 3 tables I need from this file but there will be a 4th table added in the future which needs to be extracted as well. I have the header and the column names for that table as well. The cell ranges are not fixed and the rows can increase or decrease in the next file so the tables need to be picked up dynamically.

The output needs to be one big table which contains all the distinct columns from all the 3 tables and the columns from the 4th table as and when it is avaiable in the source file.

I am trying on Data flows but not able to extract the tables. I am open to working on ADF or Synapse also if there is a solution.

The sample input is shown below(All tables are in the same sheet):

Trial Balance
Acc Number Original Balance Current Balance Interest Rate Original Date
546172 2354673 231473 10% 07-03-2023
758682 2375764 137487 9% 09-09-2020
536357 6473635 474661 13% 11-05-2022
Total 658496 784651
Delinquency Report
Acc Number Original Balance Current Balance Next Due Date
546172 2354673 231473 07-03-2024
536357 6473635 474661 09-09-2024
758682 2375764 137487 11-05-2024
Total 658496 784651

Solution

  • You can use synapse notebook with pandas functionality to combine all of you tables.

    Below i am giving an example for 2 tables in you excel file similarly you alter below code for more tables.

    df = pd.read_excel("https://vjgsblob.blob.core.windows.net/data/csvs/extracted_data.xlsx?sp=r&st=2024-09-13T07:07:11Z&se=2024-09-13T15:07:11Z&spr=https&sv=2022-11-02&sr=b&sig=2AmroznyDPtH5boSZym8yxpLibsD0X4te18Phj%2BpAxc%3D").dropna(how='all').dropna(axis=1, how='all')
    df = df.reset_index(drop=True)
    df.columns = [f'col{i}' for i,j in enumerate(df.columns)]
    
    # Getting the column names where table name indicator presents
    trialBalancCol = [col for col in df.columns if df[col].str.contains('Trial Balance', na=False).any()][0]
    deliquencyReportCol = [col for col in df.columns if df[col].str.contains('Deliquency Report', na=False).any()][0]
    
    # Getting the position, which is used to split them.
    trialBalance = df[df[trialBalancCol]=='Trial Balance'].index[0]
    deliquencyReport = df[df[deliquencyReportCol]=='Deliquency Report'].index[0]
    print(trialBalance,deliquencyReport)
    
    #splitting the tables
    
    dfTrialBalance = df[(df.index>trialBalance) & (df.index<deliquencyReport)].dropna(how='all').dropna(axis=1, how='all')
    dfDeliquencyReport = df[df.index>deliquencyReport].dropna(how='all').dropna(axis=1, how='all')
    
    #Removing total row
    totalCol = [col for col in dfTrialBalance.columns if dfTrialBalance[col].str.contains('total', na=False).any()][0]
    dfTrialBalance = dfTrialBalance[df[totalCol]!='total']
    totalCol = [col for col in dfDeliquencyReport.columns if dfDeliquencyReport[col].str.contains('total', na=False).any()][0]
    dfDeliquencyReport = dfDeliquencyReport[df[totalCol]!='total']
    
    # Creating the column names
    dfTrialBalance.columns =  dfTrialBalance.head(1).values.flatten().tolist()
    dfDeliquencyReport.columns =  dfDeliquencyReport.head(1).values.flatten().tolist()
    
    #Removing the column name row and resetting the index
    dfTrialBalance = dfTrialBalance.iloc[1:].reset_index(drop=True)
    dfDeliquencyReport = dfDeliquencyReport.iloc[1:].reset_index(drop=True)
    
    #Joining the tables
    ResTable = pd.merge(dfTrialBalance,dfDeliquencyReport,on='Acc Number',how='left')
    ResTable
    

    After executing all above code you save it back to storage account in csv format(ResTable.to_csv(<path>)).

    Again, alter this code according to your data.

    Input : enter image description here

    Output:

    Acc Number Original Bal Current Bal Interest rate Original Date Original Balance Current Balance Next Due Date
    546172 2354673 231473 0.1 2023-07-03 00:00:00 2354673 231473 2024-07-03 00:00:00
    758682 2375764 137487 0.09 2020-09-09 00:00:00 NaN NaN NaN
    536357 6473635 474661 0.13 2022-11-05 00:00:00 6473635 474661 2024-09-09 00:00:00