pythonexcelmulti-index

Using python to read a multi-column table


I want to use python to read an Excel file and transform it into a different structure (Example).

My current approach is following

  1. Read the black area (it always has a fixed number of columns, so know where to start and to end)
  2. Read the remaining red area
  3. Merge the data

Python dataframe example to generate the red marked area above:

df = pd.DataFrame({'col1':  {0: 'Year'  , 1: 'Option'   , 2: 'Category' , 3: 'Type' , 4: 'Country'  , 5: 'Australia', 6: 'New Zealand'},
                   'col2':  {0: '2024'  , 1: 'S'        , 2: 'FTE'      , 3: 'A'    , 4: ''         , 5: '-1,0'     , 6: '-2,0'},
                   'col3':  {0: ''      , 1: ''         , 2: 'Budget'   , 3: 'B'    , 4: 'EUR'      , 5: '-100,5'   , 6: '-200,5'},
                   'col4':  {0: ''      , 1: ''         , 2: ''         , 3: 'C'    , 4: 'EUR'      , 5: '-1000'    , 6: '-2000'},
                   'col5':  {0: ''      , 1: 'T'        , 2: 'FTE'      , 3: 'A'    , 4: ''         , 5: '1,0'      , 6: '2,0'},
                   'col6':  {0: ''      , 1: ''         , 2: 'Budget'   , 3: 'B'    , 4: 'EUR'      , 5: '100,5'    , 6: '200,5'},
                   'col7':  {0: ''      , 1: ''         , 2: ''         , 3: 'C'    , 4: 'EUR'      , 5: '1000'     , 6: '2000'},
                   'col8':  {0: '2025'  , 1: 'S'        , 2: 'FTE'      , 3: 'A'    , 4: ''         , 5: '-3,0'     , 6: '-4,0'},
                   'col9':  {0: ''      , 1: ''         , 2: 'Budget'   , 3: 'B'    , 4: 'EUR'      , 5: '-300,5'   , 6: '-400,5'},
                   'col10': {0: ''      , 1: ''         , 2: ''         , 3: 'C'    , 4: 'EUR'      , 5: '3000'     , 6: '-4000'},
                   'col11': {0: ''      , 1: 'T'        , 2: 'FTE'      , 3: 'A'    , 4: ''         , 5: '3,0'      , 6: '4,0'},
                   'col12': {0: ''      , 1: ''         , 2: 'Budget'   , 3: 'B'    , 4: 'EUR'      , 5: '300,5'    , 6: '400,5'},
                    'col13': {0: ''      , 1: ''         , 2: ''         , 3: 'C'    , 4: 'EUR'     , 5: '3000'     , 6: '4000'},
                    })

I am struggling to read the data and set the multi index columns of the dataframe, as the country column does not fit into the hierarchy.

Because I have to use df = pd.read_excel(...usecols='T:Z', header=None... I am reading the data and the header separately and then add the headers using df.columns = pd.MultiIndex.from_arrays(...).

The result (for 2024) looks like this

2024
S T
A B C A B C
Country
0 Australia -1,0 -100,5 -1000 1,0 100,5 1000
1 New Zealand -2,0 -200,5 -2000 2,0 200,5 2000

Here I am stuck, I tried using .stack and .melt to achieve the target structure, however was not able to achieve it.


Solution

  • Regarding the red area, you can use the following code to obtain the desired format:

    table = df.T # Assuming "df" is the dataframe you gave    
    # Reset and eliminate the index
    table.reset_index(inplace=True)
    del table['index']
    
    # Reset the dataframe to make the columns the first row
    table.columns = table.iloc[0]
    table = table[1:]
    
    # Remove irrelevant columns
    del table['Country']
    del table['Category']
    
    # Blank entries must be populated with the previous entry
    def ReplaceBlankEntriesWithPrevious(col):
        out = col.copy()
        for i in range(1, len(col)+1):
            out[i] = col[i] if col[i] != "" else out[i-1]
        return out
    
    table['Year'] = ReplaceBlankEntriesWithPrevious(table['Year'])
    table['Option'] = ReplaceBlankEntriesWithPrevious(table['Option'])
    
    # Set two of the three indexes
    table.set_index(['Year', 'Option'], inplace=True)
    
    # Swap the "Type" and "Country" positions
    table = table.pivot(columns='Type')
    table = table.stack(level=0)
    table.rename_axis(index={0:"Country"}, inplace=True)
    
    # Optionally, you can swap the index order to put it in the order of 
    # Country, Year, Option
    table = table.swaplevel(i=0, j=2)
    table = table.swaplevel(i=1, j=2)
    

    Note that I started by transposing the table, since both "Year" and "Option" were columns. Additionally, I had to fill empty cells with their implied values. This solution isn't the most elegant, but it gets the job done. I would welcome a more succinct solution.