I want to use python to read an Excel file and transform it into a different structure (Example).
My current approach is following
Python dataframe example to generate the red marked area above:
df = pd.DataFrame({'col1': {0: 'Year' , 1: 'Option' , 2: 'Category' , 3: 'Type' , 4: 'Country' , 5: 'Australia', 6: 'New Zealand'},
'col2': {0: '2024' , 1: 'S' , 2: 'FTE' , 3: 'A' , 4: '' , 5: '-1,0' , 6: '-2,0'},
'col3': {0: '' , 1: '' , 2: 'Budget' , 3: 'B' , 4: 'EUR' , 5: '-100,5' , 6: '-200,5'},
'col4': {0: '' , 1: '' , 2: '' , 3: 'C' , 4: 'EUR' , 5: '-1000' , 6: '-2000'},
'col5': {0: '' , 1: 'T' , 2: 'FTE' , 3: 'A' , 4: '' , 5: '1,0' , 6: '2,0'},
'col6': {0: '' , 1: '' , 2: 'Budget' , 3: 'B' , 4: 'EUR' , 5: '100,5' , 6: '200,5'},
'col7': {0: '' , 1: '' , 2: '' , 3: 'C' , 4: 'EUR' , 5: '1000' , 6: '2000'},
'col8': {0: '2025' , 1: 'S' , 2: 'FTE' , 3: 'A' , 4: '' , 5: '-3,0' , 6: '-4,0'},
'col9': {0: '' , 1: '' , 2: 'Budget' , 3: 'B' , 4: 'EUR' , 5: '-300,5' , 6: '-400,5'},
'col10': {0: '' , 1: '' , 2: '' , 3: 'C' , 4: 'EUR' , 5: '3000' , 6: '-4000'},
'col11': {0: '' , 1: 'T' , 2: 'FTE' , 3: 'A' , 4: '' , 5: '3,0' , 6: '4,0'},
'col12': {0: '' , 1: '' , 2: 'Budget' , 3: 'B' , 4: 'EUR' , 5: '300,5' , 6: '400,5'},
'col13': {0: '' , 1: '' , 2: '' , 3: 'C' , 4: 'EUR' , 5: '3000' , 6: '4000'},
})
I am struggling to read the data and set the multi index columns of the dataframe, as the country column does not fit into the hierarchy.
Because I have to use df = pd.read_excel(...usecols='T:Z', header=None...
I am reading the data and the header separately and then add the headers using df.columns = pd.MultiIndex.from_arrays(...)
.
The result (for 2024) looks like this
2024 | |||||||
---|---|---|---|---|---|---|---|
S | T | ||||||
A | B | C | A | B | C | ||
Country | |||||||
0 | Australia | -1,0 | -100,5 | -1000 | 1,0 | 100,5 | 1000 |
1 | New Zealand | -2,0 | -200,5 | -2000 | 2,0 | 200,5 | 2000 |
Here I am stuck, I tried using .stack
and .melt
to achieve the target structure, however was not able to achieve it.
Regarding the red area, you can use the following code to obtain the desired format:
table = df.T # Assuming "df" is the dataframe you gave
# Reset and eliminate the index
table.reset_index(inplace=True)
del table['index']
# Reset the dataframe to make the columns the first row
table.columns = table.iloc[0]
table = table[1:]
# Remove irrelevant columns
del table['Country']
del table['Category']
# Blank entries must be populated with the previous entry
def ReplaceBlankEntriesWithPrevious(col):
out = col.copy()
for i in range(1, len(col)+1):
out[i] = col[i] if col[i] != "" else out[i-1]
return out
table['Year'] = ReplaceBlankEntriesWithPrevious(table['Year'])
table['Option'] = ReplaceBlankEntriesWithPrevious(table['Option'])
# Set two of the three indexes
table.set_index(['Year', 'Option'], inplace=True)
# Swap the "Type" and "Country" positions
table = table.pivot(columns='Type')
table = table.stack(level=0)
table.rename_axis(index={0:"Country"}, inplace=True)
# Optionally, you can swap the index order to put it in the order of
# Country, Year, Option
table = table.swaplevel(i=0, j=2)
table = table.swaplevel(i=1, j=2)
Note that I started by transposing the table, since both "Year" and "Option" were columns. Additionally, I had to fill empty cells with their implied values. This solution isn't the most elegant, but it gets the job done. I would welcome a more succinct solution.