I have a problem I just can`t find a working solution for.
As a starting point, I have an array created from an XML file containing the following information:
| Location | Description | Status | ..... | Product| Name | Price | Product| Name | Price | ...
| 1 | "Foo" | 1 | | | 124 | 4.2 | | 122 | 8.8 |
| 2 | "Bar" | 1 | | | 122 | 8.9 | | NaN | NaN |
The first n columns contain generic information, which is the same for all "products" in the corresponding row. The last 24 columns contain 8 subsets of 3 columns with the same repeating set of names ["Product", "Name", "Price"].
For further handling of the data, I want to "stack" the data of the last 24 columns as subsets of three to get something like this:
| Location | Description | Status | ..... | Product| Name | Price |
| 1 | "Foo" | 1 | | | 124 | 4.2 |
| | | | | | 122 | 8.8 |
| 2 | "Bar" | 1 | | | 122 | 8.9 |
| | | | | | NaN | NaN |
or
| Location | Description | Status | ..... | Product| Name | Price |
| 1 | "Foo" | 1 | | | 124 | 4.2 |
| 1 | "Foo" | 1 | | | 122 | 8.8 |
| 2 | "Bar" | 1 | | | 122 | 8.9 |
| 2 | "Bar" | 1 | | | NaN | NaN |
I have tried to solve my transformation problem as follows:
Step 1: De-duplicate the identical column names (to later be able to use df.stack():
cols=pd.Series(result_list_xml_t[0])
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
Step 2: Create a MultiIndex to group
columns_xml = pd.MultiIndex.from_arrays([xml_groups,cols])
With this, I create a pandas df that looks like:
| base | prod1 | prod 2
| Location | Description | Status | ..... | Product.1| Name.1 | Price.1 | Product.2| Name.2 | Price.2 | ...
1 | 1 | "Foo" | 1 | | | 124 | 4.2 | | 122 | 8.8 |
2 | 2 | "Bar" | 1 | | | 122 | 8.9 | | NaN | NaN |
Now, I would like to only stack the prod-groups while maintaining the base-group. Can this be done by using the stack-function or is there another, more efficient approach?
I tried to work with df.stack because of this thread:
Stack Overflow - Unnnest multi index data frame
but df.stack(future_stack=True,level=0)
does not seem to get me closer to my desired format.
| | base | prod1 | prod2 | prod3| ...
0 | Location | 1 |
0 | Description| "Foo" |
0 | Status | 1 |
...
0 | Product.1 | | | | |
0 | Name.1 | | 124 | | |
0 | Price.1 | | 4.2 | | |
...
I also tried to go with a combination of .melt() and .set_index() proposed by the following thread Unstack or Pivot in Pandas, but that also did not get me closer.
Create MultiIndex
by first n
columns first by DataFrame.set_index
, create counter in duplicated columns by GroupBy.cumcount
, so possible use DataFrame.stack
:
print (df)
Location Description Status Product Name Price Product Name Price
0 1 Foo 1 a 124 4.2 v 122.0 8.8
1 2 Bar 1 b 122 8.9 b NaN NaN
n = 3
out = df.set_index(df.columns[:n].tolist())
out.columns = [out.columns, out.columns.to_series().groupby(level=0).cumcount()]
out = out.stack(dropna=False).droplevel(-1).reset_index()
print (out)
Location Description Status Product Name Price
0 1 Foo 1 a 124.0 4.2
1 1 Foo 1 v 122.0 8.8
2 2 Bar 1 b 122.0 8.9
3 2 Bar 1 b NaN NaN