pythonpandas

Stacking/duplicating rows in a DataFrame using only a subset of columns


I have a problem I just can`t find a working solution for.

As a starting point, I have an array created from an XML file containing the following information:

| Location | Description | Status | ..... | Product| Name | Price | Product| Name | Price | ...
| 1        | "Foo"       | 1      |       |        | 124  | 4.2   |        | 122  | 8.8   |
| 2        | "Bar"       | 1      |       |        | 122  | 8.9   |        | NaN  | NaN   |

The first n columns contain generic information, which is the same for all "products" in the corresponding row. The last 24 columns contain 8 subsets of 3 columns with the same repeating set of names ["Product", "Name", "Price"].

For further handling of the data, I want to "stack" the data of the last 24 columns as subsets of three to get something like this:

| Location | Description | Status | ..... | Product| Name | Price |
| 1        | "Foo"       | 1      |       |        | 124  | 4.2   |
|          |             |        |       |        | 122  | 8.8   |
| 2        | "Bar"       | 1      |       |        | 122  | 8.9   |
|          |             |        |       |        | NaN  | NaN   |

or

| Location | Description | Status | ..... | Product| Name | Price |
| 1        | "Foo"       | 1      |       |        | 124  | 4.2   |
| 1        | "Foo"       | 1      |       |        | 122  | 8.8   |
| 2        | "Bar"       | 1      |       |        | 122  | 8.9   |
| 2        | "Bar"       | 1      |       |        | NaN  | NaN   |

I have tried to solve my transformation problem as follows:

Step 1: De-duplicate the identical column names (to later be able to use df.stack():

cols=pd.Series(result_list_xml_t[0])
for dup in cols[cols.duplicated()].unique(): 
    cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]

Step 2: Create a MultiIndex to group

columns_xml = pd.MultiIndex.from_arrays([xml_groups,cols])

With this, I create a pandas df that looks like:

  | base                                    | prod1                       | prod 2
  | Location | Description | Status | ..... | Product.1| Name.1 | Price.1 | Product.2| Name.2 | Price.2 | ...
1 | 1        | "Foo"       | 1      |       |          | 124    | 4.2     |          | 122    | 8.8     |
2 | 2        | "Bar"       | 1      |       |          | 122    | 8.9     |          | NaN    | NaN     |

Now, I would like to only stack the prod-groups while maintaining the base-group. Can this be done by using the stack-function or is there another, more efficient approach?

I tried to work with df.stack because of this thread: Stack Overflow - Unnnest multi index data frame but df.stack(future_stack=True,level=0) does not seem to get me closer to my desired format.

  |            | base  | prod1 | prod2 | prod3| ...
0 | Location   | 1     |
0 | Description| "Foo" |
0 | Status     | 1     |
...
0 | Product.1  |       |       |      |      |
0 | Name.1     |       | 124   |      |      |
0 | Price.1    |       | 4.2   |      |      |
...

I also tried to go with a combination of .melt() and .set_index() proposed by the following thread Unstack or Pivot in Pandas, but that also did not get me closer.


Solution

  • Create MultiIndex by first n columns first by DataFrame.set_index, create counter in duplicated columns by GroupBy.cumcount, so possible use DataFrame.stack:

    print (df)
       Location Description  Status Product  Name  Price Product   Name  Price
    0         1         Foo       1       a   124    4.2       v  122.0    8.8
    1         2         Bar       1       b   122    8.9       b    NaN    NaN
    
    n = 3
    out = df.set_index(df.columns[:n].tolist())
    out.columns = [out.columns, out.columns.to_series().groupby(level=0).cumcount()]
    
    out = out.stack(dropna=False).droplevel(-1).reset_index()
    print (out)
       Location Description  Status Product   Name  Price
    0         1         Foo       1       a  124.0    4.2
    1         1         Foo       1       v  122.0    8.8
    2         2         Bar       1       b  122.0    8.9
    3         2         Bar       1       b    NaN    NaN