pythonpandasdataframe

Pandas Dataframe: add columns based on list of samples and column headers


I want to add columns in my df with values based on the sample list in one column and the next column headers as sample numbers. In detail: based on the 11 column, I want to add 3 columns designed as 11_1, 11_2 and 11_3 with values according to the sample list in the 11 and then the same for 00.

My tiny part of input data:

df_matrix_data = {'11': [['P4-1', 'P4-2', 'P4-3'], ['P4-1', 'P4-3', 'P4-4']],
                  '00': [['P4-4', 'P4-6', 'P4-7',], ['P4-2', 'P4-5', 'P4-7']],
                  'P4-1': [1, 2], 'P4-2': [6, 8], 'P4-3': [5, 2], 'P4-4': [2, 3], 'P4-5': [np.nan, 2], 'P4-6': [6, np.nan],
                  'P4-7': [3, 2]}
df_matrix = pd.DataFrame.from_dict(df_matrix_data)

will look like this:

                   11                  00  P4-1  P4-2  P4-3  P4-4  P4-5  P4-6  P4-7
0  [P4-1, P4-2, P4-3]  [P4-4, P4-6, P4-7]     1     6     5     2   NaN   6.0     3
1  [P4-1, P4-3, P4-4]  [P4-2, P4-5, P4-7]     2     8     2     3   2.0   NaN     2

and desired output should look like this:

                   11                  00  P4-1  P4-2  P4-3  P4-4  P4-5  P4-6  P4-7  11_1  11_2  11_3  00_1  00_2  00_3
0  [P4-1, P4-2, P4-3]  [P4-4, P4-6, P4-7]     1     6     5     2   NaN   6.0     3     1     6     5     2     6     3
1  [P4-1, P4-3, P4-4]  [P4-2, P4-5, P4-7]     2     8     2     3   2.0   NaN     2     2     2     3     8     2     2

Any ideas on how to perform it?


Solution

  • Another possible solution:

    df_matrix.assign(
        **{f"{k}_{i+1}": df_matrix.apply(
            lambda row: row[row[k][i]], axis=1) 
           for k in ['11', '00'] for i in range(3)})
    

    It uses a dictionary comprehension within assign, iterating over each key (e.g., '11') and list index (0-2), then generates columns like 11_1 by mapping the list's element (e.g., row['11'][0]) to its corresponding value in the row via lambda.


    To avoid the inefficient apply:

    df_matrix.assign(
        **{f"{k}_{i+1}": df_matrix.values[
        np.arange(len(df_matrix)), 
        df_matrix.columns.get_indexer(df_matrix[k].str[i])]
           for k in ['11', '00'] for i in range(3)})
    

    It uses index.get_indexer to convert column names to numeric indices.

    Output:

                       11                  00  P4-1  P4-2  P4-3  P4-4  P4-5  P4-6  \
    0  [P4-1, P4-2, P4-3]  [P4-4, P4-6, P4-7]     1     6     5     2   NaN   6.0   
    1  [P4-1, P4-3, P4-4]  [P4-2, P4-5, P4-7]     2     8     2     3   2.0   NaN   
    
       P4-7  11_1  11_2  11_3  00_1  00_2  00_3  
    0     3     1     6     5     2   6.0     3  
    1     2     2     2     3     8   2.0     2