pythonpandasdataframenumpynumpy-ndarray

How to subtract a dataframe with repeating sample IDs by a ref data set


I have the following reference dataframe (ref_df) and a dataframe_1 (df1)

ref_df
| Sample_Name | Value |
| --------    | -------- |
| CTRL        | 18.17    |
| E-1         | 19.06    |
| FIB         | 17.78    |
| 10nM Benz   | 17.89    |

df1 
| Sample_Name | Value |
| --------    | -------- |
| CTRL        | 27.49    |
| E-1         | 28.89    |
| FIB         | 24.81    |
| 10nM Benz   | 24.24    |
| CTRL        | 27.36    |
| E-1         | 28.87    |
| FIB         | 24.66    |
| 10nM Benz   | 24.32    |```

I would like to take make the following happen: df1 - ref_df to obtain the df_result dataframe. 
Notice how each Sample id is subtracted by it corresponding id in the ref_df. 
The problem here seems to be the difference in size of the dfs and making sure 
each value is subtracted by the correct corresponding ref_df ID.


``` df_result 
| Sample_Name | Value |
| --------    | -------- |
| CTRL        | 9.32    |
| E-1         | 9.83    |
| FIB         | 7.03    |
| 10nM Benz   | 6.35    |
| CTRL        | 9.19    |
| E-1         | 9.81    |
| FIB         | 6.88    |
| 10nM Benz   | 6.43    | 

How do I subtract the ref_df from df1 while making sure the correct things get subtracted according to Sample_name. On top of that, I'm unsure how to address the mismatch in the number of values (since ref_df has 4 values and df1 has 8). I'd like to keep the duplicate values as they are replicates and will need them for additional arithmetic later on.

I tried duplicating the ref_df to account for the mismatch in shape between the two dataframes, however that results in subtracting taking place without taking into account Sample_Name


Solution

  • Try:

    df_result = (
        df1.set_index("Sample_Name")["Value"] - ref_df.set_index("Sample_Name")["Value"]
    ).reset_index()
    

    Prints:

      Sample_Name  Value
    0   10nM Benz   6.35
    1   10nM Benz   6.43
    2        CTRL   9.32
    3        CTRL   9.19
    4         E-1   9.83
    5         E-1   9.81
    6         FIB   7.03
    7         FIB   6.88
    

    OR: Put result into df1:

    df1["Result"] = df1["Value"] - df1["Sample_Name"].map(
        ref_df.set_index("Sample_Name")["Value"]
    )
    print(df1)
    

    Prints:

      Sample_Name  Value  Result
    0        CTRL  27.49    9.32
    1         E-1  28.89    9.83
    2         FIB  24.81    7.03
    3   10nM Benz  24.24    6.35
    4        CTRL  27.36    9.19
    5         E-1  28.87    9.81
    6         FIB  24.66    6.88
    7   10nM Benz  24.32    6.43