pythonpandassplitreformatting

Renaming columns using the first part of the string in the first row of a dataframe


I imported a dataset with different column names that all start with Area, Mean, StdDev but have () with different strings inside. What I want to do is split the string in each column by the "( )" and set the first part as the column headers and the second part as a row under that header. I was looking into stacking and unstacking similar to this link or .split() and .explode() in this link but I don't know how to apply it to mine. It is slightly different. My initial dataframe looks like df1. What I am trying to do is df2 but at the end, I am trying to get a format like df3. If you can help me getting to either df2 or df3 that would be great, otherwise I can keep playing around with it after I mange to get it to look like df2.

My dataset looks like:

df1 = pd.DataFrame({
1: {'1': "Area(E10_1_nucleus)", '2': 435}, 
2: {'1': "Mean(E10_1_nucleus)", '2': 313},
3: {'1': "StdDev(E10_1_nucleus)", '2': 150}, 
4: {'1': "Area(E10_1_cytoplasm)", '2': 635},
5: {'1': "Mean(E10_1_cytoplasm)", '2': 847}, 
6: {'1': "StdDev(E10_1_cytoplasm)", '2': 321}})

I am trying to do the following:

df2= pd.DataFrame({
1: {'1': "Area", '2':'E10_1_nucleus','3': 435}, 
2: {'1': "Mean",'2':'E10_1_nucleus' ,'3': 313},
3: {'1': "StdDev",'2':'E10_1_nucleus' ,'3': 150}, 
4: {'1': "Area",'2':'E10_1_cytoplasm' ,'3': 635},
5: {'1': "Mean",'2':'E10_1_cytoplasm' ,'3': 847}, 
6: {'1': "StdDev",'2':'E10_1_cytoplasm' ,'3': 331}})

The final desired format is the following:

df3= pd.DataFrame({
'Label': {'1':'E10_1_nucleus' ,'2':'E10_1_cytoplasm' },
'Area': {'1': 435,'2':635}, 
'Mean': {'1': 313,'2':847},
'StdDev': {'1': 150,'2':331}})

Solution

  • Using pivot

    df = df.transpose()
    df = (
        df
        .assign(
            Aggregate=df["1"].str.replace(r"\(.*\)", "", regex=True),
            Label=df.pop("1").str.extract(r"\((.*?)\)", expand=True)
        )
        .pivot(index="Label", columns="Aggregate", values="2")
        .astype(int)
        .reset_index()
        .rename_axis(None, axis=1)
    )
    
    print(df)
    
                 Label Area Mean StdDev
    0  E10_1_cytoplasm  635  847    321
    1    E10_1_nucleus  435  313    150