pythondataframegroup-by

Python: Only 2 unique column names in dataframe, 3105 columns total. How to get average of row, grouped by unique column name


My dataframe

My dataframe is in the linked image. Basically to make it simple, my dataframe currently looks something like this:

Gene Cell_A Cell_B Cell_B Cell_B Cell_A
Gene_A 0 4 35.5 4.5 3.5
Gene_B 1.3 52 3.4 2.4 0
Gene_C 2.3 3.3 32 0 2

And there are 3105 columns of Cell_A and Cell_B combined. There are around 13k (I think?) rows of genes. What I want to do is get the average number per gene (row), grouped by the unique column name. So in the end I would have just 2 columns, Cell_A and Cell_B, with the average number (per gene, i.e. row) as data.

I expect that it has to do something with either agg or groupby. But I have no idea where to even start with this. If you can offer some guidance I would be very grateful!


Solution

  • You are right, you want to group by columns and do the mean operation.

    First, preserve the first column as an index:

    df = df.set_index(['Gene']) 
    

    Then do

    df.groupby(by=df.columns, axis=1).mean()