pythonpandasdataframe

combine two dataframes element-wise as long format


I have two dfs,

df1

       ARHGEF10L     HIF3A     RNF17     RNF10     RNF11
NCBP1        NaN -0.432931       NaN -0.233554  0.165081
NCBP2   0.184332 -0.077655  0.331873 -0.449421  0.153836
RPL37        NaN       NaN  0.192629       NaN -0.089123
DHX9   -0.115242 -0.133209 -0.207657 -0.267636  0.363868
TCOF1        NaN       NaN  0.084838  0.140575 -0.122832

df2:

       ARHGEF10L     HIF3A     RNF17     RNF10     RNF11
NCBP1        NaN  0.000067       NaN  0.038310       NaN
NCBP2        NaN       NaN  0.002809  0.000033       NaN
RPL37        NaN       NaN       NaN       NaN       NaN
DHX9         NaN       NaN       NaN  0.017100  0.000979
TCOF1        NaN       NaN       NaN       NaN       NaN

Now I want to create a new df that colude 4 columns: gene1(row index of both df1 and df2), gene2(column index of both df1 and df2), value1(df1 value) and value2(df2 value)

So it would be like 25*4

gene1    gene2      value1     value2
NCBP1   ARHGEF10L   NaN        Nan
NCBP1   HIF3A      -0.432931   0.000067
NCBP1   RNF17      NaN         NaN 
NCBP1   RNF10      -0.233554   0.038310
NCBP1   RNF11      0.165081    NaN
NCBP2   ARHGEF10L   0.184332   Nan
NCBP2   HIF3A      -0.077655   NaN
NCBP2   RNF17      0.331873    0.002809 
....
TCOF1   ARHGEF10L    NaN       NaN 
TCOF1   HIF3A        NaN       NaN 
...

I have this fake code but not sure for the middle part

def coef_fdr_table(df1,df2):
    column_names = ['gene1','gene2', 'value1', 'value2']
    df = pd.DataFrame(columns = column_names)
    for i in range(25):
            df.iloc[i,2] = df1[...]
            df.iloc[i,3] = df2[...]
    df.set_index(['gene1'],inplace = True)
    return(df)

Any suggestion is appreciated!


Solution

  • You can just concat the two dataframes with a MultiIndex and reshape:

    (pd.concat({'value1': df1,
                'value2': df2,
               }, axis=1)
       .stack(1, dropna=False)           # value identifiers to rows
       .rename_axis(['gene1', 'gene2'])  # set future column names
       .reset_index()                    # index to columns
    )
    

    output:

        gene1      gene2    value1    value2
    0   NCBP1  ARHGEF10L       NaN       NaN
    1   NCBP1      HIF3A -0.432931  0.000067
    2   NCBP1      RNF10 -0.233554  0.038310
    3   NCBP1      RNF11  0.165081       NaN
    4   NCBP1      RNF17       NaN       NaN
    5   NCBP2  ARHGEF10L  0.184332       NaN
    ...