pythonpandasdataframedata-analysisdata-filtering

How do I remove columns in Pandas that contains non-zero in less than 1% of number of rows?


I have the following dataset:

    Col1    Col2    Col3    Col4    Col5    Col6    Col7    Col8    Col9    Col10   ... 

Col991  Col992  Col993  Col994  Col995  Col996  Col997  Col998  Col999  Col1000
rows                                                                                    
Row1    0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row2    0   0   0   0   0   23  0   0   0   0   ... 0   0   0   0   7   0   0   0   0   0
Row3    97  0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row4    0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row5    0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Row496  182 0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   116 0   0   0
Row497  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row498  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row499  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row500  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   125 0   0   0

I am trying to remow columns where total number of nonzeros entries is less than 1% of the number of rows.

I can calculate the percentage of nonzeros entries columnwise

(df[df > 0.0].count()/df.shape[0])*100

How should I use this to get df with those columns where number of columns have nonzeros in more than 1% of the rows only? Further, how should I change code to remove rows where nonzeros is less than 1% of columns?


Solution

  • You can use loc to get the specified columns or rows for your new df as shown in this answer, essentially you can do this:

    df.loc[rows, cols]  # accepts boolean lists/arrays
    

    So the df with removed columns can be achieved with this:

    col_condition = df[df > 0].count() / df.shape[0] >= .01
    df_ = df[:, col_condition]
    

    If you need to switch between columns and rows you can simply transpose the dataframe with

    df.T
    

    So the same for rows where number of nonzeros is less than 1% of length of columns:

    row_condition = df.T[df.T > 0].count() / df.shape[1] >= .01
    df_ = df[row_condition]
    

    And in a bit more concise forms:

    df_ = df.loc[:, df.gt(0).mean() >= .01]  # keep columns
    df_ = df[df.T.gt(0).mean() >= .01]  # keep rows