I have the following dataset:
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 ...
Col991 Col992 Col993 Col994 Col995 Col996 Col997 Col998 Col999 Col1000
rows
Row1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Row2 0 0 0 0 0 23 0 0 0 0 ... 0 0 0 0 7 0 0 0 0 0
Row3 97 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Row4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Row5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Row496 182 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 116 0 0 0
Row497 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Row498 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Row499 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Row500 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 125 0 0 0
I am trying to remow columns where total number of nonzeros entries is less than 1% of the number of rows.
I can calculate the percentage of nonzeros entries columnwise
(df[df > 0.0].count()/df.shape[0])*100
How should I use this to get df
with those columns where number of columns have nonzeros in more than 1% of the rows only? Further, how should I change code to remove rows where nonzeros is less than 1% of columns?
You can use loc to get the specified columns or rows for your new df as shown in this answer, essentially you can do this:
df.loc[rows, cols] # accepts boolean lists/arrays
So the df with removed columns can be achieved with this:
col_condition = df[df > 0].count() / df.shape[0] >= .01
df_ = df[:, col_condition]
If you need to switch between columns and rows you can simply transpose the dataframe with
df.T
So the same for rows where number of nonzeros is less than 1% of length of columns:
row_condition = df.T[df.T > 0].count() / df.shape[1] >= .01
df_ = df[row_condition]
And in a bit more concise forms:
df_ = df.loc[:, df.gt(0).mean() >= .01] # keep columns
df_ = df[df.T.gt(0).mean() >= .01] # keep rows