pythonpandascountmultiple-columns

Pandas count number of specific columns that match certain conditions


I am working on figuring out how to count number of specific columns that match certain conditions and filtering any above 1. I am thinking it would require adding an additional column that includes the count(), and then a separate step of seeing if the count() is greater than 1. Finally, removing the unnecessary count() column.

For the example below, I am interested in only colB, colC, colD, and colE greater than 0.

The example attempted code I have tried looks like:

#Step a
filtData['countCol'] = filtData[(filtData['colB']>0) & (filtData['colC']>0) &
        (filtData['colD']>0) & (filtData['colE']>0)].count()
#Step b
filtData['countCol'] = filtData[filtData['countCol'] > 1]

#Step c
filtData = filtData.drop(columns=['countCol'])

Input:

    colA    colB    colC    colD    colE    colF
0   1105    0.00    867     3.4     0.00    text1
1   1106    3       3.22    1       3       text2
2   1107    0.5     0       0       1       text3
3   1110    0       23      0       0       text4
4   1019    9       0.0     2       0       text5
5   1267    0       0.0     0       2       text6

Output Step a:

    colA    colB    colC    colD    colE    colF    countCol
0   1105    0.00    867     3.4     0.00    text1   2
1   1106    3       3.22    1       3       text2   4
2   1107    0.5     0       0       1       text3   2
3   1110    0       23      0       0       text4   1
4   1019    9       0.0     2       0       text5   2
5   1267    0       0.0     0       2       text6   1

Output Step b:

    colA    colB    colC    colD    colE    colF    countCol
0   1105    0.00    867     3.4     0.00    text1   2
1   1106    3       3.22    1       3       text2   4
2   1107    0.5     0       0       1       text3   2
4   1019    9       0.0     2       0       text5   2

Output Step c:

    colA    colB    colC    colD    colE    colF
0   1105    0.00    867     3.4     0.00    text1
1   1106    3       3.22    1       3       text2
2   1107    0.5     0       0       1       text3
4   1019    9       0.0     2       0       text5

If there is a way to perform this in one step that is elegant (isn't too advanced of an expression to understand), that would be ideal. I am still learning pandas, so maybe performing the filtering I am looking for would need to be broken in the three sub-steps shown.


Solution

  • You can filter columns by list, comapre by 0 abd count Trues by sum, last filter rows greater like 1 in boolean indexing:

    out = filtData[(filtData[['colB','colC','colD','colE']]>0).sum(axis=1) > 1]
    

    Or is possible use DataFrame.gt for greater:

    out = filtData[filtData[['colB','colC','colD','colE']].gt(0).sum(axis=1).gt(1)]
    

    print (out)
       colA  colB    colC  colD  colE   colF
    0  1105   0.0  867.00   3.4   0.0  text1
    1  1106   3.0    3.22   1.0   3.0  text2
    2  1107   0.5    0.00   0.0   1.0  text3
    4  1019   9.0    0.00   2.0   0.0  text5
    

    How it working:

    print (filtData[['colB','colC','colD','colE']].gt(0))
        colB   colC   colD   colE
    0  False   True   True  False
    1   True   True   True   True
    2   True  False  False   True
    3  False   True  False  False
    4   True  False   True  False
    5  False  False  False   True
    
    print (filtData[['colB','colC','colD','colE']].gt(0).sum(axis=1))
    0    2
    1    4
    2    2
    3    1
    4    2
    5    1
    dtype: int64
    
    print (filtData[['colB','colC','colD','colE']].gt(0).sum(axis=1).gt(1))
    0     True
    1     True
    2     True
    3    False
    4     True
    5    False
    dtype: bool