pythonpandasmultilevel-analysis

Filter columns by values in a row in Pandas


I have obtained the statistics for my dataframe by df.describe() in Pandas.

statistics = df.describe()

I want to filter the statistics dataframe base on count:

    main    Meas1     Meas2 Meas3   Meas4  Meas5
    sublvl  Value     Value Value   Value   Value       
    count   7.000000  1.0   1.0     582.00  97.000000       
    mean    30        37.0  26.0    33.03   16.635350

I want to get something like that: filter out all Values with count less than 30 and show me only the columns with count >30 in a new dataframe (or give me a list with all main that have count>30).

For the above example, I want:

    main       Meas4    Meas5
    sublvl     Value    Value       
    count      582.00   97.000000       
    mean       33.03    16.635350

and [Meas4, Meas5]

I have tried

thresh = statistics.columns[statistics['count']>30]

And variations thereof.


Solution

  • import pandas as pd
    
    df = pd.DataFrame.from_dict({'name':[1,2,3,4,5], 'val':[1, None,None,None,None]})
    
    df
    
    name    val
    0   1   1.0
    1   2   NaN
    2   3   NaN
    3   4   NaN
    4   5   NaN
    

    if you want to use describe() then note that describe does not give all columns. only columns with numerical data types are returned by default:

    you can do so in this way:

    statistics = df.describe()
    
    # to describe all columns you can do this
    statistics = df.describe(include = 'all')
    
    [column for column in statistics.columns if statistics.loc['count'][column] > 3]
    # output ['name']
    

    As discussed in comments, As this is a MultiIndex column to chose only first index we can do this:

    # [column[0] for column in statistics.columns if statistics.loc['count'][column] > 3] # this code won't work correctly for non multi index dataframes.
    

    for each column check if count is > threshold and add it to chosen_columns list:

    chosen_columns = []
    for column in df.columns:
        if len(df[column].value_counts()) > 3:
            chosen_columns.append(column)
    
    # chosen_columns output: ['name']
    

    OR:

    chosen_columns = []
    for column in df.columns:
        if df[column].count() > 3:
            chosen_columns.append(column)
    
    # chosen_columns output: ['name']