I have obtained the statistics for my dataframe by df.describe() in Pandas.
statistics = df.describe()
I want to filter the statistics dataframe base on count:
main Meas1 Meas2 Meas3 Meas4 Meas5
sublvl Value Value Value Value Value
count 7.000000 1.0 1.0 582.00 97.000000
mean 30 37.0 26.0 33.03 16.635350
I want to get something like that: filter out all Values with count less than 30 and show me only the columns with count >30 in a new dataframe (or give me a list with all main that have count>30).
For the above example, I want:
main Meas4 Meas5
sublvl Value Value
count 582.00 97.000000
mean 33.03 16.635350
and [Meas4, Meas5]
I have tried
thresh = statistics.columns[statistics['count']>30]
And variations thereof.
import pandas as pd
df = pd.DataFrame.from_dict({'name':[1,2,3,4,5], 'val':[1, None,None,None,None]})
df
name val
0 1 1.0
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
if you want to use describe()
then note that describe does not give all columns. only columns with numerical data types are returned by default:
you can do so in this way:
statistics = df.describe()
# to describe all columns you can do this
statistics = df.describe(include = 'all')
[column for column in statistics.columns if statistics.loc['count'][column] > 3]
# output ['name']
As discussed in comments, As this is a MultiIndex column to chose only first index we can do this:
# [column[0] for column in statistics.columns if statistics.loc['count'][column] > 3] # this code won't work correctly for non multi index dataframes.
for each column check if count is > threshold and add it to chosen_columns list:
chosen_columns = []
for column in df.columns:
if len(df[column].value_counts()) > 3:
chosen_columns.append(column)
# chosen_columns output: ['name']
OR:
chosen_columns = []
for column in df.columns:
if df[column].count() > 3:
chosen_columns.append(column)
# chosen_columns output: ['name']