I have a dataframe with many columns, all of which contain text data mixed with NaNs
.
I want to count the number of characters in each cell in each column and then drop any rows where all the columns have less than 5 characters (if any cells have more than 5 characters, then the row is not dropped).
I was considering making a new column with str.len
for each column and then filter out rows using that, but it seems very cumbersome.
Example:
>>> df
column_1 column_2 column_3
0 werhi dsfhjk dh ---> not filtered because some columns have more than 5 characters
1 sgds fuo g ---> filtered
2 wqyuio dsklh fhkjfj
3 fhi d fgho ---> filtered
4 sadfhkj sdjfkhs yyisdk
>>> df_filtered
column_1 column_2 column_3
0 werhio dsfhjk dh
2 wqyuio dsfjklh fhkjfj
4 sadfhkj sdjfkhs yyisdk
No need to create new columns.
just apply
to all columns, aggregate the total sum
and create a boolean Series for boolean indexing:
thresh = 10
out = df[df.apply(lambda x: x.str.len()).sum(axis=1).ge(thresh)]
Output:
column_1 column_2 column_3
0 werhio dsfhjk dh
2 wqyuio dsfjklh fhkjfj
4 sadfhkj sdjfkhs yyisdk
If you want to ensure all values in a column have at least a given size, compare to the threshold before aggregating with any
:
thresh = 5
df[df.apply(lambda x: x.str.len()).ge(thresh).any(axis=1)]
Output:
column_1 column_2 column_3
0 werhio dsfhjk dh
2 wqyuio dsfjklh fhkjfj
4 sadfhkj sdjfkhs yyisdk
OPTION 1:
# df.apply(lambda x: x.str.len())
column_1 column_2 column_3
0 6 6 2
1 4 3 1
2 6 7 6
3 3 1 4
4 7 7 6
# df.apply(lambda x: x.str.len()).sum(axis=1)
0 14
1 8
2 19
3 8
4 20
dtype: int64
# df.apply(lambda x: x.str.len()).sum(axis=1).ge(thresh)
0 True
1 False
2 True
3 False
4 True
dtype: bool
OPTION 2:
# df.apply(lambda x: x.str.len().ge(5))
column_1 column_2 column_3
0 True True False
1 False False False
2 True True True
3 False False False
4 True True True
# df.apply(lambda x: x.str.len().ge(5)).any(axis=1)
0 True
1 False
2 True
3 False
4 True
dtype: bool