pythonpandasstringdataframe

Count number of characters in each row and drop if all are below a certain number


I have a dataframe with many columns, all of which contain text data mixed with NaNs.

I want to count the number of characters in each cell in each column and then drop any rows where all the columns have less than 5 characters (if any cells have more than 5 characters, then the row is not dropped).

I was considering making a new column with str.len for each column and then filter out rows using that, but it seems very cumbersome.

Example:

>>> df
   column_1    column_2    column_3
0   werhi      dsfhjk       dh      ---> not filtered because some columns have more than 5 characters
1   sgds        fuo         g       ---> filtered
2   wqyuio      dsklh       fhkjfj
3   fhi         d           fgho    ---> filtered
4   sadfhkj     sdjfkhs     yyisdk

>>> df_filtered
   column_1    column_2    column_3
0   werhio      dsfhjk      dh
2   wqyuio      dsfjklh     fhkjfj
4   sadfhkj     sdjfkhs     yyisdk

Solution

  • No need to create new columns.

    OPTION 1: minimal total string length per row

    just apply to all columns, aggregate the total sum and create a boolean Series for boolean indexing:

    thresh = 10
    out = df[df.apply(lambda x: x.str.len()).sum(axis=1).ge(thresh)]
    

    Output:

      column_1 column_2 column_3
    0   werhio   dsfhjk       dh
    2   wqyuio  dsfjklh   fhkjfj
    4  sadfhkj  sdjfkhs   yyisdk
    

    OPTION 2: minimal string length for any strings in a row

    If you want to ensure all values in a column have at least a given size, compare to the threshold before aggregating with any:

    thresh = 5
    df[df.apply(lambda x: x.str.len()).ge(thresh).any(axis=1)]
    

    Output:

      column_1 column_2 column_3
    0   werhio   dsfhjk       dh
    2   wqyuio  dsfjklh   fhkjfj
    4  sadfhkj  sdjfkhs   yyisdk
    

    Intermediates

    OPTION 1:

    # df.apply(lambda x: x.str.len())
       column_1  column_2  column_3
    0         6         6         2
    1         4         3         1
    2         6         7         6
    3         3         1         4
    4         7         7         6
    
    # df.apply(lambda x: x.str.len()).sum(axis=1)
    0    14
    1     8
    2    19
    3     8
    4    20
    dtype: int64
    
    # df.apply(lambda x: x.str.len()).sum(axis=1).ge(thresh)
    0     True
    1    False
    2     True
    3    False
    4     True
    dtype: bool
    

    OPTION 2:

    # df.apply(lambda x: x.str.len().ge(5))
       column_1  column_2  column_3
    0      True      True     False
    1     False     False     False
    2      True      True      True
    3     False     False     False
    4      True      True      True
    
    # df.apply(lambda x: x.str.len().ge(5)).any(axis=1)
    0     True
    1    False
    2     True
    3    False
    4     True
    dtype: bool