pythonpandas

DataFrame column: show only values that are not numerical


Consider:

  1. We have one Pandas DataFrame.
  2. This DataFrame has one column that should be filled only with floats.
  3. However, the datatype is "object", meaning there is at least one non-numeric value in there.
  4. I want to see this/these non-numerical value(s) so I can decide how to proceed.

Question: how to filter this column in a way that only the non-numerical cells are returned?

Code example:

url = ('https://ml-repository-krakers.s3-eu-west-1.amazonaws.com/','kaggle+/churn_modelling/Telco-Customer-Churn.csv')
df = pd.read_csv(url)
df['TotalCharges'][ df['TotalCharges'] == ' ' ]

As can be seen, the dataset has spaces on a column that should be float.

Notice the question has the premise that we don't already know that there are only these non-float values. We don't know how many nor which non-float values are in there.


Solution

  • Your TotalCharges column consists only of strings, some of which represent float numbers. So you need to find the rows with those values that do not represent floats.

    To do this, you need to form a mask by trying to convert values to numeric. In the code below to_numeric attempts this conversion and because of coerce gives NaN values where this is not possible. The mask is formed by testing for these Nan values. This mask can then be used to select only these rows:

    import pandas as pd
    
    url = ('https://ml-repository-krakers.s3-eu-west-1.amazonaws.com/'
            'kaggle+/churn_modelling/Telco-Customer-Churn.csv')
    df = pd.read_csv(url)
    
    
    mask = pd.to_numeric(df['TotalCharges'], errors = 'coerce').isna()
    df2 = df[mask]
    
    print(df2)
    

    gives:

          customerID  gender  SeniorCitizen  ... MonthlyCharges TotalCharges  Churn
    488   4472-LVYGI  Female              0  ...          52.55                  No
    753   3115-CZMZD    Male              0  ...          20.25                  No
    936   5709-LVOEQ  Female              0  ...          80.85                  No
    1082  4367-NUYAO    Male              0  ...          25.75                  No
    1340  1371-DWPAZ  Female              0  ...          56.05                  No
    3331  7644-OMVMY    Male              0  ...          19.85                  No
    3826  3213-VVOLG    Male              0  ...          25.35                  No
    4380  2520-SGTTA  Female              0  ...          20.00                  No
    5218  2923-ARZLG    Male              0  ...          19.70                  No
    6670  4075-WKNIU  Female              0  ...          73.35                  No
    6754  2775-SEFEE    Male              0  ...          61.90                  No
    
    [11 rows x 21 columns]