Consider:
Question: how to filter this column in a way that only the non-numerical cells are returned?
Code example:
url = ('https://ml-repository-krakers.s3-eu-west-1.amazonaws.com/','kaggle+/churn_modelling/Telco-Customer-Churn.csv')
df = pd.read_csv(url)
df['TotalCharges'][ df['TotalCharges'] == ' ' ]
As can be seen, the dataset has spaces on a column that should be float.
Notice the question has the premise that we don't already know that there are only these non-float values. We don't know how many nor which non-float values are in there.
Your TotalCharges
column consists only of strings, some of which represent float numbers. So you need to find the rows with those values that do not represent floats.
To do this, you need to form a mask by trying to convert values to numeric. In the code below to_numeric
attempts this conversion and because of coerce
gives NaN values where this is not possible. The mask is formed by testing for these Nan values. This mask can then be used to select only these rows:
import pandas as pd
url = ('https://ml-repository-krakers.s3-eu-west-1.amazonaws.com/'
'kaggle+/churn_modelling/Telco-Customer-Churn.csv')
df = pd.read_csv(url)
mask = pd.to_numeric(df['TotalCharges'], errors = 'coerce').isna()
df2 = df[mask]
print(df2)
gives:
customerID gender SeniorCitizen ... MonthlyCharges TotalCharges Churn
488 4472-LVYGI Female 0 ... 52.55 No
753 3115-CZMZD Male 0 ... 20.25 No
936 5709-LVOEQ Female 0 ... 80.85 No
1082 4367-NUYAO Male 0 ... 25.75 No
1340 1371-DWPAZ Female 0 ... 56.05 No
3331 7644-OMVMY Male 0 ... 19.85 No
3826 3213-VVOLG Male 0 ... 25.35 No
4380 2520-SGTTA Female 0 ... 20.00 No
5218 2923-ARZLG Male 0 ... 19.70 No
6670 4075-WKNIU Female 0 ... 73.35 No
6754 2775-SEFEE Male 0 ... 61.90 No
[11 rows x 21 columns]