I am looking for guidance on how to check for missing values in a DataFrame that are not the typical "NaN" or "np.nan" in Python. I have a dataset/DataFrame that has a string literal "?" representing missing data. How can I identify this string as a missing value?
When I run usual commands using Pandas like:
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])
Python doesn't pick up on these cells as missing and returns 0s for the sum of null values. It also doesn't return anything for printing missing values > 0.
You can use df.replace("?", pd.NA)
to properly encode "?"
as missing value. This will ensure that those are properly handled in all operations.
import pandas as pd
data = {"x": [1, 2, "?"], "y": [3, "?", 5]}
df = pd.DataFrame(data)
print(df.isnull().sum())
# x 0
# y 0
df = df.replace("?", pd.NA)
print(df.isnull().sum())
# x 1
# y 1