[SOLVED] Python - How to check for missing values not represented by NaN?

Python - How to check for missing values not represented by NaN?

I am looking for guidance on how to check for missing values in a DataFrame that are not the typical "NaN" or "np.nan" in Python. I have a dataset/DataFrame that has a string literal "?" representing missing data. How can I identify this string as a missing value?

When I run usual commands using Pandas like:

missing_values = df.isnull().sum()

print(missing_values[missing_values > 0])

Python doesn't pick up on these cells as missing and returns 0s for the sum of null values. It also doesn't return anything for printing missing values > 0.

Solution

You can use df.replace("?", pd.NA) to properly encode "?" as missing value. This will ensure that those are properly handled in all operations.

import pandas as pd

data = {"x": [1, 2, "?"], "y": [3, "?", 5]}
df = pd.DataFrame(data)

print(df.isnull().sum())
# x    0
# y    0

df = df.replace("?", pd.NA)
print(df.isnull().sum())
# x    1
# y    1