pythonpandasnullmissing-data

Python - How to check for missing values not represented by NaN?


I am looking for guidance on how to check for missing values in a DataFrame that are not the typical "NaN" or "np.nan" in Python. I have a dataset/DataFrame that has a string literal "?" representing missing data. How can I identify this string as a missing value?

When I run usual commands using Pandas like:

missing_values = df.isnull().sum()

print(missing_values[missing_values > 0])

Python doesn't pick up on these cells as missing and returns 0s for the sum of null values. It also doesn't return anything for printing missing values > 0.


Solution

  • You can use df.replace("?", pd.NA) to properly encode "?" as missing value. This will ensure that those are properly handled in all operations.

    import pandas as pd
    
    data = {"x": [1, 2, "?"], "y": [3, "?", 5]}
    df = pd.DataFrame(data)
    
    print(df.isnull().sum())
    # x    0
    # y    0
    
    df = df.replace("?", pd.NA)
    print(df.isnull().sum())
    # x    1
    # y    1