pythondataframereplacenanin-place

dropna is filling all the column with None rather than just the NaN values


Working with a DataFrame about places to start and places to go, kind of a Uber app.

It is given to me a DataFrame with 2 columns that have lots of NaN values: Some of them are NaN and other are string types that are written as "nan". An example of how it is written could be this:

data = [
    ["Sevilla","Granada"],
    ["nan","Madrid"],
    ["Burgos",NaN],
    ["Badajoz","Valencia"],
    [NaN,NaN],
    [NaN,"nan"],
    [NaN,"Barcelona"]
    ]
    df = pd.DataFrame(
        data,
        columns=["start_station_name", "end_station_name"],
    )
    print(df)

So you should eliminate both the "nan" and the NaN. The main problem is that when I use .dropna(inplace=True) it eliminates all the values and appears None, how do I keep the good ones?

What I´ve tried is the following.

First, I have replaced all the "nan" with are not NaN values by NotaNumbers. Second, I have dropped all the NaN values, while keeping the name of the cities.

    df["start_station_name"] = df["start_station_name"].replace("nan",np.nan)
    df["end_station_name"] = df["end_station_name"].replace("nan",np.nan)

    df["start_station_name"] = df["start_station_name"].dropna(inplace=True)
    df["end_station_name"] = df["end_station_name"].dropna(inplace=True)

    print(df[["start_station_name","end_station_name"]])

It shows only None (So all the values are None)


Solution

  • You need to combine replace and dropna directly, not at the column/Series level:

    out = df.replace('nan', float('nan')).dropna()
    

    Output:

      start_station_name end_station_name
    0            Sevilla          Granada
    3            Badajoz         Valencia
    

    Your original approach didn't work because you were dropping the NaNs for each column individually, but the columns are reindexed upon insertion back into the DataFrame, which recreates the NaNs.

    note in the original answer

    (not needed but just left here in case it's useful to others)

    If for some reason you want to keep the existing string 'nan', but still use them to drop the rows in which all values are NaN/'nan':

    out = df[df.replace('nan', float('nan')).notna().any(axis=1)]
    

    Output:

      start_station_name end_station_name
    0            Sevilla          Granada
    1                nan           Madrid  # notice the 'nan'
    2             Burgos              NaN
    3            Badajoz         Valencia
    6                NaN        Barcelona