pythonpandasdataframespark-koalas

fill NA of a column with elements of another column


i'm in this situation, my df is like that

    A   B   
0   0.0 2.0 
1   3.0 4.0 
2   NaN 1.0 
3   2.0 NaN 
4   NaN 1.0 
5   4.8 NaN 
6   NaN 1.0 

and i want to apply this line of code: df['A'] = df['B'].fillna(df['A'])

and I expect a workflow and final output like that:

    A   B   
0   2.0 2.0 
1   4.0 4.0 
2   1.0 1.0 
3   NaN NaN 
4   1.0 1.0 
5   NaN NaN 
6   1.0 1.0 

    A   B   
0   2.0 2.0 
1   4.0 4.0 
2   1.0 1.0 
3   2.0 NaN 
4   1.0 1.0 
5   4.8 NaN 
6   1.0 1.0 

but I receive this error:

TypeError: Unsupported type Series

probably because each time there is an NA it tries to fill it with the whole series and not with the single element with the same index of the B column.

I receive the same error with a syntax like that: df['C'] = df['B'].fillna(df['A']) so the problem seems not to be the fact that I'm first changing the values of A with the ones of B and then trying to fill the "B" NA with the values of a column that is technically the same as B

I'm in a databricks environment and I'm working with koalas data frames but they work as the pandas ones. can you help me?


Solution

  • Another option

    Suppose the following dataset

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame(data={'State':[1,2,3,4,5,6, 7, 8, 9, 10], 
                             'Sno Center': ["Guntur", "Nellore", "Visakhapatnam", "Biswanath", "Doom-Dooma", "Guntur", "Labac-Silchar", "Numaligarh", "Sibsagar", "Munger-Jamalpu"], 
                             'Mar-21': [121, 118.8, 131.6, 123.7, 127.8, 125.9, 114.2, 114.2, 117.7, 117.7],
                             'Apr-21': [121.1, 118.3, 131.5, np.NaN, 128.2, 128.2, 115.4, 115.1, np.NaN, 118.3]})
    
    df
    State   Sno Center      Mar-21  Apr-21
    0   1   Guntur          121.0   121.1
    1   2   Nellore         118.8   118.3
    2   3   Visakhapatnam   131.6   131.5
    3   4   Biswanath       123.7   NaN
    4   5   Doom-Dooma      127.8   128.2
    5   6   Guntur          125.9   128.2
    6   7   Labac-Silchar   114.2   115.4
    7   8   Numaligarh      114.2   115.1
    8   9   Sibsagar        117.7   NaN
    9   10  Munger-Jamalpu  117.7   118.3
    

    Then

    df.loc[(df["Mar-21"].notnull()) & (df["Apr-21"].isna()), "Apr-21"] = df["Mar-21"]
    
    df
    State   Sno Center      Mar-21  Apr-21
    0   1   Guntur          121.0   121.1
    1   2   Nellore         118.8   118.3
    2   3   Visakhapatnam   131.6   131.5
    3   4   Biswanath       123.7   123.7
    4   5   Doom-Dooma      127.8   128.2
    5   6   Guntur          125.9   128.2
    6   7   Labac-Silchar   114.2   115.4
    7   8   Numaligarh      114.2   115.1
    8   9   Sibsagar        117.7   117.7
    9   10  Munger-Jamalpu  117.7   118.3