pythonpandasdataframereplacemissing-data

Replace missing rows in dataframe A, along with their corresponding values with data from dataframe B


I have two dataframes (A and B). Dataframe A contains weather from 2010 to 2013. But the row which is suppoesd to contain data for the first day is missing for each of the years (e.g. '2013-01-01'). But I have a second dataframe B where I have saved the missing data. How do I create a new row in dataframe A and include the missing dates and data from dataframe B. I need to do this for all the years shown in dataframe B.

Dataframe_A:

    TIMESTAMP   SR  TEMP2MMAX   TEMP2MMIN   PRECIP
0   2013-01-02  10.90   -6.33   -15.78  0.00
1   2013-01-03  10.93   0.20    -16.25  2.54
2   2013-01-04  11.19   -1.25   -15.91  0.51
3   2013-01-05  11.38   0.67    -19.19  0.00
4   2013-01-06  11.36   2.90    -13.49  0.00
... ... ... ... ... ...
359 2013-12-27  11.26   14.48   -6.43   0.00
360 2013-12-28  11.41   15.77   -6.73   0.00
361 2013-12-29  11.51   15.33   -6.30   0.00
362 2013-12-30  7.40    -3.06   -12.48  0.00
363 2013-12-31  11.33   11.21   -11.87  0.00

364 rows × 5 columns

Dataframe_B:

DATE    SRAD    TMAX    TMIN    RAIN
0   2010-01-01  6.7 5.8 -9.6    0.0
1   2011-01-01  7.9 3.3 -14.4   0.0
2   2012-01-01  9.9 5.0 -5.3    0.0
3   2013-01-01  6.6 -6.3    -15.8   0.0

My desired output is Dataframe_A as shown below:

    TIMESTAMP    SR  TEMP2MMAX  TEMP2MMIN  PRECIP
**0   2010-01-01  6.7      5.8      -9.6     0.0**
1   2010-01-02  9.45       3.45     -11.23     0.0
2   2010-01-03  1.52      -7.49     -10.96     0.0
3   2010-01-04  6.33       2.03      -7.72     0.0
4   2010-01-05  4.66       2.16      -5.07     0.0
..         ...   ...        ...        ...     ...
**0   2013-01-01  6.6      -6.3     -15.8    0.00**
1   2013-01-02  10.90      -6.33     -15.78    0.00
2   2013-01-03  10.93       0.20     -16.25    2.54
3   2013-01-04  11.19      -1.25     -15.91    0.51
4   2013-01-05  11.38       0.67     -19.19    0.00

I tried the code below, but it doesn't work

for year in range(2010, 2014):
    first_row = Dataframe_B[Dataframe_B['DATE']['TIMESTAMP'].dt.year == year].iloc[0]
    for year in range(2010, 2014):
        pd.concat([Dataframe_A[Dataframe_A['TIMESTAMP'].dt.year == year], first_row.to_frame()],
        , ignore_index=True)

Output:

0   TIMESTAMP   SR  TEMP2MMAX   TEMP2MMIN   PRECIP
0   NaN 2013-01-02  10.90   -6.33   -15.78  0.00
1   NaN 2013-01-03  10.93   0.20    -16.25  2.54
2   NaN 2013-01-04  11.19   -1.25   -15.91  0.51
3   NaN 2013-01-05  11.38   0.67    -19.19  0.00
4   NaN 2013-01-06  11.36   2.90    -13.49  0.00
... ... ... ... ... ... ...
364 2013-01-01 00:00:00 NaT NaN NaN NaN NaN
365 6.6 NaT NaN NaN NaN NaN
366 -6.3    NaT NaN NaN NaN NaN
367 -15.8   NaT NaN NaN NaN NaN
368 0.0 NaT NaN NaN NaN NaN

I also tried a second code like this below:

for year in range(2010, 2014):
    second_df = Dataframe_A[Dataframe_A['DATE'].dt.year == year].copy()
    first_row_B = Dataframe_B[Dataframe_B['DATE'].dt.year == year].iloc[0]
    for year in range(2010, 2014):
        second_df = second_df[second_df['TIMESTAMP'] != pd.Timestamp(f'{year}-01-01')]
    second_df = pd.concat([first_row_B, second_df], axis=0, ignore_index=True).sort_values(by='TIMESTAMP').reset_index(drop=True)

The second code (when year ==2013) produced the result below:

DATE    SRAD    TMAX    TMIN    RAIN    TIMESTAMP   SR  TEMP2MMAX   TEMP2MMIN   PRECIP
0   NaT NaN NaN NaN NaN 2013-01-02  10.90   -6.33   -15.78  0.00
1   NaT NaN NaN NaN NaN 2013-01-03  10.93   0.20    -16.25  2.54
2   NaT NaN NaN NaN NaN 2013-01-04  11.19   -1.25   -15.91  0.51
3   NaT NaN NaN NaN NaN 2013-01-05  11.38   0.67    -19.19  0.00
4   NaT NaN NaN NaN NaN 2013-01-06  11.36   2.90    -13.49  0.00
... ... ... ... ... ... ... ... ... ... ...
363 NaT NaN NaN NaN NaN 2013-12-31  11.33   11.21   -11.87  0.00
364 2010-01-01  6.7 5.8 -9.6    0.0 NaT NaN NaN NaN NaN
365 2011-01-01  7.9 3.3 -14.4   0.0 NaT NaN NaN NaN NaN
366 2012-01-01  9.9 5.0 -5.3    0.0 NaT NaN NaN NaN NaN
367 2013-01-01  6.6 -6.3    -15.8   0.0 NaT NaN NaN NaN NaN
368 rows × 10 columns

Solution

  • What you tried went wrong because Dataframe_A and Dataframe_B have columns with different names. Try to rename the columns before concatenating.

    Dataframe_B = Dataframe_B.rename(columns={'DATE': 'TIMESTAMP', 'SRAD': 'SR', 'TMAX': 'TEMP2MMAX', 'TMIN': 'TEMP2MMIN', 'RAIN': 'PRECIP'})
    Dataframe_C = pd.concat([Dataframe_A, Dataframe_B])