pythonpandasgroup-byrowffill

Python df add rows by date, so each group ends on the same date. Ffill remaining rows


to use the geo-plot animation frame I want all my groups to end on the same date. This would avoid the last frame to grey out certain countries. Currently, the newest data point according to date is 'Timestamp('2021-05-13 00:00:00')'.

So in the next step I want to add new rows according to all countries so that they have rows until the newest date in the df. Columns 'people_vaccinated_per_hundred' and 'people_fully_vaccinated_per_hundred' can be populated using ffill.

Data: Data Set

So ideally, if e.g. norway has 1 day less than the newest data point '2021-05-13', then it should add a new row as shown below. And this should be done for all other countries in the df.

Example

    country iso_code    date    people_vaccinated_per_hundred   people_fully_vaccinated_per_hundred
12028   Norway  NOR 2021-05-02  0.00    NaN
12029   Norway  NOR 2021-05-03  0.00    NaN
12188   Norway  NOR ...         ...     ...
12188   Norway  NOR 2021-05-11  27.81   9.55
12189   Norway  NOR 2021-05-12  28.49   10.42

Add new row
12189   Norway  NOR 2021-05-13  28.49   10.42

Solution

  • A blunt approach to this might be to create a cartesian product of the countries and dates, then join on this to create empty values for every missing combination of dates and country.

    countries = df.loc[:, ['country', 'iso_code']].drop_duplicates()
    dates = df.loc[:, 'date'].drop_duplicates()
    all_countries_dates = countries.merge(dates, how='cross')
    
    df.merge(all_countries_dates, how='right', on=['country', 'iso_code', 'date'])
    

    With a dataset like:

    country       iso_code  date        people_vaccinated   people_fully_vaccinated
    Norway        NOR       2021-05-09  0.00                1.00
    Norway        NOR       2021-05-10  0.00                3.00
    Norway        NOR       2021-05-11  27.81               9.55
    Norway        NOR       2021-05-12  28.49               10.42
    Norway        NOR       2021-05-13  28.49               10.42
    United States USA       2021-05-09  23.00               3.00
    United States USA       2021-05-10  23.00               3.00
    

    This transformation would give you:

    country       iso_code  date        people_vaccinated   people_fully_vaccinated
    Norway        NOR       2021-05-09  0.00                1.00
    Norway        NOR       2021-05-10  0.00                3.00
    Norway        NOR       2021-05-11  27.81               9.55
    Norway        NOR       2021-05-12  28.49               10.42
    Norway        NOR       2021-05-13  28.49               10.42
    United States USA       2021-05-09  23.00               3.00
    United States USA       2021-05-10  23.00               3.00
    United States USA       2021-05-11  NaN                 NaN
    United States USA       2021-05-12  NaN                 NaN
    United States USA       2021-05-13  NaN                 NaN
    

    After this you could use fillna to change the empty values for the added rows.