pythondatetimeaveragepython-xarrayresample

Two year average from xarray dataset


I have xarray data called DataList with "time" and "value" variables.

My goal is to produce two year average from 10 years of data. One year average works perfectly, but I run into issues when attempting a two year average.

"time" is in datetime64 format and it is hourly data starting from:

print(DataList["time"][0])
2014-01-01T00:00:00.000000000

and ending at:

print(DataList["time"][-1])    
2023-12-31T23:00:00.000000000

When I attempt the following code for a single year average, it works perfectly:

YearlyAverage = DataList["Value"].resample(time = "1Y").mean(dim="time")

I get total of 10 values for the following time steps, one value at the last day of each year.

2014-12-31, 2015-12-31, ... 2023-12-31

Now for the problem. If I want to produce a two year average instead of 1 year average I attempted to change .resample(time = "1Y") to .resample(time = "2Y") and this almost works but it provides me with wrong times. It first calculates a single year average for 2014, and then proceeds to calculate two year averages for 2015-2016, 2017-2018, 2019-2020, 2021-2022, 2023-2024 and I get a total of six values.

2014-12-31, 2016-12-31, 2018-12-31, 2020-12-31, 2022-12-31, 2024-12-31

First and last time step is the average for a single year, 2014 and 2023. 2013 and 2024 do not exist in my data. It is as if the average calculation starts from 2013 (which is why it ends at 2014), but year 2013 does not exist in my data so I have no idea what is going on. I would understand it if there was a faulty value in the beginning, but the first time step in my data is definitely in 2014.

So why would it not do this automatically and how can i fix it to get the following time outputs?

2015-12-31, 2017-12-31, 2019-12-31, 2021-12-31, 2023-12-31

These steps would represent the two year averages for 2014-2015, 2016-2017, 2018-2019, 2020-2021, 2022-2023

Now of course for a small data like this it is easy to do it manually, but I got very curious as if why this is happening if I run into similar issue with larger data. So if anyone has any idea I am very thankful!

Here is an example code snippet which reproduces the problem, it creates lists equal to my problem:

import random as rnd
import pandas as pd
import xarray as xr

datelist = pd.date_range(start ='01-01-2014',end ='01-01-2024', freq ='1H')
datelist = datelist.tolist()
datelist.pop() # remove last value so datelist ends at 23.00 on 31-12-2023

values = []
n = 87648
for i in range(n):
    values.append(rnd.randint(0,10)) # list of values

DataList = xr.Dataset(
        {
        "time": (["time"], datelist),
        "Value": (["time"],values,{"units": "-"}),
        }
 )

Values_1Y = DataList["Value"].resample(time = "1Y").mean(dim="time")
print(Values_1Y["time"])
# this is absolutely correct
# first year average is taken at the end
# of first year: 2014-12-31T00:00:00.000000000 

Values_2Y = DataList["Value"].resample(time = "2Y").mean(dim="time")
print(Values_2Y["time"])
# this results in the first step being
# 2014-12-31T00:00:00.000000000 (should be end of 2015)
# and last step 2024-12-31T00:00:00.000000000

Solution

  • You can specify that the resampling should happen with a 2-year period from the start by using 2YS instead of 2Y:

    Values_2Y = DataList["Value"].resample(time="2YS").mean(dim="time")
    print(Values_2Y["time"])
    
    <xarray.DataArray 'time' (time: 5)> Size: 40B
    array(['2014-01-01T00:00:00.000000000', '2016-01-01T00:00:00.000000000',
           '2018-01-01T00:00:00.000000000', '2020-01-01T00:00:00.000000000',
           '2022-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
    Coordinates:
      * time     (time) datetime64[ns] 40B 2014-01-01 2016-01-01 ... 2022-01-01
    

    If you want to 'center' the interval label, you can do so via timedelta arithmetic (careful with leap years if you're dealing with yearly intervals ;-)):

    print(Values_2Y["time"] + pd.Timedelta(days=365))
    
    <xarray.DataArray 'time' (time: 5)> Size: 40B
    array(['2015-12-31T00:00:00.000000000', '2017-12-31T00:00:00.000000000',
           '2019-12-31T00:00:00.000000000', '2021-12-31T00:00:00.000000000',
           '2023-12-31T00:00:00.000000000'], dtype='datetime64[ns]')
    Coordinates:
      * time     (time) datetime64[ns] 40B 2014-12-31 2016-12-31 ... 2022-12-31
    

    Consider using pandas DateOffsets - this however won't work with xarray directly; you'd have to extract a pandas datetime Series first, e.g. like pd.Series(Values_2Y["time"]) + pd.DateOffset(years=1).