pythonpandastime-series

Differencing Time Series & Create Stationary Time Series - Pandas


I want to difference time series to make it stationary. However it is not guaranteed that by taking first lag would make time series stationary. Generate an example Pandas dataframe as below

test = {'A':[10,15,19,24,23]}
test_df = pd.DataFrame(test)  

by using diff() method we can take first lag as expected but if I attempt diff(2) i.e. if I want to use a lag period of 2 I am not getting results as expected.

Expected Output

+----+-------+-------+
| A  | Lag 1 | Lag 2 |
+----+-------+-------+
| 10 | NA    | NA    |
| 15 | 5     | NA    |
| 19 | 4     |-1     |
| 24 | 5     | 1     |
| 23 |-1     |-6     |
+----+-------+-------+

Resulting Output

+----------------+
|  A  lag1  lag2 |
+----------------+
| 10   NaN   NaN |
| 15   5.0   NaN |
| 19   4.0   9.0 |
| 24   5.0   9.0 |
| 23  -1.0   4.0 |
+----------------+

The above output was generated using test_df['lag2'] = test_df['A'].diff(2). How can I obtain the expected output and regenerate the actual time series by only using the Lag 2 time series?

Edit 1 This question does not pertains to any data type conversion or NaNs and is incorrectly marked as duplicate. The expected output is clearly mentioned and the scope of question is completely different from one mentioned here.

Edit 2 To work on more number of samples following dummy data frame can be used.

test = np.random.randint(100, size=500)
test_df = pd.DataFrame(test, columns = ['A'])

Edit 3 In order to explain the expected output more please consider the expected output below.

+----+-------+-------+
| A  | Lag 1 | Lag 2 |
+----+-------+-------+
| 10 | NA    | NA    |
| 15 | 5     | NA    |
| 19 | 4     | -1    |
| 24 | 5     | 1     |
| 23 | -1    | -6    |
| 50 | 27    | 28    |
| 34 | -16   | -43   |
| 56 | 22    | 38    |
| 33 | -23   | -45   |
| 26 | -7    | 16    |
| 45 | 19    | 26    |
+----+-------+-------+

test = {'A': [10,15,19,24,23,50,34,56,33,26,45]}
test_df = pd.DataFrame(test)

Lag 1 of this column can be created using test_df['lag1'] = test_df['A'].diff(). But to create lag 2 I need to do test_df['lag2'] = test_df['A'].diff().diff(). This solution won't work in case where I've to take 365 lags. Hence I need a solution takes lag of original series A and then recursively takes lag of lag1 to generate lag2 and so on and so forth.

Once we've created lagged term lag2 how can we retrieve the original series back from it?


Solution

  • diff(2) will give you the difference between an and an-2. What you want is the difference between items of the diff-ed column. Trivial mathematics show that it will be an + an-2 - 2 * an-1. That is enough to explain why you do not get the expected result.

    What you want is to iterate diff():

    df['Lag 1'] = df['A'].diff()
    df['Lag 2'] = df['A'].diff().diff()
    

    With df = pd.DataFrame({'A': [10,15,19,24,23,50,34,56,33,26,45]}, it gives as expected:

         A  Lag 1  Lag 2
    0   10    NaN    NaN
    1   15    5.0    NaN
    2   19    4.0   -1.0
    3   24    5.0    1.0
    4   23   -1.0   -6.0
    5   50   27.0   28.0
    6   34  -16.0  -43.0
    7   56   22.0   38.0
    8   33  -23.0  -45.0
    9   26   -7.0   16.0
    10  45   19.0   26.0
    

    The little maths above, suggest that Lag 365 will use the 365 previous values with the binomial coefficients C365i. So IMHO it is simpler to do:

    s = df['A']
    for i in range(365): s = s.diff()
    df['Lag 365'] = s
    

    If you do not want to keep the intermediary Lag i columns.


    You can retrieve the initial values from a diff-ed column provided you also have the first value with cumsum: df['Lag 1'].fillna(df.iloc[0,0]).cumsum() gives back df['A']. So to be able to restore the initial values from a n-diff-ed column, I would use a slight variation of diff to keep the initial value instead of the initial NaN:

    def difx(s):
        return s.diff().combine_first(s)
    

    Then to process the 4th diff, I would use

    s = df['A']
    for i in range(4): s = difx(s)
    s['Lag 4'] = s
    

    it gives:

    0      10.0
    1     -25.0
    2      19.0
    3      -2.0
    4      -9.0
    5      41.0
    6    -105.0
    7     152.0
    8    -164.0
    9     144.0
    10    -51.0
    

    We can now retrieve the initial values with

    s = df['Lag 4']
    for i in range(4): s = s.cumsum()
    

    We get back as expected the initial values:

    0     10.0
    1     15.0
    2     19.0
    3     24.0
    4     23.0
    5     50.0
    6     34.0
    7     56.0
    8     33.0
    9     26.0
    10    45.0
    

    If your Series is long enough, you can do the same for Lag 365 just replacing the 4 in my examples with 365...