pythonpandasstatsmodelslinearmodels

Pandas OLS Random Effects model - weird prediction values


I am trying to create a RE-model using Pandas. I only have previous experience in Stata.

My aim is to generate predicted temperatures by regressing the temperature data on a set of dummies. The first set of dummies represents the month of the year while the second dummy represents the station.

I suspect I am doing something wrong the predicted values for id = 1 and month = 1 is zero. When performing the same RE-model in Stata I receive a predicted value of -23.02825. Now, there could of course be some difference in the RE-model between Pandas and Stata that I am unaware of but I suspect that I have somewhat specified the code wrong. Does someone have insights about this?

I am not a experienced poster here on StackOverflow, so if I need to add any additional information, just let me know!

import pandas as pd

import statsmodels.api as sm

from patsy import dmatrices

from linearmodels.panel import RandomEffects

data = pd.read_stata("/Users/user/Desktop/dataset2.dta") 

print(data.head()) 

        id  value      dates  month  antal       year  iid
0        1      0 2000-01-01    1.0   62.0 2000-01-01    1
1        1     21 2000-01-02    1.0   62.0 2000-01-01    1
2        1     32 2000-01-03    1.0   62.0 2000-01-01    1
3        1     31 2000-01-04    1.0   62.0 2000-01-01    1
4        1     11 2000-01-05    1.0   62.0 2000-01-01    1
...     ..    ...        ...    ...    ...        ...  ...
174903  55    -56 2010-12-27   12.0   31.0 2010-01-01   49
174904  55    -76 2010-12-28   12.0   31.0 2010-01-01   49
174905  55    -93 2010-12-29   12.0   31.0 2010-01-01   49
174906  55    -90 2010-12-30   12.0   31.0 2010-01-01   49
174907  55     10 2010-12-31   12.0   31.0 2010-01-01   49

[174908 rows x 7 columns]

data = data.astype({'iid': 'int16'})

print(data)

data = data.astype({'id': 'int16'})

data = data.astype({'antal': 'int8'})

data = data.astype({'month': 'int8'})

data.dtypes

id                int16
value             int32
dates    datetime64[ns]
month              int8
antal              int8
year     datetime64[ns]
iid               int16
dtype: object


d_iid = pd.get_dummies(data.iid, prefix='id', dtype='int8').iloc[:,1:]
data = pd.concat([data, d_iid], axis=1)

d_m = pd.get_dummies(data.month, prefix='m').iloc[:,1:]
data = pd.concat([data, d_m], axis=1)

data = data.set_index(["iid", "dates"])

print(data)

                id  value  month  antal       year  id_2  id_3  id_4  id_5  \
iid dates                                                                    
1   2000-01-01   1      0      1     62 2000-01-01     0     0     0     0   
    2000-01-02   1     21      1     62 2000-01-01     0     0     0     0   
    2000-01-03   1     32      1     62 2000-01-01     0     0     0     0   
    2000-01-04   1     31      1     62 2000-01-01     0     0     0     0   
    2000-01-05   1     11      1     62 2000-01-01     0     0     0     0   
...             ..    ...    ...    ...        ...   ...   ...   ...   ...   
49  2010-12-27  55    -56     12     31 2010-01-01     0     0     0     0   
    2010-12-28  55    -76     12     31 2010-01-01     0     0     0     0   
    2010-12-29  55    -93     12     31 2010-01-01     0     0     0     0   
    2010-12-30  55    -90     12     31 2010-01-01     0     0     0     0   
    2010-12-31  55     10     12     31 2010-01-01     0     0     0     0   

                id_6  ...  m_3  m_4  m_5  m_6  m_7  m_8  m_9  m_10  m_11  m_12  
iid dates             ...                                                       
1   2000-01-01     0  ...    0    0    0    0    0    0    0     0     0     0  
    2000-01-02     0  ...    0    0    0    0    0    0    0     0     0     0  
    2000-01-03     0  ...    0    0    0    0    0    0    0     0     0     0  
    2000-01-04     0  ...    0    0    0    0    0    0    0     0     0     0  
    2000-01-05     0  ...    0    0    0    0    0    0    0     0     0     0  
...              ...  ...  ...  ...  ...  ...  ...  ...  ...   ...   ...   ...  
49  2010-12-27     0  ...    0    0    0    0    0    0    0     0     0     1  
    2010-12-28     0  ...    0    0    0    0    0    0    0     0     0     1  
    2010-12-29     0  ...    0    0    0    0    0    0    0     0     0     1  
    2010-12-30     0  ...    0    0    0    0    0    0    0     0     0     1  
    2010-12-31     0  ...    0    0    0    0    0    0    0     0     0     1  

[174908 rows x 62 columns]

var=data.loc[:, 'id_2':'m_12']


mod = RandomEffects(data.value, var)

re_res = mod.fit()

print(re_res)

                        RandomEffects Estimation Summary                        
================================================================================
Dep. Variable:                  value   R-squared:                        0.9129
Estimator:              RandomEffects   R-squared (Between):              0.9993
No. Observations:              174908   R-squared (Within):               0.7416
Date:                Fri, Oct 09 2020   R-squared (Overall):              0.9129
Time:                        18:15:09   Log-likelihood                -8.937e+05
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                   3.216e+04
Entities:                          47   P-value                           0.0000
Avg Obs:                       3721.4   Distribution:               F(57,174851)
Min Obs:                       1096.0                                           
Max Obs:                       4018.0   F-statistic (robust):          3.216e+04
                                        P-value                           0.0000
Time periods:                    4018   Distribution:               F(57,174851)
Avg Obs:                       43.531                                           
Min Obs:                       41.000                                           
Max Obs:                       46.000                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
id_2          -32.954     0.7098    -46.425     0.0000     -34.345     -31.563
id_3          -71.447     0.9984    -71.562     0.0000     -73.404     -69.491
id_4          -20.982     0.6981    -30.056     0.0000     -22.350     -19.614
id_5          -20.982     0.6981    -30.056     0.0000     -22.350     -19.614
id_6          -17.287     0.7680    -22.511     0.0000     -18.792     -15.782
id_7          -26.455     0.6981    -37.897     0.0000     -27.824     -25.087
id_8          -51.832     0.6981    -74.248     0.0000     -53.200     -50.464
id_9          -51.832     0.6981    -74.248     0.0000     -53.200     -50.464
id_10         -11.651     0.6981    -16.690     0.0000     -13.019     -10.283
id_11          8.3476     0.7003     11.921     0.0000      6.9751      9.7201
id_12          20.876     0.6982     29.902     0.0000      19.508      22.245
id_13          12.719     0.6982     18.218     0.0000      11.351      14.088
id_14          10.063     0.6981     14.415     0.0000      8.6947      11.431
id_15         -137.28     0.6981    -196.66     0.0000     -138.65     -135.92
id_16          23.770     0.6981     34.050     0.0000      22.402      25.138
id_17          38.704     0.6981     55.443     0.0000      37.336      40.073
id_18          48.107     0.9304     51.706     0.0000      46.283      49.930
id_19          112.55     0.6991     161.00     0.0000      111.18      113.92
id_20          111.20     0.6981     159.30     0.0000      109.84      112.57
id_21          107.49     0.6981     153.98     0.0000      106.12      108.86
id_22          112.01     0.9761     114.75     0.0000      110.10      113.92
id_23          22.266     0.6981     31.895     0.0000      20.898      23.634
id_24         -26.090     0.6981    -37.373     0.0000     -27.458     -24.722
id_25         -52.292     0.6982    -74.900     0.0000     -53.660     -50.924
id_26         -86.101     0.6981    -123.34     0.0000     -87.469     -84.733
id_27          73.427     1.2464     58.911     0.0000      70.984      75.869
id_28          30.520     0.6981     43.719     0.0000      29.152      31.888
id_29          50.253     0.6981     71.986     0.0000      48.884      51.621
id_30          50.380     0.6981     72.167     0.0000      49.011      51.748
id_31          69.253     0.6981     99.204     0.0000      67.885      70.622
id_32          36.982     0.6981     52.976     0.0000      35.614      38.350
id_33          68.688     0.6981     98.394     0.0000      67.320      70.056
id_34          5.7040     0.6981     8.1709     0.0000      4.3358      7.0723
id_35          7.8369     0.6981     11.226     0.0000      6.4687      9.2052
id_36          7.8414     0.6981     11.233     0.0000      6.4732      9.2097
id_37          6.1834     0.6981     8.8576     0.0000      4.8151      7.5516
id_38          19.922     0.6981     28.538     0.0000      18.554      21.290
id_39          9.7777     0.6986     13.995     0.0000      8.4084      11.147
id_41          7.0468     0.6981     10.094     0.0000      5.6785      8.4150
id_42         -14.053     0.6981    -20.131     0.0000     -15.421     -12.685
id_43          11.974     0.6981     17.152     0.0000      10.605      13.342
id_44          10.657     0.9111     11.697     0.0000      8.8716      12.443
id_45          23.552     0.6981     33.737     0.0000      22.183      24.920
id_46          9.1107     0.6981     13.051     0.0000      7.7425      10.479
id_48          6.9631     0.6981     9.9745     0.0000      5.5949      8.3314
id_49          4.3713     0.6981     6.2618     0.0000      3.0030      5.7395
m_2            6.2046     0.4627     13.411     0.0000      5.2978      7.1115
m_3            36.329     0.4509     80.562     0.0000      35.445      37.212
m_4            84.415     0.4549     185.57     0.0000      83.524      85.307
m_5            129.82     0.4509     287.93     0.0000      128.94      130.71
m_6            166.63     0.4546     366.52     0.0000      165.74      167.52
m_7            188.15     0.4509     417.30     0.0000      187.26      189.03
m_8            181.04     0.4509     401.55     0.0000      180.16      181.93
m_9            138.76     0.4543     305.42     0.0000      137.87      139.65
m_10           94.589     0.4488     210.78     0.0000      93.709      95.468
m_11           47.314     0.4514     104.81     0.0000      46.429      48.199
m_12           5.9907     0.4483     13.363     0.0000      5.1121      6.8694
==============================================================================

re_res.predict()

    fitted_values
iid dates   
1   2000-01-01  0.000000
2000-01-02  0.000000
2000-01-03  0.000000
2000-01-04  0.000000
2000-01-05  0.000000
... ... ...
49  2010-12-27  10.362003
2010-12-28  10.362003
2010-12-29  10.362003
2010-12-30  10.362003
2010-12-31  10.362003
174908 rows × 1 columns

Solution

  • So, I found out that Python does not add an intercept (Pandas/Statsmodel OLS predicting future values).

    Thus, the solution:

    data['intercept'] = 1
    var=data.loc[:,'intercept':'m_12']