pythonstatsmodelspatsy

How to identify the subjects in the predicted results of OLS in Statsmodels?


I am doing a Linear Regression using Statsmodels in a Jupyter notebook. The data is in a DataFrame called "train_base", where the id column identifies every unique subject of my database. Train_base is like this:

id     y     x0     x1     x2
a123   20     8      1      3
b789   33     8      3      2
d782   77     9      6      5      

The main chunk of code is shown below. Note that I am using another base called "test_base" to make predictions, and this base also has the same structure as "train_base", except for the "y" column:

results = smf.ols('y ~ x0 + x1 + x2', data=train_base).fit()
predictions = results.predict(test_base)
predictions.head()

The predictions are like this:

0   -0.054789
1   -0.036042
2   -0.043962
3   -0.135725
4   -0.409129
dtype: float64

It seems to me that the first column shown in the predictions is the index of the original train_base (am I correct?). Since I need to identify the predicted value for each individual of my test base, what do I have to do to have the "id" column in the predictions?


Solution

  • First: yes, the id column is an index. Second: I don't have access to your data to test my suggestion, but I think the following may work (or something like that):

    predict = pd.DataFrame(results.predict(test_base), train_base['id'])
    predict.columns = ['predict']
    predict
    

    I think this can work if each of the values that appear in your predictions are related to each of the id index, since the purpose of this code is to create a dataframe composed of the prediction results and the id index.