pythonpython-3.xmachine-learningstatsmodelspatsy

How to make a sm.Logit regresiion on patsy matrix?


I wanted to create a Logit plot for a natural cubic spline function with four degrees of freedom for P(wage > 250) but the error occurs for some reason. I don't understand why, because OLS works fine.

Here is the code (it should be fully working without any tweaks, except, of course, the part in question):

import pandas as pd
from patsy import dmatrix
import statsmodels.api as sm
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(7,5))
df = pd.read_csv('http://web.stanford.edu/~oleg2/hse/wage/wage.csv').sort_values(by=['age'])
ind_df = df[['wage', 'age']].copy()

plt.xlabel('Age', fontsize=15)
plt.ylabel('Wage', fontsize=15)
plt.ylim((0,333))

d = 4
knots = [df.age.quantile(0.25), df.age.quantile(0.5), df.age.quantile(0.75)]

my_spline_transformation = f"bs(train, knots={knots}, degree={d}, include_intercept=True)"

transformed = dmatrix( my_spline_transformation, {"train": df.age}, return_type='dataframe' )

lft = sm.Logit( (df.age > 250), transformed )
y_grid1 = lft.predict(transformed)

plt.show()

The error is:

ValueError: shapes (3000,9) and (3000,9) not aligned: 9 (dim 1) != 3000 (dim 0)

I tried transposing the dataframe, but the result is some jibberish graphs. How do I do this properly?


Solution

  • First, your dependent variable is wrong, it should be df.wage>250 instead of df.age>250 .

    Second, I am not sure you need a spline with 4 degrees of freedom (meaning up to the x^4 polynomial) for a single variable. If you look at your data, it is not so complex:

    df = pd.read_csv('http://web.stanford.edu/~oleg2/hse/wage/wage.csv').sort_values(by=['age'])
    df['wage_factor'] = (df.wage > 250).astype('int')
    fig,ax = plt.subplots(1,2,figsize=(8,3))
    df.plot.scatter(x='age',y='wage',ax=ax[0])
    df.plot.scatter(x='age',y='wage_factor',ax=ax[1])
    

    enter image description here

    Third, after you call sm.Logit() , you need to fit it. See below for how it should go:

    d = 3
    knots = [30,60]
    
    my_spline_transformation = f"bs(train, knots={knots}, degree={d}, include_intercept=True)"
    transformed = dmatrix( my_spline_transformation, {"train": df.age}, return_type='dataframe' )
    
    lft = sm.Logit( (df.wage>250), transformed)
    res = lft.fit()
    y_grid1 = res.predict(transformed)
    

    These would work without an error. I am not so sure if the results make sense because in this example, your targets are severely imbalanced and logistic regression will be quite problematic.