I have this code below that suppose to create 2 dataframes on giving columns. df's Region column has 5 variables; W,E,N,S and C. However the resulted dataframe has only W,E,N,S and an intercept column.
import statsmodels.api as sm
from patsy import dmatrices
df = sm.datasets.get_rdataset('Guerry','HistData').data
vars = ['Department','Lottery','Literacy','Wealth','Region']
df = df[vars]
df = df.dropna()
# Department Lottery Literacy Wealth Region
# 0 Ain 41 37 73 E
# 1 Aisne 38 51 22 N
# 2 Allier 66 13 61 C
# 3 Basses-Alpes 80 46 76 E
# 4 Hautes-Alpes 79 69 83 E
y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')
print(X.columns.tolist())
# ['Intercept', 'Region[T.E]', 'Region[T.N]', 'Region[T.S]', 'Region[T.W]', 'Literacy', 'Wealth']
When I change to last row as below it works fine and shows 5 Region values on the dataframe.
y, X = dmatrices('Literacy + Wealth + Region ~ Lottery', data=df, return_type='dataframe')
print(y.columns.tolist())
# ['Region[C]', 'Region[E]', 'Region[N]', 'Region[S]', 'Region[W]', 'Literacy', 'Wealth']
Can someone please explain it what is the reason of this? and what is the intercept column created on the first code instead of Region C?
Patsy automatically adds a constant "Intercept" term to the right-hand side of formulas. This leads to a design matrix with an Intercept column of all 1's. For example
import pandas as pd
import patsy
data = patsy.demo_data("a", "b", "y")
# a b y
# 0 a1 b1 1.764052
# 1 a1 b2 0.400157
# 2 a2 b1 0.978738
# 3 a2 b2 2.240893
# 4 a1 b1 1.867558
# 5 a1 b2 -0.977278
# 6 a2 b1 0.950088
# 7 a2 b2 -0.151357
mat = patsy.dmatrices("y ~ a + b ", data, return_type='dataframe')[1]
print(mat)
yields
Intercept a[T.a2] b[T.b2]
0 1.0 0.0 0.0
1 1.0 0.0 1.0
2 1.0 1.0 0.0
3 1.0 1.0 1.0
4 1.0 0.0 0.0
5 1.0 0.0 1.0
6 1.0 1.0 0.0
7 1.0 1.0 1.0
Patsy analyzes the expressions on each side of the formula, and only adds new terms when such a term is needed to add the required flexibility to the model. In terms of the design matrix, this means that a new column is not added unless the vector space spanned by the columns is expanded by the addition of the new column. In other words, a new column which is already in the span of the other columns would be redundant and so it is not added.
When you have a categorical variable which must equal W, E, N, S, or C, knowing that the value of the variable is not W, E, N, or S is equivalent to knowing the variable equals C.
Look at the output from the previous example. Knowing that the a
variable
is not a2
is equivalent to knowing it equals a1
. In terms of the design
matrix, the column space would not be increased by including an a1
column, since
Intercept - a2
is a1
. (Below, the a1
column is labeled a[T.a1]
, and
similarly for a2
):
Intercept a[T.a2] b[T.b2] a[T.a1]
0 1.0 0.0 0.0 1.0
1 1.0 0.0 1.0 1.0
2 1.0 1.0 0.0 0.0
3 1.0 1.0 1.0 0.0
4 1.0 0.0 0.0 1.0
5 1.0 0.0 1.0 1.0
6 1.0 1.0 0.0 0.0
7 1.0 1.0 1.0 0.0
Similarly, in your situation, no column is added for the categorical value C, because Intercept - (W + E + N + S) equals C.
Now we can return to your original code and understand the result more clearly:
import statsmodels.api as sm
from patsy import dmatrices
df = sm.datasets.get_rdataset('Guerry','HistData').data
vars_ = ['Department','Lottery','Literacy','Wealth','Region']
df = df[vars_]
df = df.dropna()
formula1 = 'Lottery ~ Literacy + Wealth + Region'
print(formula1)
y1, X1 = dmatrices(formula1, data=df, return_type='dataframe')
print('LHS: {}'.format(y1.columns.tolist()))
# ['Lottery'],
print('RHS: {}'.format(X1.columns.tolist()))
# ['Intercept', 'Region[T.E]', 'Region[T.N]', 'Region[T.S]', 'Region[T.W]', 'Literacy', 'Wealth']
formula2 = 'Literacy + Wealth + Region ~ Lottery'
print(formula2)
y2, X2 = dmatrices(formula2, data=df, return_type='dataframe')
print('LHS: {}'.format(y2.columns.tolist()))
# ['Region[C]', 'Region[E]', 'Region[N]', 'Region[S]', 'Region[W]', 'Literacy', 'Wealth']
print('RHS: {}'.format(X2.columns.tolist()))
# ['Intercept', 'Lottery']
Notice that an Intercept
has been automatically added to the right-hand side
of each formula. When there is both an Intercept term and a categorical
variable on the same side of the formula, one value of the categorical variable
is always missing because its presence would not expand the design matrix's
column space.
You can tell patsy to not add an Intercept column by including + 0
on the right-hand side of the formula, or by including - 1
. They both do the same thing.
formula3 = 'Lottery ~ Literacy + Wealth + Region + 0'
print(formula3)
y1, X1 = dmatrices(formula3, data=df, return_type='dataframe')
print('LHS: {}'.format(y1.columns.tolist()))
print('RHS: {}'.format(X1.columns.tolist()))
Now, the right-hand side has a Region[C]
column:
LHS: ['Lottery']
RHS: ['Region[C]', 'Region[E]', 'Region[N]', 'Region[S]', 'Region[W]', 'Literacy', 'Wealth']