pythonpatsy

dmatrices don't see a column


I have this code below that suppose to create 2 dataframes on giving columns. df's Region column has 5 variables; W,E,N,S and C. However the resulted dataframe has only W,E,N,S and an intercept column.

import statsmodels.api as sm
from patsy import dmatrices
df = sm.datasets.get_rdataset('Guerry','HistData').data
vars = ['Department','Lottery','Literacy','Wealth','Region']
df = df[vars]
df = df.dropna()
#      Department  Lottery  Literacy  Wealth Region
# 0           Ain       41        37      73      E
# 1         Aisne       38        51      22      N
# 2        Allier       66        13      61      C
# 3  Basses-Alpes       80        46      76      E
# 4  Hautes-Alpes       79        69      83      E

y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')
print(X.columns.tolist())
# ['Intercept', 'Region[T.E]', 'Region[T.N]', 'Region[T.S]', 'Region[T.W]', 'Literacy', 'Wealth']

When I change to last row as below it works fine and shows 5 Region values on the dataframe.

y, X = dmatrices('Literacy + Wealth + Region ~ Lottery', data=df, return_type='dataframe')
print(y.columns.tolist())
# ['Region[C]', 'Region[E]', 'Region[N]', 'Region[S]', 'Region[W]', 'Literacy', 'Wealth']

Can someone please explain it what is the reason of this? and what is the intercept column created on the first code instead of Region C?


Solution

  • Patsy automatically adds a constant "Intercept" term to the right-hand side of formulas. This leads to a design matrix with an Intercept column of all 1's. For example

    import pandas as pd
    import patsy
    
    data = patsy.demo_data("a", "b", "y")
    #     a   b         y
    # 0  a1  b1  1.764052
    # 1  a1  b2  0.400157
    # 2  a2  b1  0.978738
    # 3  a2  b2  2.240893
    # 4  a1  b1  1.867558
    # 5  a1  b2 -0.977278
    # 6  a2  b1  0.950088
    # 7  a2  b2 -0.151357
    
    mat = patsy.dmatrices("y ~ a + b ", data, return_type='dataframe')[1]
    print(mat)
    

    yields

       Intercept  a[T.a2]  b[T.b2]
    0        1.0      0.0      0.0
    1        1.0      0.0      1.0
    2        1.0      1.0      0.0
    3        1.0      1.0      1.0
    4        1.0      0.0      0.0
    5        1.0      0.0      1.0
    6        1.0      1.0      0.0
    7        1.0      1.0      1.0
    

    Patsy analyzes the expressions on each side of the formula, and only adds new terms when such a term is needed to add the required flexibility to the model. In terms of the design matrix, this means that a new column is not added unless the vector space spanned by the columns is expanded by the addition of the new column. In other words, a new column which is already in the span of the other columns would be redundant and so it is not added.

    When you have a categorical variable which must equal W, E, N, S, or C, knowing that the value of the variable is not W, E, N, or S is equivalent to knowing the variable equals C.

    Look at the output from the previous example. Knowing that the a variable is not a2 is equivalent to knowing it equals a1. In terms of the design matrix, the column space would not be increased by including an a1 column, since Intercept - a2 is a1. (Below, the a1 column is labeled a[T.a1], and similarly for a2):

       Intercept  a[T.a2]  b[T.b2]  a[T.a1]
    0        1.0      0.0      0.0      1.0
    1        1.0      0.0      1.0      1.0
    2        1.0      1.0      0.0      0.0
    3        1.0      1.0      1.0      0.0
    4        1.0      0.0      0.0      1.0
    5        1.0      0.0      1.0      1.0
    6        1.0      1.0      0.0      0.0
    7        1.0      1.0      1.0      0.0
    

    Similarly, in your situation, no column is added for the categorical value C, because Intercept - (W + E + N + S) equals C.


    Now we can return to your original code and understand the result more clearly:

    import statsmodels.api as sm
    from patsy import dmatrices
    
    df = sm.datasets.get_rdataset('Guerry','HistData').data
    vars_ = ['Department','Lottery','Literacy','Wealth','Region']
    df = df[vars_]
    df = df.dropna()
    
    formula1 = 'Lottery ~ Literacy + Wealth + Region'
    print(formula1)
    y1, X1 = dmatrices(formula1, data=df, return_type='dataframe')
    print('LHS: {}'.format(y1.columns.tolist()))
    # ['Lottery'], 
    print('RHS: {}'.format(X1.columns.tolist()))
    # ['Intercept', 'Region[T.E]', 'Region[T.N]', 'Region[T.S]', 'Region[T.W]', 'Literacy', 'Wealth']
    
    formula2 = 'Literacy + Wealth + Region ~ Lottery'
    print(formula2)
    
    y2, X2 = dmatrices(formula2, data=df, return_type='dataframe')
    print('LHS: {}'.format(y2.columns.tolist()))
    # ['Region[C]', 'Region[E]', 'Region[N]', 'Region[S]', 'Region[W]', 'Literacy', 'Wealth']
    print('RHS: {}'.format(X2.columns.tolist()))
    # ['Intercept', 'Lottery']
    

    Notice that an Intercept has been automatically added to the right-hand side of each formula. When there is both an Intercept term and a categorical variable on the same side of the formula, one value of the categorical variable is always missing because its presence would not expand the design matrix's column space.


    You can tell patsy to not add an Intercept column by including + 0 on the right-hand side of the formula, or by including - 1. They both do the same thing.

    formula3 = 'Lottery ~ Literacy + Wealth + Region + 0'
    print(formula3)
    y1, X1 = dmatrices(formula3, data=df, return_type='dataframe')
    print('LHS: {}'.format(y1.columns.tolist()))
    print('RHS: {}'.format(X1.columns.tolist()))
    

    Now, the right-hand side has a Region[C] column:

    LHS: ['Lottery']
    RHS: ['Region[C]', 'Region[E]', 'Region[N]', 'Region[S]', 'Region[W]', 'Literacy', 'Wealth']