pythonmachine-learningscikit-learnpatsy

One-hot encoding in patsy


For regressions I usually encode categorical variables using sklearn's OneHotEncoder.

I am now exploring using patsy but it doesn't appear to offer One-hot encoding: http://patsy.readthedocs.io/en/latest/categorical-coding.html

Is it possible to specify One-hot encoding using patsy?


Solution

  • There are two things to know here that might help: (1) patsy by default includes an intercept (there's an invisible 1 + at the beginning of every formula), and (2) when coding a categorical value, patsy automatically chooses an encoding strategy that avoids creating an over-parameterized model.

    If you combine an intercept + full-rank one-hot encoding, then you get an over-parameterized model. So patsy switches to treatment coding (= basically dropping one column from the one-hot encoding you're thinking of). This avoids creating a linear dependence between your encoding columns and the intercept column.

    An easy way to avoid this is to remove the intercept -- then patsy won't be worried about the linear dependence, and will use the kind of one-hot encoding you're expecting: y ~ -1 + a (the -1 cancels out the invisible 1 to remove the intercept).

    Alternatively, if you really want an over-parameterized model, then if you scroll down further on the docs page you linked to, it tells you how to define arbitrary custom encoding schemes.

    import numpy as np
    from patsy import ContrastMatrix
    
    class FullRankOneHot(object):
        def __init__(self, reference=0):
            self.reference = reference
    
        # Called to generate a full-rank encoding
        def code_with_intercept(self, levels):
            return ContrastMatrix(np.eye(len(levels)),
                                  ["[My.%s]" % (level,) for level in levels])
    
        # Called to generate a non-full-rank encoding. But we don't care,
        # we do what we want, and return a full-rank encoding anyway.
        # Take that, patsy.
        def code_without_intercept(self, levels):
            return self.code_with_intercept(levels)
    

    Then you can use it like: y ~ 1 + C(a, FullRankOneHot).