pandasmachine-learninghashscikit-learnstring-hashing

How exactly does feature hashing work?


I have read many online articles on feature hashing of categorical variables for machine learning. Unfortunately, I still couldn't grasp the concept and understand how it works. I will illustrate my confusion through the sample dataset and hashing function below that I grabbed from another site:

>>>data

   pop     state  year
0  1.5      Ohio  2000
1  1.7      Ohio  2001
2  3.6  New York  2002
3  2.4    Nevada  2001
4  2.9    Nevada  2002
5  1.8    Oregon  2003

>>> def hash_col(df, col, N):
       cols = [col + "_" + str(i) for i in range(N)]
       def xform(x): tmp = [0 for i in range(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols)
       df[cols] = df[col].apply(xform)
       return df.drop(col,axis=1)

The below functions are used to print out the different transformed output by specifying different number of dimensions (or in other words, hashed features):

>>> print(hash_col(data, 'state',4))

   pop  year  state_0  state_1  state_2  state_3
0  1.5  2000        0        0        1        0
1  1.7  2001        0        0        1        0
2  3.6  2002        0        0        0        1
3  2.4  2001        0        1        0        0
4  2.9  2002        0        1        0        0
5  1.8  2003        0        0        0        1

>>> print(hash_col(data, 'state',5))
   pop  year  state_0  state_1  state_2  state_3  state_4
0  1.5  2000        1        0        0        0        0
1  1.7  2001        1        0        0        0        0
2  3.6  2002        1        0        0        0        0
3  2.4  2001        0        0        1        0        0
4  2.9  2002        0        0        1        0        0
5  1.8  2003        0        0        0        0        1

>>> print(hash_col(data, 'state',6))
   pop  year  state_0  state_1  state_2  state_3  state_4  state_5
0  1.5  2000        0        0        0        0        1        0
1  1.7  2001        0        0        0        0        1        0
2  3.6  2002        0        0        0        0        0        1
3  2.4  2001        0        0        0        1        0        0
4  2.9  2002        0        0        0        1        0        0
5  1.8  2003        0        0        0        0        0        1

What I can't understand is what does each of the 'state_0', 'state_1', 'state_2' etc. column represent. Also, since there are 4 unique states in my dataset (Ohio, New York, Nevada, Oregon), why are all the '1' allocated to just 3 'state_n' columns instead of 4 as in one hot encoding? For example, when I set the number of dimensions to 6, the output had two '1' in state_3, state_4 and state_5, but there was no '1' in state_0, state_1 and state_2. Any feedback would be greatly appreciated!


Solution

  • Feature hashing is typically used when you don't know all the possible values of a categorical variable. Because of this, we can't create a static mapping from categorical values to columns. So a hash function is used to determine which column each categorical value corresponds to.

    This is not the best use case because we know there are exactly 50 states and could just use one hot encoding.

    A hash function will also have collisions, where different values are mapped to the same value. That is what's happening here. Two different state names hash to the same value after the modulus operation in the hash function.

    One way to alleviate collisions is to make your feature space(number of columns) larger than the number of possible categorical values.