python pandas scikit-learn categorical-data one-hot-encoding

Pandas - replace categorical text with numpy arrays for machine learning

I have a file:

data = pd.read('data.csv')

And that file contains categorical text data about digital users such as: (source = 'google', 'facebook', 'twitter') and (country = 'US', 'FR', 'GER').

Using the sklearn.feature_extraction.DictVectorizer() class, I've managed to turn these categories into numpy arrays. I then created a dictionary, which contains the text categories as keys, and the vectorized numpy arrays for the relevant category as the value, i.e.:

{'google': np.array([0.,  0.,  0.,  0.,  1.])}
{'facebook': np.array([1., 0., 0., 0., 0.])}
{'FR': np.array([0., 0., 1.])}

What I would ideally like to do is replace each text category (e.g., 'google') with it's vectorized numpy array value (e.g., np.array([0., 0., 0., 0., 1.]), so that I can then use a feature reduction algorithm to reduce the features down to 2, for visualization purposes.

So ideally, a row in the data that reads:

source | country 
google | FR
twitter| US

Would read:

source                             | country
np.array([0.,  0.,  0.,  0.,  1.]) | np.array([0., 0., 1.])
np.array([1.,  0.,  0.,  0.,  0.]) | np.array([1., 0., 0.])

Could someone recommend the best way to go about this?

Solution

Perhaps this is a little bit more succinct operation for converting the categorical to a numerical representation. I had to brush up on it a little since I've been using R mostly lately. This blog post was a great resource.

import pandas as pd
from sklearn.feature_extraction import DictVectorizer

d = {'source' : pd.Series(['google', 'facebook', 'twitter','twitter'],
                          index=['1', '2', '3', '4']),
     'country' : pd.Series(['GER', 'GER', 'US', 'FR'], 
                           index=['1', '2', '3', '4'])}
df = pd.DataFrame(d)
df_as_dicts=df.T.to_dict().values()

The df.T gives the transpose that we then apply the to_dict() to get the list of dictionaries that DictVectorizer wants. The values() method returns just the values, we don't want the indices.

df_as_dicts:

 [{'source': 'google', 'country': 'GER'},
 {'source': 'twitter', 'country': 'US'},
 {'source': 'facebook', 'country': 'GER'},
 {'source': 'twitter', 'country': 'FR'}]

Then the conversion using DictVectorizer follows:

vectorizer = DictVectorizer( sparse = False )
d_as_vecs = vectorizer.fit_transform( df_as_dicts )

resulting in:

array([[ 0.,  1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  1.,  0.,  0.]])

get_feature_names() allows us to retrieve the column names for this array from the vectorizer if we want to check our result.

vectorizer.get_feature_names()
['source=facebook',
 'source=google',
 'source=twitter',
 'country=FR',
 'country=GER',
 'country=US']

We can confirm the conversion has given us a correct representation of the test data in one-hot encoding form.