I have a file:
data = pd.read('data.csv')
And that file contains categorical text data about digital users such as: (source = 'google', 'facebook', 'twitter') and (country = 'US', 'FR', 'GER').
Using the sklearn.feature_extraction.DictVectorizer()
class, I've managed to turn these categories into numpy arrays. I then created a dictionary, which contains the text categories as keys, and the vectorized numpy arrays for the relevant category as the value, i.e.:
{'google': np.array([0., 0., 0., 0., 1.])}
{'facebook': np.array([1., 0., 0., 0., 0.])}
{'FR': np.array([0., 0., 1.])}
What I would ideally like to do is replace each text category (e.g., 'google') with it's vectorized numpy array value (e.g., np.array([0., 0., 0., 0., 1.]
), so that I can then use a feature reduction algorithm to reduce the features down to 2, for visualization purposes.
So ideally, a row in the data that reads:
source | country
google | FR
twitter| US
Would read:
source | country
np.array([0., 0., 0., 0., 1.]) | np.array([0., 0., 1.])
np.array([1., 0., 0., 0., 0.]) | np.array([1., 0., 0.])
Could someone recommend the best way to go about this?
Perhaps this is a little bit more succinct operation for converting the categorical to a numerical representation. I had to brush up on it a little since I've been using R mostly lately. This blog post was a great resource.
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
d = {'source' : pd.Series(['google', 'facebook', 'twitter','twitter'],
index=['1', '2', '3', '4']),
'country' : pd.Series(['GER', 'GER', 'US', 'FR'],
index=['1', '2', '3', '4'])}
df = pd.DataFrame(d)
df_as_dicts=df.T.to_dict().values()
The df.T
gives the transpose that we then apply the to_dict()
to get the list of dictionaries that DictVectorizer wants. The values()
method returns just the values, we don't want the indices.
df_as_dicts:
[{'source': 'google', 'country': 'GER'},
{'source': 'twitter', 'country': 'US'},
{'source': 'facebook', 'country': 'GER'},
{'source': 'twitter', 'country': 'FR'}]
Then the conversion using DictVectorizer follows:
vectorizer = DictVectorizer( sparse = False )
d_as_vecs = vectorizer.fit_transform( df_as_dicts )
resulting in:
array([[ 0., 1., 0., 0., 1., 0.],
[ 0., 0., 1., 0., 0., 1.],
[ 1., 0., 0., 0., 1., 0.],
[ 0., 0., 1., 1., 0., 0.]])
get_feature_names()
allows us to retrieve the column names for this array from the vectorizer if we want to check our result.
vectorizer.get_feature_names()
['source=facebook',
'source=google',
'source=twitter',
'country=FR',
'country=GER',
'country=US']
We can confirm the conversion has given us a correct representation of the test data in one-hot encoding form.