I am trying to do a text classifier using "Sci kit" learn bag of words. Vectorization into a classifier. However, I was wondering how would i add another variable to the input apart from the text itself. Say I want to add a number of words in the text in addition to text (because I think it may affect the result). How should I go about doing so?
Do I have to add another classifier on top of that one? Or is there a way to add that input to vectorized text?
Scikit learn classifiers works with numpy arrays. This means that after your vectorization of text, you can add your new features to this array easily (I am taking this sentence back, not very easily but doable). Problem is in text categorization, your features will be sparse therefore normal numpy column additions does not work.
Code modified from text mining example from scikit learn scipy 2013 tutorial.
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import scipy
# Load the text data
twenty_train_subset = load_files('datasets/20news-bydate-train/',
categories=categories, encoding='latin-1')
# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train_only_text_features = vectorizer.fit_transform(twenty_train_subset.data)
print type(X_train_only_text_features)
print "X_train_only_text_features",X_train_only_text_features.shape
size = X_train_only_text_features.shape[0]
print "size",size
ones_column = np.ones(size).reshape(size,1)
print "ones_column",ones_column.shape
new_column = scipy.sparse.csr.csr_matrix(ones_column )
print type(new_column)
print "new_column",new_column.shape
X_train= scipy.sparse.hstack([new_column,X_train_only_text_features])
print "X_train",X_train.shape
output is following:
<class 'scipy.sparse.csr.csr_matrix'>
X_train_only_text_features (2034, 17566)
size 2034
ones_column (2034L, 1L)
<class 'scipy.sparse.csr.csr_matrix'>
new_column (2034, 1)
X_train (2034, 17567)