[SOLVED] Sklearn other inputs in addition to text for text classification

Sklearn other inputs in addition to text for text classification

I am trying to do a text classifier using "Sci kit" learn bag of words. Vectorization into a classifier. However, I was wondering how would i add another variable to the input apart from the text itself. Say I want to add a number of words in the text in addition to text (because I think it may affect the result). How should I go about doing so?
Do I have to add another classifier on top of that one? Or is there a way to add that input to vectorized text?

Solution

Scikit learn classifiers works with numpy arrays. This means that after your vectorization of text, you can add your new features to this array easily (I am taking this sentence back, not very easily but doable). Problem is in text categorization, your features will be sparse therefore normal numpy column additions does not work.

Code modified from text mining example from scikit learn scipy 2013 tutorial.

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import scipy

# Load the text data

twenty_train_subset = load_files('datasets/20news-bydate-train/',
    categories=categories, encoding='latin-1')

# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train_only_text_features = vectorizer.fit_transform(twenty_train_subset.data)


print type(X_train_only_text_features)
print "X_train_only_text_features",X_train_only_text_features.shape

size = X_train_only_text_features.shape[0]
print "size",size

ones_column = np.ones(size).reshape(size,1)
print "ones_column",ones_column.shape


new_column = scipy.sparse.csr.csr_matrix(ones_column )
print type(new_column)
print "new_column",new_column.shape

X_train= scipy.sparse.hstack([new_column,X_train_only_text_features])

print "X_train",X_train.shape

output is following:

<class 'scipy.sparse.csr.csr_matrix'>
X_train_only_text_features (2034, 17566)
size 2034
ones_column (2034L, 1L)
<class 'scipy.sparse.csr.csr_matrix'>
new_column (2034, 1)
X_train (2034, 17567)