I want to analyze data that has been incorrectly classified by a model using sci-kit learn, so that I can improve my feature generation. I have a method for doing this, but I am both new to sci-kit learn and pandas, so I'd like to know if there is a more efficient/direct way to accomplish this. It seems like something that would be part of a standard workflow, but in the research I did, I didn't find anything directly addressing this backwards mapping from model classification, through the feature matrix, to the original data.
Here's the context/workflow i'm using, as well as the solution i've devised. below that is sample code.
Context. My workflow looks like this:
Solution.
Here's code associated with an example using tweets. Again, this works, but is there a more direct/smarter way to do it?
# take a sample of our original data
data=tweet_df[0:100]['texts']
y=tweet_df[0:100]['truth']
# create the feature vectors
vec=TfidfVectorizer(analyzer="char",ngram_range=(1,2))
X=vec.fit_transform(data) # this is now feature matrix
# split the feature matrix into train/test subsets, keeping the indices back into the original X using the
# array indices
indices = np.arange(X.shape[0])
X_train, X_test, y_train, y_test,idx_train,idx_test=train_test_split(X,y,indices,test_size=0.2,random_state=state)
# fit and test a model
forest=RandomForestClassifier()
forest.fit(X_train,y_train)
predictions=forest.predict(X_test)
# get the indices for false_negatives and false_positives in the test set
false_neg, false_pos=tweet_fns.check_predictions(predictions,y_test)
# map the false negative indices in the test set (which is features) back to it's original data (text)
print "False negatives: \n"
pd.options.display.max_colwidth = 140
for i in false_neg:
original_index=idx_test[i]
print data.iloc[original_index]
and the checkpredictions function:
def check_predictions(predictions,truth):
# take a 1-dim array of predictions from a model, and a 1-dim truth vector and calculate similarity
# returns the indices of the false negatives and false positives in the predictions.
truth=truth.astype(bool)
predictions=predictions.astype(bool)
print sum(predictions == truth), 'of ', len(truth), "or ", float(sum(predictions == truth))/float(len(truth))," match"
# false positives
print "false positives: ", sum(predictions & ~truth)
# false negatives
print "false negatives: ",sum( ~predictions & truth)
false_neg=np.nonzero(~predictions & truth) # these are tuples of arrays
false_pos=np.nonzero(predictions & ~truth)
return false_neg[0], false_pos[0] # we just want the arrays to return
Your workflow is:
raw data -> features -> split -> train -> predict -> error analysis on the labels
There is row-for-row correspondence between the predictions and the feature matrix, so if you want to do error analysis on the features, there should be no problem. If you want to see what raw data is associated with errors, then you have to either do the split on the raw data, or else track which data rows mapped to which test rows (your current approach).
The first option looks like:
fit transformer on raw data -> split raw data -> transform train/test separately -> train/test -> ...
That is, it uses fit
before splitting and transform
after splitting, leaving you with raw data partitioned in the same way as the labels.