I am looking to run classification on a column that has few possible values, but i want to consolidate them into fewer labels.
for example, a job may have multiple end states: success
, fail
, error
, killed
. but i am looking to classify the jobs into either a group of end states (which would include error
and killed
) and another group (which will only include success
and fail
).
I cannot find a way to do that in sklearn's LabelEncoder, and other than manually changing the target column myself (by assigning 1
to success
or fail
and 0
to everything else) i cannot find a way.
EDIT example. this is what i need to happen:
>>> label_binarize(['success','fail','error','killed', 'success'], classes=(['success', 'fail']))
array([[1],
[1],
[0],
[0],
[1]])
unfortunately, label_binarize
(or LabelBinarizer, for that matter) does it for each column separately. THIS IS NOT WHAT I WANT:
>>> label_binarize(['success','fail','error','killed', 'success'], classes=['success', 'fail'])
array([[1, 0],
[0, 1],
[0, 0],
[0, 0],
[1, 0]])
any ideas on how to do that?
Maybe you should check out label_binarize
. You could set the success
as the only class, thereby defaulting the rest to 0. Same result as changing the data prior to encoding, but might fit better into your pipeline.
from sklearn.preprocessing import label_binarize
label_binarize(['success','fail','error','killed', 'success'], classes=['success'])
Output
array([[1],
[0],
[0],
[0],
[1]])