rclassificationpartypredictive

Unbalanced dataset, Classification tree and cost matrix in R


I'm trying to create a classification model to predict one of two classes: "Hit" or "Miss".

The dataset contains around 80% "Hits" thus it is highly unbalanced, so models such as classification trees (ctree from party package) choose to predict all outcomes as "Hit" and obtain 80% accuracy.

I tried undersampling and SMOTE algorithm without success.

How can I change the cost matrix in order to penalize the model when it classifies a "Miss" as a "Hit"?


Solution

  • You can do that with the weights argument to ctree. Since you do not provide any data, I will illustrate with bogus data.

    library(party)
    
    ## Some bogus data
    set.seed(42)
    class = factor(sample(1:2, 500, replace=TRUE, prob=c(0.8, 0.2)) )
    x1 = rnorm(500)
    x2 = rnorm(500, 0.7, 0.9)
    x = ifelse(class == 1, x1, x2)
    y1 = rnorm(500)
    y2 = rnorm(500, 0.7, 0.9)
    y = ifelse(class == 1, y1, y2)
    Imbalanced = data.frame(x,y,class)
    

    Just using ctree on this data makes it classify all data as class 1.

    CT1 = ctree(class ~ ., data=Imbalanced)
    table(predict(CT1))
      1   2 
    500   0 
    

    But if you set the weights, you can make it find more of the class 2 data.

    W = ifelse(class==1, 1, 2)
    CT2 = ctree(class ~ ., data=Imbalanced, weights=W)
    table(predict(CT2), class)
       class
          1   2
      1 336  44
      2  63  57
    

    Notice that the overall accuracy has gone down but we got more of the class 2 points correctly classified. If you use a really big weighting factor, you can get almost all of the class 2 points (at the expense of even greater loss of overall accuracy).

    W = ifelse(class==1, 1, 5)
    CT3 = ctree(class ~ ., data=Imbalanced, weights=W)
    table(predict(CT3), class)
       class
          1   2
      1 178   4
      2 221  97