rparty

Error in kids_node(node)[[i]] : subscript out of bounds in partykit


I'm trying to replicate the procedure proposed here on my data.

target is the categorical variable that I want to predict while I would force the first split of the classification tree to be done according to split.variable (categorical too). Due to the object characteristics, indeed, if split.variable is 1 target can be only 1, while if it is 0, target can be 0 or 1. This leads to:

> table(training_set$target, training_set$split.variable)
     0  1
  0 69  0
  1 59 56

I'm able to create tr1 and tr2 (tr3 returns an error [Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels] because -if I'm correct- it's "empty", so no need of it [see also this post]).

tr1 <- ctree(target ~ split.variable,     data = training_set, maxdepth = 1) # create the first split at comp_cat
tr2 <- ctree(target ~ split.variable + ., data = training_set,  # then the left branch...
             subset = predict(tr1, type = "node") == 2)

fix_ids <- function(x, startid = 1L) {
  id <- startid - 1L
  new_node <- function(x) {
    id <<- id + 1L
    if(is.terminal(x)) return(partynode(id, info = info_node(x)))
    partynode(id,
              split = split_node(x),
              kids = lapply(kids_node(x), new_node),
              surrogates = surrogates_node(x),
              info = info_node(x))
  }
  
  return(new_node(x))   
}

no <- node_party(tr1)
no$kids <- list(
  fix_ids(node_party(tr2), startid = 2L)
  #, fix_ids(node_party(tr3), startid = 5L)
  )
no # visualize the structure    
[1] root
|   [2] V2 <= 1
|   |   [3] V15 <= -2.489 *
|   |   [4] V15 > -2.489 *

mdf <- model.frame(target ~ split.variable + ., data = training_set)
tr <- party(no, 
            data = mdf,
            fitted = data.frame(
              "(fitted)" = fitted_node(no, data = mdf),
              "(response)" = model.response(mdf),
              check.names = FALSE),
            terms = terms(mdf), )

but, running party(...) I get the following error:

Error in kids_node(node)[[i]] : subscript out of bounds

The only reference to such error that I was able to find is this Github issue.

Here the traceback:

8: is.terminal(node)
7: fitted_node(kids_node(node)[[i]], data, vmatch, obs[indx], perm)
6: fitted_node(no, data = mdf)
5: data.frame(`(fitted)` = fitted_node(no, data = mdf), `(response)` = model.response(mdf), 
       check.names = FALSE)
4: party(no, data = mdf, fitted = data.frame(`(fitted)` = fitted_node(no, 
       data = mdf), `(response)` = model.response(mdf), check.names = FALSE), 
       terms = terms(mdf), )
3: .is.positive.intlike(x)
2: .traceback(x, max.lines = max.lines)
1: traceback(party(no, data = mdf, fitted = data.frame(`(fitted)` = fitted_node(no, 
       data = mdf), `(response)` = model.response(mdf), check.names = FALSE), 
       terms = terms(mdf), ))

I don't get if it is an issue related to the missing branch, to mlr or to any other particular situation related to my data.


Solution

  • Your issue

    The problem is that you in no$kids you just define the first subtree but just leave out the second subtree (consisting of just a terminal node). You can simply set this up with the correct id as partynode(5L), i.e.,

    no$kids <- list(
      fix_ids(node_party(tr2), startid = 2L),
      partynode(5L)
    )
    

    This is already sufficient here. In case the node your subsetting would have an info associated with it (not the case here), you would also have to pass that on:

    no$kids <- list(
      fix_ids(node_party(tr2), startid = 2L),
      partynode(5L, info = info_node(kids_node(tr1$node)[[2L]]))
    )
    

    After that you can follow the steps from the other answer to set up your constparty object.

    More generally

    I don't understand why you are doing this in the first place. If split.variable = 1 always implies target = 1, then there seems no point in modeling that. So why not just model the subset of the data with split.variable = 0?

    But even if you decide that you want to model it, ctree chooses split.variable as the first split anyway. So all of this manual forcing of the split does not seem to be necessary in the first place.

    training_set <- read.csv("training_set.txt")
    training_set <- transform(training_set,
      target = factor(target),
      split.variable = factor(split.variable)
    )
    tr <- ctree(target ~ ., data = training_set)
    tr
    ## Model formula:
    ## target ~ split.variable + var1 + var2 + var3 + var4 + var5 + 
    ##     var6 + var7 + var8 + var9 + var10 + var11 + var12 + var13 + 
    ##     var14 + var15 + var16 + var17 + var18 + var19 + var20 + var21 + 
    ##     var22 + var23 + var24 + var25 + var26 + var27 + var28 + var29
    ## 
    ## Fitted party:
    ## [1] root
    ## |   [2] split.variable in 0
    ## |   |   [3] var13 <= -2.489: 1 (n = 28, err = 32.1%)
    ## |   |   [4] var13 > -2.489: 0 (n = 54, err = 25.9%)
    ## |   [5] split.variable in 1: 1 (n = 48, err = 0.0%)
    ## 
    ## Number of inner nodes:    2
    ## Number of terminal nodes: 3
    plot(tr)
    

    Visualization of constparty object created by plot(tr)