I'm trying to replicate the procedure proposed here on my data.
target
is the categorical variable that I want to predict while I would force the first split of the classification tree to be done according to split.variable
(categorical too). Due to the object characteristics, indeed, if split.variable
is 1 target
can be only 1, while if it is 0, target
can be 0 or 1. This leads to:
> table(training_set$target, training_set$split.variable)
0 1
0 69 0
1 59 56
I'm able to create tr1
and tr2
(tr3
returns an error [Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
] because -if I'm correct- it's "empty", so no need of it [see also this post]).
tr1 <- ctree(target ~ split.variable, data = training_set, maxdepth = 1) # create the first split at comp_cat
tr2 <- ctree(target ~ split.variable + ., data = training_set, # then the left branch...
subset = predict(tr1, type = "node") == 2)
fix_ids <- function(x, startid = 1L) {
id <- startid - 1L
new_node <- function(x) {
id <<- id + 1L
if(is.terminal(x)) return(partynode(id, info = info_node(x)))
partynode(id,
split = split_node(x),
kids = lapply(kids_node(x), new_node),
surrogates = surrogates_node(x),
info = info_node(x))
}
return(new_node(x))
}
no <- node_party(tr1)
no$kids <- list(
fix_ids(node_party(tr2), startid = 2L)
#, fix_ids(node_party(tr3), startid = 5L)
)
no # visualize the structure
[1] root
| [2] V2 <= 1
| | [3] V15 <= -2.489 *
| | [4] V15 > -2.489 *
mdf <- model.frame(target ~ split.variable + ., data = training_set)
tr <- party(no,
data = mdf,
fitted = data.frame(
"(fitted)" = fitted_node(no, data = mdf),
"(response)" = model.response(mdf),
check.names = FALSE),
terms = terms(mdf), )
but, running party(...)
I get the following error:
Error in kids_node(node)[[i]] : subscript out of bounds
The only reference to such error that I was able to find is this Github issue.
Here the traceback
:
8: is.terminal(node)
7: fitted_node(kids_node(node)[[i]], data, vmatch, obs[indx], perm)
6: fitted_node(no, data = mdf)
5: data.frame(`(fitted)` = fitted_node(no, data = mdf), `(response)` = model.response(mdf),
check.names = FALSE)
4: party(no, data = mdf, fitted = data.frame(`(fitted)` = fitted_node(no,
data = mdf), `(response)` = model.response(mdf), check.names = FALSE),
terms = terms(mdf), )
3: .is.positive.intlike(x)
2: .traceback(x, max.lines = max.lines)
1: traceback(party(no, data = mdf, fitted = data.frame(`(fitted)` = fitted_node(no,
data = mdf), `(response)` = model.response(mdf), check.names = FALSE),
terms = terms(mdf), ))
I don't get if it is an issue related to the missing branch, to mlr
or to any other particular situation related to my data.
Your issue
The problem is that you in no$kids
you just define the first subtree but just leave out the second subtree (consisting of just a terminal node). You can simply set this up with the correct id as partynode(5L)
, i.e.,
no$kids <- list(
fix_ids(node_party(tr2), startid = 2L),
partynode(5L)
)
This is already sufficient here. In case the node your subsetting would have an info
associated with it (not the case here), you would also have to pass that on:
no$kids <- list(
fix_ids(node_party(tr2), startid = 2L),
partynode(5L, info = info_node(kids_node(tr1$node)[[2L]]))
)
After that you can follow the steps from the other answer to set up your constparty
object.
More generally
I don't understand why you are doing this in the first place. If split.variable
= 1 always implies target
= 1, then there seems no point in modeling that. So why not just model the subset of the data with split.variable
= 0?
But even if you decide that you want to model it, ctree
chooses split.variable
as the first split anyway. So all of this manual forcing of the split does not seem to be necessary in the first place.
training_set <- read.csv("training_set.txt")
training_set <- transform(training_set,
target = factor(target),
split.variable = factor(split.variable)
)
tr <- ctree(target ~ ., data = training_set)
tr
## Model formula:
## target ~ split.variable + var1 + var2 + var3 + var4 + var5 +
## var6 + var7 + var8 + var9 + var10 + var11 + var12 + var13 +
## var14 + var15 + var16 + var17 + var18 + var19 + var20 + var21 +
## var22 + var23 + var24 + var25 + var26 + var27 + var28 + var29
##
## Fitted party:
## [1] root
## | [2] split.variable in 0
## | | [3] var13 <= -2.489: 1 (n = 28, err = 32.1%)
## | | [4] var13 > -2.489: 0 (n = 54, err = 25.9%)
## | [5] split.variable in 1: 1 (n = 48, err = 0.0%)
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
plot(tr)