Warning to those who read this later ...
My background in tree and graph algorithms meant that I expected this to be a simple question with a simple answer. But, it is not.
How this could work in the context that I come from is, say, functions left and right that get the left and right node of the current node. So that root(fit) is the root node and left(root(fit)) is the left child of the root node. And that other functions such as split(node) would give the information about the decision in the node.
However, the real answer is - almost no one who uses rpart, including the developer, thinks of a decision tree as being anything to do with a tree in that sense. Most of the answers given involve merely printing the tree out in human readable form as a diagram or text.
Several people have asked this question and the best answer is to look at getAnywere(summary.rpart), which lists the code that produces the text version of the tree. Is is quite mucky.
The original question follows.
I have a decision tree obtained by ...
makeDecisionTree <- function(ndata, fp, fn) {
ctrl <- rpart.control( nobs = nrow(ndata), mincut = 2, minsize=20, maxdepth=20)
mytree <- rpart( formula = PredictedLabel ~ . ,
data=ndata,
minsplit=1,
method="class",
control = ctrl,
parms=list(split="information",
loss=matrix(c(0, fp, fn, 0),
byrow=TRUE, nrow=2)) )
return(mytree)
}
I can print it out and get a listing of the decisions such as x>6, and I can understand what the tree is doing and so on. But, what I cannot see is how to work directly with the tree structure - in the sense of a recursive descent of the tree from the root node, under program control.
I have got to the point that I am considering seriously to print the tree to a text file and parse the resulting file just to get the actual tree structure. As this seems somewhat absurd -- I am assuming that I am missing something.
I have looked at the structure of the types, and looked at the splits matrix and so on. But, it is not clear to me how any of this produces a tree structure.
This is only a partial answer, but it was the core of the clue I needed to keep going. And I put it here in case someone else finds this question when confronted with the same problem.
Suppose that you create a fitted tree by ...
fit = rpart( ... )
Then fit$frame
is a dataframe whose rows describe the nodes. The name of the row is the node number. The children of node n are 2n and 2n+1. The columns include var which gives the name of the field used in the split the node represents. A leaf node has the name <leaf>
.
In the example below, 1 is the root node, with child nodes 2=2*1+0
and 3=2*1+1
. The top node was split on MOB.
var n wt dev yval complexity ncompete
1 MOB 121841 121841 295428 1 0.249854448 4
2 <leaf> 30302 30302 5514 1 0.000000000 0
3 MONTHS_TO 91539 91539 216100 2 0.205237824 4
6 MONTHS_TO 26002 26002 37842 1 0.092878806 4
12 <leaf> 18270 18270 1788 1 0.010000000 0
13 MONTHS_TO 7732 7732 8615 2 0.017249550 4
26 <leaf> 1622 1622 1644 1 0.010000000 0
27 <leaf> 6110 6110 1875 2 0.010000000 0
7 STAT_CD_1 65537 65537 117625 2 0.048749611 4
14 <leaf> 4724 4724 5028 1 0.000000000 0
15 DISCH 60813 60813 98195 2 0.019574990 4
30 SENDER 11248 11248 27522 1 0.014570386 4
60 PORT 10344 10344 22770 1 0.014570386 4
120 <leaf> 6878 6878 10908 1 0.005544498 0
121 <leaf> 3466 3466 7445 2 0.003743721 0
61 <leaf> 904 904 560 2 0.010000000 0
31 <leaf> 49565 49565 64890 2 0.000000000 0
For a more complete description, please see
https://henckr.github.io/distRforest/reference/rpart.object.html
But you will also need this