rrpart

How to extract the tree structure from an rpart object?


Warning to those who read this later ...

My background in tree and graph algorithms meant that I expected this to be a simple question with a simple answer. But, it is not.

How this could work in the context that I come from is, say, functions left and right that get the left and right node of the current node. So that root(fit) is the root node and left(root(fit)) is the left child of the root node. And that other functions such as split(node) would give the information about the decision in the node.

However, the real answer is - almost no one who uses rpart, including the developer, thinks of a decision tree as being anything to do with a tree in that sense. Most of the answers given involve merely printing the tree out in human readable form as a diagram or text.

Several people have asked this question and the best answer is to look at getAnywere(summary.rpart), which lists the code that produces the text version of the tree. Is is quite mucky.

The original question follows.


I have a decision tree obtained by ...

makeDecisionTree <- function(ndata, fp, fn) {

   ctrl <- rpart.control( nobs = nrow(ndata), mincut = 2,  minsize=20, maxdepth=20)

   mytree <- rpart( formula = PredictedLabel ~ . ,
                   data=ndata,
                   minsplit=1,
                   method="class",
                   control = ctrl,
                   parms=list(split="information",
                               loss=matrix(c(0, fp, fn, 0),
                               byrow=TRUE, nrow=2))   )
   return(mytree)
}

I can print it out and get a listing of the decisions such as x>6, and I can understand what the tree is doing and so on. But, what I cannot see is how to work directly with the tree structure - in the sense of a recursive descent of the tree from the root node, under program control.

I have got to the point that I am considering seriously to print the tree to a text file and parse the resulting file just to get the actual tree structure. As this seems somewhat absurd -- I am assuming that I am missing something.

I have looked at the structure of the types, and looked at the splits matrix and so on. But, it is not clear to me how any of this produces a tree structure.


Solution

  • This is only a partial answer, but it was the core of the clue I needed to keep going. And I put it here in case someone else finds this question when confronted with the same problem.

    Suppose that you create a fitted tree by ...

    fit = rpart( ... )
    

    Then fit$frame is a dataframe whose rows describe the nodes. The name of the row is the node number. The children of node n are 2n and 2n+1. The columns include var which gives the name of the field used in the split the node represents. A leaf node has the name <leaf>.

    In the example below, 1 is the root node, with child nodes 2=2*1+0 and 3=2*1+1. The top node was split on MOB.

                   var      n     wt    dev yval  complexity ncompete
    1              MOB 121841 121841 295428    1 0.249854448        4
    2           <leaf>  30302  30302   5514    1 0.000000000        0
    3        MONTHS_TO  91539  91539 216100    2 0.205237824        4
    6        MONTHS_TO  26002  26002  37842    1 0.092878806        4
    12          <leaf>  18270  18270   1788    1 0.010000000        0
    13       MONTHS_TO   7732   7732   8615    2 0.017249550        4
    26          <leaf>   1622   1622   1644    1 0.010000000        0
    27          <leaf>   6110   6110   1875    2 0.010000000        0
    7        STAT_CD_1  65537  65537 117625    2 0.048749611        4
    14          <leaf>   4724   4724   5028    1 0.000000000        0
    15           DISCH  60813  60813  98195    2 0.019574990        4
    30          SENDER  11248  11248  27522    1 0.014570386        4
    60            PORT  10344  10344  22770    1 0.014570386        4
    120         <leaf>   6878   6878  10908    1 0.005544498        0
    121         <leaf>   3466   3466   7445    2 0.003743721        0
    61          <leaf>    904    904    560    2 0.010000000        0
    31          <leaf>  49565  49565  64890    2 0.000000000        0
    

    For a more complete description, please see

    https://henckr.github.io/distRforest/reference/rpart.object.html

    But you will also need this

    R: Extracting Rules from a Decision Tree