is anyone familiar with how to figure out what's going on inside a gbm
model in R?
Let's say we wanted to see how to predict the Petal.Length
in iris. Just to keep it simple I ran:
tg=gbm(Petal.Length~.,data=iris)
This works and when you run:
summary(tg)
Then you get:
Hit <Return> to see next plot:
var rel.inf
Petal.Width Petal.Width 67.39
Species Species 32.61
Sepal.Length Sepal.Length 0.00
Sepal.Width Sepal.Width 0.00
This makes sense intuitively. When you run pretty.gbm.tree(tg)
You get:
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 2 0.8000000000 1 2 3 184.6764 75 0.0001366667
1 -1 -0.0022989091 -1 -1 -1 0.0000 22 -0.0022989091
2 -1 0.0011476604 -1 -1 -1 0.0000 53 0.0011476604
3 -1 0.0001366667 -1 -1 -1 0.0000 75 0.0001366667
So clearly gbm thinks that you split by Variable #2 and get back three separate regressions. I assume that SplitVar==2
is Petal.Width
since the order you see in str(iris)
makes sense.
But what data is missing? iris
has no missing data. And then how do we see what is going on in each of the three nodes that were created?
Let's say I wanted to code this up in C++ for production, I don't get how one would know what to code beyond knowing that you should do something differently depending on if Petal.Width >.8
.
Thanks,
Josh
See the function gbm2sas
in the package mlmeta, which uses metaprogramming to convert the R object to SAS format.
The SAS format is similar to C++, so it is both easy to read and easy hack to C++.