I tried several times to apply the pmml function from package pmml to a random forest model ('model.rf') created by package randomForest:
> library(randomForest)
> dim(data)
[1] 32000 76
> model.rf <- randomForest(x=data[,2:76],y=data[,1],type='regression',ntree=150)
> library(pmml)
> model.rf.pmml<-pmml(model.rf)
Each time it took several hours on my Windows 8 system (i7-4500U / 8gb RAM) until R would crash.
The model is quite large. The .RData file (with only the model) is approx. 10mb on disk and:
> model.rf$forest$nrnodes
[1] 5819
Is the crash due to memory insufficiency? I realized that the R process occupied virtually all of the available memory before crashing. If so, what system would be required to convert my model to pmml?
Also from the iris example it seems the size on disk increases by factor ~15, because XML is not a compressed format as opposed to R data files:
> library(randomForest)
> iris.rf <- randomForest(Species ~ ., data=iris, ntree=20)
> save(iris.rf,file='iris.rf.RData')
> iris.rf.pmml<-pmml(iris.rf)
> saveXML(iris.rf.pmml,file='iris.rf.xml')
iris.rf.RData --> 4kb iris.rf.xml --> 59kb
Is this factor constant? Will the pmml version of my model be ~150mb on disk?
Unfortunately, the R pmml package does have memory as well as speed limitations. When I released the present version, I did not realize how big "big data" could be! I should add that Windows is not very good at memory efficiency. There have been many models I could not output in a Windows machine....but was able to produce the exact same model faster and with better usage of memory in a linux or mac computer. I have been making improvements on both for the next release version, but for now, based on an experiment for a RF model with 500 trees, applied to a dataset with 50 variables and 50000 rows (~18Mb), the time taken to create a pmml model was 5hrs (linux machine). The average number of nodes in a tree was 4000. A general rule of thumb would be that the memory used to save a pmml object ~2.5x the R object....as you found. The memory used just to save the object as an xml file is a major factor. In the present state of the package, (not yet released), instead of 5hrs, it took 1hr15min. The numbers above are for a linux machine....I expect them to be more than double for a windows machine. Please consider using a non-windows machine for analysis of large datasets; I am sure this applies to most R packages...not just PMML!