I'm trying to use Renjin to build models from data that exists in a Java program. I have an ArrayList
list of POJO objects where each attribute is either a String
, a double
, or an int
. If I call toString()
the records look like this:
Record{id='uibbd923e5929b43', countryCode='FR', revenue=3.14159, count=1}
Record{id='uicdd967e5942b55', countryCode='GB', revenue=0.07, count=49}
...
I instantiated R, running inside the JVM, like this:
ScriptEngineManager manager = new ScriptEngineManager();
ScriptEngine engine = manager.getEngineByName("Renjin");
... and put the ArrayList
of records into R:
engine.put("records", records);
Inside R, the records are stored as a list of <externalptr>
objects. It's possible to see a string representation of the values stored inside the pointer, e.g.
engine.eval("print(data.frame(lapply(records, as.character), stringsAsFactors=FALSE))");
However, I really want these stored as a dataframe, with the correct datatypes, instead of a list of external pointers that can be viewed as a string.
How to convert the list of externalptr
's to a dataframe?
This is my lame workaround, at least for now. Write the data to a CSV:
CSVWriter writer = new CSVWriter(new FileWriter("tmp/output.csv"), '\t');
writer.writeNext(new String[] {"id", "countryCode", "revenue", "count"});
for (Record record : records){
writer.writeNext(new String[]{record.getId(),
record.getCountryCode(),
record.getRevenue().toString(),
record.getCount().toString()});
}
writer.close();
Then have Renjin read the CSV into a dataframe:
engine.eval("df <- read.table(\"tmp/output.csv\", header = TRUE)");
For now, I decided to use Rserve instead because it offers a lot more flexibility. One downside of Rserve (vs Renjin) is that we now need to ensure that R is running and has the necessary packages installed.
This is something that might be useful to put together as a little helper library, but for the moment, you can "manually" construct a data.frame step by step in Java in the following manner:
StringArrayVector.Builder id = new StringArrayVector.Builder();
StringArrayVector.Builder country = new StringArrayVector.Builder();
DoubleArrayVector.Builder revenue = new DoubleArrayVector.Builder();
for(Record record : records) {
id.add(record.getId());
country.add(record.getCountry());
revenue.add(record.getRevenue());
}
ListVector.NamedBuilder myDf = new ListVector.NamedBuilder();
myDf.setAttribute(Symbols.CLASS, StringVector.valueOf("data.frame"));
myDf.setAttribute(Symbols.ROW_NAMES, new RowNamesVector(records.size());
myDf.add("id", id.build());
myDf.add("country", country.build());
myDf.add("revenue", revenue.build());
A data.frame object as you can see from the above, is actually just a list of columns, so it takes a bit of fiddling to get a collection of Java Beans, which is essentially a row-based format, to a collection of columns.
It's also important to add the "row.names" attribute which is used by functions like nrow() to get the dimensions of the data.frame object.
The RowNamesVector above is a specialized implementation of StringVector which computes the row.names "1", "2", "3", etc on demand without allocating memory for all of the strings.