javarrenjin

Convert ArrayList of POJO objects to R dataframe with Renjin


I'm trying to use Renjin to build models from data that exists in a Java program. I have an ArrayList list of POJO objects where each attribute is either a String, a double, or an int. If I call toString() the records look like this:

Record{id='uibbd923e5929b43', countryCode='FR', revenue=3.14159, count=1}
Record{id='uicdd967e5942b55', countryCode='GB', revenue=0.07, count=49}
...

I instantiated R, running inside the JVM, like this:

ScriptEngineManager manager = new ScriptEngineManager();
ScriptEngine engine = manager.getEngineByName("Renjin");

... and put the ArrayList of records into R:

engine.put("records", records);

Inside R, the records are stored as a list of <externalptr> objects. It's possible to see a string representation of the values stored inside the pointer, e.g.

engine.eval("print(data.frame(lapply(records, as.character), stringsAsFactors=FALSE))");

However, I really want these stored as a dataframe, with the correct datatypes, instead of a list of external pointers that can be viewed as a string.

How to convert the list of externalptr's to a dataframe?

Update:

This is my lame workaround, at least for now. Write the data to a CSV:

CSVWriter writer = new CSVWriter(new FileWriter("tmp/output.csv"), '\t');    
writer.writeNext(new String[] {"id", "countryCode", "revenue", "count"});

    for (Record record : records){

        writer.writeNext(new String[]{record.getId(),
                record.getCountryCode(),
                record.getRevenue().toString(),
                record.getCount().toString()});
    }

    writer.close();

Then have Renjin read the CSV into a dataframe:

engine.eval("df <- read.table(\"tmp/output.csv\", header = TRUE)");

Update:

For now, I decided to use Rserve instead because it offers a lot more flexibility. One downside of Rserve (vs Renjin) is that we now need to ensure that R is running and has the necessary packages installed.


Solution

  • This is something that might be useful to put together as a little helper library, but for the moment, you can "manually" construct a data.frame step by step in Java in the following manner:

    StringArrayVector.Builder id = new StringArrayVector.Builder();
    StringArrayVector.Builder country = new StringArrayVector.Builder(); 
    DoubleArrayVector.Builder revenue = new DoubleArrayVector.Builder();
    for(Record record : records) {
       id.add(record.getId());
       country.add(record.getCountry());
       revenue.add(record.getRevenue());
    }
    
    ListVector.NamedBuilder myDf = new ListVector.NamedBuilder();
    myDf.setAttribute(Symbols.CLASS, StringVector.valueOf("data.frame"));
    myDf.setAttribute(Symbols.ROW_NAMES, new RowNamesVector(records.size()); 
    myDf.add("id", id.build());
    myDf.add("country", country.build());
    myDf.add("revenue", revenue.build());
    

    A data.frame object as you can see from the above, is actually just a list of columns, so it takes a bit of fiddling to get a collection of Java Beans, which is essentially a row-based format, to a collection of columns.

    It's also important to add the "row.names" attribute which is used by functions like nrow() to get the dimensions of the data.frame object.

    The RowNamesVector above is a specialized implementation of StringVector which computes the row.names "1", "2", "3", etc on demand without allocating memory for all of the strings.