javaexcelheap-memoryfile-conversionpcf

Write 1 Million record from XLS file to CVS file in Java


I have a scenario in my project where user uploads XLS file with 1Million record and I need to conver this xls file to csv file and then sql server job will process the csv file.

I have a process to convert xls file to csv - but when I deploy this code in PCF it starts throwing heap memory erorr. I am also getting the same error in my local environment.

Code Snapshot:

String inputFileName = "UserInput.xls";

FileInputStream input_document = new FileInputStream(new File(inputFileName));
Workbook my_xls_workbook = StreamingReader.builder().open(input_document);//this line throws out of memory error.
Sheet sheet = my_xls_workbook.getSheetAt(0);

...... remaing code reads the sheet object and perform conversion.


Solution

  • The method you're using to load the file into a Workbook seems to be eager , i.e. it will just read the whole document into memory, parse it, and then return the result if it didn't run out of memory first.

    However, on the README page of the project, they show what you should be doing instead to avoid that:

    import com.monitorjbl.xlsx.StreamingReader;
    
    InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
    Workbook workbook = StreamingReader.builder()
        .rowCacheSize(100)    // number of rows to keep in memory (defaults to 10)
        .bufferSize(4096)     // buffer size to use when reading InputStream to file (defaults to 1024)
        .open(is);            // InputStream or File for XLSX file (required)
    

    After this, iterate over the workbooks, but do not use the get(0) method as that seems to require loading all workbooks into memory at once.

    for (Sheet sheet : workbook){
        System.out.println(sheet.getSheetName());
        for (Row r : sheet) {
            for (Cell c : r) {
                System.out.println(c.getStringCellValue());
            }
        }
    }
    

    Have you tried this? If you did, you should file a bug because clearly it shouldn't use up all memory available as that's the whole point of the library existing.