javacsvmemorymemory-managementapache-commons-csv

Does Apache Commons CSV framework offer a memory-efficient incremental/sequential mode for reading large files?


The Apache Commons CSV project works quite well for parsing comma-separates values, tab-delimited data, and similar data formats.

My impression is that this tool reads a file entirely with the resulting line objects kept in memory. But I am not sure, I cannot find any documentation with regard to this behavior.

For parsing very large, I should like to do an incremental read, one line at a time, or perhaps a relatively small number of lines at a time, to avoid overwhelming memory limitations.

With regard only to the aspect of memory usage, the idea here is like how a SAX parser for XML reads incrementally to minimize use of RAM versus a DOM style XML parser that reads a document entirely into memory to provide tree-traversal.

Questions:


Solution

  • My impression is that this tool reads a file entirely with the resulting line objects kept in memory

    No. The use of memory is governed by how you choose to interact with your CSVParser object.

    The Javadoc for CSVParser addresses this issue explicitly, in its sections Parsing record wise versus Parsing into memory, with a caution:

    Parsing into memory may consume a lot of system resources depending on the input. For example if you're parsing a 150MB file of CSV data the contents will be read completely into memory.

    I took a quick glance at the source code, and indeed parsing record wise seems to be reading from its input source a chunk at a time, not all at once. But see for yourself.

    Parsing record wise

    In section Parsing record wise, it shows how to incrementally read one CSVRecord at a time by looping the Iterable that is CSVParser.

    CSVParser parser = CSVParser.parse(csvData, CSVFormat.RFC4180);
    for (CSVRecord csvRecord : parser) {
        ...
    }
    

    Parsing into memory

    In contrast, the Parsing into memory section shows the use of CSVParser::getRecords to load all the CSVRecord objects into a List all at once, in memory. So obviously a very large input file could blow out memory on a constrained machine.

    Reader in = new StringReader("a;b\nc;d");
    CSVParser parser = new CSVParser(in, CSVFormat.EXCEL);
    List<CSVRecord> list = parser.getRecords();