The Apache Commons CSV project works quite well for parsing comma-separates values, tab-delimited data, and similar data formats.
My impression is that this tool reads a file entirely with the resulting line objects kept in memory. But I am not sure, I cannot find any documentation with regard to this behavior.
For parsing very large, I should like to do an incremental read, one line at a time, or perhaps a relatively small number of lines at a time, to avoid overwhelming memory limitations.
With regard only to the aspect of memory usage, the idea here is like how a SAX parser for XML reads incrementally to minimize use of RAM versus a DOM style XML parser that reads a document entirely into memory to provide tree-traversal.
Questions:
My impression is that this tool reads a file entirely with the resulting line objects kept in memory
No. The use of memory is governed by how you choose to interact with your CSVParser
object.
The Javadoc for CSVParser
addresses this issue explicitly, in its sections Parsing record wise versus Parsing into memory, with a caution:
Parsing into memory may consume a lot of system resources depending on the input. For example if you're parsing a 150MB file of CSV data the contents will be read completely into memory.
I took a quick glance at the source code, and indeed parsing record wise seems to be reading from its input source a chunk at a time, not all at once. But see for yourself.
In section Parsing record wise, it shows how to incrementally read one CSVRecord
at a time by looping the Iterable
that is CSVParser
.
CSVParser parser = CSVParser.parse(csvData, CSVFormat.RFC4180);
for (CSVRecord csvRecord : parser) {
...
}
In contrast, the Parsing into memory section shows the use of CSVParser::getRecords
to load all the CSVRecord
objects into a List
all at once, in memory. So obviously a very large input file could blow out memory on a constrained machine.
Reader in = new StringReader("a;b\nc;d");
CSVParser parser = new CSVParser(in, CSVFormat.EXCEL);
List<CSVRecord> list = parser.getRecords();