hadoopcolumn-oriented

Why column oriented file formats are not well suited to streaming writes?


Hadoop the definitive guide(4th edition) has a paragraph on page 137:

Column-oriented formats need more memory for reading and writing, since they have to buffer a row split in memory, rather than just a single row. Also, it’s not usually possible to control when writes occur (via flush or sync operations), so column-oriented formats are not suited to streaming writes, as the current file cannot be recovered if the writer process fails. On the other hand, row-oriented formats like sequence files and Avro datafiles can be read up to the last sync point after a writer failure. It is for this reason that Flume (see Chapter 14) uses row-oriented formats.

I don't understands why current block can not be recovered in the case of failure. Can someone explain technical difficulties about this statement:

we can not control when writes occur (via flush or sync operations)


Solution

  • don't understands why current block can not be recovered in the case of failure.

    Simply because there is no block to recover. The explanation is quite clear that the columnar formats (ORC, Parquet etc) make their own decision when to flush. If there was no flush then there is no 'block'. As Flume cannot control when the columnar memory buffers get written out to storage, it cannot rely on such formats.