hadoophivecolumn-oriented

Sequence order in the column oriented formats chapter of book Hadoop the definitive guide?


In page 137 of Hadoop: The definitive guide 4th edition, it talks about column oriented formats file and shows a picture below.

enter image description here

In the RCFile, why the sequence order of numbers is 1,4,2,5,3,6,7,10,8,11,9,12 rather than 1,4,7,10,2,5,8,11,3,6,9,12?


Solution

  • First of all, RC is not columnar file, it is Record Columnar file. RC as well as ORC are splittable. This means you do not read all the file to get only few rows and it can be read in parallel by many containers. And this is why we need splits.

    Splits contain rows that are grouped together and can be read independent of each other, and at the same time columns are also grouped inside splits. Similar data can be compressed better, so if columns are grouped together, it improves compression. In your example one split contains only two rows, but it can contain 10000 or more rows.

    What the official documentation says about RC file:

    Also read about ORC. Using indexes in ORC, stripes can be easily filtered on the lowest level. This feature is called predicate push down.