[SOLVED] Do we need to create an index file (with lzop) if compression type is RECORD instead of block?

Do we need to create an index file (with lzop) if compression type is RECORD instead of block?

As I understand, an index file is needed to make the output Splitable. If mapred.output.compression.type=SequenceFile.CompressionType.RECORD, do we still need to create an Index file?

Solution

Short answer:

RECORD and BLOCK compression.type properties apply to sequence files, not to simple text files (which can be independently compressed with lzo or gzip or bz2 ...)

More info:

LZO is a compression codec which gives better compression and decompression speed than gzip, and also the capability to split. LZO allows this because its composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries, as opposed to gzip where the dictionary for the whole file is written at the top.

When you specify mapred.output.compression.codec as LzoCodec, hadoop will generate .lzo_deflate files. These contain the raw compressed data without any header, and cannot be decompressed with lzop -d command. Hadoop can read these files in the map phase, but this makes your life hard.

When you specify LzopCodec as the compression.codec, hadoop will generate .lzo files. These contain the header and can be decompressed using lzop -d

However, neither .lzo nor .lzo_deflate files are splittable by default. This is where LzoIndexer comes into play. It generates an index file which tells you where the record boundary is. This way, multiple map tasks can process the same file.

See this cloudera blog post and LzoIndexer for more info.