hadoopsnappy

Is Snappy splittable or not splittable?


According to this Cloudera post, Snappy IS splittable.

For MapReduce, if you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not. Splittability is not relevant to HBase data.

But from the hadoop definitive guide, Snappy is NOT splittable. enter image description here

There are also some confilitcting information on the web. Some say it's splittable, some say it's not.


Solution

  • Both are correct but in different levels.

    According with Cloudera blog http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

    One thing to note is that Snappy is intended to be used with a
    container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.

    This means that if a whole text file is compressed with Snappy then the file is NOT splittable. But if each record inside the file is compressed with Snappy then the file could be splittable, for example in Sequence files with block compression.

    To be more clear, is not the same:

    <START-FILE>
      <START-SNAPPY-BLOCK>
         FULL CONTENT
      <END-SNAPPY-BLOCK>
    <END-FILE>
    

    than

    <START-FILE>
      <START-SNAPPY-BLOCK1>
         RECORD1
      <END-SNAPPY-BLOCK1>
      <START-SNAPPY-BLOCK2>
         RECORD2
      <END-SNAPPY-BLOCK2>
      <START-SNAPPY-BLOCK3>
         RECORD3
      <END-SNAPPY-BLOCK3>
    <END-FILE>
    

    Snappy blocks are NOT splittable but files with snappy blocks are splittables.