javahadoopparquetcodecsnappy

Apache dependency bug? org.apache.parquet.hadoop.codec.SnappyCodec was not found Error in apache library


Currently trying to read a parquet file in Java without the use of Spark. Here's what I have so far, based on Adam Melnyk's blog post on the subject.

Code

        ParquetFileReader reader = ParquetFileReader.open(file);
        MessageType schema = reader.getFooter().getFileMetaData().getSchema();
        List<Type> fields = schema.getFields();
        PageReadStore pages;
-->     while ((pages = reader.readNextRowGroup()) != null) {
            long rows = pages.getRowCount();
            LOG.info("Number of rows: " + rows);
            MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
            RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));

            for (int i = 0; i < rows; i++) {
                SimpleGroup simpleGroup = (SimpleGroup) recordReader.read();
                simpleGroups.add(simpleGroup);
            }
        }

(note that the arrow is the line (167) that the error is thrown at in my code)

Error Message

org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.parquet.hadoop.codec.SnappyCodec was not found
        at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:243)
        at org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.<init>(CodecFactory.java:96)
        at org.apache.parquet.hadoop.CodecFactory.createDecompressor(CodecFactory.java:212)
        at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:201)
        at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:42)
        at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1519)
        at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402)
        at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023)
        at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928)
        at [myClassPath]([myClass].java:167)

Dependencies

   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-hdfs</artifactId>
   <version>3.1.1.3.1.4.0-315</version>
 </dependency>
 <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-common</artifactId>
   <version>3.1.1.3.1.4.0-315</version>
 </dependency>
 <dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-launcher_2.12</artifactId>
   <version>3.0.0-preview2</version>
 </dependency>
 <dependency>
   <groupId>org.apache.parquet</groupId>
   <artifactId>parquet-avro</artifactId>
   <version>1.12.0</version>
 </dependency>

It seems as though the SnappyCodec class cannot be found from the CodecFactory class, but I looked into my referenced libraries and the class is there: referenced_libraries

CodecFactory should be able to recognize the SnappyCodec class. Any recommendations? Thanks


Solution

  • Found a solution.

    So the problem was that the SnappyCodec class was being shaded by the maven shade plugin I have configured for my application.

    I realized this after packaging the jar with maven, opening that jar with WinZip, and checking the codec directory of the packaged jar (where I found the SanppyCodec.class no longer existed).

    The solution was that I needed to add the following filters to the configuration of my maven shade plugin:

    <filter>
        <artifact>org.apache.parquet:parquet-hadoop</artifact>
        <includes>
              <include>**</include>
        </includes>
    </filter>
    <filter>
        <artifact>org.apache.parquet:parquet-column</artifact>
        <includes>
              <include>**</include>
        </includes>
    </filter>
    <filter>
        <artifact>org.apache.parquet:parquet-encoding</artifact>
        <includes>
              <include>**</include>
        </includes>
    </filter>
    

    Basically, maven-shade was shading seemingly random classes from the parquet-hadoop artifact, so by adding the <include> filter, maven-shade did NOT shade any of the classes inside it, thus not shading the SnappyCodec.class file within it.

    After doing this, I needed to add the other two filters because by using the <include> tag on the parquet-hadoop artifact, it then excluded every other parquet-* artifact from being added to the compiled jar. So, I needed to explicitly tell it to include parquet-column and parquet-encoding as well since my application used some other classes within those artifacts.

    This configuration meant that maven-shader would not touch these three artifacts, meaning that any and every class that was present within those artifacts before compile time would remain there after compiling/packaging them with maven (thus, would be there at runtime, whereas they weren't before, causing the original error). Awesome!