Currently trying to read a parquet file in Java without the use of Spark. Here's what I have so far, based on Adam Melnyk's blog post on the subject.
Code
ParquetFileReader reader = ParquetFileReader.open(file);
MessageType schema = reader.getFooter().getFileMetaData().getSchema();
List<Type> fields = schema.getFields();
PageReadStore pages;
--> while ((pages = reader.readNextRowGroup()) != null) {
long rows = pages.getRowCount();
LOG.info("Number of rows: " + rows);
MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
for (int i = 0; i < rows; i++) {
SimpleGroup simpleGroup = (SimpleGroup) recordReader.read();
simpleGroups.add(simpleGroup);
}
}
(note that the arrow is the line (167) that the error is thrown at in my code)
Error Message
org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.parquet.hadoop.codec.SnappyCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:243)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.<init>(CodecFactory.java:96)
at org.apache.parquet.hadoop.CodecFactory.createDecompressor(CodecFactory.java:212)
at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:201)
at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:42)
at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1519)
at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402)
at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928)
at [myClassPath]([myClass].java:167)
Dependencies
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.1.1.3.1.4.0-315</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.1.1.3.1.4.0-315</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-launcher_2.12</artifactId>
<version>3.0.0-preview2</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>1.12.0</version>
</dependency>
It seems as though the SnappyCodec class cannot be found from the CodecFactory class, but I looked into my referenced libraries and the class is there: referenced_libraries
CodecFactory should be able to recognize the SnappyCodec class. Any recommendations? Thanks
Found a solution.
So the problem was that the SnappyCodec class was being shaded by the maven shade plugin I have configured for my application.
I realized this after packaging the jar with maven, opening that jar with WinZip, and checking the codec directory of the packaged jar (where I found the SanppyCodec.class no longer existed).
The solution was that I needed to add the following filters to the configuration of my maven shade plugin:
<filter>
<artifact>org.apache.parquet:parquet-hadoop</artifact>
<includes>
<include>**</include>
</includes>
</filter>
<filter>
<artifact>org.apache.parquet:parquet-column</artifact>
<includes>
<include>**</include>
</includes>
</filter>
<filter>
<artifact>org.apache.parquet:parquet-encoding</artifact>
<includes>
<include>**</include>
</includes>
</filter>
Basically, maven-shade was shading seemingly random classes from the parquet-hadoop artifact, so by adding the <include>
filter, maven-shade did NOT shade any of the classes inside it, thus not shading the SnappyCodec.class file within it.
After doing this, I needed to add the other two filters because by using the <include>
tag on the parquet-hadoop artifact, it then excluded every other parquet-* artifact from being added to the compiled jar. So, I needed to explicitly tell it to include parquet-column and parquet-encoding as well since my application used some other classes within those artifacts.
This configuration meant that maven-shader would not touch these three artifacts, meaning that any and every class that was present within those artifacts before compile time would remain there after compiling/packaging them with maven (thus, would be there at runtime, whereas they weren't before, causing the original error). Awesome!