I have been trying recently to create a metadata-driven pipeline with NYC data within ADF but that process has failed apart from when I select a particular file. With this file alone, I can also 'preview data' within ADF. All other files within the same container and directory I get the following error: An error occurred when invoking java, message:
java.lang.NoClassDefFoundError:Could not initialize class com.github.luben.zstd.RecyclingBufferPool
total entry:19
org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:90)
org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:83)
org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.decompress(CodecFactory.java:112)
org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader.readDictionaryPage(ColumnChunkPageReadStore.java:236)
org.apache.parquet.column.impl.ColumnReaderBase.<init>(ColumnReaderBase.java:410)
org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:46)
org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82)
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:141)
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.<init>(ParquetBatchReaderBridge.java:70)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.open(ParquetBatchReaderBridge.java:64)
com.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createReader(ParquetFileBridge.java:22)
I cannot find any knowledge base online to identify the source of this error nor anyone else experiencing a similar issue.
Since I can read and copy only one of the files, I have a sense this might be a permission or policy issue but I am quite a newbie in Azure and not sure what to look into if its not the permissions.
In the meantime, I will continue to read about permissions and policies pertaining to containers/directories and files but I might just be looking in the wrong direction.
I would appreciate any suggestions of what I should check or where I could continue to research to understand the issue.
This is the second error I get also when selecting other files, reading through it, suggests there is some restriction in the realm of authorization or security mechanism:
An error occurred when invoking java, message: java.lang.UnsatisfiedLinkError:D:\Users\_azbatchtask_1\AppData\Local\Temp\libzstd-jni-1.5.5-59177584671814094752.dll: Your organization used Device Guard to block this app. Contact your support person for more info
no zstd-jni-1.5.5-5 in java.library.path
Unsupported OS/arch, cannot find /win/amd64/libzstd-jni-1.5.5-5.dll or load zstd-jni-1.5.5-5 from system libraries. Please try building from source the jar or providing libzstd-jni-1.5.5-5 in your system.
total entry:30
java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
java.lang.Runtime.loadLibrary0(Runtime.java:870)
java.lang.System.loadLibrary(System.java:1122)
com.github.luben.zstd.util.Native$1.run(Native.java:69)
com.github.luben.zstd.util.Native$1.run(Native.java:67)
java.security.AccessController.doPrivileged(Native Method)
com.github.luben.zstd.util.Native.loadLibrary(Native.java:67)
com.github.luben.zstd.util.Native.load(Native.java:154)
com.github.luben.zstd.util.Native.load(Native.java:85)
com.github.luben.zstd.ZstdOutputStreamNoFinalizer.<clinit>(ZstdOutputStreamNoFinalizer.java:18)
com.github.luben.zstd.RecyclingBufferPool.<clinit>(RecyclingBufferPool.java:18)
org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:90)
org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:83)
org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.decompress(CodecFactory.java:112)
org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader.readDictionaryPage(ColumnChunkPageReadStore.java:236)
org.apache.parquet.column.impl.ColumnReaderBase.<init>(ColumnReaderBase.java:410)
org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:46)
org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82)
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:141)
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.<init>(ParquetBatchReaderBridge.java:70)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.open(ParquetBatchReaderBridge.java:64)
com.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createReader(ParquetFileBridge.java:22)
Here is a view of the schema of the files:
{
"type" : "record",
"name" : "schema",
"fields" : [ {
"name" : "VendorID",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "tpep_pickup_datetime",
"type" : [ "null", {
"type" : "long",
"logicalType" : "local-timestamp-micros"
} ],
"default" : null
}, {
"name" : "tpep_dropoff_datetime",
"type" : [ "null", {
"type" : "long",
"logicalType" : "local-timestamp-micros"
} ],
"default" : null
}, {
"name" : "passenger_count",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "trip_distance",
"type" : [ "null", "double" ],
"default" : null
}, {
"name" : "RatecodeID",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "store_and_fwd_flag",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "PULocationID",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "DOLocationID",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "payment_type",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "fare_amount",
"type" : [ "null", "double" ],
"default" : null
}, {
"name" : "extra",
"type" : [ "null", "double" ],
"default" : null
}, {
"name" : "mta_tax",
"type" : [ "null", "double" ],
"default" : null
}, {
"name" : "tip_amount",
"type" : [ "null", "double" ],
"default" : null
}, {
"name" : "tolls_amount",
"type" : [ "null", "double" ],
"default" : null
}, {
"name" : "improvement_surcharge",
"type" : [ "null", "double" ],
"default" : null
}, {
"name" : "total_amount",
"type" : [ "null", "double" ],
"default" : null
}, {
"name" : "congestion_surcharge",
"type" : [ "null", "double" ],
"default" : null
}, {
"name" : "Airport_fee",
"type" : [ "null", "double" ],
"default" : null
} ]
}
Below I've attached screenshots of the NYC Yellow Taxi data dictionary and a screenshot of some files I've download to test the ADF pipeline.
The error says missing dependency class when you do preview in dataset
or copy activity.
Some of the parquet files require additional drivers and dependencies to read the file successfully, so use dataflow which use spark cluster to read data which consists of all the required dependencies.
Even i got the same error for NYC trip data, but when i tried in dataflow it read successfully.
So, while creating dataset you choose Import schema option as None
and save it.
Next, use this dataset in dataflow source settings.
Output: