Trying to create a AvroParquetReader for a parquet file reading in blockBlob in azure storageaccount, but getting an error - Caused by: java.lang.RuntimeException: InputBuffer@7a70b9e9 is not a Parquet file. Expected magic number at tail, but found [0, 0, 0, 0]
public void parquetReader(){
BlobServiceClient blobServiceClient =
new BlobServiceClientBuilder()
.endpoint("https://" + storageAccountName + ".blob.core.windows.net/")
.credential(new StorageSharedKeyCredential(storageAccountName, blobKey))
.connectionString(storageAccountConnectionString)
.buildClient();
BlobContainerClient blobContainerClient =
blobServiceClient.getBlobContainerClient(containerName);
String path = "data/first/test.parquet";
BlockBlobClient blockBlobClient = blobContainerClient.getBlobClient(path).getBlockBlobClient();
InputBuffer inputBuffer =
InputBuffer.create(
blockBlobClient.openInputStream(), Math.toIntExact(blockBlobClient.getProperties().getBlobSize()));
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(inputBuffer).build(); // getting error here
}
How to resolve the error?
but getting an error - Caused by: java.lang.RuntimeException: InputBuffer@7a70b9e9 is not a Parquet file. Expected magic number at tail, but found [0, 0, 0, 0]
The error maybe occur due to you are passing blob as stream to the AvroParquetReader
but it accepts file.
You can use the below which downloads the blob (parquet file) as temporary file for reading the file and after reading the file it deletes the temporary file.
Code:
String connectionString = "xxxxxx";
String containerName = "test";
String blobName = "sample/data.parquet";
BlobClient blob = new BlobClientBuilder()
.connectionString(connectionString)
.containerName(containerName)
.blobName(blobName)
.buildClient();
String tempFilePath = "data.parquet";
try {
// Download the blob to a temporary file
blob.downloadToFile(tempFilePath);
// Create a ParquetReader object for the Parquet file with Avro support.
Path parquetFile = new Path(tempFilePath);
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(parquetFile).build();
// Iterate over the records in the file.
GenericRecord record;
while ((record = reader.read()) != null) {
// Do something with the record.
System.out.println(record.toString());
}
// Close the ParquetReader object.
reader.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
// Delete the temporary file
File tempFile = new File(tempFilePath);
if (tempFile.exists() && !tempFile.delete()) {
System.err.println("Failed to delete the temporary file: " + tempFilePath);
}
}
Output:
{"column0": "first", "column1": " last"}
{"column0": "Jorge", "column1": "Frank"}
{"column0": "Hunter", "column1": "Moreno"}
{"column0": "Esther", "column1": "Guzman"}
{"column0": "Dennis", "column1": "Stephens"}
{"column0": "Nettie", "column1": "Franklin"}
{"column0": "Stanley", "column1": "Gibson"}
{"column0": "Eugenia", "column1": "Greer"}
{"column0": "Jeffery", "column1": "Delgado"}
{"column0": "Clara", "column1": "Cross"}
{"column0": "Bernice", "column1": "Vega"}
{"column0": "Kevin", "column1": "Diaz"}
{"column0": "Henrietta", "column1": "Rivera"}
Reference: Apache Parquet Java Example: A Step-by-Step Guide (hatchjs.com)