javaazureazure-blob-storageazure-storageparquet

Why do I get the "is not a Parquet file" error when when creating a parquet reader


Trying to create a AvroParquetReader for a parquet file reading in blockBlob in azure storageaccount, but getting an error - Caused by: java.lang.RuntimeException: InputBuffer@7a70b9e9 is not a Parquet file. Expected magic number at tail, but found [0, 0, 0, 0]


public void parquetReader(){

    BlobServiceClient blobServiceClient =
        new BlobServiceClientBuilder()
            .endpoint("https://" + storageAccountName + ".blob.core.windows.net/")
            .credential(new StorageSharedKeyCredential(storageAccountName, blobKey))
            .connectionString(storageAccountConnectionString)
            .buildClient();

    BlobContainerClient blobContainerClient =
        blobServiceClient.getBlobContainerClient(containerName);

    String path = "data/first/test.parquet";

    BlockBlobClient blockBlobClient = blobContainerClient.getBlobClient(path).getBlockBlobClient();

    
    InputBuffer inputBuffer =
        InputBuffer.create(
            blockBlobClient.openInputStream(), Math.toIntExact(blockBlobClient.getProperties().getBlobSize()));

    ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(inputBuffer).build(); // getting error here

  }

How to resolve the error?


Solution

  • but getting an error - Caused by: java.lang.RuntimeException: InputBuffer@7a70b9e9 is not a Parquet file. Expected magic number at tail, but found [0, 0, 0, 0]

    The error maybe occur due to you are passing blob as stream to the AvroParquetReader but it accepts file.

    You can use the below which downloads the blob (parquet file) as temporary file for reading the file and after reading the file it deletes the temporary file.

    Code:

            String connectionString = "xxxxxx";
            String containerName = "test";
            String blobName = "sample/data.parquet";
    
            BlobClient blob = new BlobClientBuilder()
                    .connectionString(connectionString)
                    .containerName(containerName)
                    .blobName(blobName)
                    .buildClient();
            String tempFilePath = "data.parquet";
    
            try {
                // Download the blob to a temporary file
                blob.downloadToFile(tempFilePath);
    
                // Create a ParquetReader object for the Parquet file with Avro support.
                Path parquetFile = new Path(tempFilePath);
                ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(parquetFile).build();
    
                // Iterate over the records in the file.
                GenericRecord record;
                while ((record = reader.read()) != null) {
                    // Do something with the record.
                    System.out.println(record.toString());
                }
    
                // Close the ParquetReader object.
                reader.close();
                
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                // Delete the temporary file
                File tempFile = new File(tempFilePath);
                if (tempFile.exists() && !tempFile.delete()) {
                    System.err.println("Failed to delete the temporary file: " + tempFilePath);
                }
            }
    

    Output:

    {"column0": "first", "column1": " last"}
    {"column0": "Jorge", "column1": "Frank"}
    {"column0": "Hunter", "column1": "Moreno"}
    {"column0": "Esther", "column1": "Guzman"}
    {"column0": "Dennis", "column1": "Stephens"}
    {"column0": "Nettie", "column1": "Franklin"}
    {"column0": "Stanley", "column1": "Gibson"}
    {"column0": "Eugenia", "column1": "Greer"}
    {"column0": "Jeffery", "column1": "Delgado"}
    {"column0": "Clara", "column1": "Cross"}
    {"column0": "Bernice", "column1": "Vega"}
    {"column0": "Kevin", "column1": "Diaz"}
    {"column0": "Henrietta", "column1": "Rivera"}
    

    enter image description here

    Reference: Apache Parquet Java Example: A Step-by-Step Guide (hatchjs.com)