javaapache-beamavroapache-beam-io

Error when using apache beam ParquetIO to read data from parquet file due to avro schema exception


I am using the Apache Beam ParquetIO.read(schema) method to read data from a parquet file. When performing the read I was getting the following error: java.lang.NullPointerException: null of com.namespace.myfield field myfield.

This was occuring because the field in question in the source data had a null value. I updated the avro schema being used by the ParquetIO.read(schema) method to include a union so that it now looks like the below:

{
   "type": "record",
   "name": "TABLE",
   "namespace": "com.namespace",
   "fields": [
      {
         "name": "myfield ",
         "type": [
            "null",
            {
                "type": "fixed",
                "name": "myfield",
                "size": 5,
                "logicalType": "decimal",
                "precision": 10,
                "scale": 5
            }
         ]
      }
}

My thinking was this would allow the value to be null or of the fixed type required.

When I run the same code now I get a different error: org.apache.avro.UnresolvedUnionException: Not in union ["null",{"type":"fixed","name":"myfield","namespace":"com.namespace","size":5,"logicalType":"decimal","precision":10,"scale":5}]: [0, 0, 0, 0, 0]

When I debug the code and step through it seems to be that the exception is being thrown from the org.apache.avro.generic.GenericData class within the resolveUnion method and it looks as though it is unable to find the required fixed type because it cant handle the complex type within the array.

Has anyone had any experience of getting ParquetIO working with reading a file using an avro schema that contains a union of null and a fixed type?

For reference I am using the 2.19.0 version of beam-sdks-java-io-parquet and I beleive this in turn is using v 1.8.2 of org.apache.avro. I am unsure whether this is occuring because there is a known bug in the older versions being used or if I am missing something in the format of the schema?

UPDATE It now looks like the error is occurring because the lookup is searching for the fixed field within the union by name "myfield" however it looks like it is only findable with its fqdn "com.namespace.myfield". I am not entirely sure what to change so that it searches for the field including the namespace.


Solution

  • So I figured this out for anyone who possibly runs into the same issue. ParquetIO.read() in apache beam use org.apache.avro.generic. When resolving the union there is a line of code in the GenericData class in the resolveUnion method:

    Integer i = union.getIndexNamed(getSchemaName(datum));
    

    the getIndexNamed method is being called for a fixed type, within this method there is a map called indexByName which contains the elements in the union. That line of code above is searching for a field in the union called 'myfield'. 'myfield' is however not in that map, when it was created the field was added to the map with the full name (including namespace) so it was called 'com.namespace.myfield'. As a result of this it can never resolve.

    If I remove the namespace from the record it is able to resolve the union with no issues.