scalaapache-sparkgoogle-bigquery

java.lang.NullPointerException: null: when loading the scala case class


I am reading bigquery table data and loading them to the case class and while loading it facing this null pointer exception

java.lang.NullPointerException: null
    at org.apache.spark.unsafe.UTF8StringBuilder.append(UTF8StringBuilder.java:76) ~[spark-unsafe_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_fieldToString_0_0$(Unknown Source) ~[?:?]
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_fieldToString_1_2$(Unknown Source) ~[?:?]
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_elementToString_1$(Unknown Source) ~[?:?]
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]


select cm from `test-project.MAPPINGS.dataset_configurations_test` t
,unnest(column_mappings) as cm
where t.data_type='PM'
and cm.column_name in ('XPI_MIN','WEEK');

[{
  "cm": {
    "mapping_type": "aggregation",
    "source_column_name": "COLLECTTIME",
    "column_name": "WEEK",
    "name": "WEEK",
    "display_name": null,
    "description": null,
    "keep_source_column": "true",
    "formula": "DATE_FORMAT(COLLECTTIME, \"w\")",
    "functions": {
      "fun_temporal": "FIRST",
      "fun_regional": "FIRST",
      "fun_temporal_unit": [],
      "fun_regional_unit": []
    }
  }
}, {
  "cm": {
    "mapping_type": "ingestion",
    "source_column_name": "XPI_Min",
    "column_name": "XPI_MIN",
    "name": "XPI_MIN",
    "display_name": "XPI_Min",
    "description": "XPI_Min",
    "keep_source_column": "false",
    "formula": null,
    "functions": {
      "fun_temporal": "MIN",
      "fun_regional": "MIN",
      "fun_temporal_unit": [{
        "key": null,
        "value": null
      }],
      "fun_regional_unit": []
    }
  }
}]

given is the structure of case class

  case class Functions
  (
    fun_temporal: Option[String],
    fun_regional: Option[String],
    fun_temporal_unit: Option[Map[String,String]],
    fun_regional_unit: Option[Map[String,String]],
  )

code is failing when trying to load the column XPI_Min

I can update the bigquery table data as below to fix it but it will be too much overhead for us. since we have to update huge number of records. Looking for some solution within the case class declaration or using some scala/spark.

update `test-project.MAPPINGS.dataset_configurations_test` a
set column_mappings=
ARRAY(
    SELECT AS STRUCT mapping_type,source_column_name,column_name,b.name,display_name,description,keep_source_column,formula,
    STRUCT(functions.fun_temporal as fun_temporal
    , functions.fun_regional as  fun_regional
    ,  CAST(NULL as ARRAY<STRUCT<key STRING, value STRING>>)  as fun_regional_unit
, CAST(NULL as ARRAY<STRUCT<key STRING, value STRING>>)  as fun_regional_unit
) as functions
FROM UNNEST(column_mappings) b where b.column_name='XPI_MIN'
)
where a.name='SNIR_XPI' and a.technology='MW'
;

Solution

  • Although it'd be great if you posted a reproducible example to verify it seems you are trying to have a map with a null key - that's not allowed:

    MapType value, keys are not allowed to have null values

    You will have to either fix the data at source, remove any null key entry via a select / projection before attempting to show / map etc. or treat it as a Seq[(String, String)] and handle the null in your code.