My AWS Glue job for Hudi CDC is failing on a column that is a precombine field (see error message below). I have validated that there are no NULL values on this column (it has an AFTER UPDATE Trigger and a default of NOW() set). When I query the parquet files using spark, the only records that show NULL are records that are marked with an operation ('op') of DELETE. From my understanding, Hudi only transmits the PRIMARY KEY on a DELETE operation and nothing else.
Why is Hudi failing on a precombine with a NULL value in the DELETE operation? How can I fix this? Am I missing an option or something? Any help is greatly appreciated.
Error message:
2022-06-06 19:05:13,633 ERROR [Executor task launch worker for task 2.0 in stage 46.0 (TID 264)] executor.Executor (Logging.scala:logError(94)): Exception in task 2.0 in stage 46.0 (TID 264) org.apache.hudi.exception.HoodieException: The value of last_modified_date can not be null
Hudi options:
options = {
"hoodie_overrides": {
"hoodie.datasource.write.recordkey.field": "id",
"hoodie.datasource.write.precombine.field": "last_modified_date",
"hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.NonPartitionedExtractor",
"hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
"hoodie.datasource.hive_sync.support_timestamp": "true",
}
}
Spark query of parquet files:
You can try setting this configuration to "false":
hoodie.combine.before.delete
Source: hoodie.combine.before.delete