apache-flinkflink-streamingflink-sqlflink-ceppyflink

Can a source ignore unknown fields in Apache Flink?


Suppose I have a Kafka topic that will be pushed with events by many services, and I want to use Flink to handle these events. In addition, those events are heterogeneous but have several fields that are the same.

For example, there are three common fields in JSON format.

{
  "id": 1,
  "msg": "hello",
  "state": 5
}

And, there are many other columns from services that are different, eg: a, b, c. How can I ignore those unknown fields and focus on those recognized fields to get the result?


Solution

  • Use the SimpleStringSchema to read your records from Kafka as strings. Then have a process function that parses each incoming String as JSON, and extract the "recognized fields".

    See Immerok's Creating Dead Letter Queues recipe for code that shows how to read from Kafka as a JSON string, and then parse it in a downstream operator.