apache-sparkcdc

Shareplex CDC output - complete after image possible?


Shareplex CDC offers 3 JSON sub-structs per CDC record:

This is what data engineers state and the documentation seems to state this possibility only, as well.

My question is how can we get the complete after image of the record including both changed and non-changed data? May be it is simply not possible.

{
  "meta":{
    "op":"upd",
    "table":"BILL.PRODUCTS"
  },
  "data":{
    "PRICE":"3599"
  },
  "key":{
    "PRODUCT_ID":"230117",
    "DESCRIPTION":"Hamsberry vintage tee, cherry",
    "PRICE":"4099"
  }
}

The above approach is unhandy with Spark schema's being computed in batch, or defining the complete schema in conjunction with NULL values issues, as far as I can see.


Solution

  • No, this is standardly not possible.

    What you can do is the read the Kafka JSON, do as per below and set the after image on a new Kafka Topic and proceed:

    import org.json4s._
    import org.json4s.jackson.JsonMethods._
    
    val jsonS = 
    """
    {
      "meta":{
        "op":"upd",
        "table":"BILL.PRODUCTS"
      },
      "data":{
        "PRICE":"3599"
      },
      "key":{
        "PRODUCT_ID":"230117",
        "DESCRIPTION":"Hamsberry vintage tee, cherry",
        "PRICE":"4099"
      }
    }
    """.stripMargin
    
    val jsonNN = parse(jsonS) 
    val meta = jsonNN\"meta"
    val data = jsonNN\"data"
    val key  = jsonNN\"key"
    
    val Diff(changed, added, deleted) = key diff data
    
    val afterImage = changed merge deleted
    
    // Convert to JSON
    println(pretty(render(afterImage)))