dataframeapache-sparkobjectmapperapache-spark-dataset

Spark Dataframe Vs traditional object mapper


Traditional object mapper is used to abstract the code and the database in typical use cases. In my scenario, I am using spark to read data from source and converting to dataframe. The target for my case is GCP BQ. In this scenario, is there any advantage of using a traditional object mapper to map to GCP BQ table? Or spark's dataframe or any other feature solve the purpose of object mapper?

I am looking to understand the object mapper importance on top of having a spark dataframe.


Solution

  • If you must convert to objects (instead of using DataFrame / Row directly) then Spark provides Encoders. You typically, for performance reasons, want to keep as much transformation code using the Spark Column api (or sql directly) as possible. Any time you must use your own classes there is a cost for serializing and deserializing the objects from Sparks own InternalRow format.

    For those occasions where you really need to use your own classes for JVM you have Bean encoders (also used with Java), product encoders (case classes in Scala) or kyro etc. Where you want to have more specific encoding, Scala specific, you can use Frameless but the default Spark product encoders already provide most of what is typically needed.

    There is no equivalent to an automatic join to other tables in Spark, you must wire that up yourself from various datasets, again that's probably not worthwhile.