[SOLVED] Spark Encoders: when to use beans()

Spark Encoders: when to use beans()

I came across a memory management problem while using Spark's caching mechanism. I am currently utilizing Encoders with Kryo and was wondering if switching to beans would help me reduce the size of my cached dataset.

Basically, what are the pros and cons of using beans over Kryo serialization when working with Encoders? Are there any performance improvements? Is there a way to compress a cached Dataset apart from caching with SER option?

For the record, I have found a similar topic that tackles the comparison between the two. However, it doesn't go into the details of this comparison.

Solution

Whenever you can. Unlike generic binary Encoders, which use general purpose binary serialization and store whole objects as opaque blobs, Encoders.bean[T] leverages the structure of an object, to provide class specific storage layout.

This difference becomes obvious when you compare the schemas created using Encoders.bean and Encoders.kryo.

Why does it matter?

You get efficient field access using SQL API without any need for deserialization and full support for all Dataset transformations.
With transparent field serialization you can fully utilize columnar storage, including built-in compression.

So when to use kryo Encoder? In general when nothing else works. Personally I would avoid it completely for data serialization. The only really useful application I can think of is serialization of aggregation buffer (check for example How to find mean of grouped Vector columns in Spark SQL?).