I read that the Kryo serializer can provide faster serialization when used in Apache Spark. However, I'm using Spark through Python.
Do I still get notable benefits from switching to the Kryo serializer?
Kryo
won’t make a major impact on PySpark
because it just stores data as byte[]
objects, which are fast to serialize even with Java.
But it may be worth a try — you would just set the spark.serializer
configuration and trying not to register any classe.
What might make more impact is storing your data as MEMORY_ONLY_SER
and enabling spark.rdd.compress
, which will compress them your data.
In Java this can add some CPU overhead, but Python runs quite a bit slower, so it might not matter. It might also speed up computation by reducing GC or letting you cache more data.
Reference : Matei Zaharia's answer in the mailing list.