apache-sparkpysparkkryo

Do you benefit from the Kryo serializer when you use Pyspark?


I read that the Kryo serializer can provide faster serialization when used in Apache Spark. However, I'm using Spark through Python.

Do I still get notable benefits from switching to the Kryo serializer?


Solution

  • Kryo won’t make a major impact on PySpark because it just stores data as byte[] objects, which are fast to serialize even with Java.

    But it may be worth a try — you would just set the spark.serializer configuration and trying not to register any classe.

    What might make more impact is storing your data as MEMORY_ONLY_SER and enabling spark.rdd.compress, which will compress them your data.

    In Java this can add some CPU overhead, but Python runs quite a bit slower, so it might not matter. It might also speed up computation by reducing GC or letting you cache more data.

    Reference : Matei Zaharia's answer in the mailing list.