time-complexityapache-spark-sqlspace-complexitymemory-consumption

What is the time complexity and memory footprint of dataframe operations in Spark?


What is the algorithmic complexity and/or memory consumption of dataframe operations in Spark? I can't find any information in the documentation.

One useful example would be the answer to the memory/disk footprint when extending a dataframe with another column (withColumn()): (in Python with automatic garbage collection) is it better to do table = table.withColumn(…) or does extended_table = table.withColumn() take about the same memory?

PS: Let's say that both tables are persisted with persist().


Solution

  • Assigning to the same variable or another variable doesn't make a difference. Spark just uses these assignments to build the lineage graph from your specified operation. When you invoke an actual Spark action the operations in the lineage graph are executed.

    Additional memory is only needed when you cache intermediate results via .cache() or .persist().