As per the documentation:
Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.
Shuffle spill (disk) is the size of the serialized form of the data on disk.
My understanding of shuffle is this:
For each existing partition: new_partition = hash(partitioning_id)%200; target_executor = new_partition%num_executors
where %
is the modulo operator and the num_executors is the number of executors on the cluster.Is my understanding of the shuffle operation correct?
Can you help me put the definition of shuffle spill (memory) and shuffle spill (disk) in the context of the shuffle mechanism (the one described above if it is correct)? For example (maybe): "shuffle spill (disk) is the part that is happening in point 2 mentioned above where the 200 partitions are dumped to the disk of their respective nodes" (I do not know if it is correct to say that; just giving an example)
Lets take a look at docu where we can find this:
Shuffle read: Total shuffle bytes and records read, includes both data read locally and data read from remote executors
This is what your executor loads into memory when stage processing is starting, you can think about this as shuffle files prepared in previous stage by other executors
Shuffle write: Bytes and records written to disk in order to be read by a shuffle in a future stage
This is size of output of your stage which may be picked up by next stage for processing, in other words this is a size of shuffle files which this stage created
And now what is shuffle spill
Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.
Shuffle spill (disk) is the size of the serialized form of the data on disk.
Shuffle spill hapens when your executor is reading shuffle files but they cannot fit into execution memory of this executor. When this happens, some chunk of data is removed from memory and written to disc (its spilled to disc in other words)
Moving back to your question: what is the difference between spill(memory) and spill(disc)? Its describing excatly the same chunk of data. First metric is describing space occupied by those spilled data in memory before they were moved to disc, second is describing their size when written to disc. Those two metrics may be different because data may be represented differently when written to disc, for example they may be compressed.
If you want to read more:
"Shuffle spill (memory) is the size of the deserialized form of the data in memory at the time when we spill it, whereas shuffle spill (disk) is the size of the serialized form of the data on disk after we spill it. This is why the latter tends to be much smaller than the former. Note that both metrics are aggregated over the entire duration of the task (i.e. within each task you can spill multiple times)."
Spill is represented by two values: (These two values are always presented together.)
Spill (Memory): is the size of the data as it exists in memory before it is spilled.
Spill (Disk): is size of the data that gets spilled, serialized and, written into disk and gets compressed.