apache-sparkpalantir-foundryfoundry-code-repositoriesfoundry-code-workbooksfoundry-contour

How do I identify that my Foundry job's stage has skew?


I have a job running with a stage that seems to be taking a long time. I've heard that this might be due to something called 'skew'.

How do I know if I'm being impacted by this?

I know this is commonly associated with joins, windows, and other operations that incur shuffles but I don't know how to identify it.


Solution

    1. Open Spark Details like this
    2. Identify either currently running stage or your slowest stage overall
    3. Click on this stage's row to reveal Stage Details button Stage Details button
    4. Click on Stage Details button
    5. Look at metrics of stage at the top of your screen. If you see a smaller number of tasks running significantly longer than the others, this means you have skew Skew
    6. If you click on the slowest task, you'll find the task highlighted in the overview below, which will indicate the sizes of inputs / outputs. Skew Details

    In the above example, there is a task in this job + stage that is taking orders of magnitude longer to run because its input size is orders of magnitude larger than the other tasks.

    This is the definition of a skewed task / skewed stage.

    If you want to know what value is causing this task to be slow, check out the guidance over here