I have a job running with a stage that seems to be taking a long time. I've heard that this might be due to something called 'skew'.
How do I know if I'm being impacted by this?
I know this is commonly associated with joins, windows, and other operations that incur shuffles but I don't know how to identify it.
In the above example, there is a task in this job + stage that is taking orders of magnitude longer to run because its input size is orders of magnitude larger than the other tasks.
This is the definition of a skewed task / skewed stage.
If you want to know what value is causing this task to be slow, check out the guidance over here