apache-sparkgoogle-cloud-dataproc

Spark process running without disk error exception


I have a DataProc cluster spined up in Google Cloud.

I am executing a Spark application in it. This application acts like a web server. It listens for requests; then triggers Spark jobs (a.k.a: Spark actions) and returns the results. The cluster is dedicated only to my spark application; no other jobs are running in it. Each node in the cluster has 375GB of hard disk attached to it.

While the spark app is fulfilling the requests, the spark jobs (actions) that it forked, creates a lot of shuffle data.

What I anticipate is: since the spark application keeps on running, it would have exhausted the disk space with the shuffle data at some point (because it keeps on getting requests). I even monitor it in Spark UI that the aggregated shuffle data is growing and even has gone past 375GB; but the job is fulfilling new requests with out throwing any disk error exception.

In the application, I have enabled external shuffle service as well.

So it is clear that the application is removing shuffle data; but not sure which spark process is removing the shuffle data.

Is it the executor process itself or the external shuffle service process running on each node ? or the driver process ?

Could someone throw light.

Thanks


Solution

  • Spark has a component for application-wide cleanup - ContextCleaner. It runs on the driver and removes the shuffle files when their ShuffleDependency instances has no references.

    /**
     * An asynchronous cleaner for RDD, shuffle, and broadcast state.
     *
     * This maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest,
     * to be processed when the associated object goes out of scope of the application. Actual
     * cleanup is performed in a separate daemon thread.
     */
    private[spark] class ContextCleaner(
    ...
    

    So, the process is sort of depends on how JVM garbage collection is behaving. A relevant quote from [SPARK-5750][SPARK-3441][SPARK-5836][CORE] Added documentation explaining shuffle is below.

    I know that there has been some concern about the shuffle files filling up disk, but that as of now can happen because one or more of the following reasons.

    1. GC does not kick in for a long time (very high driver memory). The solution may often be periodically call GC.
    2. Nothing goes out of scope and so nothing is GCed.
    3. There are some issues reported with shuffle files not being cleaned up in Mesos

    The 3rd one is a bug and we will fix it. The first two should be clarified in the docs.

    For reducing delay of release of the resources, I'd set Dataset and RDD variables to null and call Dataset.unpersist() and RDD.unpersist() functions if the data was cached when it's no longer needed. Please see page When are Java objects eligible for garbage collection? in Oracle Blogs and section Removing Data in Spark's RDD Programming Guide for more details.

    The cleaner is enabled by default. Please see page Spark Configuration for more information.