apache-spark google-cloud-platform mapreduce google-cloud-logging dataproc

Error while scanning intermediate done dir - dataproc spark job

Our spark aggregation jobs are taking a lot of execution time to complete. It supposed to complete in 5 mins but taking 30 to 40 minutes to complete.

dataproc cluster logging say it's trying to scan the intermediate done dir and its keep on appearing.

If I look at the spark UI, it says that job is actually taking less time only and it may be hold up because of something.

I tried to look at the yarn logs for a given job id, but I could not find the similar error message there. Where I can see the same log which appearing on cloud logging?

yarn logs -applicationId application_1700468925211_1632269

I read some articles which says it has something to do with jobhistory server wherein it trying to scan the directory in loop.

for reference: https://issues.apache.org/jira/browse/MAPREDUCE-6684

Having a look at mapred-site.xml file I found below properties which are pointing to gcs bucket location of a temp bucket to store dataproc jobs details.

mapreduce.jobhistory.done-dir
mapreduce.jobhistory.intermediate-done-dir

<property>
    <name>mapreduce.jobhistory.always-scan-user-dir</name>
    <value>true</value>
    <description>Enable history server to always scan user dir.</description>
</property>


<property>
        <name>mapreduce.jobhistory.recovery.enable</name>
        <value>true</value>
        <description>
            Enable history server to recover server state on startup.
        </description>
    </property>

Can we disable above to false in order to resolve the problem. I am very skeptical about the approach since we are facing this issue in production and cannot be replicated at lower environment. Looking forward for meaningful suggestions.

Solution

I logged into master node and changed the mapred-site.xml file and set the mapreduce.jobhistory.always-scan-user-dir property to false

cd /etc/hadoop/conf/
sudo vi mapred-site.xml

<property>
    <name>mapreduce.jobhistory.always-scan-user-dir</name>
    <value>true</value>
    <description>Enable history server to always scan user dir.</description>
</property>

After this I cancelled the existing running jobs and stop and restarted the dataproc cluster. Jobs are running fine and taking less expected execution time.

Not too sure about the cause of this issue though we have other clusters with similar configuration, and they are running absolutely fine.