Error while scanning intermediate done dir - dataproc spark job

Our spark aggregation jobs are taking a lot of execution time to complete. It supposed to complete in 5 mins but taking 30 to 40 minutes to complete.

dataproc cluster logging say it's trying to scan the intermediate done dir and its keep on appearing.

If I look at the spark UI, it says that job is actually taking less time only and it may be hold up because of something.

I tried to look at the yarn logs for a given job id, but I could not find the similar error message there. Where I can see the same log which appearing on cloud logging?

yarn logs -applicationId application_1700468925211_1632269

I read some articles which says it has something to do with jobhistory server wherein it trying to scan the directory in loop.

for reference: https://issues.apache.org/jira/browse/MAPREDUCE-6684

Having a look at mapred-site.xml file I found below properties which are pointing to gcs bucket location of a temp bucket to store dataproc jobs details.

mapreduce.jobhistory.done-dir
mapreduce.jobhistory.intermediate-done-dir

<property>
    <name>mapreduce.jobhistory.always-scan-user-dir</name>
    <value>true</value>
    <description>Enable history server to always scan user dir.</description>
</property>


<property>
        <name>mapreduce.jobhistory.recovery.enable</name>
        <value>true</value>
        <description>
            Enable history server to recover server state on startup.
        </description>
    </property>

Can we disable above to false in order to resolve the problem. I am very skeptical about the approach since we are facing this issue in production and cannot be replicated at lower environment. Looking forward for meaningful suggestions.

Recommended topics

Hot tags