I have a Spark application (2.4.5) using Kafka as the source using big batch windows (5 minutes), in our application we only really care about the RDD from that specific interval to process data.
What is happening is that our application is crashing from time to time with either OutOfMemory exception on the Driver (running in client mode) or GC OutOfMemory on the executors. After a lot of research, it seemed that we were not handling the states properly which was causing the Lineage to grow indefinitely. We considered fixing the problem either by using a batch approach where we control the offsets grabbed from Kafka and create the RDD's from them (which would truncate the lineage) or by enabling checkpointing.
During the investigations someone found a not really similar issue which was solved by tweaking some UI parameters (Yarn Heap usage growing over time):
- spark.ui.retainedJobs=50
- spark.ui.retainedStages=50
- spark.ui.retainedTasks=500
- spark.worker.ui.retainedExecutors=50
- spark.worker.ui.retainedDrivers=50
- spark.sql.ui.retainedExecutions=50
- spark.streaming.ui.retainedBatches=50
Since these are UI parameters, it doesn't make sense to me that they would affect the application's memory usage unless they affect the way applications store information to send to the UI. Early tests show that the application is indeed running longer without OOM issues.
Can anyone explain what is the impact these parameters have on Applications? Can they really impact memory usage on applications? Are there any other parameters that I should look into to get the whole picture (I'm wondering if there is a "factor" parameter that needs to be tweaked so memory allocation is appropriate for our case)?
Thank you