Spark Streaming Kafka backpressure
Asked Answered
K

1

6

We have a Spark Streaming application, it reads data from a Kafka queue in receiver and does some transformation and output to HDFS. The batch interval is 1min, we have already tuned the backpressure and spark.streaming.receiver.maxRate parameters, so it works fine most of the time.

But we still have one problem. When HDFS is totally down, the batch job will hang for a long time (let us say the HDFS is not working for 4 hours, and the job will hang for 4 hours), but the receiver does not know that the job is not finished, so it is still receiving data for the next 4 hours. This causes OOM exception, and the whole application is down, we lost a lot of data.

So, my question is: is it possible to let the receiver know the job is not finishing so it will receive less (or even no) data, and when the job finished, it will start receiving more data to catch up. In the above condition, when HDFS is down, the receiver will read less data from Kafka and block generated in the next 4 hours is really small, the receiver and the whole application is not down, after the HDFS is ok, the receiver will read more data and start catching up.

Khichabia answered 15/4, 2016 at 7:57 Comment(0)
W
8

You can enable back pressure by setting the property spark.streaming.backpressure.enabled=true. This will dynamically modify your batch sizes and will avoid situations where you get an OOM from queue build up. It has a few parameters:

  • spark.streaming.backpressure.pid.proportional - response signal to error in last batch size (default 1.0)
  • spark.streaming.backpressure.pid.integral - response signal to accumulated error - effectively a dampener (default 0.2)
  • spark.streaming.backpressure.pid.derived - response to the trend in error (useful for reacting quickly to changes, default 0.0)
  • spark.streaming.backpressure.pid.minRate - the minimum rate as implied by your batch frequency, change it to reduce undershoot in high throughput jobs (default 100)

The defaults are pretty good but I simulated the response of the algorithm to various parameters here

Wastebasket answered 4/12, 2016 at 12:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.