Spark streaming + Kafka vs Just Kafka

Why and when one would choose to use Spark streaming with Kafka?

Suppose I have a system getting thousand messages per seconds through Kafka. I need to apply some real time analytics on these messages and store the result in a DB.

I have two options:

Create my own worker that reads messages from Kafka, run the analytics algorithm and store the result in DB. In a Docker era it is easy to scale this worker through my entire cluster with just scale command. I just need to make sure I have an equal or grater number of partitions than my workers and all is good and I have a true concurrency.
Create a Spark cluster with Kafka streaming input. Let the Spark cluster to do the analytics computations and then store the result.

Is there any case when the second option is a better choice? Sounds to me like it is just an extra overhead.

In a Docker era it is easy to scale this worker through my entire cluster

If you already have that infrastructure available, then great, use that. Bundle your Kafka libraries in some minimal container with health checks, and what not, and for the most part, that works fine. Adding a Kafka client dependency + a database dependency is all you really need, right?

If you're not using Spark, Flink, etc, you will need to handle Kafka errors, retries, offset and commit handling more closely to your code rather than letting the framework handle those for you.

I'll add in here that if you want Kafka + Database interactions, check out the Kafka Connect API. There's existing solutions for JDBC, Mongo, Couchbase, Cassandra, etc. already.

If you need more complete processing power, I'd go for Kafka Streams rather than needing to separately maintain a Spark cluster, and so that's "just Kafka"

Create a Spark cluster

Let's assume you don't want to maintain that, or rather you aren't able to pick between YARN, Mesos, Kubernetes, or Standalone. And if you are running the first three, it might be worth looking at running Docker on those anyway.

You're exactly right that it is extra overhead, so I find it's all up to what you have available (for example, an existing Hadoop / YARN cluster with idle memory resources), or what you're willing to support internally (or pay for vendor services, e g. Kafka & Databricks in some hosted solution).

Plus, Spark isn't running the latest Kafka client library (up until 2.4.0 updated to Kafka 2.0, I believe), so you'll need to determine if that's a selling point.

For actual streaming libraries, rather than Spark batches, Apache Beam or Flink would probably let you do the same types of workloads against Kafka

In general, in order to scale a producer / consumer, you need some form of resource scheduler. Installing Spark may not be difficult for some, but knowing how to use it efficiently and tune for appropriate resources can be

Recommended topics

Hot tags