Spark streaming + Kafka vs Just Kafka
Asked Answered
C

1

14

Why and when one would choose to use Spark streaming with Kafka?

Suppose I have a system getting thousand messages per seconds through Kafka. I need to apply some real time analytics on these messages and store the result in a DB.

I have two options:

  1. Create my own worker that reads messages from Kafka, run the analytics algorithm and store the result in DB. In a Docker era it is easy to scale this worker through my entire cluster with just scale command. I just need to make sure I have an equal or grater number of partitions than my workers and all is good and I have a true concurrency.

  2. Create a Spark cluster with Kafka streaming input. Let the Spark cluster to do the analytics computations and then store the result.

Is there any case when the second option is a better choice? Sounds to me like it is just an extra overhead.

Cyrillic answered 23/7, 2017 at 8:11 Comment(6)
It depends. With Spark Streaming you get kafka consumer scalability out of the box because of the way streaming is built, you can parallelize by the amount of partitions you have, and not worry about consumer groups, etc. When reading manually, you have to manage offsets, distribution of topics between the worker nodes yourself. In addition, you get computability parallelism by definition of using a DStream, which again, if your computation is "heavy", you'll need to do on your own.Anstus
And on the contrary, learning a framework such as Spark just to handle a small amount traffic may definitely be an overhead. Do you really need all the scalability right now? How much traffic will this be handling? Will there be data peaks? This varies highly by the use case, not something one can answer on StackOverflow.Anstus
I get dozens terabytes per day so it's not a small amount. If I have more partitions than workers then everything is automatically concurrent as each worker assigned to a different partitions. It is all done automatically by Kafka.Cyrillic
Are they all reading from the same topics under the same consumer group? Are your messages partitioned by some key? I can think of many things off the top of my head that you get for free with Spark, but that isn't a discussion in comments.Anstus
You can separate #1 even further and have an even simpler Kafka Streams app that consumers Kafka messages, does streaming analytics, and publishes to an output topic which then goes to a separate Kafka Connector which stores the results into a DB.Sesterce
Same topic, same consumer group, no key.Cyrillic
T
2

In a Docker era it is easy to scale this worker through my entire cluster

If you already have that infrastructure available, then great, use that. Bundle your Kafka libraries in some minimal container with health checks, and what not, and for the most part, that works fine. Adding a Kafka client dependency + a database dependency is all you really need, right?

If you're not using Spark, Flink, etc, you will need to handle Kafka errors, retries, offset and commit handling more closely to your code rather than letting the framework handle those for you.

I'll add in here that if you want Kafka + Database interactions, check out the Kafka Connect API. There's existing solutions for JDBC, Mongo, Couchbase, Cassandra, etc. already.

If you need more complete processing power, I'd go for Kafka Streams rather than needing to separately maintain a Spark cluster, and so that's "just Kafka"

Create a Spark cluster

Let's assume you don't want to maintain that, or rather you aren't able to pick between YARN, Mesos, Kubernetes, or Standalone. And if you are running the first three, it might be worth looking at running Docker on those anyway.

You're exactly right that it is extra overhead, so I find it's all up to what you have available (for example, an existing Hadoop / YARN cluster with idle memory resources), or what you're willing to support internally (or pay for vendor services, e g. Kafka & Databricks in some hosted solution).

Plus, Spark isn't running the latest Kafka client library (up until 2.4.0 updated to Kafka 2.0, I believe), so you'll need to determine if that's a selling point.

For actual streaming libraries, rather than Spark batches, Apache Beam or Flink would probably let you do the same types of workloads against Kafka


In general, in order to scale a producer / consumer, you need some form of resource scheduler. Installing Spark may not be difficult for some, but knowing how to use it efficiently and tune for appropriate resources can be

Thorvald answered 13/5, 2018 at 22:47 Comment(4)
Spark provides Windowing functions and when combined with HyperLogLog for example can do state management to the approximate without storing all the data in an external system and querying in loop.Iolite
Isn't Kafka the "external system"? Spark is generally all in memory. In other words, I'm not sure how/where that's persistent for resiliencyThorvald
I am just trying to say that Spark provides windowing functions which are not doable with docker based solution out of the box.Iolite
Why would Docker affect windowing? If it just launches a Driver on a remote machine then Docker isn't the problem. I feel like the question was more asking about only the Kafka Brokers, and adding any processing layer, which happened to be Dockerized... At the end of the day, it's still a JVM running the codeThorvald

© 2022 - 2024 — McMap. All rights reserved.