Akka Stream Kafka vs Kafka Streams
Asked Answered
M

5

49

I am currently working with Akka Stream Kafka to interact with kafka and I was wonderings what were the differences with Kafka Streams.

I know that the Akka based approach implements the reactive specifications and handles back-pressure, functionality that kafka streams seems to be lacking.

What would be the advantage of using kafka streams over akka streams kafka?

Maulmain answered 11/8, 2017 at 6:17 Comment(1)
Confluent addresses the backpressure issue here docs.confluent.io/current/streams/…. "Kafka Streams does not use a backpressure mechanism because it does not need one." Records are never buffered in memory between processing stages.Bye
H
50

Your question is very general, so I'll give a general answer from my point of view.

First, I've got two usage scenario:

  1. cases where I'm reading data from kafka, processing it and writing some output back to kafka, for these I'm using kafka streams exclusively.
  2. cases where either the data source or sink is not kafka, for those I'm using akka streams.

This already allows me to answer the part about back-pressure: for the 1st scenario above, there is a back-pressure mechanism in kafka streams.

Let's now only focus on the first scenario described above. Let's see what I would loose if I decided to stop using Kafka streams:

  • some of my stream processors stages need a persistent (distributed) state store, kafka streams provides it for me. It is something that akka streams doesn't provide.
  • scaling, kafka streams automatically balances the load as soon as a new instance of a stream processor is started, or as soon as one gets killed. This works inside the same JVM, as well as on other nodes: scaling up and out. This is not provided by akka streams.

Those are the biggest differences that matter to me, I'm hoping that it makes sense to you!

Hypophysis answered 11/8, 2017 at 7:45 Comment(6)
I think you misunderstood my question, I am specifically talking about akka-streams kafka, which is made to interact with kafka using akka stream constructs.Maulmain
That's what I understood. Akka-streams kafka is just a kafka consumer/producer wrapped as akka-streams source/sink. As such my answer seems valid. What do you think is not appropriate?Hypophysis
@FredericA. the point about scaling is true for akka streams too when Kafka is a source. You don't loose it if you decide to use akka streams.Belloir
@DanielWojda is correct, this works by defining a consumer group for the stream source. That way there will be only one active consumer per topic partition. When using reactive-kafka for example, this functionality is provided by the kafka consumer backing the stream source.Hypophysis
Case 2: If you already have Kafka infrastructure, you can just deploy Kafka connect and can continue from there on.Inshrine
@FredericA. could you please elaborate this statement "there is a back-pressure mechanism in kafka streams". How Kafka streams provide back pressure handling?Froebel
M
5

The big advantage of Akka Stream over Kafka Streams would be the possibility to implement very complex processing graphs that can be cyclic with fan in/out and feedback loop. Kafka streams only allows acyclic graph if I am not wrong. It would be very complicated to implement cyclic processing graph on top of Kafka streams

Mangosteen answered 24/11, 2017 at 9:42 Comment(2)
This is incorrect, cyclic streams are possible with Kafka streams.Kurtzman
The biggest advantage of akka is you do not need any additional server/broker (as kafka needs). Akka stream is just stream processing tool powered by actor. Currently it is introducing remote streaming mechanism also. It is a tool like brick , by which you can build any kind of complex building. Where kafka is a streaming tool with persistence (Log Database) and joining multiple streams or stream with table record. So if you need stream with join feature and stream timeline you can use Kafka. I hope very soon this will be available in akka.Slowly
S
2

Found this article to give a good summary of distributed design concerns that Kafka Streams provides (complements Akka Streams).

https://www.beyondthelines.net/computing/kafka-streams/

message ordering: Kafka maintains a sort of append only log where it stores all the messages, Each message has a sequence id also known as its offset. The offset is used to indicate the position of a message in the log. Kafka streams uses these message offsets to maintain ordering.

partitioning: Kafka splits a topic into partitions and each partition is replicated among different brokers. The partitioning allows to spread the load and replication makes the application fault-tolerant (if a broker is down the data are still available). That’s good for data partitioning but we also need to distribute the processes in a similar way. Kafka Streams uses the processor topology that relies on Kafka group management. This is the same group management that is used by the Kafka consumer to distribute load evenly among brokers (This work is mainly managed by the brokers).

Fault tolerance: data replication ensures data fault tolerance. Group management has fault tolerance built-in as it redistributes the workload among remaining live broker instances.

State management: Kafka streams provides a local storage backed up by a kafka change-log topic which uses log compaction (keeps only latest value for a given key).Kafka log compaction

Reprocessing: When starting a new version of the app, we can reprocess the logs from the start to compute new state then redirect the traffic the new instance and shutdown old application.

Time management: “Stream data is never complete and can always arrive out-of-order” therefore one must distinguish the event time vs processed time and handle it correctly.

Author also says "Using this change-log topic Kafka Stream is able to maintain a “table view” of the application state."

My take is that this applies mostly to an enterprise application where the "application state" is ... small.

For a data science application working with "big data", the "application state" produced by a combination of data munging, machine learning models and business logic to orchestrate all of this will likely not be managed well with Kafka Streams.

Also, am thinking that using a "pure functional event sourcing runtime" like https://github.com/notxcain/aecor will help make the mutations explicit and separate the application logic from the technology used to manage the persistent form of the state through the principled management of state mutation and IO "effects" (functional programming).

In other words the business logic does not become tangled with the Kafka apis.

Seeley answered 3/2, 2018 at 12:51 Comment(3)
[My take is that this applies mostly to an enterprise application where the "application state" is ... small.] - I would rather say this is a pretty myopic view of what Kafka stream actually is. Kafka, at its core, operates on key values - the 'table view' is basically a very summarized reference to the stream-table duality, as handled by Kafka. Kafka is intended to be used (and is used, actually) for true big (huge?) data platforms.Unskillful
Say I have an analysis which produces a series of large matrices which are persisted already (in some way: spark RDDs, etc) and I want to send domain events to other components referencing these matrices. Would you send the matrices themselves in Kafka?Seeley
For a data science application working with "big data": The data is already persisted and is not changing, you don't need Kafka or Akka Streams for that purpose, you need a distributed computing framework like Spark for eg.Inshrine
L
1

Akka Streams emerged as a dataflow-centric abstraction for the Akka Actors model. These are high-performance library built for the JVM and specially designed for general-purpose microservices.

Whereas as long as Kafka Streams is concerned, these are client libraries used to process unbounded data. They are used to read data from Kafka topics, then process it, and write the results to new topics.

Loon answered 25/3, 2021 at 12:59 Comment(0)
I
0

Well I used both of those and I have a pretty good idea about their strength's and weaknesses.

If you are solely concentrated in Kafka and you don't have to much experience about stream processing, Kafka Streams is good solution out of the box to help understand the streaming concepts. It Achilles heel in my opinion is its datastore, RockDB to help stateful scenarios with KTable or internal State Stores.

If you use Kafka Streams library, RockDB install itself in the background transparently, which is great for a beginner but troublesome for an experienced developer. RockDB is a key/value database like Cassandra, it has the most strengths of Cassandra but also the weakness, one major of those you can only query the things with primary key, which is for most of the real life scenarios s huge limitation. There are some means to implement your own datastore but they are not that well documented and could be great challenge. Also RockDB is really great loading single Values but if you have iterate over things, after a Dataset size of 100 000 the performance degrades significantly.

Unfortunately while RockDB is embedded so deep in Kafka Streams, it is also not that easy to implement a CQRS solution with it.

And as mentioned above, it has no concept of Back Pressure while Kafka Consumer give Records one by one, in a scenario that you have to scale out that can be really good bottleneck. And be really careful about that statement that Kafka Streams does not need Backpressure mechanism, as this Netflix blog points out it can really cause unpleasant effects.

"By the following morning, alerts were received regarding high memory consumption and GC latencies, to the point where the service was unresponsive to HTTP requests. An investigation of the JVM memory dump revealed an internal Kafka message concurrent queue whose size had grown uncontrollably to over 1.3 million elements. The cause for this abnormal queue growth is due to Spring KafkaListener’s lack of native back-pressure support."

Well so what are the advantages and disadvantages of Akka Streams compared to Kafka Streams. Well first of all, Akka is not that much of out of the box framework, you have to understand the concepts much better, it is not coupled with single persistence of options, you can choose whatever you want. It has direct support for CQRS pattern (Akka Projection) so you are not bound to query your data only over Primary Key. Akka developer thought about a lot scaling out and back pressure, committed a lot of code to Kafka code base to improve performance.

So if you are only working with Kafka and new to Stream Processing you can use Kafka Streams but be prepared that at some point you can hit a wall and switch to Akka Stream.

You want to see working details/example, I have two blogs about it, you can check it those, blog1 blog2

Introjection answered 21/7, 2022 at 6:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.