Apache Kafka vs Apache Storm
Asked Answered
P

7

112

Apache Kafka: Distributed messaging system
Apache Storm: Real Time Message Processing

How we can use both technologies in a real-time data pipeline for processing event data?

In terms of real time data pipeline both seems to me do the job identical. How can we use both the technologies on a data pipeline?

Palladio answered 16/2, 2014 at 7:31 Comment(0)
K
183

You use Apache Kafka as a distributed and robust queue that can handle high volume data and enables you to pass messages from one end-point to another.

Storm is not a queue. It is a system that has distributed real time processing abilities, meaning you can execute all kind of manipulations on real time data in parallel.

The common flow of these tools (as I know it) goes as follows:

real-time-system --> Kafka --> Storm --> NoSql --> BI(optional)

So you have your real time app handling high volume data, sends it to Kafka queue. Storm pulls the data from kafka and applies some required manipulation. At this point you usually like to get some benefits from this data, so you either send it to some Nosql db for additional BI calculations, or you could simply query this NoSql from any other system.

Kirit answered 16/2, 2014 at 7:53 Comment(3)
Thanks Forhas. This is very helpful. One question can we use Apache Kafka to aggregate Apache log files or do we still need Flume to do that?Palladio
I guess you can although I'm not familiar with such a flow. Maybe you can check Splunk for your needs (just a guess..).Kirit
I recommend to use GrayLog and connect it to apache kafka. GrayLog already have a kakfa input plugin.Worker
C
44

I know that this is an older thread and the comparisons of Apache Kafka and Storm were valid and correct when they were written but it is worth noting that Apache Kafka has evolved a lot over the years and since version 0.10 (April 2016) Kafka has included a Kafka Streams API which provides stream processing capabilities without the need for any additional software such as Storm. Kafka also includes the Connect API for connecting into various sources and sinks (destinations) of data.

Announcement blog - https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/

Current Apache documentation - https://kafka.apache.org/documentation/streams/

In 0.11 Kafka the stream processing functionality was further expanded to provide Exactly Once Semantics and Transactions.

https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

Carbrey answered 19/8, 2017 at 19:29 Comment(3)
So basically now a real-time system communicates directly with Kafkaas the end point and Kafka stores e.g. to a DB?Fairway
Yes now Kafka includes Kafka Connect to talk to databases and other data sources (syslog, JMS, log files, etc), and Kafka Streams to do the stream processing (joins, Transforms, filters, aggregations), and back to Kafka Connect to write out to another database or repository.Carbrey
Kafka Streams are apparently not as good as Storm, according to this answer: https://mcmap.net/q/195887/-what-capabilities-does-apache-storm-offer-that-are-not-now-covered-by-kafka-streaming-closedIngrained
D
43

Kafka and Storm have a slightly different purpose:

Kafka is a distributed message broker which can handle big amount of messages per second. It uses publish-subscribe paradigm and relies on topics and partitions. Kafka uses Zookeeper to share and save state between brokers. So Kafka is basically responsible for transferring messages from one machine to another.

Storm is a scalable, fault-tolerant, real-time analytic system (think like Hadoop in realtime). It consumes data from sources (Spouts) and passes it to pipeline (Bolts). You can combine them in the topology. So Storm is basically a computation unit (aggregation, machine learning).


But you can use them together: for example your application uses kafka to send data to other servers which uses storm to make some computation on it.

Decern answered 26/1, 2015 at 7:48 Comment(0)
P
17

This is how it works

Kafka - To provide a realtime stream

Storm - To perform some operations on that stream

You might take a look at the GitHub project https://github.com/abhishekgoel137/kafka-nodejs-d3js.

(D3js is a graph-representation library)

Ideal case:

Realtime application -> Kafka -> Storm -> NoSQL -> d3js

This repository is based on:

Realtime application -> Kafka -> <plain Node.js> -> NoSQL -> d3js
Prepay answered 30/1, 2015 at 18:24 Comment(1)
Abhishek, link mentioned in the above answer is broken. Can you please update the link?Coussoule
W
5

As every one explain you that Apache Kafka: is continuous messaging queue

Apache Storm: is continuous processing tool

here in this aspect Kafka will get the data from any website like FB,Twitter by using API's and that data is processed by using Apache Storm and you can store the processed data in either in any databases you like.

https://github.com/miguno/kafka-storm-starter

Just follow it you will get some idea

Wain answered 31/12, 2015 at 7:22 Comment(0)
P
3

When I have a use case that requires me to visualize or alert on patterns (think of twitter trends), while continuing to process the events, I have a several patterns.
NiFi would allow me to process an event and update a persistent data store with low(er) batch aggregation with very, very little custom coding.
Storm (lots of custom coding) allows me nearly real time access to the trending events.
If I can wait for many seconds, then I can batch out of kafka, into hdfs (Parquet) and process.
If I need to know in seconds, I need NiFi, and probably even Storm. (Think of monitoring thousands of earth stations, where I need to see small region weather conditions for tornado warnings).

Pforzheim answered 7/5, 2018 at 5:51 Comment(0)
C
0

Simply Kafka send the messages from node to another , and Storm processing the messages . Check this example of how you can Integration Apache Kafka With Storm

Calia answered 20/1, 2020 at 8:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.