Apache Flume vs Apache Flink difference
Asked Answered
A

2

11

I need to read a stream of data from some source (in my case it's UDP stream, but it shouldn't matter), transform the each record and write it to the HDFS.

Is there any difference between using Flume or Flink for this purpose?

I know I can use Flume with the custom interceptor to transform each event.

But I am new in Flink, so for me it looks like Flink will do the same.

Which one is better to choose? Is there a difference in performance?

Please, help!

Arietta answered 4/10, 2016 at 16:59 Comment(0)
W
12

Disclaimer: I'm a committer and PMC member of Apache Flink. I do not have detailed knowledge about Apache Flume.

Moving streaming data from various sources into HDFS is one of the primary use cases for Apache Flume as far as I can tell. It is a specialized tool and I would assume it has a lot of related functionality built in. I cannot comment on Flume's performance.

Apache Flink is a platform for data stream processing and more generic and feature rich than Flume (e.g., support for event-time, advance windowing, high-level APIs, fault-tolerant and stateful applications, ...). You can implement and execute many different kinds of stream processing applications with Flink including streaming analytics and CEP.

Flink features a rolling file sink to write data streams to HDFS files and allows to implement all kinds of custom behavior via user-defined functions. However, it is not a specialized tool for data ingestion into HDFS. Do not expect a lot of built-in functionality for this use case. Flink provides very good throughput and low latency.

If you do not need more than simple record-level transformations, I'd first try to solve your use case with Flume. I would expect Flume to come with a few features that you would need to implement yourself when choosing Flink. If you expect to do more advanced stream processing in the future, Flink is definitely worth a look.

Wynny answered 4/10, 2016 at 19:57 Comment(0)
A
10

Disclaimer: I'm a committer of Apache Flume. I do not have detailed knowledge about Apache Flink.

For the use case you have described, Flume could be the right choice.

You could use the Exec Source until netcat UDP source gets committed to the codebase.

For the transformation, it's hard to provide suggestions, but you might want to take a look at Morphline Interceptor.

Regarding the channel, I would recommend Memory Channel, because if the source is UDP, some negligible data loss should be acceptable.

Sink-wise, HDFS Sink probably covers your needs.

Ardath answered 9/10, 2016 at 3:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.