Event Sourcing with Kinesis - Replaying and Persistence
Asked Answered
S

1

10

I am trying to implement an event-driven architecture using Amazon Kinesis as the central event log of the platform. The idea is pretty much the same to the one presented by Nordstrom's with the Hello-Retail project.

I have done similar things with Apache Kafka before, but Kinesis seems to be a cost-effective alternative to Kafka and I decided to give it a shot. I am, however, facing some challenges related to event persistence and replaying. I have two questions:

  1. Are you guys using Kinesis for such use-case OR do you recommend using it?
  2. Since Kinesis is not able to retain the events forever (like Kafka does), how to handle replays from consumers?

I'm currently using a lambda function (Firehose is also an option) to persist all events to Amazon S3. Then, one could read past events from the storage and then start listening to new events coming from the stream. But I'm not happy with this solution. Consumers are not be able to use Kinesis' checkpoints (Kafka's consumer offsets). Plus, Java's KCL does not support the AFTER_SEQUENCE_NUMBER yet, which would be useful in such implementation.

Saks answered 12/12, 2017 at 11:36 Comment(0)
M
5

First question. Yes I am using Kinesis Streams when I need to process the received log / event data before storing in S3. When I don't I use Kinesis Firehose.

Second question. Kinesis Streams can store data up to seven days. This is not forever, but should be enough time to process your events. Depending on the value of the events being processed ....

If I do not need to process the event stream before storing in S3, then I use Kinesis Firehose writing to S3. Now I do not have to worry about event failures, persistence, etc. I then process the data stored in S3 with the best tool. I use Amazon Athena often and Amazon Redshift too.

You don't mention how much data you are processing or how it is being processed. If it is large, multiple MB / sec or higher, then I would definitely use Kinesis Firehose. You have to manage performance with Kinesis Streams.

One issue that I have with Kinesis Streams is that I don't like the client libraries, so I prefer to write everything myself. Kinesis Firehose reduces coding for custom applications as you just store the data in S3 and then process afterwards.

I like to think of S3 as my big data lake. I prefer to throw everything into S3 without preprocessing and then use various tools to pull out the data that I need. By doing this I remove lots of points of failure that need to be managed.

Musette answered 12/12, 2017 at 18:45 Comment(3)
Hi. Thanks for your answer. However, I would like to use Kinesis Stream primarily as a broker for event sourced applications. It is an absolute must to enable consumers to replay events at any given time. My problem is that there are a lot of gotchas at the consumer side to make it work, like querying S3 to fetch old events until you find your place at the Kinesis Stream and start listening to it for new events. What I am interested in is how people are handling this specific situation. I also don't like the client libraries... maybe I come up with something that abstracts all this approach.Saks
About the data lake, I am already looking forward to it. The idea is that event sourced asynchronous microservices (consumers) will process a stream of events and then publish the state (aggregation / projection / snapshot) also to S3. The state would then be used as the source for Athena/Redshift queries. This is basically what Kafka Streams does in a pretty cool way.Saks
Without a much deeper understanding of how everything works in regards to producers and consumers for your setup, I cannot offer more. I prefer to put everything into S3 first and then process from there.Musette

© 2022 - 2024 — McMap. All rights reserved.