Reliable Webhook dispatching system
Asked Answered
T

3

6

I am having a hard time figuring out a reliable and scalable solution for a webhook dispatch system.

The current system uses RabbitMQ with a queue for webhooks (let's call it events), which are consumed and dispatched. This system worked for some time, but now there are a few problems:

  • If a system user generates too many events, it will take up the queue causing other users to not receive webhooks for a long time
  • If I split all events into multiple queues (by URL hash), it reduces the possibility of the first problem, but it still happens from time to time when a very busy user hits the same queue
  • If I try to put each URL into its own queue, the challenge is to dynamically create/assign consumers to those queues. As far as RabbitMQ documentation goes, the API is very limited in filtering for non-empty queues or for queues that do not have consumers assigned.
  • As far as Kafka goes, as I understand from reading everything about it, the situation will be the same in the scope of a single partition.

So, the question is - is there a better way/system for this purpose? Maybe I am missing a very simple solution that would allow one user to not interfere with another user?

Thanks in advance!

Tricksy answered 21/8, 2021 at 14:59 Comment(6)
I feel like hashing is the correct solution. You can implement incoming rate limiting to prevent bad actors that slow a particular queue/partition downWizardry
Won't incoming rate-limiting slow down the producers? Also, it would mean, that "slow" messages need to go somewhere else anyway.Tricksy
I didn't understand how do you split events into multiple queues using url hasing. Can you give some explanation, pls?Amphictyon
@nsv Each webhook handler has a unique URL. Each webhook handler can have multiple events assigned to it. So when an event is created, it is then put into the queue for its respective webhook handler and since each webhook handler has a unique URL, it's basically the same thing.Tricksy
@Tricksy but how do you manage if you have a queue per url when you so many urls?Amphictyon
@nsv Well, the queues are cleaned up when not in use. A slight drawback here is that I still had to use a locking mechanism (in my case, it was implemented using Redis) for creating the queues and for deleting them. Like a queue cannot be deleted by the housekeeping process if there was a message posted less than 30 seconds ago. If 30 seconds pass and the queue is empty, it is deleted. The Redis lock prevents the queue from being deleted if it is being posted to.Tricksy
T
0

So, I am not sure if this is the correct way to solve this problem, but this is what I came up with.

Prerequisites: RabbitMQ with deduplication plugin

So my solution involves:

  • g:events queue - let's call it a parent queue. This queue will contain the names of all child queues that need to be processed. Probably it can be replaced with some other mechanism (like Redis sorted Set or something), but you would have to implement ack logic yourself then.
  • g:events:<url> - there are the child queues. Each queue contains only events that are need to be sent out to that url.

When posting a webhook payload to RabbitMQ, you post the actual data to the child queue, and then additionally post the name of the child queue to the parent queue. The deduplication plugin won't allow the same child queue to be posted twice, meaning that only a single consumer may receive that child queue for processing.

All you consumers are consuming the parent queue, and after receiving a message, they start consuming the child queue specified in the message. After the child queue is empty, you acknowledge the parent message and move on.

This method allows for very fine control over which child queues are allowed to be processed. If some child queue is taking too much time, just ack the parent message and republish the same data to the end of the parent queue.

I understand that this is probably not the most effective way (there's also a bit of overhead for constantly posting to the parent queue), but it is what it is.

Tricksy answered 14/10, 2021 at 14:44 Comment(0)
E
0

You may experiment several rabbitmq features to mitigate your issue (without removing it completly):

  • Use a public random exchange to split events across several queues. It will mitigate large spikes of events and dispatch work to several consumers.

  • Set some TTL policies to your queues. This way, Rabbitmq may republish events to another group of queues (through another private random exchange for example) if they are not processed fast enough.

You may have several "cycles" of events, varying configuration (i.e number of cycles and TTL value for each cycle). Your first cycle handles fresh events the best it can, mitigating spikes through several queues under a random exchange. If it fails to handle events fast enough, events are moved to another cycle with dedicated queues and consumers.

This way, you can ensure that fresh events have a better change to be handled quickly, as they will always be published in the first cycle (and not behind a pile of old events from another user).

Expiratory answered 26/8, 2021 at 10:12 Comment(5)
Well yes, but this will also introduce the possibility that a single user will start receiving events in parallel, breaking the "order of events".Tricksy
Yes indeed. I wasn't realizing it was an issue while reading your post. I guess it leaves you with the "put each URL in its own queue" solution then.Expiratory
Well the main challenge is - how to dynamically connect consumers to these queues? And how to evenly spread all consumers between all of the queues?Tricksy
Can't your consumers be "aware" of the queue distribution you want, so they can create them when they are started and remove them when the job is done?Expiratory
Well that's the question here :D Is there a way to evenly distribute consumers across dynamically created queues? So that each consumer takes a queue, that has no consumers and works on it and then moves to the next?Tricksy
S
0

If you need order, unfortunatelly you depend on user input.

But in Kafka world, there are a few things to mention here;

  • You can achieve exactly-once delivery with Transactions which allows you to build a similar system like regular AMQPs.
  • Kafka supports partitioning by key. Which allows you to keep processing order of the same keys (in your case userId).
  • Throughput can be increased by tuning all producer, server and consumer sides (batch-size, inflight-requests etc. see Kafka documentation for more parameters).
  • Kafka supports message compression which is reduces network traffic and increases throughtput (just consumes a little more CPU power for fast compression algorithms like LZ4).

Partitions are most important thing in the scenario of yours. You can increase partitions to process more messages in the same time. Your consumers can be as much as your partitions in the same consumer-gorup. Even if you scale after reaching partition count, your new consumers won't be able to read and they will stay unassigned.

Unlike regular AMQP services Kafka does not remove messages after you read it, just marks offsets for consumer-gorup-id. This allows you to do a few things at the same time. Like calculating realtime user count in a separate process.

Southwards answered 31/8, 2021 at 13:31 Comment(3)
Not really what I am looking for. I have found a solution by the way how to deal with this problem using RabbitMQ. Will post an answer a bit later.Tricksy
@Tricksy I'm interested in your solution if you can find some time to share it here.Expiratory
@Expiratory I've posted an answer to the questionTricksy
T
0

So, I am not sure if this is the correct way to solve this problem, but this is what I came up with.

Prerequisites: RabbitMQ with deduplication plugin

So my solution involves:

  • g:events queue - let's call it a parent queue. This queue will contain the names of all child queues that need to be processed. Probably it can be replaced with some other mechanism (like Redis sorted Set or something), but you would have to implement ack logic yourself then.
  • g:events:<url> - there are the child queues. Each queue contains only events that are need to be sent out to that url.

When posting a webhook payload to RabbitMQ, you post the actual data to the child queue, and then additionally post the name of the child queue to the parent queue. The deduplication plugin won't allow the same child queue to be posted twice, meaning that only a single consumer may receive that child queue for processing.

All you consumers are consuming the parent queue, and after receiving a message, they start consuming the child queue specified in the message. After the child queue is empty, you acknowledge the parent message and move on.

This method allows for very fine control over which child queues are allowed to be processed. If some child queue is taking too much time, just ack the parent message and republish the same data to the end of the parent queue.

I understand that this is probably not the most effective way (there's also a bit of overhead for constantly posting to the parent queue), but it is what it is.

Tricksy answered 14/10, 2021 at 14:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.