How to reliably process a queue?
Asked Answered
F

6

8

Take this conceptually simple task: consuming a queue, sending an email for each entry.

A simple approach would be:

while true:
   entry = queue.pop()
   sendMail();

The problem here is, if the consumer crashes after popping but before/during sending the mail, a mail is lost. So you change it to:

while true:
   entry = queue.peek()
   sendMail();
   queue.pop();

But now, if the consumer crashes after mailing, but before popping, the mail will be sent again when the consumer comes back up.

What's the best-practice way of handling this problem?

Sending email is just an example here which be substituted for any mission critical task. Also assume that the popping of the queue is the only record of the mail having been sent, so the mail subsystem does not record anything itself.

Faunus answered 12/4, 2016 at 9:8 Comment(8)
Your problem is not so much a problem of processing a queue but of distinguishing the states of having the mail sent or not. If you can distinguish these states, eg by catching different exceptions from call to sendMail, you can decide whether to pop or not.Inhibitor
@WolfS Catching exceptions is not a hard problem, but let's assume sendMail cannot fail. However, the consumer can crash instantly after it, but before popping (due to e.g. a power failure).Faunus
Then one person gets two emails. If this is something mission critical, would assume the servers are located in a data center with back up power and have UPSs. And in the event the UPS power is needed the servers have have been configured to shut down gracefully.Fronia
@Fronia Sending mails is just an example, assume they are mission critical. Preventing a failure with UPS etc. is nice, but I'm wondering if there's a common way of dealing with the inevitable failure that occurs at some point in time.Faunus
I think the general question here is how to wrap a transaction around two potentially-unrelated actions or commands.Outspan
Perhaps a better example is a financial transaction where a customer is charged and a product order is placed. The software cannot be allowed to charge a customer without placing an order, nor can it be allowed to place an order without charging the customer. How can we guarantee these actions always occur together, despite potentially-random hardware or software failures?Outspan
You say sending a mail is only an example. But whether you send an email or update a database makes a big difference. That is because what you need (as already mentioned) is a transaction. And mail servers do not support transactions. You may have even one or more mail relay hosts in between which each one could fail without you ever getting noticed. Transactions always have the possibility to roll back if anything failed or even if you are unsure whether it failed or not. You can not roll back a sent email. And what if mail is delivered, but reader dies while reading the first line ;-)Inhibitor
I'd say Two Generals Problem. Can't be solved perfectly, decide on your compromise :)Resistive
R
3

Doesn't your requirement seem like trying to solve the two generals problem (which doesn't have a deterministic solution/limit)? https://en.wikipedia.org/wiki/Two_Generals%27_Problem

Peek - Process - Remove

You want to only remove when you've ensured successful processing, and ensure proper removal. Well any of those messages can be lost/programs can crash at any step.

Most robust messaging queues rely on a set of acks + repeated tries (deliveries) to get the desired behavior (until the acks come back).

But it's actually impossible to guarantee perfect behavior in every scenario. You just have to end up weighing odds and make an engineering compromise between repeated (atleast attempted) processing and "never" (infinite memory etc) losing a message - specific to your actual application needs. Again not a new problem :), and unlikely you'll need to write scratch code for it - like i mentioned, most MQs solve this exact problem.

Resistive answered 20/4, 2016 at 21:22 Comment(1)
I had a feeling this isn't fundamentally solvable, but it's good to know this has been proven. So it's all about mitigating the risks then.Faunus
A
3

I am proposing two solutions here. The first is a proposed design (can be elaborated after further brainstorming) based on my experience and the second one is a short and quick solution. Have a look, ponder and you can choose whichever suits you.

THE LONG WAY OUT – CREATE IT FROM SCRATCH

If you are planning to create a fault-tolerant and a highly-available queue system, you would have to address the major challenge you are facing.

How to ensure that no messages are lost?

Know your producers and consumers: In order to design a solution, first we need to have knowledge of our producers and consumers. Single producer - single consumer. Single producer – multiple consumers. Multiple producers – multiple consumers. The best approach would be to create a mechanism which caters to multiple producers - multiple consumers and in addition to that, is configurable to cater to any of the three scenarios.

Next question; how do we do that? Simple answer, if we can somehow create a configurable mechanism which has the ability to take multiple messages and broadcast it to multiple consumers. That mechanism also has the ability to read the configurations, validate messages (optional, you can add it in consumers too), store messages for a small duration, track acknowledgements, decompose one message to many, aggregate many messages to one, have an ‘action plan’ when dealing with timeouts or failures and implement that ‘action – plan’.

Elaborating the mechanism: Let us call this mechanism, a Broker. So in your solution the broker will be placed in the following manner. The solid arrows are messages, and the dotted ones are acknowledgements.
Broker Block Diagram

I am avoiding going into the detailed design of a broker here, as it will be out of context.

Handling failures: Identify the possible point of failures 1. Producers 2. Consumers 3. Broker 4. Network

Producer Failures: - If there is a replication, and alternate producers keep sending messages without impacting the functionality, the throughput may be affected, until the original producer is up and running again.

For Consumer Failures and Network Failures, the broker can maintain a mechanism which will keep the messages until an acknowledgement (let us call it ack, for the sake of brevity) is received. Once the ack is received, the message corresponding to the ack is removed.

The consumer has to deal with this scenario a little differently. Let us say, that Consumer keeps the following variables in the state a. Last received message b. Consumer state = (Active, Dormant, Re-Started).

The moment consumer is started, its value can be (RE-STARTED).The last received message of the consumer is updated with every message received from the broker, and the state is changed to ACTIVE. If consumer tries to send an ack to the broker and the connection times out, or there are issues with network, the state CHANGES to DORMANT, and it is preserved. For the two scenarios of RE-STARTED and DORMANT, a validation is performed whether the processing of the Last received message is done. If yes, it sends the ack again to the broker and waits for the next message. The moment, the next message is received, the state can be changed to ACTIVE and the processing can begin as normal.

The broker on the other hand just keeps the last sent message until the ack is received. To overcome the failures of the broker, a master-slave configuration can be prepared wherein, the state of the broker is replicated and the messages are re-directed to the other broker, in case the first one becomes unavailable.

THE SHORT SOLUTION:

Use a JMS like @Marcin proposed. I have personally worked upon RabbitMQ(http://previous.rabbitmq.com/v3_4_x/features.html) and feel that for most of the distributed computing scenarios, this will just work. You can configure high-availability (http://previous.rabbitmq.com/v3_4_x/ha.html) and it also comes with a nice user interface where you can monitor your queues and messages.

You are, however, encouraged to check out a JMS system that suits your needs.

Hope this helps

Axial answered 20/4, 2016 at 7:18 Comment(2)
Most JMS queues (and RabbitMQ in particular) have an "at least once" delivery guarantee by default, which doesn't solve the OP's question. Even with the JMS in transactional mode, it doesn't solve the OP's problem: instead of "delivered at least once," transactional mode for the JMS means "successfully processed and acknowledged exactly once." But since email can't be sent transactionally (let's assume), the system is still susceptible to the failure scenario the OP describes.Actinomycin
Thanks for your answer, but JMS can hardly be the solution to a language-agnostic problem :) In fact I have a system with JMS in mind, but it doesn't change the problem, since that just makes the MQ the "producer" here.Faunus
R
3

Doesn't your requirement seem like trying to solve the two generals problem (which doesn't have a deterministic solution/limit)? https://en.wikipedia.org/wiki/Two_Generals%27_Problem

Peek - Process - Remove

You want to only remove when you've ensured successful processing, and ensure proper removal. Well any of those messages can be lost/programs can crash at any step.

Most robust messaging queues rely on a set of acks + repeated tries (deliveries) to get the desired behavior (until the acks come back).

But it's actually impossible to guarantee perfect behavior in every scenario. You just have to end up weighing odds and make an engineering compromise between repeated (atleast attempted) processing and "never" (infinite memory etc) losing a message - specific to your actual application needs. Again not a new problem :), and unlikely you'll need to write scratch code for it - like i mentioned, most MQs solve this exact problem.

Resistive answered 20/4, 2016 at 21:22 Comment(1)
I had a feeling this isn't fundamentally solvable, but it's good to know this has been proven. So it's all about mitigating the risks then.Faunus
D
1

You could use JMS queue. It gives you transacts. Message will go down from the queue when it has been processed correctly.

Digital answered 15/4, 2016 at 9:12 Comment(5)
What is "processed correctly"? What if it has been processed, but the message about it having been processed was not sent by the consumer?Faunus
Adding to queue and getting message from queue is transactional. "processed correctly" means that transaction has completed with successDigital
@MarcinSzymczak, the JMS transaction only guarantees that a consumer has received the message. If the consumer crashes halfway through its business logic, the message may be lost even though the JMS transaction was successful.Outspan
You can also unbound it from transaction and send confirmation manually whenever you want.Digital
Manual confirmation trades one problem for another: if the consumer crashes while attempting to send a confirmation, the processed message will be processed again. This is only acceptable when the processing is idempotent and relatively inexpensive.Outspan
K
1

If you are serious about this statement:

"Also assume that the popping of the queue is the only record of the mail having been sent,"

Then you cannot guarantee the reliable processing property.

Proof: Say I have a hypothetical program that guarantees the reliable processing property. Run the program once, and as soon as it tries to send a email we "intervene" by causing the email to fail and killing the thread simultaneously. Then, run the program (it will respawn a new thread I suppose) until the program determines that the email was sent or not (this must be a point in the program, otherwise the program runs indefinitely without making progress.) Now say we recorded the operation of the program and play it back in a parallel universe (any calls to random number generators should return the same) where we intervened by causing the email to succeed and killing the thread simultaneously. The program must conclude the same thing as it did before, which is a contradiction since the program was supposed to guarantee the reliable processing property without having any records of whether an email was sent or not, and in one of the simulations it must have been wrong about whether the email was sent or not.

Here's a solution that uses an email system that will tell you whether you have sent an email, and that will send an email only if you have not already sent one.

while true:
    task = queue.peek()
    if (task_email_sent_already(task)){
         //Then we failed after emailing but before pop
         goto pop_step;
    }
    if (task.done){
         //Then we failed after doing the task but before sending email
         goto email_step;
    }
    //run_task needs to be written transactionally to set task.done on completion. 
    //Think transactional memory with persistent logging.
    run_task(task); 
    LABEL email_step
    send_email_if_already_not_sent(task);
    LABEL pop_step
    queue.pop()

It's important that send_email_if_already_not_sent does not send two emails if called twice for the same task, as otherwise the above code could result in a duplicate email (if the email succeeds, but there is some "lag time" before task_email_already_sent returns true.)

If you make some assumption on how long it can be from where an email succeeds but before task_email_already_sent returns true, i.e. you assume that no email takes longer than 5 seconds to run, then you can just write in a local log that you sent an email at a certain time X, and you spin until time X+5 seconds before checking task_email_sent_already. But of course this is risky since you may send a duplicate email if some email does take more than 5 seconds to send.

Kelleykelli answered 21/4, 2016 at 1:5 Comment(0)
E
0

You have to divide responsibilities. Main actions that you perform:

  • Pick an email
  • Send the email
  • Remove the email from queue on success
  • Resend the email on failure

Let's assign actions to the responsible objects:

  • Pick — to consumer
  • Send — to sender
  • Remove — to result handler
  • Resend — to result handler

So, based on this, you need to have: a separate sender that only sends an email and notifies about the status of the operation; a separate consumer of emails that knows what to do and from where to get data; a separate handler of operation results.

It should work like this:

  1. Consumer picks an email from a queue. Is there any exceptions? Consumer handles them. Otherwise to the next step.

  2. Consumer passes an email to sender. Sender sends the email. Is there any exceptions? Sender notifies about that anybody who is subscribed if it can't handle the exception by itself. If there is no exceptions, notify that an email has been sent successfully.

  3. Handler receives a notification. Is there any exceptions? Do nothing. If there are no exceptions, remove an email from the queue. If removing an email cause an exception to be raised, then it is a bug and should be fixed.

  4. Consumer picks another email...

    There is another alternative, however. Handler and consumer could be united in one entity, but this approach leads to obvious obstacles. Also, you can have two queues: the first one should contain emails that need to be send, and the other one — passed to a sender. They a only passed, we don't know are they sent already or not. As before, sender should be a separate entity and should notify about successful send operation. When consumer notified, it removes an email from the first queue and from the second. If there is an exception in consumer, it looks to the second queue, finds unsent emails and resends them.

Asynchronous interaction and single responsibility are our best friends.

Everson answered 18/4, 2016 at 14:50 Comment(1)
I don't think you grasped OP's problem; it is about the communication, not handling possible exceptions that might occur.Yon
E
0

The problem of the producer is that it does not know if a message has been processed entirely or not. So: let the consumer confirm that the email has been arrived and only pop the mail then.

One little problem left: The message has been processed, but confirmation could not be sent any more. To solve this, provide for each mail a unique id that is sent together with the mail. The consumer can identify duplicates then (however, it has to persist the latest received id somehow so that this approach survives a crash...).

Endomorphism answered 21/4, 2016 at 12:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.