I am proposing two solutions here. The first is a proposed design (can be elaborated after further brainstorming) based on my experience and the second one is a short and quick solution. Have a look, ponder and you can choose whichever suits you.
THE LONG WAY OUT – CREATE IT FROM SCRATCH
If you are planning to create a fault-tolerant and a highly-available queue system, you would have to address the major challenge you are facing.
How to ensure that no messages are lost?
Know your producers and consumers: In order to design a solution, first we need to have knowledge of our producers and consumers. Single producer - single consumer. Single producer – multiple consumers. Multiple producers – multiple consumers.
The best approach would be to create a mechanism which caters to multiple producers - multiple consumers and in addition to that, is configurable to cater to any of the three scenarios.
Next question; how do we do that? Simple answer, if we can somehow create a configurable mechanism which has the ability to take multiple messages and broadcast it to multiple consumers. That mechanism also has the ability to read the configurations, validate messages (optional, you can add it in consumers too), store messages for a small duration, track acknowledgements, decompose one message to many, aggregate many messages to one, have an ‘action plan’ when dealing with timeouts or failures and implement that ‘action – plan’.
Elaborating the mechanism: Let us call this mechanism, a Broker. So in your solution the broker will be placed in the following manner. The solid arrows are messages, and the dotted ones are acknowledgements.
I am avoiding going into the detailed design of a broker here, as it will be out of context.
Handling failures: Identify the possible point of failures
1. Producers
2. Consumers
3. Broker
4. Network
Producer Failures: - If there is a replication, and alternate producers keep sending messages without impacting the functionality, the throughput may be affected, until the original producer is up and running again.
For Consumer Failures and Network Failures, the broker can maintain a mechanism which will keep the messages until an acknowledgement (let us call it ack, for the sake of brevity) is received. Once the ack is received, the message corresponding to the ack is removed.
The consumer has to deal with this scenario a little differently. Let us say, that Consumer keeps the following variables in the state
a. Last received message
b. Consumer state = (Active, Dormant, Re-Started).
The moment consumer is started, its value can be (RE-STARTED).The last received message of the consumer is updated with every message received from the broker, and the state is changed to ACTIVE. If consumer tries to send an ack to the broker and the connection times out, or there are issues with network, the state CHANGES to DORMANT, and it is preserved. For the two scenarios of RE-STARTED and DORMANT, a validation is performed whether the processing of the Last received message is done. If yes, it sends the ack again to the broker and waits for the next message. The moment, the next message is received, the state can be changed to ACTIVE and the processing can begin as normal.
The broker on the other hand just keeps the last sent message until the ack is received. To overcome the failures of the broker, a master-slave configuration can be prepared wherein, the state of the broker is replicated and the messages are re-directed to the other broker, in case the first one becomes unavailable.
THE SHORT SOLUTION:
Use a JMS like @Marcin proposed. I have personally worked upon RabbitMQ(http://previous.rabbitmq.com/v3_4_x/features.html) and feel that for most of the distributed computing scenarios, this will just work. You can configure high-availability (http://previous.rabbitmq.com/v3_4_x/ha.html) and it also comes with a nice user interface where you can monitor your queues and messages.
You are, however, encouraged to check out a JMS system that suits your needs.
Hope this helps
sendMail
cannot fail. However, the consumer can crash instantly after it, but before popping (due to e.g. a power failure). – Faunus