I am currently looking for a good middleware to build a solution to for a monitoring and maintenance system. We are tasked with the challenge to monitor, gather data from and maintain a distributed system consisting of up to 10,000 individual nodes.
The system is clustered into groups of 5-20 nodes. Each group produces data (as a team) by processing incoming sensor data. Each group has a dedicated node (blue boxes) acting as a facade/proxy for the group, exposing data and state from the group to the outside world. These clusters are geographically separated and may connect to the outside world over different networks (one may run over fiber, another over 3G/Satellite). It is likely we will experience both shorter (seconds/minutes) and longer (hours) outages. The data is persisted by each cluster locally.
This data needs to be collected (continuously and reliably) by external & centralized server(s) (green boxes) for further processing, analysis and viewing by various clients (orange boxes). Also, we need to monitor the state of all nodes through each groups proxy node. It is not required to monitor each node directly, even though it would be good if the middleware could support that (handle heartbeat/state messages from ~10,000 nodes). In case of proxy failure, other methods are available to pinpoint individual nodes.
Furthermore, we need to be able to interact with each node to tweak settings etc. but that seems to be more easily solved since that is mostly manually handled per-node when needed. Some batch tweaking may be needed, but all-in-all it looks like a standard RPC situation (Web Service or alike). Of course, if the middleware can handle this too, via some Request/Response mechanism that would be a plus.
Requirements:
- 1000+ nodes publishing/offering continuous data
- Data needs to be reliably (in some way) and continuously gathered to one or more servers. This will likely be built on top of the middleware using some kind of explicit request/response to ask for lost data. If this could be handled automatically by the middleware this is of course a plus.
- More than one server/subscriber needs to be able to be connected to the same data producer/publisher and receive the same data
- Data rate is max in the range of 10-20 per second per group
- Messages sizes range from maybe ~100 bytes to 4-5 kbytes
- Nodes range from embedded constrained systems to normal COTS Linux/Windows boxes
- Nodes generally use C/C++, servers and clients generally C++/C#
- Nodes should (preferable) not need to install additional SW or servers, i.e. one dedicated broker or extra service per node is expensive
- Security will be message-based, i.e. no transport security needed
We are looking for a solution that can handle the communication between primarily proxy nodes (blue) and servers (green) for the data publishing/polling/downloading and from clients (orange) to individual nodes (RPC style) for tweaking settings.
There seems to be a lot of discussions and recommendations for the reversed situation; distributing data from server(s) to many clients, but it has been harder to find information related to the described situation. The general solution seems to be to use SNMP, Nagios, Ganglia etc. to monitor and modify large number of nodes, but the tricky part for us is the data gathering.
We have briefly looked at solutions like DDS, ZeroMQ, RabbitMQ (broker needed on all nodes?), SNMP, various monitoring tools, Web Services (JSON-RPC, REST/Protocol Buffers) etc.
So, do you have any recommendations for an easy-to-use, robust, stable, light, cross-platform, cross-language middleware (or other) solution that would fit the bill? As simple as possible but not simpler.