I have a few multi-core computers connected by Infiniband network. I would like to have some low-latency computation on a pool of shared memory, with remote atomic operations. I know RDMA is the way to go. On each node I would register a memory region (and protection domain) for data sharing.
The online RDMA examples often focus at a single connection between a single-threaded server and a single-threaded client. Now I would like to have a multi-threaded process on each of the Infiniband node. I am very puzzled about the following...
How many queue pairs should I prepare on each node, for a cluster of n nodes and m threads in total? To be more specific, can multiple threads on the same node share the same queue pair?
How many completion queues should I prepare on each node? I will have multiple threads issuing remote read/write/cas operations on each node. If they were to share a common completion queue, the completion events will be mixed up. If the threads have their own separated completion queues, there would be really a lot of them.
Do you suggest me to have any existing libraries instead of writing this software? (hmm, or I should write one and open-source it? :-)
Thank you for your kind suggestion(s).