RDMA memory sharing

I have a few multi-core computers connected by Infiniband network. I would like to have some low-latency computation on a pool of shared memory, with remote atomic operations. I know RDMA is the way to go. On each node I would register a memory region (and protection domain) for data sharing.

The online RDMA examples often focus at a single connection between a single-threaded server and a single-threaded client. Now I would like to have a multi-threaded process on each of the Infiniband node. I am very puzzled about the following...

How many queue pairs should I prepare on each node, for a cluster of n nodes and m threads in total? To be more specific, can multiple threads on the same node share the same queue pair?
How many completion queues should I prepare on each node? I will have multiple threads issuing remote read/write/cas operations on each node. If they were to share a common completion queue, the completion events will be mixed up. If the threads have their own separated completion queues, there would be really a lot of them.
Do you suggest me to have any existing libraries instead of writing this software? (hmm, or I should write one and open-source it? :-)

Thank you for your kind suggestion(s).

On Linux at least, the InfiniBand verbs library is completely thread-safe. So you can use as many or as few queue pairs (QPs) in your multi-threaded app as you want -- multiple threads can post work requests to a single QP safely, although of course you will have to make sure whatever tracking of outstanding requests, etc. that you do in your own application is thread-safe.

It is true that each send queue and each receive queue (remember that QP is really a pair of queues :) is attached to a single completion queue (CQ). So if you want each thread to have its own CQ then each thread will need its own QP to submit work into.

In general QPs and CQs are not really a limited resource -- you can easily have hundreds or thousands on a single node without trouble. So you can design your app without worrying too much about the absolute number of queues you're using. This is not to say you don't have to worry about scalability -- for example if you have a lot of receive queues and a lot of buffers per queue, then you may tie up too much memory in receive buffering, so you end up needing to use shared receive queues (SRQs).

There are a number of middleware libraries that use IB; probably MPI (eg http://open-mpi.org/) is the best-known one, and it's probably worth evaluating that before you get too far into reinventing things. The MPI developers have also published a lot of research about using IB/RDMA efficiently, which is probably worth seeking out in case you do decide to build your own system.

Recommended topics

Hot tags