how is tcp(kernel) bypass implemented?
Asked Answered
V

1

9

Assuming I would like to avoid the overhead of the linux kernel in handling incoming packets and instead would like to grab the packet directly from user space. I have googled around a bit and it seems that all that needs to happen is one would use raw sockets with some socket options. Is this the case? Or is it more involved than this? And if so, what can I google for or reference in order to implement something like this?

Varicocele answered 30/7, 2012 at 20:43 Comment(4)
What is "the overhead" you are trying to avoid, specifically?Vandenberg
context switching from my application when kernel handles network packets. Also, I am curious about how this is implemented.Varicocele
@PalaceChan : as now said, to intercept packets, you need to have root privileges (and talk to the kernel) cause all packets traverses the kernel networking stackGenesia
I've been reading about things like TCP offload, Infiniband, and RDMA..It still seems hazy though as some of this stuff appears propietary (like Infiniband). So it seems kernel bypass requires very specific hardware?Varicocele
K
15

There are many techniques for networking with kernel bypass.

First, if you are sending messages to another process on the same machine, you can do so through a shared memory region with no jumps into the kernel.

Passing packets over a network without involving the kernel gets more interesting, and involves specialized hardware that gets direct access to user memory. This idea is called RDMA.

Here's one way it can work (this is what InfiniBand hardware does). The application registers a memory buffer with the RDMA hardware. This buffer is pinned in physical memory, since swapping it out would obviously be bad (since the hardware will keep writing to the physical memory region). A control region is also mapped into userspace memory. When an application is ready to use the buffer to send or receive a message, it writes a command to the control region. The hardware takes the data from a registered buffer on one end, and places it into another registered buffer at the other end.

Clearly, this is too low level, so there are abstractions that make programming RDMA hardware easier. OFED verbs are one such abstraction.

The InfiniBand software stack has one extra interesting bit: the Sockets Direct Protocol (SDP) that is used for compatibility with existing applications. It works by inserting an LD_PRELOAD shim that translates standard socket API calls into IB verbs.

InfiniBand is just what I'm most familiar with. RoCE/iWARP hardware is very similar from the programmer's perspective, but uses a different transport than InfiniBand (TCP using an offload engine in iWarp, Ethernet in RoCE). There are/were also other approaches to RDMA (Quadrics, for example).

Kennethkennett answered 30/7, 2012 at 21:50 Comment(6)
Is RDMA (being a hardwared feature of the NIC) the only "proprietary" aspect of doing this? I.e. IB verbs does not seem open source for example and neither does OFED.Varicocele
The OFED stack is open-source (on Linux at least, I haven't worked with the Windows version). Part of it is in the kernel, the rest is provided by libibverbs and friends. Vendors such as Mellanox have their own versions of the OFED stack that include proprietary tweaks. Also, quite a lot of magic happens in hardware, and its firmware is closed-source.Kennethkennett
How hardware can reach memory of some user-space process? Will it go by physical addresses or by virtual? If virtual - then who will point IOMMU to the page table of current process?Thies
@Thies Depends on the vendor. IB hardware does not do address translation and uses physical addresses, hence the need to pin memory buffers. This is done in the kernel portion of the IB HCA driver, and is an expensive operation (this is the main reason unexpected queues are needed in MPI over IB, and true zero-copy is only used for large messages). Quadrics had an MMU on the card.Kennethkennett
@Greg Inozemtsev, When the card uses IOMMU, how it will know which virtual address space to use? (each process in Unix running in own virtual address space)Thies
@Thies I'm not too familiar with Quadrics, but I know that using it without pinning buffers required a patched kernel. They basically synchronized with the card's MMU any time the host page tables changed.Kennethkennett

© 2022 - 2024 — McMap. All rights reserved.