OpenSSL SSL_write from multiple buffers / SSL_writev

Asked 5/7, 2016 at 8:28 Answered 30/12, 2023 at 5:28

I've written a networking server that uses OpenSSL for SSL/TLS. The server sends and receives large blocks of data and performs various transformations in between. For performance reasons, transformations are done mainly using vector information (see iovec from POSIX) that avoids expensive memory moves (memcpy() etc.). When data are ready to be sent, I use writev() POSIX function that gathers data from the memory using these vectors and it sends that usually as one network packet.

Now with OpenSSL, it is not entirely possible because OpenSSL offers only SSL_write() function as far as I know. That means I have to call this function for every vector entry I want to send. It, unfortunately, causes that every vectored chunk of data is transmitted in its own SSL frame, and that introduces unwanted and unnecessary network overhead.

My question is: Is there SSL_writev() equivalent of writev()? Or in general, is there a technique how I can tell to OpenSSL to stash SSL_write() data into a one SSL application record (type 22) without sending it (and then of course some kind of flush() function)?

Edit: As discussed below, a viable approach is to consolidate vectored data into a big chunk prior a final single SSL_write() call. There is however connected overhead with 2 copies (1st during consolidation, 2nd when SSL_write() performs AES encryption). Theoretical SSL_writev() call doesn't introduce this overhead.

Arborvitae answered 5/7, 2016 at 8:28 Comment(12)

I think you will need a "pull up" function. I.e., one that combines multiple buffers into one. – Longshore 5/7, 2016 at 14:22

That's exactly what I do today. But it is quite expensive b/c it moves a large chunks of data in the memory. – Arborvitae 5/7, 2016 at 17:57

It seems like the large chunks amortize the cost of TCP overhead. Maybe you should allow the separate writes for large data, and only "pull up" smaller ones. Also, if you compile with -march=native and -O3, then you should get the SSE4 and AVX versions of memcpy and memmove on modern hardware. They are lightning fast because they move 16, 32 and 64 bytes at a time. – Longshore 5/7, 2016 at 18:9

@ateska But it is quite expensive b/c it moves a large chunks of data in the memory. On Linux, writev() is actually implemented as just a wrapper around write() that allocates a temporary buffer, copies the writev() buffers into the temp buffer, then calls write(). If you're running on Linux and writev() is working for you without SSL, just write your own SSL_writev() wrapper. – Vocalist 5/7, 2016 at 18:25

@AndrewHenle - good point with writev() implementation. I was hoping that it uses scatter/gather kernel feature. – Arborvitae 5/7, 2016 at 18:59

@Longshore - agree, an efficient copying is important. Yet, it still means that SSL version will do 2 copies: 1st is "pull up", 2nd is AES (or similar) encryption during SSL write. Both can be indeed 'accelerated' by SSE4/AVX and AES-NI respectively. I'm looking for consolidating that into a 1 copy, that is an original idea behind SSL_writev(). – Arborvitae 5/7, 2016 at 19:4

@ateska - the "SSL version will do 2 copies: 1st is "pull up", 2nd is AES (or similar) encryption..." is a slightly different requirement. You should edit you question and add that information. Its a good requirement and question, but I did not pay it any mind due to the wording of the current question (which I thought was closer to the scatter/gather you mentioned). – Longshore 5/7, 2016 at 19:17

@AndrewHenle fortunately this is not true, or the performance would suffer. Here is linux's writev() implementation, it solely uses iovecs and low-level iterators : git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/… – Pound 14/1, 2019 at 6:53

@WillyTarreau It doesn't matter what the actual kernel system call may or may not do if the user-space implementation translates the program's writev() function call into the write() system call. And how do you know the "performance would suffer"? The writev() kernel must make multiple copies from user space to kernel space - one for each of the memory areas copied. The write() kernel code only has to copy from user space to kernel space once - and if the user is using direct IO, even that can be skipped. It's not clear performance would suffer at all. – Vocalist 15/1, 2019 at 10:29

@AndrewHenle please read the code I pointed, you will see that it does not make multiple copies but iterates over vectors. – Pound 18/3, 2019 at 3:32

One, I looked at the kernel code you linked. Again, that's irrelevant if the user-space program translates the writev() function to the write() system call - just like glibc does.. Not only that, "iterating over a vector" is implemented by doing multiple copies of data. You need to dig deeper into the Linux kernel. Stopping at do_writev() isn't enough. You need to go look at vfs_writev() and then do_iter_write(). And see that the code does multiple copies. – Vocalist 18/3, 2019 at 11:49

@AndrewHenle great to read this. I'm not actually sure how relevant OP's copies even are in the first place. you have to decrypt, process and reencrypt all that data.. I imagine this one move is likely negligible, especially if you take advantage of SSE instructions (and that's without assuming he's breaking into kernel often for other stuff like memory allocation). is this something you'd agree with? I know x86 has AES instructions now but even still, we're talking about a 5% performance increase at best here no? plus even if he's sending jumbo frames we're talking ~9kb max, it's not much. – Pfeifer 30/12, 2023 at 5:4

My strategy would be to use SSL for key exchange, close the SSL session, and encrypt+send the stuff myself on the regular socket. You could still do all you had been doing before including SSL, but then also get to choose exactly how you'd be encrypting your data.

If you have the benefit of reasonably recent hardware, you have x86's AES instructions. So I personally would try my hand at some inlining and manual loop folding to see how crazy fast I could make it.

Security warning: While OpenSSL and AES (or any other industry standard cryptography libraries/algorithms you may use) by themselves are pretty damn secure, doing this requires extreme attention and a good amount of knowledge. Using AES has many pitfalls, and cryptography is a generally hard subject with lots hidden dangers. Many people would recommend you don't do this at all, I personally recommend trying to at least assess where you are on the dunning krugger curve and gaining a full understanding of how OpenSSL does it and why.

Implementation Notes

These are pointers, they only go as far as what I know. The above warning should be taken very seriously- even if you have a cryptographer with a 20-year career by your side, consider the fact that 1. plenty of people do the same thing the wrong way their entire career, 2. even very smart people fall prey to considering only the scope they operate in (e.g. it's hard to consider side-channel attacks if you're just the guy working the math and aren't particularly well versed in computer architecture) and 3. even if they're einstein-level smart and well-versed in all adjacent subjects, remember einstein denied the existence of quantum mechanics. We're all humans and no human is perfect.

Key generation is quite possibly the most important thing here. If you do it wrong and someone reverse engineers your client, they could gain the ability to guess the keys of any communication session.
Obviously this means you shouldn't hardcode or reuse keys, but it also means you should be very careful about how you generate them in the first place.
If you use a pseudo-random number generator and you fail to seed it properly - either seeding it with a constant or seeding it with "current time" (as is often done in example code) - an attacker will be able to listen in on everything.
I would strongly advise using OpenSSL's key generation (and making damn sure you use it properly).
If you want to try a hand at DIYing it for fun, just want to learn a bit about it, or are in an environment where there is no current implementation, here is OpenSSL's "Random Numbers" wiki page. Often there is an instruction/peripheral that derives random numbers from hardware entropy sources. True randomness is not a trivial matter - here is tom scott's video on cloudflare's physical random generator, which begins to illustrate the difficulty and importance of the problem.

Now let's say you properly implemented your key generation, and you transferred the key through the SSL channel. And let's say there are no inherent security risks in closing that channel and maintaining communication (can't tell you with certainty there aren't).

Now you're doing AES stream encryption. The stream part is very very important because if you just encrypt every packet with the same key in the same way, then any data that's predictible across packets can be used to help break your encryption.
This is why OpenSSL uses specific modes of operation for AES that are determined to be secure. They're an added layer on top of the encryption itself which in the case of stream ciphers mitigates the aforementioned vulnerability.
In the case of TLS1.3, ciphersuites use either 'Galois/counter mode' (GCM) or 'counter with CBC-MAC' (CCM) variants for AES, which means that's likely what you should be trying to go with as well (there's also chacha20 but then you don't get hardware acceleration).

Additionally you have to make sure things like packet size don't inherently give away important information.

Lastly the implementation code itself has to be bullet proof. This goes for every level of your application, but remember that you're implementing a protocol. If your logic is funky or you mismanage a buffer, it could compromise everything from your private keys down to giving the attacker root access to your entire system and network - and them using it to ruin your nuke-grade uranium refinement centrifuges (see stuxnet). Something like this even happened with OpenSSL itself (see heartbleed).

Pfeifer answered 30/12, 2023 at 5:28 Comment(0)

-1

You can use a BIO_f_buffer() to achieve this. Wrap your network layer BIO in a BIO_f_buffer() filter BIO and set that as the write BIO for your SSL object. This will cause all data written out to stay in the buffer until you issue a flush on it.

Demmer answered 5/7, 2016 at 19:18 Comment(7)

BTW, you may want to wait to do this until after the handshake has completed - otherwise you will have to manually issue "flush" commands for each flight of handshake messages that are exchanged. – Demmer 5/7, 2016 at 19:22

Agree, but it will still create an SSL application record (type 23) for each call of SSL_write. That means an unwanted network overhead that I want to avoid. – Arborvitae 5/7, 2016 at 21:15

Yes it will - although from your description it sounded like your main concern was to avoid multiple network packets. That is different to a TLS record. Multiple records can be contained within a single TCP packet, or split up across many. By buffering and flushing in the way I propose that gives the network layer the best opportunity to transmit the data across the network in the most efficient way possible. The overhead is then limited to the additional record header bytes (5 bytes) plus the MAC size (depends on the ciphersuite). If your aim is to reduce the number of records then... – Demmer 5/7, 2016 at 21:58

...you can do the same thing in reverse, i.e. put a BIO_f_buffer() in front of an SSL BIO and flush it through to the SSL layer when you are ready. – Demmer 5/7, 2016 at 21:59

The general problem I have with this answer is that there is an unnecessary sacrifice in either the network overhead or a mem-copy operation. My goal is to avoid both (because I have a working solution on this level already). From an architectural point of view, it is quite straightforward: ... – Arborvitae 13/8, 2016 at 14:21

I have a unencrypted buffer in a scatter/gather form because of the app-specific data transformations. I can use symmetric encryption function (e.g. AES) to 'go around' that scatter/gather buffer (see writev function), do its job and put an output to the other buffer, in one go, into one SSL record (type 23). No memcpy, no network overhead. To my current knowledge, no existing SSL implementation offers this feature. – Arborvitae 13/8, 2016 at 14:21

There's Kernel TLS support now, starting from 4.13 which allows to literally use writev and seamlessly encrypt outgoing traffic with certain limitations. Details here: lwn.net/Articles/666509 – Salmi 18/12, 2017 at 18:51

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Implementation Notes

Recommended topics

Hot tags