My strategy would be to use SSL for key exchange, close the SSL session, and encrypt+send the stuff myself on the regular socket. You could still do all you had been doing before including SSL, but then also get to choose exactly how you'd be encrypting your data.
If you have the benefit of reasonably recent hardware, you have x86's AES instructions. So I personally would try my hand at some inlining and manual loop folding to see how crazy fast I could make it.
Security warning: While OpenSSL and AES (or any other industry standard cryptography libraries/algorithms you may use) by themselves are pretty damn secure, doing this requires extreme attention and a good amount of knowledge. Using AES has many pitfalls, and cryptography is a generally hard subject with lots hidden dangers. Many people would recommend you don't do this at all, I personally recommend trying to at least assess where you are on the dunning krugger curve and gaining a full understanding of how OpenSSL does it and why.
Implementation Notes
These are pointers, they only go as far as what I know. The above warning should be taken very seriously- even if you have a cryptographer with a 20-year career by your side, consider the fact that 1. plenty of people do the same thing the wrong way their entire career, 2. even very smart people fall prey to considering only the scope they operate in (e.g. it's hard to consider side-channel attacks if you're just the guy working the math and aren't particularly well versed in computer architecture) and 3. even if they're einstein-level smart and well-versed in all adjacent subjects, remember einstein denied the existence of quantum mechanics. We're all humans and no human is perfect.
Key generation is quite possibly the most important thing here. If you do it wrong and someone reverse engineers your client, they could gain the ability to guess the keys of any communication session.
Obviously this means you shouldn't hardcode or reuse keys, but it also means you should be very careful about how you generate them in the first place.
If you use a pseudo-random number generator and you fail to seed it properly - either seeding it with a constant or seeding it with "current time" (as is often done in example code) - an attacker will be able to listen in on everything.
I would strongly advise using OpenSSL's key generation (and making damn sure you use it properly).
If you want to try a hand at DIYing it for fun, just want to learn a bit about it, or are in an environment where there is no current implementation, here is OpenSSL's "Random Numbers" wiki page. Often there is an instruction/peripheral that derives random numbers from hardware entropy sources. True randomness is not a trivial matter - here is tom scott's video on cloudflare's physical random generator, which begins to illustrate the difficulty and importance of the problem.
Now let's say you properly implemented your key generation, and you transferred the key through the SSL channel. And let's say there are no inherent security risks in closing that channel and maintaining communication (can't tell you with certainty there aren't).
Now you're doing AES stream encryption. The stream part is very very important because if you just encrypt every packet with the same key in the same way, then any data that's predictible across packets can be used to help break your encryption.
This is why OpenSSL uses specific modes of operation for AES that are determined to be secure. They're an added layer on top of the encryption itself which in the case of stream ciphers mitigates the aforementioned vulnerability.
In the case of TLS1.3, ciphersuites use either 'Galois/counter mode' (GCM) or 'counter with CBC-MAC' (CCM) variants for AES, which means that's likely what you should be trying to go with as well (there's also chacha20 but then you don't get hardware acceleration).
Additionally you have to make sure things like packet size don't inherently give away important information.
Lastly the implementation code itself has to be bullet proof. This goes for every level of your application, but remember that you're implementing a protocol. If your logic is funky or you mismanage a buffer, it could compromise everything from your private keys down to giving the attacker root access to your entire system and network - and them using it to ruin your nuke-grade uranium refinement centrifuges (see stuxnet). Something like this even happened with OpenSSL itself (see heartbleed).
-march=native
and-O3
, then you should get the SSE4 and AVX versions ofmemcpy
andmemmove
on modern hardware. They are lightning fast because they move 16, 32 and 64 bytes at a time. – Longshorewritev()
is actually implemented as just a wrapper aroundwrite()
that allocates a temporary buffer, copies thewritev()
buffers into the temp buffer, then callswrite()
. If you're running on Linux andwritev()
is working for you without SSL, just write your ownSSL_writev()
wrapper. – Vocalistwritev()
function call into thewrite()
system call. And how do you know the "performance would suffer"? Thewritev()
kernel must make multiple copies from user space to kernel space - one for each of the memory areas copied. Thewrite()
kernel code only has to copy from user space to kernel space once - and if the user is using direct IO, even that can be skipped. It's not clear performance would suffer at all. – Vocalistwritev()
function to thewrite()
system call - just like glibc does.. Not only that, "iterating over a vector" is implemented by doing multiple copies of data. You need to dig deeper into the Linux kernel. Stopping atdo_writev()
isn't enough. You need to go look atvfs_writev()
and thendo_iter_write()
. And see that the code does multiple copies. – Vocalist