Does this prefetch256() function offer any protection against cache timing attacks on AES?
Asked Answered
A

1

7

This is a borderline topic. Since I wanted to know about programming, CPU cache memory, reading CPU cache lines etc, I'm posting it here.

I was implementing AES algorithm in C/C++. Since performing GF(28) multiplications are computationally expensive without hardware support, I've optimized to use lookup tables for the AES S-box. But unfortunately lookup table based implementations are vulnerable to cache-timing attacks. Therefore, being very naive to CPU caches, I started learning about how it works, and how to circumvent such an attack without any additional computational costs.

I realize that in practice there are AES NI instructions and Bit Slice AES implementations but they are way beyond my current understanding.

I've learned from crypto.SE that the easiest method is to read all cache lines, or read the entire table, before I do the lookup. (This also affects my performance) But I do not know how to implement it in software, or the intricate concepts behind it.

In the C implementation reference guide of OpenSSL - aes-x86core.c the authors have implemented a function:

static void prefetch256(const void *table)
{
    volatile unsigned long *t=(void *)table,ret;
    unsigned long sum;
    int i;

    /* 32 is common least cache-line size */
    for (sum=0,i=0;i<256/sizeof(t[0]);i+=32/sizeof(t[0]))   sum ^= t[i];

    ret = sum;
}
  • In the for loop i is incremented by 8 if sizeof(t[0]) is 4. Since the AES sbox is an unsigned char array of 256 elements and when we call prefetch256(sbox) , the sbox pointer is cast to an unsigned long pointer, therefore every element dereferenced using t[i] is 4 bytes. But I don't understand the reasons why we skip 32 bytes, and not read it fully (to protect against cache timing attacks)?
  • What is the motivation behind sum ^= t[i] and setting ret = sum ?

  • Are there other simpler solutions to protect my implementation from cache timing attacks? Will this following simpler code help me:

       unsigned char tmp[256];
       memcpy(tmp, sbox, 256); // memcpy reads the full sbox table quickly..  
    
Abducent answered 9/5, 2020 at 16:13 Comment(2)
The motivation behind sum, ret, and the volatile keyword is to prevent the compiler from optimizing the code. Without that, the compiler may decide that the code doesn't have any observable results, and delete the entire body of the function. For code like this, part of the release procedure should be to check the assembly code of this function to make sure it was implemented correctly. And you need to check every call site in the assembly to make sure that the function is actually called, not inlined and optimized out. The optimizer is your adversary, watch it closely.Demilitarize
One small addition to Brendan's fine answer. From the article you linked, your goal is: "an implementation in which every call to a subroutine always returns in exactly x seconds, where x is the maximum time it ever takes to execute that routine on every possible authorized input." One way to achieve that is to take a high-precision time stamp at the beginning of the function, and then sit in a loop at the end of the function (doing meaningless computations similar to the actual algorithm) until x amount of has passed.Demilitarize
M
6

In the for loop i is incremented by 8 if sizeof(t[0]) is 4.

But I don't understand the reasons why we skip 32 bytes, and not read it fully (to protect against cache timing attacks)?

In the for loop, i is incremented by the equivalent of 32 chars (regardless of what sizeof(t[0]) happens to be) because "32 chars" (or 32 bytes) is what the authors determined is the minimum cache line size for all CPUs they care about. Note that you only need to read from 1 byte of the cache line to ensure that the entire cache line is fetched into cache.

What is the motivation behind sum ^= t[i] and setting ret = sum ?

A good compiler will notice that the data you're reading is not used, and will avoid doing an "unnecessary" (for correctness of the "C abstract machine") read to improve performance without knowing that the "unnecessary" read is necessary for reasons that the compiler can't be expected to know about. To prevent the compiler from optimizing like that you need to trick it into thinking the data is actually used, or use volatile. The OpenSSL authors are doing both (trying to trick the compiler into not optimizing by doing the sum ^= t[i] and ret sum; and also using volatile), possibly because (historically) plenty of old compilers had bugs involving volatile.

Also note that there's still a very tiny timing problem - cache could be flushed (e.g. by task switch, etc) after the data was prefetched but before part of the table is used for AES; so there's an (extremely tiny) chance that an attacker can still use a cache timing attack to determine which part of the table AES used. See "Confidence In Cache Timing Attack Prevention" (below).

Will this following simpler code help me:

It's very likely that the compiler will turn your code into literally nothing (and if it didn't, it'd have the same problems as prefetch256() and may be slower because you're writing to memory rather than only reading).

Are there other simpler solutions to protect my implementation from cache timing attacks?

Everything is a compromise between complexity, portability, security, features and performance; and "simpler" almost always means "worse portability" and/or "worse quality" and/or "worse features" and/or "worse performance". You can't make the quality or features worse while still protecting against cache timing attacks. You can't make performance worse because it's already as simple as it can be.

You might (or might not) be able to make it simpler by sacrificing portability. For example; if you know that the whole table fits in a single cache line on one computer (and is aligned to a cache line boundary), then you could do nothing for that one computer and say the code should never be used for any other computer.

Confidence In Cache Timing Attack Prevention

One of the key factors in guarding against cache timing attacks is how much control the attacker has. The typical scenario is that the attacker floods the cache (pollutes it with known data to cause its previous contents to become evicted due to "least recently used"), then lets the victim do something, then measures how quickly it can access its known data to determine if that known data is still in the cache or was evicted. If a cache line of known data was evicted the attacker knows that the victim accessed something that has the same location in the cache as the evicted cache line.

The worst possible case is that the attacker is able to do this extremely often (e.g. for every instruction the victim executes), the cache has no associativity (direct mapped), and the attacker either knows everything about how the victim uses virtual addresses and the relationship between the victim's virtual addresses and locations in the cache (likely including the relationship between virtual addresses and physical addresses) or is in the same process (e.g. shared library, where they can access the table themselves to determine if it was accessed instead of relying on the eviction of other data). In this case the only defense is to ensure that all memory access patterns are always the same (never dependent on any data). This is extremely hard, but not impossible - e.g. if you want to read one byte from a table (e.g. "byte = table[index]" where you don't want the attacker to know anything about index), you can read all preceding cache lines, then read the byte you want, then read all following cache lines (so that it always looks like a sequential read of the whole table) and do those accesses at a fixed rate (no "pause before reading the byte you want" and no "pause after reading the byte you want", including "no pause caused by branch mispredictions"). If you do this; then you can have extremely high confidence in your ability to guard against cache timing attacks (up to guaranteeing that your code is immune to all possible cache timing attacks).

However; it's almost impossible for the attacker to gain that level of control, extremely hard to write code like this, and code like that would have large performance overheads.

At the other extreme; you can do nothing and have no confidence in your ability to prevent cache timing attacks. Everything else falls between these extremes.

The question then is; what is a good compromise? This depends on too many factors - how much control the attacker has, if the attacker is in the same process, if the attacker can repeat the attack (and use a probabilistic approach to defeat any "noise"), how much the data is worth to the attacker (sane thieves don't spend more than $2 in an attempt to steal something that's worth $2 to the thief), how much the data is worth to the victim (nobody spends $100 per day to protect something that can be replaced for $2), what other mitigations are in place (e.g. most operating systems provide "address space layout randomization" now), etc.

For a good compromise; for your purposes, I'm personally fond of the "make it look like a sequential read of the whole table" approach, but a significantly simpler version that doesn't care very much about the fixed rate of accesses (or the "pause before/after reading the piece you want" and doesn't care about cache line sizes (accessing every byte won't cost much if the table is only 256 bytes, partly because most accesses will be "cache hit"). An example of this might be:

uint16_t fetch_byte_from_table(int index) {
    size_t table_entries = sizeof(table)/sizeof(table[0]);
    uint8_t dummy = 0;
    uint8_t data = 0;

    for(i = 0; i < table_entries; i++) {
        if(i == index) {
            data ^= table[i];
        } else {
            dummy ^= table[i];
        }
    }
    return ((uint16_t)dummy << 8) | data;        // Caller can ignore the higher 8 bits
}

Of course there are tricks you can use to try to hide or avoid the branches (e.g. data ^= table[i] * (i == index); dummy = data ^= table[i] * (i != index);), but they depend on compiler and target CPU.

Mirepoix answered 9/5, 2020 at 18:18 Comment(48)
Thankyou for your reply ! I am using only a 256 byte s-box table. My implementation is different from OpenSSL 's implementation that uses many other lookup tables that merge the SubBytes() ShiftRows() MixColumns() methods into one. So, by using the same prefetch256() function as done by OpenSSL's authors, how much confidence can I gain, that the probability of a cache timing attack on my implementation will be negligible ? The S-box table is used by the AES SubBytes() function. Should I call prefetch256(sbox) on every round before SubBytes() is called ?Abducent
@VivekanandV: Added a section on confidence in cache timing attack prevention (I think I got a little carried away/wrote too much - sorry). :-)Mirepoix
Thankyou for your explanation, but this new method drastically affects my performance, I tried to use uint32_t * for accessing my 256 byte s-box, to make fetch_byte_from_table() faster (and then extract the byte), but that will cause endianness issues on different platforms. Does replacing my 256 byte table with a two dimensional unsigned char sbox[16][16] array help against a cache timing attack (again performance is compromised)? I learned that the CPU most probably will access a 2 d array from RAM ? Am I on the right track?Abducent
@VivekanandV: If you replace the table with an unsigned char sbox[16][16] you could reduce the number of dummy reads by a factor of 16 (e.g. dummy ^= table[i][0];); which would work if and only if cache line size is >= 16 bytes. For most CPUs people still care about the cache line size is 64 bytes, so touching every 16th byte would be fine (and touching every 64th byte would be faster). However; this will increase complexity and reduce portability; and you'd need to ensure the table is aligned on the chosen boundary (so there's no "half a cache line" that wasn't touched).Mirepoix
Thankyou for all your clarifications and explanations ! I will definitely implement that.. Can I solve this issue using multi threading ? If I run a worker thread that constantly reads the 256 byte s-box, and fetches the value on request ? Can you please share your thoughts on that..?Abducent
@VivekanandV: On a system with a single CPU, a "fetcher thread" wouldn't help. On a system with many CPUs normally there's no guarantee that your threads will be running at the same time (especially if attacker starts many threads to take CPUs away from your threads) so it won't help much in that case either. For some operating systems, there might be something intended for "real-time scheduling guarantees" that might be useful, but it'd be incredibly difficult for an OS to provide that without allowing "denial of service/CPU" attacks. Mostly; it doesn't seem like a better alternative.Mirepoix
Are there other alternatives, than reading the full table ? I cant find any other alternative other than reading the full table, as you have suggested ! :) The performance loss is huge, but anyways, security is always more important! :)Abducent
@VivekanandV: There's lots of alternatives, if you're willing to sacrifice simplicity and/or portability and/or security in exchange for performance. For one example, for some CPUs (e.g. 64-bit 80x86 with AVX) you could load the whole table into registers once then use the registers for everything with no security risk (because you're not using memory/cache); but you'd need to sacrifice portability. For another example, you could maybe interleave the S-box with the state array and do the substitutions in random order, so attacker can't guess why a cache line was used (sacrificing simplicity).Mirepoix
Can I totally replace the lookup table with a function that returns the Sbox value (hard coded )after checking against input?Abducent
@VivekanandV: Yes, it's possible to replace S-box with code. If the code is like switch(input) { case 0: return 0x63; case 1: ... (with 256 cases) then you might replace the possibility of cache timing attacks with the possibility of branch direction/target timing attacks (which is less bad because that info is shared by less CPUs) but you might not (compiler might convert to a jump table that is still susceptible to cache timing attacks). If it's a mathematical equation then it'll probably be complex and slow (I don't think you can calculate s_box[n] without calculating s_box[n-1]).Mirepoix
To avoid branch prediction I made a function that returns the sbox value by evaluating a single expression ``` return 0x63 * (x==0) ^ 0x7c * (x==1) ^ /* ... and so on for the rest 254 possible inputs*/ ``` but again my performance dropped by five times ! :( I think the expression is too large ! Is == slow ?Abducent
@VivekanandV: Depending on target CPU and compiler; the compiler will probably convert == into either a branch (with branch prediction vulnerabilities and branch misprediction performance penalties) or a conditional instruction (that is likely to cause CPU's pipeline to stall on a data dependency); but with a sufficiently good optimizer a compiler may also convert your expression into a table (like "return compile_time_generated_table[input]). Would recommend disassembling the resulting code (and considering portability problems).Mirepoix
I was able to improve my performance with grouped conditional statements, but still slower than my previous benchmarks. In your previous comments you mentioned interleaving the sbox with the state. Can you please tell me how to achieve that? The AES encryption function operates on 16 byte block 128 bits. For the sake of performance my implementation treats it as four 32 bit unsigned integers. This forms the state of AES. Also, my compiler did not optimize the previous code to lookup tables. Also calculating the sbox requires Galois field inversion with affine transformation of input!Abducent
@VivekanandV: The basic idea is, rather than having uint8_t sbox[256] and uint8_t state[256] you'd have uint8_t interleaved_s_box_and_state[256][2]. However, that can't work - you have to make the sizes match. To do that (with 256 byte s-box and 16 byte state array), you'd want 16 bytes of s-box for each byte of state array; so it'd be like uint8_t interleaved_s_box_and_state[16][17]. To access values in the state array it'd be like interleaved_s_box_and_state[entry][16], and to access values in the s-box it'd be like interleaved_s_box_and_state[entry/16][entry%16].Mirepoix
@VivekanandV: Of course that means every time you access the state array you cause CPU to fetch whatever else happens to be in the same cache line (part of the s-box), which makes it hard for attacker to know when you access the s-box and when you access the state array (especially if you can do state array accesses in a randomized order).Mirepoix
Thankyou for that ! I am doing AES in CTR mode with zero IV. Therefore the input to the AES encryption function will be a 128 bit counter which is publically known. This counter gets incremented for every call to the AES encryption function. Thereby we get a pseudo random stream that gets XORed with plaintext to get ciphertext. Only the user supplied key is a secret here. Therefore does the attacker knowing the publically known counter give him any advantage even if I use an interleaved S-box with state, given the fact that he knows the contents of the interleaved array now ?Abducent
@VivekanandV: If attacker knows the counter and can figure out which parts of the s-box you accessed, then they can figure out the contents of your state array and obtain the key. For "fast attacker" (able to check what was cached between each of your accesses), if attacker knows which cache line was fetched they only know that the value was one of about 60 (depending on cache line size and implementation) values. For that reason "fast attacker" needs multiple attempts - e.g. first time they might find out that the value was one of 60 possibilities, 2nd time they rule out half the... 1/3Mirepoix
@VivekanandV: ..possibilities and are left with 30, 3rd time the rule out more and maybe they're left with 15 possibilities, etc. However; in practice the attacker won't be that fast (if attacker has that much control then they probably have kernel or hardware access and can probably just read your state array directly). If attacker is slower then they have more doubt and need more attempts (e.g. maybe billions of attempts instead of thousands); until the attacker is just plain too slow (and only sees "whole s-box was cached" because they weren't fast enough). 2/3Mirepoix
@VivekanandV: For attackers that matter in practice (slower than "fast attacker" but faster than "too slow attacker") the interleaving strategy causes doubt and increases the number of attempts attacker needs to make. In other words, depending on relative speeds (how fast attacker is relative to how quickly you do the accesses) interleaving can increase doubt enough to make an attack impractical.Mirepoix
Thankyou for your reply and detailed explanation. I understood how interleaving helps. Recently, I thought of a new idea. Usually, a 256 Sbox lookup table looks like this u8 sbox[256] = {s0, s1, s2, s3, s4, s5, s6, s7, ...... What if I cluster the values to a 32 bit entry table like u32 sbox32[64] = {s0s1s2s3, s4s5s6s7, ... } Now if I want to do an access, for a byte x, I do EXB(sbox[x>>2], x&3) , where EXB is #define EXB(x, n) (u8)(((x) >> (24 - 8 * (n))) & 0xff) . In this scheme, if I access the x th byte I'm indexing by the trunc. div of x by 4...Abducent
Please read my previous comment. Since I'm doing a truncated division to access the element the attacker knows only x/4 . Now, the attacker will have to guess, which byte among the bytes I accessed, assuming he has no knowledge of the byte extraction, which depends on the unknown remainder x%4. Since, the AES SubBytes operates on 16 bytes of data, there are $4^{16}$ possiblities. Does this method increase the difficulty for the attacker?Abducent
@VivekanandV: That won't really make any difference (unless it's deliberately misaligned), because accessing a byte or 4 bytes will both cause the whole cache line to be fetched the same. The only case where it'd help is if the 4 bytes are deliberately misaligned (e.g.first 2 bytes of access are last bytes of one cache line, second 2 bytes of access are first bytes of next cache line) so that it causes 2 cache lines to be fetched. In that case you'll break portability because some CPUs don't support misaligned accesses.Mirepoix
So, not knowing the exact byte that the Sbox function returns doesn't make the attack difficult? So, if I interleave the new 32 bit per entry Sbox with the state, like our previous idea, will that increase the difficulty for the attacker?Abducent
@VivekanandV: Interleaving 32-bit entries will be the same as interleaving bytes (unless the array is deliberately misaligned). Note that it's best to think in terms of cache lines, not bytes or entries, because that's what the attacker sees.Mirepoix
Thankyou, I know this topic is very edgy. In practice professional software developers won't go this far. They use AES NI to be certain that their implementation are safe. If the attacker sees the cache lines, but if he has some difficulty in knowing which byte I used, via byte extraction using bitwise operators, which is independent of CPU cache, doesn't that give me additional advantage?Abducent
@VivekanandV: Making it harder for attacker to know something that they already can't know doesn't give you any advantage. Assuming 64-byte cache line size (the most common), if you access a byte with x = s_box[123] the attacker knows that you accessed one of the bytes from s_box[64] to s_box[127] (because they're all on the same cache line). If you access the same cache line some other way (e.g. as a 32-bit read with byte extraction) then it makes no difference to the attacker at all because they still know that the cache line corresponds to bytes s_box[64] to s_box[127].Mirepoix
Thankyou very much! I am going to stick to the interleaved state with sbox modelAbducent
I've decided to use the interleaved model... :) Is there any reason why the state elements was placed at the end of every row ? Also as I said before my implementation treats the state as 4 unsigned integers (eqv. 16 bytes). This is done to increase performance for other AES steps. So, the interleaved state with sbox will look like u32 interleaved_sbox_state[4][65] ; I can access the state by interleaved_sbox_state[i][64] where $0\leqi\leq3$ and the sbox bytes by (u8)interleaved_sbox_state[x>>6][x&63] Thankyou very much!!! :)Abducent
@VivekanandV: Yes; putting state elements at the end helps to guard against having the last s-box elements on a cache line all by themselves (e.g. for 32 byte cache line size, "(256+16) / CACHE_LINE_SIZE" has a remainder of 16). Of coruse that also assumes that the array begins on a cache line boundary (which is something you'd want anyway).Mirepoix
Thankyou very much! :) My implementation has greater performance gain, than all other methods that I tried to mitigate side channel attacks. Does altering the scheme like I mentioned in the previous comment to u32 interleaved_sbox_state[4][65], give me the same effect in terms of security and performance as uint8_t interleaved_s_box_and_state[16][17] ? Thankyou very much for all your help , guidance and intuition! :)Abducent
I think I ran into a problem while while making it uint32_t interleaved_sbox_state[4][65] . The size of the array now is 1040 bytes, and for every element in this 2D array except the state entries we have 3 null bytes, for the sbox entries, Isn't that a waste of space, w.r.t to the array? Also, what about the values fitting into cache line? Can you please explain it to me?Abducent
@VivekanandV: To avoid wasting space you'd want to do similar byte extraction for s-box entries. This results in an array of 4 rows; with 16 (4-byte) s-box entries then one (4-byte) state array entry in each row; like uint32_t interleaved_sbox_state[4][17];. However; "16 * 4 bytes = 64 bytes between state array entries >= size of cache line" so some groups of s-box entries must be on a cache line by themselves without any state array data for attacker to see. With the 3 byte padding it's even worse (4 times as easy for attacker to determine which s-box entries you looked at).Mirepoix
@VivekanandV: The only way to fix that (while using uint32_t) is to insert dummy entries - e.g.uint32_t interleaved_sbox_state[16][5];, where interleaved_sbox_state[i][4] contains "state array or dummy entry" (and i%4 == 0 is a real state array entry and i%4 != 0 is a dummy entry that has to be accessed to ensure s-box entries on that cache line get cached). That probably makes it hard to justify interleaving (probably easier/faster to just do dummy accesses to s-box and forget about interleaving).Mirepoix
Since I don't want to risk security , I've decided to stick to your original idea of using unsigned char sbox_state_il[16][17]. To be compatible with other 32 bit word based state (eqv. 16 bytes) methods of AES, which I've already implemented for the sake of performance, I made an on demand converter for 16 bytes in the unsigned char array to 4 words in the state array and also an inverse converter. Now, my problem, is the performance cost of these two converters. Is it safe to initially construct the 4 word st. array from the u8 array make a copy of u32 ar. and use that?Abducent
Please read my previous comment. Or is it mandatory that I do have to read the state bytes from the interleaved * unsigned char *array every time, before I do the AES SubBytes ? * If I make a copy of a constructed uint32_t state array of 4 elements, from the original, I can save performance, by not doing the conversions and swapping on demand, for other AES steps, for every *round ! Will not reading the state bytes from the interleaved state, but only reading the sbox entries from the interleaved u8 array by the SubBytes, function, pose a security threat to me?Abducent
Please, read my previous two comments! :) Or can I get away with all that by doing a dummy read of the interleaved state array's state elements only, but only using the initially constructed uint32_t array of 4 words for the rest of the AES calculations, and update this u32 1D array only? Now, in this case can the attacker detect that I am not editing the interleaved state (the two dimensional u8 array) array's state bytes, located at the end of every row? Thankyou very much, in advance!!!Abducent
@VivekanandV: It's safe to convert the state array from uint32_t to the interleaved format (and the reverse conversion), because the access pattern will always be the same and won't depend on any data. You don't need to read from the interleaved data for rounds that aren't using the s-box (because you're not trying to obscure s-box accesses when you're not doing any s-box accesses). If you're doing dummy reads of the (state array part of) the interleaved data (and using the uint32_t state array), then there's no point interleaving (faster to do dummy reads from s-box).Mirepoix
Are there methods, to speed up our original linear sbox[256] lookup (with dummy reads) faster, but with the same security ? Thankyou very much in advance!!! :)Abducent
@VivekanandV: For all 80x86 (but not some other architectures) you can do a "deliberately misaligned across cache line boundary" read to double the number of cache lines fetched by a dummy read. For most 80x86 (and some other CPUs) you can use prefetch instructions (if and only if it won't be "TLB miss") instead, which would have different timing and may (and may not) have slightly worse security but should be faster. Mostly; performance gains will be from reducing security (e.g. "fetch whole s-box, then do whole SubBytes step" vs. "fetch whole s-box each time you want one byte of s-box").Mirepoix
If I fetch the table before the SubBytes, only once, can you please tell me, what an observering attacker will see, in my access pattern ? If I read the whole 256 byte table and then read 16 bytes, how much is the effect observable, by an attacker, in the long run?Abducent
@VivekanandV: It depends on how fast the attacker is (relative to how fast you are. For "extremely fast attacker able to check which cache lines were used between each of your (dummy or real) reads" it'd be a serious problem; but in practice (for any normal situation) you can expect attacker will be significantly slower; and in that case attacker might be able to try to flush cache lines while you're doing earlier reads to determine what later ("not dummy") reads accessed; and repeat that process many many times to increase the chance of getting lucky timing.Mirepoix
@VivekanandV: Note that "100% secure under all conditions" isn't really the goal. The goal is "just make it hard enough that the attacker goes and finds an different/easier way instead" (which can include attacker not bothering - e.g. a sane thief won't spend more than $2 in an attempt to steal something worth $2, because they'll decide it's easier to just buy their own).Mirepoix
In our previous comments, you mentioned about, working with CPU registers as an alternative. So, I was writing this AES implementation for an ARM (7) processor that has no AES intrinsics, but it supports NEON registers. So why not group the elements of the 256 byte S-box into 16 uint8x16_t NEON registers ? Will this protect me from cache timing attacks?Abducent
@VivekanandV: Yes; because you wouldn't be using memory to access s-box you wouldn't be using caches. It just takes some branches and shifts to get the right byte from the right register. For older 80x86 (without AES, but with 64-bit) there's SSE (similar to NEON - 16 registers able to store 16 bytes each). For portability, you'd probably end up with 4+ versions with conditional code (like #ifdef CAN_USE_NEON, #elif CAN_USE_AES, #elif CAN_USE_SSE, and #else /* Have to worry about cache timing */.Mirepoix
Is it safe to dynamically allocate registers like uint8x16_t * s = (uint8x16_t *)calloc(16, sizeof(uint8x16_t)); then I will get a multidimensional like array to work with: the s_box elements would be like s[x >> 4][x&15] ? Is this safe, or should I use 16 different registers ?Abducent
@VivekanandV: It'd be "compiler implementation dependent or worse"; but mostly I'd expect that it's not safe to make assumptions about if/how the compiler maps (uint8x16_t) variables to CPU registers. Instead I'd use assembly language for everything starting with "load s-box into registers" and ending with "compiler can use those registers for other things now".Mirepoix
Currently I am using an array (not dynamically allocated) of uint8x16x4_t s[4] and then use a for loop to initialize it. I am a beginner to assembly, so it would be difficult for me to hard code this process. This method of using arrays prevents me from using branches, reducing the penalty of branch misprediction. I tried using four different separate registers and use branching, but the application becomes 11x slower. Using the array version helps save performance and brings easy access to Sbox entries! Thankyou very much in advance!!! :)Abducent
Please read my previous comment. ARM32 has 16 q registers, therefore, the 256 byte Sbox entries will fill the whole registers available. But, I think using registers will make an attacker's job more difficult. In the case of Aarch64 there are more registers available! And the attacker cannot (I guess) know exactly what bytes, I'm using statistically to come to right conclusions, if I'm working with registers!Abducent

© 2022 - 2024 — McMap. All rights reserved.