Is vec_sld endian sensitive?
Asked Answered
W

1

6

I'm working on a PowerPC machine with in-core crypto. I'm having trouble porting AES key expansion from big endian to little endian using built-ins. Big endian works, but little endian does not.

The algorithm below is the snippet presented in an IBM blog article. I think I have the issue isolated to line 2 below:

typedef __vector unsigned char  uint8x16_p8;
uint8x64_p8 r0 = {0};

r3 = vec_perm(r1, r1, r5);       /* line  1 */
r6 = vec_sld(r0, r1, 12);        /* line  2 */
r3 = vcipherlast(r3, r4);        /* line  3 */

r1 = vec_xor(r1, r6);            /* line  4 */
r6 = vec_sld(r0, r6, 12);        /* line  5 */
r1 = vec_xor(r1, r6);            /* line  6 */
r6 = vec_sld(r0, r6, 12);        /* line  7 */
r1 = vec_xor(r1, r6);            /* line  8 */
r4 = vec_add(r4, r4);            /* line  9 */

// r1 is ready for next round
r1 = vec_xor(r1, r3);            /* line 10 */

Upon entering the function, both big endian and little endian have the following parameters:

(gdb) p r1
$1 = {0x2b, 0x7e, 0x15, 0x16, 0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88,
  0x9, 0xcf, 0x4f, 0x3c}
(gdb) p r5
$2 = {0xd, 0xe, 0xf, 0xc, 0xd, 0xe, 0xf, 0xc, 0xd, 0xe, 0xf, 0xc, 0xd, 0xe,
  0xf, 0xc}

However, after executing line 2, r6 has the value:

Little endian machine:

(gdb) p r6
$3 = {0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88, 0x9, 0xcf, 0x4f, 0x3c,
  0x0, 0x0, 0x0, 0x0}

(gdb) p $vs0
$3 = {uint128 = 0x8815f7aba6d2ae28000000003c4fcf09, v2_double = {
    4.9992689728788323e-315, -1.0395462025288474e-269}, v4_float = {
    0.0126836384, 0, -1.46188823e-15, -4.51291888e-34}, v4_int32 = {
    0x3c4fcf09, 0x0, 0xa6d2ae28, 0x8815f7ab}, v8_int16 = {0xcf09, 0x3c4f, 0x0,
    0x0, 0xae28, 0xa6d2, 0xf7ab, 0x8815}, v16_int8 = {0x9, 0xcf, 0x4f, 0x3c,
    0x0, 0x0, 0x0, 0x0, 0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88}}

Big endian machine:

(gdb) p r6
$4 = {0x0, 0x0, 0x0, 0x0, 0x2b, 0x7e, 0x15, 0x16, 0x28, 0xae, 0xd2, 0xa6,
  0xab, 0xf7, 0x15, 0x88}

Notice the odd rotation on the little endian machine.

When I disassemble on the little endian machine after line 2 executes:

 (gdb) disass $pc
 <skip multiple pages>

    0x0000000010000dc8 <+168>:   lxvd2x  vs12,r31,r9
    0x0000000010000dcc <+172>:   xxswapd vs12,vs12
    0x0000000010000dd0 <+176>:   xxlor   vs32,vs0,vs0
    0x0000000010000dd4 <+180>:   xxlor   vs33,vs12,vs12
    0x0000000010000dd8 <+184>:   vsldoi  v0,v0,v1,12
    0x0000000010000ddc <+188>:   xxlor   vs0,vs32,vs32
    0x0000000010000de0 <+192>:   xxswapd vs0,vs0
    0x0000000010000de4 <+196>:   li      r9,64
    0x0000000010000de8 <+200>:   stxvd2x vs0,r31,r9
 => 0x0000000010000dec <+204>:   li      r9,48
    0x0000000010000df0 <+208>:   lxvd2x  vs0,r31,r9
    0x0000000010000df4 <+212>:   xxswapd vs34,vs0

(gdb) p $v0
$5 = void

(gdb) p $vs0
$4 = {uint128 = 0x8815f7aba6d2ae28000000003c4fcf09, v2_double = {
    4.9992689728788323e-315, -1.0395462025288474e-269}, v4_float = {
    0.0126836384, 0, -1.46188823e-15, -4.51291888e-34}, v4_int32 = {
    0x3c4fcf09, 0x0, 0xa6d2ae28, 0x8815f7ab}, v8_int16 = {0xcf09, 0x3c4f, 0x0,
    0x0, 0xae28, 0xa6d2, 0xf7ab, 0x8815}, v16_int8 = {0x9, 0xcf, 0x4f, 0x3c,
    0x0, 0x0, 0x0, 0x0, 0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88}}

I have no idea why r6 is not the expected value. Ideally I would examine the vsx register on both machines. Unfortunately GDB is also problematic on both machines so I can't do things like disassemble and print vector registers.

Is vec_sld endian sensitive? Or is there something else wrong?

Wiese answered 21/9, 2017 at 10:46 Comment(0)
S
9

Little endian with PowerPC/AltiVec can get a little mind-bending at times - if you need to make your code work with both big and little endian then it helps to define some portability macros, e.g. for vec_sld:

#ifdef __BIG_ENDIAN__
  #define VEC_SLD(va, vb, shift) vec_sld(va, vb, shift)
#else
  #define VEC_SLD(va, vb, shift) vec_sld(vb, va, 16 - (shift))
#endif

You'll probably find this helpful for all intrinsics which involve horizontal/positional operations or narrowing/widening, e.g. vec_merge, vec_pack et al, vec_unpack, vec_perm, vec_mule/vec_mulo, vec_splat, vec_lvsl/vec_lvsr, etc.

Staggs answered 21/9, 2017 at 11:0 Comment(9)
Thanks Paul. The IBM docs don't say anything about vec_sld (or friends) being endian sensitive. If you don't mind me asking, do you have a reference? Or is this hard-won experience? For reference, here is the IBM doc on vec_sld.Wiese
This is hard-won experience - I did a lot of embedded work with AltiVec a long time ago with both big and little endian targets. I wish I could give you all the portability macros which take a lot of the pain out of this, but IP issues prevent this, however the above approach should help to take you in right direction. If you hit any further issues then post a new question with an altivec tag and I'll try to help if I can.Staggs
P.S. You might find it helpful to get the Motorola AltiVec docs - they don't cover the more modern stuff like crypto, but I think you might find them helpful for all the general stuff - just Google for "AltiVec PIM" and "AltiVec PEM" - you will want both (one is for assembly and the other is for intrinsics).Staggs
Heh - I can remember times when the endianness stuff used to make my brain hurt, so occasionally I'd use a "brute force" approach and put together a simple example and try all possible combinations to see which one worked, e.g. for vec_sld there are two possible orderings of the registers, and two possible values for the shift, so four combinations to try. ;-)Staggs
No problem, and don’t worry about the bounty etc. And thanks for the credit in your code - crypto is not really an area I know much about but I’ll take a look anyway. It’s good to know that people are still doing serious work with AltiVec.Staggs
How on earth did IBM manage to write an endian-specific bit shift function?Fidellia
@Lundin: I think it's an inherent problem with SIMD - for dual endianness architectures you would have to be able to optionally byte reverse all vectors on load/store, or byte reverse SIMD data lanes within the SIMD ALU paths, both of which would require a lot of unnecessary silicon. Normally you're only working with one endianness or the other, so you just code accordingly (I have had the dubious pleasure of working on code that needs to support both, which is where the fun begins.)Staggs
@jww: hey, thanks for the bonus (the first one I've ever had, I think) - totally unnecessary but I appreciate the kind gesture!Staggs
@jww: cool project - I don't have much free time at present, but I do plan to retire some time (I'm getting old!), so maybe I'll have time to contribute to projects like this when that happens.Staggs

© 2022 - 2024 — McMap. All rights reserved.