Broadcast a byte value to all 16 XMM slots in Delphi ASM

S

3

5

This is easy in AVX with the VBROADCASTS command, or in SSE if the value were doubles or floats.

How do I broadcast a single 8-bit value to every slot in an XMM register in Delphi ASM?

Southland answered 5/1, 2015 at 13:15 Comment(0)

W

4

You mean you have a byte in the LSB of an XMM register and want to duplicate it across all lanes of that register? I don't know Delphi's inline assembly syntax, but in Intel/MASM syntax it could be done something like this:

punpcklbw xmm0,xmm0    ; xxxxxxxxABCDEFGH -> xxxxxxxxEEFFGGHH
punpcklwd xmm0,xmm0    ; xxxxxxxxEEFFGGHH -> xxxxxxxxGGGGHHHH
punpckldq xmm0,xmm0    ; xxxxxxxxGGGGHHHH -> xxxxxxxxHHHHHHHH
punpcklqdq xmm0,xmm0   ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH

Witty answered 5/1, 2015 at 13:38 Comment(4)

Yes, that's the idea. How do I load the initial byte into the LSB? The references I've found are again oriented to floats. – Southland 5/1, 2015 at 13:45

The MOVD instruction lets you move the contents of a 32-bit register or memory location into an xmm register. – Witty 5/1, 2015 at 14:6

I'm assuming the last instruction should read "punpcklqdq" :) – Southland 5/1, 2015 at 14:15

You can pshufd xmm0, xmm0, 0 once you get to dword elements. Or better: punpcklbw / pshuflw / punpcklqdq (faster on Merom / K8 and earlier when pshufd and other 128b shuffles with granularity less than 64-bit are slower). – Florrieflorry 9/11, 2017 at 5:36

S

5

Michael's answer will work. As an alternative, if you can assume the SSSE3 instruction set, then using Packed Shuffle Bytes pshufb would also work.

Assuming (1) an 8-bit value in AL (for example) and (2) the desired broadcast destination to be XMM1, and (3) that another register, say XMM0, is available, this will do the trick:

movd   xmm1, eax  ;// move value in AL (part of EAX) into XMM1
pxor   xmm0, xmm0 ;// clear xmm0 to create the appropriate mask for pshufb
pshufb xmm1, xmm0 ;// broadcast lowest value into all slots of xmm1

And yes, Delphi's BASM understands SSSE3.

Suint answered 19/9, 2015 at 21:58 Comment(0)

W

4

You mean you have a byte in the LSB of an XMM register and want to duplicate it across all lanes of that register? I don't know Delphi's inline assembly syntax, but in Intel/MASM syntax it could be done something like this:

punpcklbw xmm0,xmm0    ; xxxxxxxxABCDEFGH -> xxxxxxxxEEFFGGHH
punpcklwd xmm0,xmm0    ; xxxxxxxxEEFFGGHH -> xxxxxxxxGGGGHHHH
punpckldq xmm0,xmm0    ; xxxxxxxxGGGGHHHH -> xxxxxxxxHHHHHHHH
punpcklqdq xmm0,xmm0   ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH

Witty answered 5/1, 2015 at 13:38 Comment(4)

Yes, that's the idea. How do I load the initial byte into the LSB? The references I've found are again oriented to floats. – Southland 5/1, 2015 at 13:45

The MOVD instruction lets you move the contents of a 32-bit register or memory location into an xmm register. – Witty 5/1, 2015 at 14:6

I'm assuming the last instruction should read "punpcklqdq" :) – Southland 5/1, 2015 at 14:15

You can pshufd xmm0, xmm0, 0 once you get to dword elements. Or better: punpcklbw / pshuflw / punpcklqdq (faster on Merom / K8 and earlier when pshufd and other 128b shuffles with granularity less than 64-bit are slower). – Florrieflorry 9/11, 2017 at 5:36

F

3

The fastest option is SSSE3 for pshufb if it's available.

; SSSE3
pshufb      xmm0,  xmm1       ; where xmm1 is zeroed, e.g. with pxor xmm1,xmm1

Otherwise you should usually use this:

; SSE2 only
punpcklbw   xmm0, xmm0        ; xxxxxxxxABCDEFGH -> xxxxxxxxEEFFGGHH
pshuflw     xmm0, xmm0, 0     ; xxxxxxxxEEFFGGHH -> xxxxxxxxHHHHHHHH
punpcklqdq  xmm0, xmm0        ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH

This is better than punpckl bw / wd -> pshufd xmm0, xmm0, 0 because there are some CPUs with only 64-bit shuffle units. (Including Merom and K8). On such CPUs, pshuflw is fast, and so is punpcklqdq, but pshufd and punpck with granularity less than 64-bit is slow. So this sequence uses only one "slow shuffle" instruction, vs. 3 for bw / wd / pshufd.

On all later CPUs, there's no difference between those two 3-instruction sequence, so it doesn't cost us anything to tune for old CPUs in this case. See also http://agner.org/optimize/ for instruction tables.

This is the sequence from Michael's answer with the middle two instructions replaced by pshuflw.

If your byte is in an integer register to start with, you can use a multiply by 0x01010101 to broadcast it to 4 bytes. e.g.

; movzx   eax, whatever

imul   edx, eax, 0x01010101    ; edx = al repeated 4 times

movd   xmm0, eax
pshufd xmm0, xmm0, 0

Note that imul's non-immediate source operand can be memory, but it has to be a 32-bit memory location with your byte zero-extended to 32 bits.

If your data starts in memory, loading into an integer register first is probably not worth it. Just movd to an xmm register. (Or possibly pinsrb if you need to avoid a wider load to avoid crossing a page or maybe a cache line. But that has a false dependency on the old value of the register where movd doesn't.)

If instruction throughput is more of an issue than latency, it can be worth considering pmuludq if you can't use pshufb, even though it has 5 cycle latency on most CPUs.

; low 32 bits of xmm0 = your byte, **zero extended**
pmuludq xmm0, xmm7        ; xmm7 = 0x01010101 in the low 32 bits
pshufd  xmm0, xmm0, 0

Florrieflorry answered 9/11, 2017 at 5:47 Comment(3)

Wow, you sure know your SSE. One question: how would pinsrb every cross a page? – Southland 10/11, 2017 at 6:52

@IamIC: It wouldn't, that's why you'd use it instead of movd to get a byte into the low byte of an xmm registers. – Florrieflorry 10/11, 2017 at 16:34

"... multiply by 0x01010101 to broadcast it..." - that's clever. – Albescent 15/6, 2018 at 6:44

Recommended topics

Hot tags