This is easy in AVX with the VBROADCASTS command, or in SSE if the value were doubles or floats.
How do I broadcast a single 8-bit value to every slot in an XMM register in Delphi ASM?
This is easy in AVX with the VBROADCASTS command, or in SSE if the value were doubles or floats.
How do I broadcast a single 8-bit value to every slot in an XMM register in Delphi ASM?
You mean you have a byte in the LSB of an XMM register and want to duplicate it across all lanes of that register? I don't know Delphi's inline assembly syntax, but in Intel/MASM syntax it could be done something like this:
punpcklbw xmm0,xmm0 ; xxxxxxxxABCDEFGH -> xxxxxxxxEEFFGGHH
punpcklwd xmm0,xmm0 ; xxxxxxxxEEFFGGHH -> xxxxxxxxGGGGHHHH
punpckldq xmm0,xmm0 ; xxxxxxxxGGGGHHHH -> xxxxxxxxHHHHHHHH
punpcklqdq xmm0,xmm0 ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH
MOVD
instruction lets you move the contents of a 32-bit register or memory location into an xmm
register. –
Witty pshufd xmm0, xmm0, 0
once you get to dword elements. Or better: punpcklbw
/ pshuflw
/ punpcklqdq
(faster on Merom / K8 and earlier when pshufd
and other 128b shuffles with granularity less than 64-bit are slower). –
Florrieflorry Michael's answer will work. As an alternative, if you can assume the SSSE3
instruction set, then using Packed Shuffle Bytes pshufb
would also work.
Assuming (1) an 8-bit value in AL
(for example) and (2) the desired broadcast destination to be XMM1
, and (3) that another register, say XMM0
, is available, this will do the trick:
movd xmm1, eax ;// move value in AL (part of EAX) into XMM1
pxor xmm0, xmm0 ;// clear xmm0 to create the appropriate mask for pshufb
pshufb xmm1, xmm0 ;// broadcast lowest value into all slots of xmm1
And yes, Delphi's BASM understands SSSE3.
You mean you have a byte in the LSB of an XMM register and want to duplicate it across all lanes of that register? I don't know Delphi's inline assembly syntax, but in Intel/MASM syntax it could be done something like this:
punpcklbw xmm0,xmm0 ; xxxxxxxxABCDEFGH -> xxxxxxxxEEFFGGHH
punpcklwd xmm0,xmm0 ; xxxxxxxxEEFFGGHH -> xxxxxxxxGGGGHHHH
punpckldq xmm0,xmm0 ; xxxxxxxxGGGGHHHH -> xxxxxxxxHHHHHHHH
punpcklqdq xmm0,xmm0 ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH
MOVD
instruction lets you move the contents of a 32-bit register or memory location into an xmm
register. –
Witty pshufd xmm0, xmm0, 0
once you get to dword elements. Or better: punpcklbw
/ pshuflw
/ punpcklqdq
(faster on Merom / K8 and earlier when pshufd
and other 128b shuffles with granularity less than 64-bit are slower). –
Florrieflorry The fastest option is SSSE3 for pshufb
if it's available.
; SSSE3
pshufb xmm0, xmm1 ; where xmm1 is zeroed, e.g. with pxor xmm1,xmm1
Otherwise you should usually use this:
; SSE2 only
punpcklbw xmm0, xmm0 ; xxxxxxxxABCDEFGH -> xxxxxxxxEEFFGGHH
pshuflw xmm0, xmm0, 0 ; xxxxxxxxEEFFGGHH -> xxxxxxxxHHHHHHHH
punpcklqdq xmm0, xmm0 ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH
This is better than punpckl bw / wd -> pshufd xmm0, xmm0, 0
because there are some CPUs with only 64-bit shuffle units. (Including Merom and K8). On such CPUs, pshuflw
is fast, and so is punpcklqdq
, but pshufd
and punpck
with granularity less than 64-bit is slow. So this sequence uses only one "slow shuffle" instruction, vs. 3 for bw / wd / pshufd.
On all later CPUs, there's no difference between those two 3-instruction sequence, so it doesn't cost us anything to tune for old CPUs in this case. See also http://agner.org/optimize/ for instruction tables.
This is the sequence from Michael's answer with the middle two instructions replaced by pshuflw
.
If your byte is in an integer register to start with, you can use a multiply by 0x01010101
to broadcast it to 4 bytes. e.g.
; movzx eax, whatever
imul edx, eax, 0x01010101 ; edx = al repeated 4 times
movd xmm0, eax
pshufd xmm0, xmm0, 0
Note that imul
's non-immediate source operand can be memory, but it has to be a 32-bit memory location with your byte zero-extended to 32 bits.
If your data starts in memory, loading into an integer register first is probably not worth it. Just movd
to an xmm register. (Or possibly pinsrb
if you need to avoid a wider load to avoid crossing a page or maybe a cache line. But that has a false dependency on the old value of the register where movd
doesn't.)
If instruction throughput is more of an issue than latency, it can be worth considering pmuludq
if you can't use pshufb
, even though it has 5 cycle latency on most CPUs.
; low 32 bits of xmm0 = your byte, **zero extended**
pmuludq xmm0, xmm7 ; xmm7 = 0x01010101 in the low 32 bits
pshufd xmm0, xmm0, 0
movd
to get a byte into the low byte of an xmm registers. –
Florrieflorry © 2022 - 2024 — McMap. All rights reserved.