How does XCHG work in Intel assembly language? - McMap

About

How does XCHG work in Intel assembly language?

Asked 30/4, 2018 at 14:13 Answered 30/4, 2018 at 14:40

arrays assembly x86 instruction-set

D

1

8

How does the xchg instruction work in the following code? It is given that arrayD is a DWORD array of 1,2,3.

mov  eax, arrayD      ; eax=1
xchg eax, [arrayD+4]  ; eax=2 arrayD=2,1,3

Why isn't the array 1,1,3 after the xchg?

Doll answered 30/4, 2018 at 14:13 Comment(14)

Why do you think the array has the values 2,1,3 at the end? – Laktasic 30/4, 2018 at 14:19

mov eax, arrayD does NOT set eax to 1. It loads the address of arrayD. What you want is mov eax, [arrayD]. Edit:misread the initial state. – Whaleboat 30/4, 2018 at 14:19

@Whaleboat That depends on the assembler. Some assemblers treat arrayD and [arrayD] the same. – Laktasic 30/4, 2018 at 14:21

@Laktasic yeah right. OP should have specified assembler. – Whaleboat 30/4, 2018 at 14:22

So what is the correct code to change arrayD from 1,2,3 to 3,1,2? – Doll 30/4, 2018 at 14:22

Is that NASM syntax, or GAS .intel_syntax? If it's GAS, then mov eax, arrayD is in fact a load, but ; is not the comment character! Is it maybe MASM syntax? I think [arrayD+4] might be legal MASM syntax, even though many people write arrayD[4] or arrayD+4 with symbols outside square brackets in MASM. – Froh 30/4, 2018 at 14:23

1,2,3 to 3,1,2 is more than one swap. – Whaleboat 30/4, 2018 at 14:24

@AlloysiusGoh: I wouldn't use xchg at all; it's slow with a memory operand because it does an atomic exchange (implicit lock prefix. See also agner.org/optimize). For 1,2,3 -> 3,1,2, I'd load all 3 values into eax,ecx, and edx, then store them back. Or do a 64-bit load of the first 2. e.g. mov rax, qword ptr [arrayD] / mov edx, [arrayD+8] / mov [arrayD], edx / mov qword ptr [arrayD+4], rax. Assuming you're using x86-64. If you're on 32-bit, you can use movq into XMM0. – Froh 30/4, 2018 at 14:27

@PeterCordes It is intel syntax My lecturer was just giving an example of how xchg works – Doll 30/4, 2018 at 14:30

Or I'd use movdqu xmm0, [arrayD] / pshufd xmm0,xmm0, _MM_SHUFFLE(4,2,1,3) / movdqu [arrayD], xmm0. Use vpmaskmovd or AVX512 masked load/store if you need to avoid load/store of the dword past the end of the array. – Froh 30/4, 2018 at 14:31

@AlloysiusGoh: there are at least 3 flavours of Intel syntax used by different assemblers, see stackoverflow.com/tags/intel-syntax/info. If the first instruction is supposed to be a load, then presumably it's MASM. – Froh 30/4, 2018 at 14:33

@PeterCordes From the way my lecturer seemed to explain it, it is safe to assume it's MASM – Doll 30/4, 2018 at 14:34

I always suggest to use squared brackets even in MASM to make the memory access easily visible upon reading the source, and to be consistent with non-symbol memory references, like mov eax,[ebx]. The MASM will ignore the [] around symbol names, so you can write mov eax,[arrayD] in such case. ... (and about +4 .. are you aware the memory is addressable by single bytes, so 32 bit value occupies 4 bytes in memory = the first element of that array occupies addresses arrayD+0, arrayD+1, arrayD+2 and arrayD+3. The second element starts at address arrayD+4 (and occupies mem up to +7) – Heavyweight 30/4, 2018 at 14:38

In my earlier comment, shuffle indices are 0-indexed, so actually _MM_SHUFFLE(3,1,0,2) – Froh 3/8, 2023 at 2:22

F

13

xchg works like Intel's documentation says.

I think the comment on the 2nd line is wrong. It should be eax=2, arrayD = 1,1,3. So you're correct, and you should email your instructor to say you think you've found a mistake, unless you missed something in your notes.

xchg only stores one element, and it can't magically look back in time to know where the value in eax came from and swap two memory locations with one xchg instruction.

The only way to swap 1,2 to 2,1 in one instruction would be a 64-bit rotate, like rol qword ptr [arrayD], 32 (x86-64 only).

BTW, don't use xchg with a memory operand if you care about performance. It has an implicit lock prefix on 386 and later, so it's a full memory barrier, and even apart from waiting for the store buffer to drain, it takes about 20 CPU cycles on Haswell/Skylake (http://agner.org/optimize/ and https://uops.info/). Of course, multiple instructions can be in flight at once, but xchg mem,reg is 8 uops, vs. 2 total for separate load + store. xchg doesn't stall the pipeline, but the memory barrier hurts a lot (stopping later loads from being started early as well as waiting for earlier loads and stores to fully complete). It's also a lot of work for the CPU to do to make it atomic.

Related:

swapping 2 registers in 8086 assembly language(16 bits) (how to efficiently swap a register with memory). xchg is only useful for this case if you need atomicity, or if you care about code-size but not speed. Or on CPUs before 386, where xchg doesn't imply lock.
Can num++ be atomic for 'int num'?
Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? (for xchg reg,reg, no memory barrier)
Are loads and stores the only instructions that gets reordered? - Instruction-level parallelism around mfence vs. a locked operation

Froh answered 30/4, 2018 at 14:40 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.