Is there a list of deprecated x86 instructions?

Asked 2/2, 2011 at 22:7 Answered 3/2, 2011 at 13:17

assembly x86 intel deprecated instructions

I'm taking an x86 assembly language programming class and know that certain instructions shouldn't be used anymore -- because they're slow on modern processors; for example, the loop instruction.

I haven't been able to find any list of instructions that are considered deprecated and should be avoided; any guidance would be appreciated.

Consumerism answered 2/2, 2011 at 22:7 Comment(4)

Basicly all complex instructions are slower than simple ones. While technically x86 is a CISC, it's faster to use it as RISC. – Telles 3/2, 2011 at 10:49

@ruslik: That's not necessarily true. If you have a complex instruction that performs the same functionality as a set of simple instructions, you're likely to get better performance with the dedicated instruction. It is more likely to be optimized and may even have dedicated hardware that you're missing out on by using only simple instructions. – Diarmid 3/2, 2011 at 13:20

@Nathan Fellman, cases like the enter/leave instructions come to mind with regard to this. However, I can't think of any cases where the complex instructions are faster than the simple ones. Can you name a few for me so that I am better informed? – Jimmiejimmy 3/2, 2011 at 17:42

You won't find a list as they really aren't ever "deprecated"... Just the ones that you use for a given situation will shift depending on the CPU architecture. Also, some are unavailable in certain circumstances, but they're still not deprecated... – Odellodella 2/6, 2011 at 19:24

Your best bet is to consult Intel's official optimization guide.

This an other manuals can be found here.

Diarmid answered 3/2, 2011 at 13:17 Comment(2)

That's exactly the kind of document that I was looking for! Thank you. – Consumerism 4/2, 2011 at 18:11

See also the x86 tag wiki for links to optimization resources, esp. Agner Fog's excellent guides and insn tables. – Specialism 4/3, 2016 at 4:26

Oh, but there still might be a good reason to use the loop instruction. For example, loop label only requires two bytes. As opposed to dec cx followed by jnz label requires three bytes. Sometimes code size is more important than speed.

I would suggest, however, that if you're just learning x86 assembly--especially if this is your first foray into assembly language--that you first concentrate on how to do things. Once you've gotten a better feel for how things work, then worry about making them faster.

Avalanche answered 2/2, 2011 at 22:25 Comment(2)

dec cx or dec rcx may require more bytes in 32 or 64-bit mode – Lobelia 26/4, 2014 at 4:25

For asm beginners, minimizing number of instructions, and esp. number of branches, is a reasonable approximation for efficiency. I often see beginner questions with a huge amount of branching (e.g. compare and branch, then unconditional branch to somewhere else, or even jmp label / label: to jump over a blank line between blocks...). I guess thinking about branches as fall-through-or-not takes practice. It's very true that trying to write code that avoids pitfalls on a range of AMD and Intel CPUs is hard. But see agner.org/optimize for insn tables and uarch writeups. – Specialism 4/3, 2016 at 4:24

All CPU instructions are 100% functional to reach compatibility with older CPUs. So why to avoid some instruction? There is no realy deprecated x86 instructions! But we can say:

1)All string istructions like rep movsb are slower.

2) xlat is slow and very rare in use.

3)Also the use of stack frame functions ENTER and LEAVE is slow.

4)Uder Windows (XP, vista...) the deprecated instructions are IN and OUT, but only under CPU ring 2 (aplication level), also the int nn is deprecated, except int3 (debugger trap).

EDIT: added simple speed test to check strings instruction rep cmp on different versions of CPUs.

Test is made under Delphi IDE but the asm part is very easy to translate in any other IDE.

program ProjectTest;

{$APPTYPE CONSOLE}

uses SysUtils, windows;

const
  ArraySize = 50000;

var
  StartTicks    :int64;
  EndTicks      :int64;
  arA           :array [0..ArraySize - 1]of byte;
  arB           :array [0..ArraySize - 1]of byte;

begin
  FillChar(ArA, SizeOf(ArA), 255);          //Set all bytes to 0xFF
  FillChar(ArB, SizeOf(ArB), 255);          //Set all bytes to 0xFF

repeat
  Sleep(100);       //Calm down
  asm
//Save  StartTicks
    rdtsc
    mov         dword ptr [StartTicks], eax
    mov         dword ptr [StartTicks + 4], edx
//Test LOOP
    push        edi
    mov         ecx, -ArraySize
    mov         edi, offset arA + ArraySize
    mov         esi, offset arB + ArraySize
@loop:
    mov         al,[esi + ecx]
    cmp         [edi + ecx], al
    jnz         @exit
    inc         ecx
    jnz         @loop
@exit:
    pop         edi
//Save  EndTicks
    rdtsc
    mov         dword ptr [EndTicks], eax
    mov         dword ptr [EndTicks + 4], edx
  end;

  WriteLn('Loop ticks : ' + IntToStr(EndTicks - StartTicks));

  Sleep(100);       //Calm down
  asm
//Save  StartTicks
    rdtsc
    mov         dword ptr [StartTicks], eax
    mov         dword ptr [StartTicks + 4], edx
//Test REP
    push        edi
    cld
    mov         ecx, ArraySize
    mov         edi, offset arA
    mov         esi, offset arB
    repe        cmpsb
    pop         edi
//Save  EndTicks
    rdtsc
    mov         dword ptr [EndTicks], eax
    mov         dword ptr [EndTicks + 4], edx
  end;

  WriteLn('Rep ticks  : ' + IntToStr(EndTicks - StartTicks));

  ReadLn                    //Wait keyboard
until false;

end.

TESTs for ArraySize = 50000

Average results...

1)My Intel single core CPU Pentium 4 results: Loop ticks : 232000; Rep ticks : 233000

2)My Intel Core 2 Quad CPU results: Loop ticks : 158000; Rep ticks : 375000

Nunuance answered 3/2, 2011 at 9:27 Comment(9)

These details are very CPU-dependent, to say the least. What you write is very true for some CPUs, and very untrue for others. – Diarmid 3/2, 2011 at 13:18

Agree... This details are valid for newer Intel CPUs with build in Hyper-threading technology. – Nunuance 3/2, 2011 at 14:0

@GJ, you mention that "all string istructions" [sic] are slower. Slower than what? Also, are you sure that all the int nn style instructions are deprecated besides int 3? How about int 1, the single-step interrupt? – Jimmiejimmy 3/2, 2011 at 17:45

@mrduclaw, int 1 under IA32? :) The first 32 interrupt vector numbers are reserved by Intel for system use, so the system can use it! According strings instruction speed there are slower than code without string instructions and the same function. – Nunuance 3/2, 2011 at 19:29

@GJ, yes int 1 under IA32. I think you are confusing a connection between "reserved for system use" and "deprecated"? I don't know any debuggers that don't allow for single-steping through code; so it's seems hardly "deprecated". But with regard to the string instructions, can you give me a series of assembly instructions that require fewer clock cycles to execute than say repe cmpsb, so that I can compare? – Jimmiejimmy 4/2, 2011 at 16:21

I sed: "There is no realy deprecated x86 instructions!" But under some OS like win XP under application level they exist. According repe cmpsb I have added simple test in my answer. – Nunuance 5/2, 2011 at 12:36

rep stos / rep movs are fast on Intel CPUs. For memset/memcpy of more than ~128B, they can beat any SSE or AVX loop when used with aligned inputs. This is especially true for IvyBridge and later, with the ERMSB feature (which makes rep stosb the best choice, rather than rep stosq and then cleanup). repe / repne compare / search instructions are not particularly fast, and you can beat the pants off them with a good SSE pcmpeqb loop. – Specialism 4/3, 2016 at 4:38

The Intel optimization manual has a chapter on memset/memcpy, with graphs of their optimized SSE implementation (from their high-performance library) vs. rep movs, for various sizes and alignments. Don't lump rep movs together with repe cmps. rep movs can use stores that avoid read-for-ownership overhead on cache misses, but without evicting the written data from cache. (With SSE, you have to choose regular stores (worse for large buffers) vs. movnt stores (much worse for small buffers since it evicts written lines from cache).) – Specialism 4/3, 2016 at 4:39

Also, enter is slow, and should be avoided. leave is not bad. Agner Fog lists it as 3 uops for Intel, but IDK if that includes a stack-engine sync uop. If so, then it's as fast as mov rsp, rbp / pop rbp, else 1 uop worse but saves code size. gcc uses it (usually only in functions with C99 variable-length arrays, where it's cumbersome to add the right amount to rsp to restore rsp that way.) And of course, with the default -fomit-frame-pointer, gcc usually only makes stack frames in functions with variable-length arrays. Clang doesn't use it at all. – Specialism 4/3, 2016 at 5:10

If you what to know what to avoid, go directly to the processor manufacturers, both intel and amd have manuals for the instruction sets their processors support and to what degree they support them, your best bet if probably the optimization volumes, but if your only just starting out, take Jim's advice, get the thing working first before you worry about speed

Gorges answered 3/2, 2011 at 4:12 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags