Intel x86 assembly optimization techniques for expanding 8 bits to 8 boolean bytes of 0 or 1
Asked Answered
L

13

24

I am learning assembler quite a while and I am trying to rewrite some simple procedures \ functions to it to see performance benefits (if any). My main development tool is Delphi 2007 and first examples will be in that language but they can be easily translated to other languages as well.

The problem states as:

We have given an unsigned byte value in which each of the eight bits represents a pixel in one row of a screen. Each single pixel can be solid (1) or transparent (0). So in other words, we have 8 pixels packed in one byte value. I want to unpack those pixels into an eight byte array in the way that youngest pixel(bit) will land under the lowest index of the array and so on. Here is an example:

One byte value -----------> eight byte array

10011011 -----------------> [1][1][0][1][1][0][0][1]

Array index number ------->  0  1  2  3  4  5  6  7

Below I present five methods which are solving the problem. Next I will show their time comparison and how I did measure those times.

My questions consist of two parts:

1.

I am asking you for detailed answer concerning methods DecodePixels4a and DecodePixels4b. Why method 4b is somewhat slower than the 4a?

If for example it is slower because my code is not aligned correctly then show me which instructions in a given method could be better aligned and how to do this to not break the method.

I would like to see real examples behind the theory. Please bear in mind that I am learning assembly and I want to gain knowledge from your answers which allows me in the future to writing better optimized code.

2.

Can you write faster routine than DecodePixels4a? If so, please present it and describe optimization steps that you have taken. By faster routine I mean routine that runs in the shortest period of time in your test environment among all the routines presented here.

All Intel family processors are allowed and those which are compatible with them.

Below you will find routines written by me:

procedure DecodePixels1(EncPixels: Byte; var DecPixels: TDecodedPixels);
var
  i3: Integer;
begin
  DecPixels[0] := EncPixels and $01;
  for i3 := 1 to 7 do
  begin
    EncPixels := EncPixels shr 1;
    DecPixels[i3] := EncPixels and $01;
    //DecPixels[i3] := (EncPixels shr i3) and $01;  //this is even slower if you replace above 2 lines with it
  end;
end;


//Lets unroll the loop and see if it will be faster.
procedure DecodePixels2(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels[0] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[1] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[2] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[3] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[4] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[5] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[6] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[7] := EncPixels and $01;
end;


procedure DecodePixels3(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    push ecx;
    mov bl, al;
    and bl, $01;
    mov [edx], bl;
    mov ecx, $00;
@@Decode:
    inc ecx;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + ecx], bl;
    cmp ecx, $07;
    jnz @@Decode;
    pop ecx;
    pop ebx;
    pop eax;
  end;
end;


//Unrolled assembly loop
procedure DecodePixels4a(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    mov bl, al;
    and bl, $01;
    mov  [edx], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $01], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $02], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $03], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $04], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $05], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $06], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $07], bl;
    pop ebx;
    pop eax;
  end;
end;


// it differs compared to 4a only in switching two instructions (but seven times)
procedure DecodePixels4b(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx], bl;        //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $01], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $02], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $03], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $04], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $05], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $06], bl;  //
    mov bl, al;
    and bl, $01;
    mov [edx + $07], bl;
    pop ebx;
    pop eax;
  end;
end;

And here is how do I test them:

program Test;

{$APPTYPE CONSOLE}

uses
  SysUtils, Windows;

type
  TDecodedPixels = array[0..7] of Byte;
var
  Pixels: TDecodedPixels;
  Freq, TimeStart, TimeEnd :Int64;
  Time1, Time2, Time3, Time4a, Time4b: Extended;
  i, i2: Integer;

begin
  if QueryPerformanceFrequency(Freq) then
  begin
    for i2 := 1 to 100 do
    begin
      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels1(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time1 := Time1 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels2(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time2 := Time2 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels3(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time3 := Time3 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels4a(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time4a := Time4a + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels4b(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time4b := Time4b + ((TimeEnd - TimeStart) / Freq * 1000);

    end;
    Writeln('Time1 : ' + FloatToStr(Time1 / 100) + ' ms.    <- Delphi loop.');
    Writeln('Time2 : ' + FloatToStr(Time2 / 100) + ' ms.    <- Delphi unrolled loop.');
    Writeln('Time3 : ' + FloatToStr(Time3/ 100) + ' ms.    <- BASM loop.');
    Writeln('Time4a : ' + FloatToStr(Time4a / 100) + ' ms.    <- BASM unrolled loop.');
    Writeln('Time4b : ' + FloatToStr(Time4b / 100) + ' ms.    <- BASM unrolled loop instruction switch.');
  end;
  Readln;
end.

Here are the results from my machine ( Intel® Pentium® E2180 on Win32 XP) :

Time1  : 1,68443549919493 ms.     <- Delphi loop.
Time2  : 1,33773024572211 ms.     <- Delphi unrolled loop.
Time3  : 1,37015271374424 ms.     <- BASM loop.
Time4a : 0,822916962526627 ms.    <- BASM unrolled loop.
Time4b : 0,862914462301607 ms.    <- BASM unrolled loop instruction switch.

The results are pretty stable - times vary only by few percent between each test I've made. And that was always true: Time1 > Time3 > Time 2 > Time4b > Time4a

So I think that de difference between Time4a and Time4b depends of that instructions switch in the method DecodePixels4b. Sometimes it is 4% sometimes it is up to 10% but 4b is always slower than 4a.

I was thinking about another method with usage of MMX instructions to write into memory eight bytes at one time, but I can't figure out fast way to unpack byte into the 64 bit register.

Thank you for your time.


Thank you guys for your valuable input. Whish I could answer all of you at the same time, unfortunately compared to the modern CPU's I have only one "pipe" and can execute only one instruction "reply" at the time ;-) So, I will try sum up some things over here and write additional comments under your answers.

First of all, I wanted to say that before posting my question I came up with the solution presented by Wouter van Nifterick and it was actually way slower then my assembly code. So I've decided not to post that routine here, but you may see that I took the same approach also in my loop Delphi version of the routine. It is commented there because it was giving me worser results.

This is a mystery for me. I've run my code once again with Wouter's and PhilS's routines and here are the results:

Time1  : 1,66535493194387 ms.     <- Delphi loop.
Time2  : 1,29115785420688 ms.     <- Delphi unrolled loop.
Time3  : 1,33716934524107 ms.     <- BASM loop.
Time4a : 0,795041753757838 ms.    <- BASM unrolled loop.
Time4b : 0,843520166815013 ms.    <- BASM unrolled loop instruction switch.
Time5  : 1,49457681191307 ms.     <- Wouter van Nifterick, Delphi unrolled
Time6  : 0,400587402866258 ms.    <- PhiS, table lookup Delphi
Time7  : 0,325472442519827 ms.    <- PhiS, table lookup Delphi inline
Time8  : 0,37350491544239 ms.     <- PhiS, table lookup BASM

Look at the Time5 result, quite strange isn't it? I guess I have different Delphi version, since my generated assembly code differs from that provided by Wouter.

Second major edit:


I know why routine 5 was slower on my machnie. I had checked "Range checking" and "Overflow checking" in my compiler options. I've added assembler directive to routine 9 to see if it helps. It seems that with this directive assembly procedure is as good as Delphi inline variant or even slightly better.

Here are the final results:

Time1  : 1,22508325749317 ms.     <- Delphi loop.
Time2  : 1,33004145373084 ms.     <- Delphi unrolled loop.
Time3  : 1,1473583622526 ms.      <- BASM loop.
Time4a : 0,77322594033463 ms.     <- BASM unrolled loop.
Time4b : 0,846033593023372 ms.    <- BASM unrolled loop instruction switch.
Time5  : 0,688689382044384 ms.    <- Wouter van Nifterick, Delphi unrolled
Time6  : 0,503233741036693 ms.    <- PhiS, table lookup Delphi
Time7  : 0,385254722925063 ms.    <- PhiS, table lookup Delphi inline
Time8  : 0,432993919452751 ms.    <- PhiS, table lookup BASM
Time9  : 0,362680491244212 ms.    <- PhiS, table lookup BASM with assembler directive

Third major edit:


In opinion @Pascal Cuoq and @j_random_hacker the difference in execution times between routines 4a, 4b and 5 is caused by the data dependency. However I have to disagree with that opinion basing on the further tests that I've made.

I've also invented new routine 4c based on 4a. Here it is:

procedure DecodePixels4c(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push ebx;
    mov bl, al;
    and bl, 1;
    mov [edx], bl;
    mov bl, al;
    shr bl, 1;
    and bl, 1;
    mov [edx + $01], bl;
    mov bl, al;
    shr bl, 2;
    and bl, 1;
    mov [edx + $02], bl;
    mov bl, al;
    shr bl, 3;
    and bl, 1;
    mov [edx + $03], bl;
    mov bl, al;
    shr bl, 4;
    and bl, 1;
    mov [edx + $04], bl;
    mov bl, al;
    shr bl, 5;
    and bl, 1;
    mov [edx + $05], bl;
    mov bl, al;
    shr bl, 6;
    and bl, 1;
    mov [edx + $06], bl;
    shr al, 7;
    and al, 1;
    mov [edx + $07], al;
    pop ebx;
  end;
end;

I would say it is pretty data dependent.

And here are the tests and results. I've made four tests to make sure there is no accident. I've also added new times for the routines proposed by GJ (Time10a, Time10b).

          Test1  Test2  Test3  Test4

Time1   : 1,211  1,210  1,220  1,213
Time2   : 1,280  1,258  1,253  1,332
Time3   : 1,129  1,138  1,130  1,160

Time4a  : 0,690  0,682  0,617  0,635
Time4b  : 0,707  0,698  0,706  0,659
Time4c  : 0,679  0,685  0,626  0,625
Time5   : 0,715  0,682  0,686  0,679

Time6   : 0,490  0,485  0,522  0,514
Time7   : 0,323  0,333  0,336  0,318
Time8   : 0,407  0,403  0,373  0,354
Time9   : 0,352  0,378  0,355  0,355
Time10a : 1,823  1,812  1,807  1,813
Time10b : 1,113  1,120  1,115  1,118
Time10c : 0,652  0,630  0,653  0,633
Time10d : 0,156  0,155  0,172  0,160  <-- current winner!

As you may see the results of 4a, 4b, 4c and 5 are very close to each other. Why is that? Because I've removed from 4a, 4b (4c already doesn't have it) two instructions: push eax and pop eax. Since I know I wont use anywhere else in my code the value under eax I do not have to prereserve it. Now my code has only one pair of push/pop so as the routine 5. Routine 5 prereserves value of eax beacause it firstly make copy of it under ecx but it deson't prereserve ecx.

So my conclusion is that: the difference in time execution of 5 and 4a and 4b (before the third edit) didn't concern data dependecny but was caused by additional pair of push / pop instructions.

I am very interested in your comments.

After a few days GJ invented even faster routine (Time 10d) than PhiS's. Nice work GJ!

Lode answered 12/9, 2009 at 11:28 Comment(11)
Nitpick: I think you mean "solid or transparent". "Opaque" means "can't be seen through".Kierkegaardian
Normally, the Delphi "assembler" directive doesn't do anything (just for backwards compatibility with Turbo Pascal), so I am a bit surprised. Which Delphi version are you using? Do you have any compiler options turned on to always generate stack frames or something similar?Sherfield
I just checked, marking the BASM version with "assembler" has no effect for me in Delphi 2009.Sherfield
On my machine, the BASM version operates faster still if the line "mov ecx, dword ptr PUint64DecPix" is replaced with "lea ecx, Uint64DecPix" (which I had originally done but for some reason not in the version I posted). On a more general note, the MOVQ-implementation I used may or may not be faster than one doing 2 moves through general-purpose registers, depending on the processor used.Sherfield
@PhiS: here are my compiler options: img220.imageshack.us/img220/2621/compileroptions.png I am using Delphi 2007. Also, you may notice in my third edit that routine 9 (with assembler directive) is faster then 8 in 3 out of 4 tests. Thanks for additional info I'll check proposed replacements later.Lode
One thing I noticed: you use begin..end around your asm..end block. That's not necessary for asm functions in Delphi. In fact, it may be something you may wish to avoid, because the compiler will in many cases introduce additional code (stack frames, local variable copies) in that case. Also, in Delphi you need not preserve EAX, ECX or EDX across function calls (you can get rid of these push/pop's). It's hidden somewhere in the Delphi help ("Assembly Procedures and Functions", "Assembly Expressions" et al.)Sherfield
Regarding the third edit, if the removal of a push/pop pair is significant, you should definitely inline this piece of code where it is used, because the call/ret pair has at least as large a cost as the push/pop one that you removed.Callie
@Pascal: The problem here is that Delphi does not allow assembly procedures to be inlined (only Pascal ones).Sherfield
@PhiS: Thank you, didn't know about that. And why EBX is the only one register which needs to be pushed?Lode
@Wodzu: EBX, ESI, EDI, ESP, EBP need to be preserved in Delphi assembly functions. The reason is simply that this is the calling convention they chose. Also, the Direction Flag should always be restored and if you use MMX (but not XMM) registers, you must revert to FPU mode by the end of the routine (i.e., use the EMMS instruction). Data are (generally) passed to functions via EAX, EDX and ECX, then the stack. If your function returns something, it is returned in AL/AX/EAX/EDX:EAX([u]int64) or ST(0) (floating point values), or some other things in @Result (passed to proc as a hidden param)Sherfield
Related: How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD and How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?Mercurio
P
6

Your asm code is relativity slow because use stack end write 8 times to memory. Check this one...

procedure DecodePixels(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels + 4], ecx
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels], ecx
end;

Maybe is even faster than code with lookup table!

Improved version:

procedure DecodePixelsI(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels + 4], ecx
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels], ecx
end;

Version 3:

procedure DecodePixelsX(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  add   al, al
  setc  byte ptr[DecPixels + 7]
  add   al, al
  setc  byte ptr[DecPixels + 6]
  add   al, al
  setc  byte ptr[DecPixels + 5]
  add   al, al
  setc  byte ptr[DecPixels + 4]
  add   al, al
  setc  byte ptr[DecPixels + 3]
  add   al, al
  setc  byte ptr[DecPixels + 2]
  add   al, al
  setc  byte ptr[DecPixels + 1]
  setnz byte ptr[DecPixels]
end;

Version 4:

const Uint32DecPix : array [0..15] of cardinal = (
  $00000000, $00000001, $00000100, $00000101,
  $00010000, $00010001, $00010100, $00010101,
  $01000000, $01000001, $01000100, $01000101,
  $01010000, $01010001, $01010100, $01010101
  );

procedure DecodePixelsY(EncPixels: byte; var DecPixels: TDecodedPixels); inline;
begin
  pcardinal(@DecPixels)^ := Uint32DecPix[EncPixels and $0F];
  pcardinal(cardinal(@DecPixels) + 4)^ := Uint32DecPix[(EncPixels and $F0) shr 4];
end;
Peppers answered 14/9, 2009 at 19:4 Comment(13)
Thanks GJ for your interests. Unfortunately your routine is the slowests from every routines in my tests. See updated results in my answer. Once again thanks, will analyze your routine later.Lode
Jeah... I didn't test it... I have forgotn that instruction "rcl ecx, 8" is slow. So the new version is about 3 times faster.Peppers
How did you measure that it is 3 times faster? It is about 40% faster according to my tests. +1 For the new method.Lode
It's depend of CPU, on single core CPU was very fast but on my 4 core CPU only about 40%! Check version 3...Peppers
So far so good... Actualy we need only 16 dword big look up tabe! Check version 4:Peppers
Thanks GJ for you work on this. You've made the fastest procedure!Lode
+1 for your version 4, a clever and relatively high-level solution that beats all the ASM. With a variant record for TDecodedPixel you could eliminate the ugly type casting too, for even nicer code. Unfortunately you edited your answer too often, so you won't get the points for it...Lemire
Interesting! Actually, I think part of the reason of why this is very fast is because the compiler does a great job optimising here and automatically inlines it (no call/ret): //DecodePixelsGJ_Y (155, Pixels2); in the loop disassembles to: mov eax,$0000000b mov eax,[eax*4+$40f1e4] mov edx,$00414d94 mov [edx],eax mov eax,$00000009 mov eax,[eax*4+$40f1e4] mov edx,$00414d94 add edx,$04 mov [edx],eax If you set the compiler options to not inline this, it's way slower. My original inlined version appears slower, because there the compiler makes data move through memory, unlike here.Sherfield
@PhiS: you are rigth, but if we declare constant 155 as var and than put the value 155 in to, compiler can't do suchlike optimisation but code execution is still very fast.Peppers
@GJ I promote your answer as the correct since you gave the fastest algorithm. I didn't know that to often edit = no points for the answer. It's a shame...Lode
xor ecx,ecx is not faster than mov ecx,ecx. The false read dependency is optimized away by the processor, at least since the early incarnations of the P6 (about 1995). Both have the same execution time, but because the xor version needs less space in the code cache, it is to be preferred.Velocipede
What CPU did you test this on? Was mov ecx, 0 really faster in your testing? If so, that's probably because of interaction with the partial-register stalls created by writing cl and ch and reading ecx. See stackoverflow.com/questions/41573502/… for details about how partial registers behave on different x86 CPUs.Mercurio
Also note that rcl by 1 is much faster than rcl by any other count on Intel CPUs. e.g. 2c latency for rcl cl,1 vs. 6c latency for rcl ecx,8 on Haswell. (agner.org/optimize)Mercurio
D
16

In general, I'd personally stay away from trying to optimize code by using tricks on assembler level, unless you really need that extra 2 or 3% of speed, and you're willing to pay the price of code that is harder to read, maintain and port.

To squeeze that last 1%, you might even have to maintain several versions optimized per processor, and if newer processors and an improved pascal compiler comes along, you're not going to benefit from it.

This Delphi code is faster than your fastest assembler code:

procedure DecodePixels5(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels[0] := (EncPixels shr 0) and $01;
  DecPixels[1] := (EncPixels shr 1) and $01;
  DecPixels[2] := (EncPixels shr 2) and $01;
  DecPixels[3] := (EncPixels shr 3) and $01;
  DecPixels[4] := (EncPixels shr 4) and $01;
  DecPixels[5] := (EncPixels shr 5) and $01;
  DecPixels[6] := (EncPixels shr 6) and $01;
  DecPixels[7] := (EncPixels shr 7) and $01;
end;


Results:

Time1  : 1,03096806151283 ms.    <- Delphi loop.
Time2  : 0,740308641141395 ms.   <- Delphi unrolled loop.
Time3  : 0,996602425688886 ms.   <- BASM loop.
Time4a : 0,608267951561275 ms.   <- BASM unrolled loop.
Time4b : 0,574162510648039 ms.   <- BASM unrolled loop instruction switch.
Time5  : 0,499628206138524 ms. !!!  <- Delphi unrolled loop 5.

It's fast because the operations can be done with registers only, instead of needing to store and fetch memory. Modern processors execute this partly in parallel (a new operation can be started before the previous finished), because the results of the consecutive instructions are independent of each other.

The machine code looks like this:

  push ebx;
  // DecPixels[0] := (EncPixels shr 0) and 1;
  movzx ecx,al
  mov ebx,ecx
  //  shr ebx,$00
  and bl,$01
  mov [edx],bl
  // DecPixels[1] := (EncPixels shr 1) and 1;
  mov ebx,ecx
  shr ebx,1
  and bl,$01
  mov [edx+$01],bl
  // DecPixels[2] := (EncPixels shr 2) and 1;
  mov ebx,ecx
  shr ebx,$02
  and bl,$01
  mov [edx+$02],bl
  // DecPixels[3] := (EncPixels shr 3) and 1;
  mov ebx,ecx
  shr ebx,$03
  and bl,$01
  mov [edx+$03],bl
  // DecPixels[4] := (EncPixels shr 4) and 1;
  mov ebx,ecx
  shr ebx,$04
  and bl,$01
  mov [edx+$04],bl
  // DecPixels[5] := (EncPixels shr 5) and 1;
  mov ebx,ecx
  shr ebx,$05
  and bl,$01
  mov [edx+$05],bl
  // DecPixels[6] := (EncPixels shr 6) and 1;
  mov ebx,ecx
  shr ebx,$06
  and bl,$01
  mov [edx+$06],bl
  // DecPixels[7] := (EncPixels shr 7) and 1;
  shr ecx,$07
  and cl,$01
  mov [edx+$07],cl
  pop ebx;

Edit: As suggested, a table lookup is indeed faster.

var
  PixelLookup:Array[byte] of TDecodedPixels;

// You could precalculate, but the performance gain would hardly be worth it because you call this once only.
for I := 0 to 255 do
  DecodePixels5b(I, PixelLookup[I]);


procedure DecodePixels7(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels := PixelLookup[EncPixels];
end;

Results:

Time1  : 1,03096806151283 ms.    <- Delphi loop.
Time2  : 0,740308641141395 ms.   <- Delphi unrolled loop.
Time3  : 0,996602425688886 ms.   <- BASM loop.
Time4a : 0,608267951561275 ms.   <- BASM unrolled loop.
Time4b : 0,574162510648039 ms.   <- BASM unrolled loop instruction switch.
Time5  : 0,499628206138524 ms. !!!  <- Delphi unrolled loop 5.
Time7 : 0,251533475182096 ms.    <- simple table lookup
Descend answered 12/9, 2009 at 12:34 Comment(3)
Another possible reason for the improved speed: You now have 8 independent flows of execution, which can be executed (partially) in parallel on modern superscalar processors (esp. P4 and up). Before, each bit's computation could not begin until the previous bit's computation had completed.Kierkegaardian
Thank you Wouter for your reply. As I said in my edited question - I took the same approach before asking the question and on my machine the result was worse than the times measured with methods 1 and 2 which I've provided in the question. Also I don't quite get this: "It's faster because the operations can be done with registers only, instead of needing to store and fetch memory." I don't think this is the right explanation since my method 4a and 4b also do not store and fetch memory apart from writing the unpacked bits into the memory. My assembly methods relay only on the CPU registers.Lode
The original assembly uses no memory loads. Your version uses exactly the same number of memory stores. The only thing that I can think of is that your is more efficient at avoiding pipeline stalls.Hardcastle
P
6

Your asm code is relativity slow because use stack end write 8 times to memory. Check this one...

procedure DecodePixels(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels + 4], ecx
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels], ecx
end;

Maybe is even faster than code with lookup table!

Improved version:

procedure DecodePixelsI(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels + 4], ecx
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels], ecx
end;

Version 3:

procedure DecodePixelsX(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  add   al, al
  setc  byte ptr[DecPixels + 7]
  add   al, al
  setc  byte ptr[DecPixels + 6]
  add   al, al
  setc  byte ptr[DecPixels + 5]
  add   al, al
  setc  byte ptr[DecPixels + 4]
  add   al, al
  setc  byte ptr[DecPixels + 3]
  add   al, al
  setc  byte ptr[DecPixels + 2]
  add   al, al
  setc  byte ptr[DecPixels + 1]
  setnz byte ptr[DecPixels]
end;

Version 4:

const Uint32DecPix : array [0..15] of cardinal = (
  $00000000, $00000001, $00000100, $00000101,
  $00010000, $00010001, $00010100, $00010101,
  $01000000, $01000001, $01000100, $01000101,
  $01010000, $01010001, $01010100, $01010101
  );

procedure DecodePixelsY(EncPixels: byte; var DecPixels: TDecodedPixels); inline;
begin
  pcardinal(@DecPixels)^ := Uint32DecPix[EncPixels and $0F];
  pcardinal(cardinal(@DecPixels) + 4)^ := Uint32DecPix[(EncPixels and $F0) shr 4];
end;
Peppers answered 14/9, 2009 at 19:4 Comment(13)
Thanks GJ for your interests. Unfortunately your routine is the slowests from every routines in my tests. See updated results in my answer. Once again thanks, will analyze your routine later.Lode
Jeah... I didn't test it... I have forgotn that instruction "rcl ecx, 8" is slow. So the new version is about 3 times faster.Peppers
How did you measure that it is 3 times faster? It is about 40% faster according to my tests. +1 For the new method.Lode
It's depend of CPU, on single core CPU was very fast but on my 4 core CPU only about 40%! Check version 3...Peppers
So far so good... Actualy we need only 16 dword big look up tabe! Check version 4:Peppers
Thanks GJ for you work on this. You've made the fastest procedure!Lode
+1 for your version 4, a clever and relatively high-level solution that beats all the ASM. With a variant record for TDecodedPixel you could eliminate the ugly type casting too, for even nicer code. Unfortunately you edited your answer too often, so you won't get the points for it...Lemire
Interesting! Actually, I think part of the reason of why this is very fast is because the compiler does a great job optimising here and automatically inlines it (no call/ret): //DecodePixelsGJ_Y (155, Pixels2); in the loop disassembles to: mov eax,$0000000b mov eax,[eax*4+$40f1e4] mov edx,$00414d94 mov [edx],eax mov eax,$00000009 mov eax,[eax*4+$40f1e4] mov edx,$00414d94 add edx,$04 mov [edx],eax If you set the compiler options to not inline this, it's way slower. My original inlined version appears slower, because there the compiler makes data move through memory, unlike here.Sherfield
@PhiS: you are rigth, but if we declare constant 155 as var and than put the value 155 in to, compiler can't do suchlike optimisation but code execution is still very fast.Peppers
@GJ I promote your answer as the correct since you gave the fastest algorithm. I didn't know that to often edit = no points for the answer. It's a shame...Lode
xor ecx,ecx is not faster than mov ecx,ecx. The false read dependency is optimized away by the processor, at least since the early incarnations of the P6 (about 1995). Both have the same execution time, but because the xor version needs less space in the code cache, it is to be preferred.Velocipede
What CPU did you test this on? Was mov ecx, 0 really faster in your testing? If so, that's probably because of interaction with the partial-register stalls created by writing cl and ch and reading ecx. See stackoverflow.com/questions/41573502/… for details about how partial registers behave on different x86 CPUs.Mercurio
Also note that rcl by 1 is much faster than rcl by any other count on Intel CPUs. e.g. 2c latency for rcl cl,1 vs. 6c latency for rcl ecx,8 on Haswell. (agner.org/optimize)Mercurio
S
5

Expanding on Nick D's answer, I tried the following table-lookup based versions, all of which are faster than the implementations you give (and faster than Wouter van Nifterick's code).

Given the following packed array:


      const Uint64DecPix : PACKED ARRAY [0..255] OF UINT64 =
  ( $0000000000000000, $0000000000000001, $0000000000000100, $0000000000000101, $0000000000010000, $0000000000010001, $0000000000010100, $0000000000010101, $0000000001000000, $0000000001000001, $0000000001000100, $0000000001000101, $0000000001010000, $0000000001010001, $0000000001010100, $0000000001010101,
    $0000000100000000, $0000000100000001, $0000000100000100, $0000000100000101, $0000000100010000, $0000000100010001, $0000000100010100, $0000000100010101, $0000000101000000, $0000000101000001, $0000000101000100, $0000000101000101, $0000000101010000, $0000000101010001, $0000000101010100, $0000000101010101,
    $0000010000000000, $0000010000000001, $0000010000000100, $0000010000000101, $0000010000010000, $0000010000010001, $0000010000010100, $0000010000010101, $0000010001000000, $0000010001000001, $0000010001000100, $0000010001000101, $0000010001010000, $0000010001010001, $0000010001010100, $0000010001010101,
    $0000010100000000, $0000010100000001, $0000010100000100, $0000010100000101, $0000010100010000, $0000010100010001, $0000010100010100, $0000010100010101, $0000010101000000, $0000010101000001, $0000010101000100, $0000010101000101, $0000010101010000, $0000010101010001, $0000010101010100, $0000010101010101,
    $0001000000000000, $0001000000000001, $0001000000000100, $0001000000000101, $0001000000010000, $0001000000010001, $0001000000010100, $0001000000010101, $0001000001000000, $0001000001000001, $0001000001000100, $0001000001000101, $0001000001010000, $0001000001010001, $0001000001010100, $0001000001010101,
    $0001000100000000, $0001000100000001, $0001000100000100, $0001000100000101, $0001000100010000, $0001000100010001, $0001000100010100, $0001000100010101, $0001000101000000, $0001000101000001, $0001000101000100, $0001000101000101, $0001000101010000, $0001000101010001, $0001000101010100, $0001000101010101,
    $0001010000000000, $0001010000000001, $0001010000000100, $0001010000000101, $0001010000010000, $0001010000010001, $0001010000010100, $0001010000010101, $0001010001000000, $0001010001000001, $0001010001000100, $0001010001000101, $0001010001010000, $0001010001010001, $0001010001010100, $0001010001010101,
    $0001010100000000, $0001010100000001, $0001010100000100, $0001010100000101, $0001010100010000, $0001010100010001, $0001010100010100, $0001010100010101, $0001010101000000, $0001010101000001, $0001010101000100, $0001010101000101, $0001010101010000, $0001010101010001, $0001010101010100, $0001010101010101,
    $0100000000000000, $0100000000000001, $0100000000000100, $0100000000000101, $0100000000010000, $0100000000010001, $0100000000010100, $0100000000010101, $0100000001000000, $0100000001000001, $0100000001000100, $0100000001000101, $0100000001010000, $0100000001010001, $0100000001010100, $0100000001010101,
    $0100000100000000, $0100000100000001, $0100000100000100, $0100000100000101, $0100000100010000, $0100000100010001, $0100000100010100, $0100000100010101, $0100000101000000, $0100000101000001, $0100000101000100, $0100000101000101, $0100000101010000, $0100000101010001, $0100000101010100, $0100000101010101,
    $0100010000000000, $0100010000000001, $0100010000000100, $0100010000000101, $0100010000010000, $0100010000010001, $0100010000010100, $0100010000010101, $0100010001000000, $0100010001000001, $0100010001000100, $0100010001000101, $0100010001010000, $0100010001010001, $0100010001010100, $0100010001010101,
    $0100010100000000, $0100010100000001, $0100010100000100, $0100010100000101, $0100010100010000, $0100010100010001, $0100010100010100, $0100010100010101, $0100010101000000, $0100010101000001, $0100010101000100, $0100010101000101, $0100010101010000, $0100010101010001, $0100010101010100, $0100010101010101,
    $0101000000000000, $0101000000000001, $0101000000000100, $0101000000000101, $0101000000010000, $0101000000010001, $0101000000010100, $0101000000010101, $0101000001000000, $0101000001000001, $0101000001000100, $0101000001000101, $0101000001010000, $0101000001010001, $0101000001010100, $0101000001010101,
    $0101000100000000, $0101000100000001, $0101000100000100, $0101000100000101, $0101000100010000, $0101000100010001, $0101000100010100, $0101000100010101, $0101000101000000, $0101000101000001, $0101000101000100, $0101000101000101, $0101000101010000, $0101000101010001, $0101000101010100, $0101000101010101,
    $0101010000000000, $0101010000000001, $0101010000000100, $0101010000000101, $0101010000010000, $0101010000010001, $0101010000010100, $0101010000010101, $0101010001000000, $0101010001000001, $0101010001000100, $0101010001000101, $0101010001010000, $0101010001010001, $0101010001010100, $0101010001010101,
    $0101010100000000, $0101010100000001, $0101010100000100, $0101010100000101, $0101010100010000, $0101010100010001, $0101010100010100, $0101010100010101, $0101010101000000, $0101010101000001, $0101010101000100, $0101010101000101, $0101010101010000, $0101010101010001, $0101010101010100, $0101010101010101);
PUint64DecPix : pointer = @Uint64DecPix;

you can write the following:


procedure DecodePixelsPS1Pas (EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels := TDecodedPixels(Uint64DecPix[EncPixels]);
end;

procedure DecodePixelsPS1PasInline (EncPixels: Byte; var DecPixels: TDecodedPixels); inline; begin DecPixels := TDecodedPixels(Uint64DecPix[EncPixels]); end;

procedure DecodePixelsPS1Asm (EncPixels: Byte; var DecPixels: TDecodedPixels); asm lea ecx, Uint64DecPix //[<-Added in EDIT 3] //mov ecx, dword ptr PUint64DecPix - alternative to the above line (slower for me) movzx eax, al movq xmm0, [8*eax+ecx] //Using XMM rather than MMX so we don't have to issue emms at the end movq [edx], xmm0 //use MOVQ because it doesn't need mem alignment end;

The standard PAS and ASM implementations are fairly similar speed-wise, but the PAS implementation marked with "INLINE" is the fastest because it gets rid of all the call/ret involved in calling the routine.

--EDIT--: I forgot to say: since you are implicitly assuming something about the memory layout of your TDecodedPixels structure, it would be better if you declare it as


PACKED ARRAY [0..7] of byte

--EDIT2--: Here are my results for comparison:


Time1 : 2.51638266874701 ms.    <- Delphi loop.
Time2 : 2.11277620479698 ms.    <- Delphi unrolled loop.
Time3 : 2.21972066282167 ms.    <- BASM loop.
Time4a : 1.34093090043567 ms.    <- BASM unrolled loop.
Time4b : 1.52222070123437 ms.    <- BASM unrolled loop instruction switch.
Time5 : 1.17106364076999 ms.    <- Wouter van Nifterick
TimePS1 : 0.633099318488802 ms.    <- PS.Pas
TimePS2 : 0.551617593856202 ms.    <- PS.Pas Inline
TimePS3 : 0.70921094720139 ms.    <- PS.Asm (speed for version before 3rd EDIT)
Sherfield answered 12/9, 2009 at 13:2 Comment(5)
Note that my Asm implementation makes assumptions about the available instruction sets (SSE2).Sherfield
Thank you PhiS for your solution to the second part of my question. There is also an "assembler" directive which I've added to your assembly method to see if it helps.Lode
@Wodzu: The "assembler" directive doesn't do anything in modern Delphi versions. It's just for backward-compatibility with Turbo Pascal code, where you needed to mark assembly-only procedures/functions thus.Sherfield
Changing "mov ecx, dword ptr PUint64DecPix" to "lea ecx, Uint64DecPix" in the assembly version is still faster for me.Sherfield
Instead of lea, just use the LUT address as a displacement in the load: movq xmm0, [8*eax+ Uint64DecPix]. And BTW, it's not surprising that LEA is faster than leading the address from a pointer stored in memory. But (in 32-bit mode) it gains you nothing over using the address directly, or mov ecx, OFFSET Uint64DecPix. In 64-bit mode, you might need a RIP-relative LEA...Mercurio
E
4

Compilers do very good job at optimizing small routines.

I would optimize your code by using a lookup table.
Since you decode a single byte - 256 different states - you can precalculate 256 arrays with the unpacked values.

Edit: Note that Pentium processors can execute specific instructions in parallel (Superscalar architecture), it is called pairing.

Elm answered 12/9, 2009 at 11:45 Comment(2)
Thank you Nick. I've readed about pairing in document under download.intel.com/ids/mmx/MMX_Manual_Tech_Developers_Guide.pdf And inventinon of method 4b was inspired by this document ;)Lode
Pairing rules for the U/V pipes only apply to actual P5 / PMMX CPUs, not Pentium II or later which use out-of-order execution. See agner.org/optimize. Optimizing for modern CPUs is different from optimizing for P5. (But not downvoting because a LUT is a good idea.)Mercurio
B
4

Pure software solution

Using the beautiful technique from this question, which was again inspired by this question we'll have a great solution like this with only one line of code (excluding declarations)

type TPackedDecodedPixels = record
case integer of
  0: (a: TDecodedPixels);
  1: (v: Int64);
end;

procedure DecodePixels(EncPixels: byte; var DecPixels: TDecodedPixels); inline;
const
  magic = $8040201008040201;
  mask  = $8080808080808080;
begin
  TPackedDecodedPixels(DecPixels).v := SwapEndian(((EncPixels*magic) and mask) shr 7);
end;

Of course you need to make sure that DecPixels is properly 8-byte aligned or you may suffer from some slow down (or even segfaults on other architectures). You can also easily vectorize the function to make it faster

Explanation

Assume we have the following bit pattern as abcdefgh. We'll want the output array to contain

0000000a 0000000b 0000000c 0000000d 0000000e 0000000f 0000000g 0000000h (1)

Reading that in little endian as a 64-bit integer we'll get %0000000h0000000g0000000f0000000e0000000d0000000c0000000b0000000a. We have to find a magic number that shifts the original bits to the positions that we can extract the necessary bits

Let's multiply the value with the magic number

  |  b7  ||  b6  ||  b4  ||  b4  ||  b3  ||  b2  ||  b1  ||  b0  |
                                                          abcdefgh (1-byte value)
x 1000000001000000001000000001000000001000000001000000001000000001
  ────────────────────────────────────────────────────────────────
= h0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh

At this point all the pixels' bits have been moved to the most significant bits of the corresponding bytes. As they already lied in the right place, we just need to strip out the remaining bits with and

  |  b7  ||  b6  ||  b4  ||  b4  ||  b3  ||  b2  ||  b1  ||  b0  |
  h0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh
& 1000000010000000100000001000000010000000100000001000000010000000
  ────────────────────────────────────────────────────────────────
= h0000000g0000000f0000000e0000000d0000000c0000000b0000000a0000000 (8-byte array)

Now the pixels' bits are in the most significant bits of the corresponding bytes, we need to do a logical right shift by 7 to move them to the least significant position. Because the OP wants the value in reversed order, we need SwapEndian() to convert the bytes to big endian. If you just want little endian you can stop at this step

So the magic number is %1000000001000000001000000001000000001000000001000000001000000001 = $8040201008040201 and the mask is %1000000010000000100000001000000010000000100000001000000010000000 = $8080808080808080. Of course in reality to solve the problem and get those values we need to do backwards from the final result → multiplied result → magic number


But why did I put the bytes in little endian at (1) and then have to convert back to big endian? Why don't just arrange the bytes in big endian order and find the magic number for that? In case you're wondering about that then it's because that way it'll only work for at most 7 bits at a time. I did that way in my old answer and have to split a bit off then combine it back later

                                                          0abcdefg
x 0000000000000010000001000000100000010000001000000100000010000001
  ────────────────────────────────────────────────────────────────
= 00000000abcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefg
& 0000000000000001000000010000000100000001000000010000000100000001
  ────────────────────────────────────────────────────────────────    
= 000000000000000a0000000b0000000c0000000d0000000e0000000f0000000g

Hardware support

This is actually a special case of bit expand with a constant mask. In AVX2 Intel introduced the pdep instruction in the BMI2 instruction set for that purpose, so you just need a single instruction to get the result. In other languages you can use this with the intrinsic function _pext_u64. Unfortunately AFAIK Free Pascal doesn't support it and you have to use assembly directly. However the expression will look like this

TPackedDecodedPixels(DecPixels).v := _pext_u64(EncPixels, $0101010101010101);

Correctness check

I tried comparing the OP's version with both my versions and didn't find any problem until now. The compiler output is like this

mov al, dil
mov rbx, rsi
movzx edi, al
movabs rax, 0x8040201008040201
imul rdi, rax
movabs rax, 0x8080808080808080
and rdi, rax
shr rdi, 0x7
call 4016a0 <SYSTEM_$$_SWAPENDIAN$INT64$$INT64>
mov QWORD PTR [rbx], rax

The FPC output is still pretty much sub-optimal because the compiler doesn't know to replace the call to SwapEndian with BSWAP, and it copies data unnecessarily. Why mov al, dil; movzx edi, al instead of just movzx edi, dil? As you can see, outputs from C and C++ compilers are a lot better

See How to create a byte out of 8 bool values (and vice versa)?

Burnham answered 26/5, 2014 at 17:59 Comment(4)
Thank you very much, that is very interesting idea. I wonder how it will perform in comparision to the others. I will update my results when I have time to run my tests again.Lode
@Lode I've fixed the code. This method uses 64-bit arithmetics so it'll be much faster on x86_64. Besides, if you often do this with a large number of pixels then consider using SIMD. Next year when AVX-512 comes out you may unpack 64 pixels or at least 8 64-bit words at a timeBurnham
@Wodzu, the compiler can be smart when calling the different DecodePixels routines with a constant (155). If it can precompile the result it will do so and replace the call with just assigning the result. To avoid this in your test comparison program, pass a variable with 155 instead.Helotry
Related: How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD has an answer without BMI2 which would also work for 16 bits -> 16 bytes. But the _mm_set1_epi8() costs several instructions without AVX2, so you're comment about multiply tricks being better is probably accurate. How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)? has an AVX2 answer.Mercurio
C
3

I was about to give the same algorithm as Wouter van Nifterick.

In addition, I would explain the better performance in terms of dependency chains. In each of the versions that you proposed, when you unrolled your basic loop, you kept a dependency between two successive iterations: each of your shr al, $01; requires the previous value of al to have been computed. If you organize your unrolled iterations such that they can be executed in parallel, they will actually be on a modern processor. Don't be fooled by false dependencies that can be suppressed by register renaming.

Someone pointed out that the Pentium can execute two instructions at once. That's true, but modern processors (since the Pentium Pro, PII,..., Core, Core 2) are executing much more than two instructions at the same time, when they have the chance -- that is, when there is no dependency between the instructions being executed. Note how in Wouter van Nifterick's version each line can be executed independently from the others.

http://www.agner.org/optimize/ has all the information you could ever need to understand the architecture of modern processors and how to take advantage of them.

Callie answered 12/9, 2009 at 12:54 Comment(9)
Good explanation and link! +1.Kierkegaardian
Thank you Pascal for your answer. However I think that your answer only refers to my Delphi versions of the routines. Assembly routines which I've provided are working in very similar fashion to assembly code generated from Wouter van Nifterick routine.Lode
No! Your assembly routine 4b is not at all similar to 5. 4b has a long dependency chain on the final value of al. During execution of 4b, an Out-Of-Order processor will most of the time be waiting for the previous valus of al to be computed so that it can compute the new value of al. By contrast, in the assembly generated for version 5, there is no such long dependency chain(if you understand register renaming. For this, read the material at agner.org/optimize). The instructions can be executed several at a time.Callie
j_random_hacker is saying the same thing in his comment to Wouter van Nifterick's answer, if you prefer his way of saying it.Callie
@Wodzu: Pascal is right, there is a big difference between your 3, 4a and 4b versions and WvN's. This makes a significant difference on modern CPUs.Kierkegaardian
@Kierkegaardian I have to disagree with you and Pascal on that. Please take a look at my third edit of the question. Please take a look at the routine 5c which I've invented. It is even more data dependent that the previous ones, isn't it? In my opinion the difference in time execution was caused by additional pair of push / pop instructions in my routines (4a, 4b). After removing those pairs, execution times of 4a, 4b and 5 are simmilar.Lode
@Wodzu: Interesting, your 4c is a useful data point. What CPU are you using?Kierkegaardian
@j_random_hacker: I am using Intel Pentium Dual Core E2180Lode
@Wodzu: Unfortunately I don't know anything about that CPU.Kierkegaardian
S
2

if you only support 80386 and above you can use BTcc and SETcc set of instructions in this manner:

BT ax,1
SETC [dx]
inc dx

BT ax,2
SETC [dx]
inc dx

etc

Sororate answered 12/9, 2009 at 12:31 Comment(6)
You could also scan only for those bits that are set, using BSF or BSR.Sherfield
@PhiS: Be warned that Intel's own optimisation manuals suggest avoiding BSF and BSR (among others) as they are microcoded -- essentially, interpreted on the CPU from a tiny "program" in ROM. So they're good for size optimisation, but not speed. (But of course the only real way to know is to test it!)Kierkegaardian
Thanks Dmitry I haven't know those instructions.Lode
@j_random_hacker: bsf / bsr are fast on Intel P6 and later; single uop with 3 cycle latency. (agner.org/optimize) Are you looking at some ancient version of the manual about optimizing for P5 Pentium? bsf/bsr are slightly slow on AMD (where only tzcnt/lzcnt are fast), so if you write your code to work correctly for tzcnt or bsf, use rep bsf so CPUs that support it will decode it as tzcnt.Mercurio
@PeterCordes: At the time when I was investigating CPU instruction latencies, the latest chip -- Pentium 4 I think -- had 0.5-cycle-latency "simple" arithmetic instructions (ADD, SUB, etc. -- but not ADC or SBB I remember discovering to my chagrin), and much higher latencies even for "simple" shifts and rotates. Just checked and BSF had a latency of 4 cycles on P4. 3 cycles on later CPUs is still very slow compared to the single-cycle latencies typical of other ALU instructions.Kierkegaardian
@j_random_hacker: oh right, I forgot to look at P4, and a lot of the stuff in Intel's optimization manual is for P4. Yup, P4 famously didn't have a barrel shifter, so especially before Prescott it sucked at shifts (and thus at bitfield stuff). P4's BSF is fairly normal at 4c latency, but Prescott's BSF is 16c latency! (still only 2 uops though). 3c latency vs. 1 for add on modern Intel is not a disaster: that's the same as multiply, popcnt, or a complex LEA. And unlike P4, it's still single-uop, so it has low impact on throughput if out-of-order exec can hide the latency.Mercurio
F
2

How about something like:

/* input byte in eax, address to store result in edx */
and eax, 0xff    /* may not be needed */
mov ebx, eax
shl ebx, 7
or  eax, ebx
mov ebx, eax
shl ebx, 14
or  eax, ebx
mov ebx, eax
and eax, 0x01010101
mov [edx], eax
shr ebx, 4
and ebx, 0x01010101
mov [edx+4], ebx
Floyd answered 15/9, 2009 at 22:15 Comment(2)
Thanks Chris, however it produces bad results.Lode
My fault, it produces good results. Thanks for the answer:) I will add it to the benchmark.Lode
R
1

The likely reason that 4b is faster than 4a is that it parallelizes better. From 4a:

mov bl, al;
and bl, $01;          // data dep (bl)
mov  [edx], bl;       // data dep (bl)
shr al, $01;
mov bl, al;           // data dep (al)
and bl, $01;          // data dep (bl)
mov [edx + $01], bl;  // data dep (bl)

Instructions marked "data dep" cannot begin executing until the previous instruction has finished, and I've written the registers that cause this data dependency. Modern CPUs are capable of starting an instruction before the last one has completed, if there is no dependency. But the way you've ordered these operations prevents this.

In 4b, you have fewer data dependencies:

mov bl, al;
and bl, $01;          // data dep (bl)
shr al, $01;
mov [edx], bl;
mov bl, al;
and bl, $01;          // data dep (bl)
shr al, $01;
mov [edx + $01], bl;

With this instruction ordering, fewer of the instructions depend on the previous instruction, so there is more opportunity for parallelism.

I can't guarantee that this is the reason for the speed difference, but it is a likely candidate. Unfortunately it is hard to come across answers as absolute as the ones you are looking for; modern processors have branch predictors, multi-level caches, hardware pre-fetchers, and all sorts of other complexities that can make it difficult to isolate the reasons for performance differences. The best you can do is read a lot, perform experiments, and get familiar with the tools for taking good measurements.

Rhoden answered 12/9, 2009 at 12:16 Comment(2)
Sounds like a good (and appropriately tentative :) ) explanation to me. Would also explain the blazing speed of Wouter van Nifterick's code.Kierkegaardian
It would be a good answer if not the one thing - 4b is SLOWER than 4a. I've created routine 4b for the same reasons that you pointed out Josh. And I was very confused seeing the benchmark results.Lode
I
0

I guess it's that writing to memory (actually, cache memory) is slower than working with registers.

So,

mov [edx+...], bl
shr al, $01;
mov bl, al;

gives the processor some time to write bl to memory before the bl register is needed again, while

shr al, $01;
mov [edx], bl;
mov bl, al;

needs bl immediately so the processor has to stop and wait for the memory write to complete.

This is surprising to me. Modern Intel processors do crazy pipelining and register renaming so in my opinion, if anything, DecodePixels4b should be faster, since the dependencies of each instruction are further back. The above is all the explanation I can offer, apart from this:

x86 is a terrible instruction set, and Intel does amazing and very advanced hocus-pocus to make it efficient. If I were you, I would look into something else. There's very little demand for megaMcOptimised software for PCs today. My friendly suggestion is to look into processors for mobile devices (mainly ARM), because in mobile devices, processor speed, power consumption and battery life concerns mean that micro-optimised software is more important. And ARM has a superior instruction set to x86.

Indira answered 12/9, 2009 at 12:13 Comment(2)
I doubt this is the reason; register renaming (en.wikipedia.org/wiki/Register_renaming) should prevent stalls due to waiting for a register to become available.Rhoden
Thanks Artelius. I tought so too, thats why I've switched shr with mov. It seems that there must some other factor which causes that 4b is slower than 4a.Lode
F
0

SIMD

If you extend the algorithm to processing arrays, then SIMD becomes an optimisation option. Here's a SIMD version that's 1/3 the time of an optimised C equivalent:

int main ()
{
  const int
    size = 0x100000;

  unsigned char
    *source = new unsigned char [size],
    *dest,
    *dest1 = new unsigned char [size * 32],
    *dest2 = new unsigned char [size * 32];

  for (int i = 0 ; i < size ; ++i)
  {
    source [i] = rand () & 0xff;
  }

  LARGE_INTEGER
    start,
    middle,
    end;

  QueryPerformanceCounter (&start);
  dest = dest1;
  for (int i = 0 ; i < size ; ++i)
  {
    unsigned char
      v = source [i];

    for (int b = 0 ; b < 8 ; ++b)
    {
      *(dest++) = (v >> b) & 1;
    }
  }
  unsigned char
    bits [] = {1,2,4,8,16,32,64,128,1,2,4,8,16,32,64,128},
    zero [] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
    ones [] = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};

  QueryPerformanceCounter (&middle);
  __asm
  {
    movdqu xmm1,bits
    movdqu xmm2,zero
    movdqu xmm3,ones
    mov ecx,0x100000/4
    mov esi,source
    mov edi,dest2
l1:
    lodsd
    movd xmm0,eax
    movd xmm4,eax
    punpcklbw xmm0,xmm0
    punpcklbw xmm4,xmm4
    punpcklwd xmm0,xmm0
    punpcklwd xmm4,xmm4
    punpckldq xmm0,xmm0
    punpckhdq xmm4,xmm4
    pand xmm0,xmm1
    pand xmm4,xmm1
    pcmpeqb xmm0,xmm2
    pcmpeqb xmm4,xmm2
    paddb xmm0,xmm3
    paddb xmm4,xmm3
    movdqu [edi],xmm0
    movdqu [edi+16],xmm4
    add edi,32
    dec ecx
    jnz l1
  }
  QueryPerformanceCounter (&end);

  cout << "Time taken = " << (middle.QuadPart - start.QuadPart) << endl;
  cout << "Time taken = " << (end.QuadPart - middle.QuadPart) << endl;
  cout << "memcmp = " << memcmp (dest1, dest2, size * 32) << endl;

  return 0;
}
Flavouring answered 15/9, 2009 at 14:44 Comment(2)
Zero an xmm reg with pxor xmm2,xmm2. lodsd / movd xmm0, eax is a bad way to write movd xmm0, [esi] / add esi, 4. Also, copy the xmm register with a movdqa instead of using movd twice. Actually, you're doing the same shuffles on 2 copies of eax for the first 2 steps. That's crazy, copy the punpcklwd result. Or better, copy+shuffle with pshufd.Mercurio
But other than the bad load and unpack strategy, this is a good way to implement bitmap -> vector (i.e. the inverse of pmovmskb: see also stackoverflow.com/questions/21622212/…).Mercurio
C
-1

As you notice, the difference of speed in 4a and 4b implementation is because of CPU optimization (by execute multiple instructions in parallel / pipelining instruction). But the factor is not in the operands, but because of the nature of operator itself.

4a Instruction Sequence:
AND - MOV - SHR

4b Instruction Sequence:
AND - SHR - MOV

Both AND and SHR use Flags register, so these two instructions has wait state in their pipeline.

Read them as follow:

4a: AND (piped) MOV (piped) SHR
4b: AND (WAIT) SHR (piped) MOV

Conclusion: 4b has 7 more wait-state in it's pipeline than 4a, thus it's slower.

Josh mentioned that there's data dependencies, i.e.:

mov bl, al;
and bl, $01;          // data dep (bl)

but it's not entirely true since those two instruction can partially be executed in paralel in CPU level:

mov bl, al -> (A:) read al (B:) write bl  => (2 clocks in i386)
and bl, 01 -> (C:) read 01 (D:) write bl  => idem

Sequentially they take 4 clocks, but pipelined they take only 3 "clocks" (actually the term "clock" is not adequate in pipeline perspective but I used it in context of simplicity)

[--A--][--B--]
 [--C--]<wait>[---D--]
Chandos answered 12/9, 2009 at 11:28 Comment(1)
immediate-count shr doesn't have an input dependency on flags on any modern x86 microarchitecture. Register-renaming avoids the write-after-write hazard. See agner.org/optimize, and also this Q&A for some more details on flag handling for shifts.Mercurio
S
-1

Incredible smart solution Chris, what would you do with the inverse problem: make a byte from an array of 8 bytes?

Non optimized solution for the inverse problem:

BtBld PROC Array:DWORD, Pixels:DWORD
  mov  eax, [Array]
  add  eax, 7
  mov  edx, [Pixels]

  mov  bx, 0

  mov  ecx, 8
rpt:  or  bx, [eax]
  dec  eax
  shl  bx, 1
  loop rpt
  shr  bx, 1
  mov  [edx], bl
  ret
BtBld ENDP
Sinh answered 12/9, 2009 at 11:28 Comment(1)
movq xmm0, [Array] / pslld xmm0, 7 / pmovmskb eax, xmm0 gives you the low bit of each byte of Array. (shift them and then extract the high bit of each byte with pmovmskb). You could also pcmpeqb against zero instead of shifting, to do a packed compare for zero / non-zero.Mercurio

© 2022 - 2024 — McMap. All rights reserved.