Count integers in [1..N] with K zero bits below the leading 1? (popcount for a contiguous range without HW POPCNT)

Asked 7/3, 2021 at 18:45 Answered 19/4, 2021 at 23:44

Solved optimization x86 nasm micro-optimization hammingweight

I have following task: Count how many numbers between 1 and N will have exactly K zero non-leading bits. (e.g. 7₁₀=111₂ will have 0 of them, 4 will have 2)

N and K satisfy condition 0 ≤ K, N ≤ 1000000000

This version uses POPCNT and is fast enough on my machine:

%include "io.inc"

section .bss
    n resd 1
    k resd 1
    ans resd 1
section .text
global CMAIN
CMAIN:
    GET_DEC 4,n
    GET_DEC 4,k
    mov ecx,1
    mov edx,0
    ;ecx is counter from 1 to n

loop_:
    mov eax, ecx
    popcnt eax,eax;in eax now amount of bits set
    mov edx, 32
    sub edx, eax;in edx now 32-bits set=bits not set
    
    mov eax, ecx;count leading bits
    bsr eax, eax;
    xor eax, 0x1f;
    sub edx, eax
    mov eax, edx
    ; all this lines something like (gcc):
    ; eax=32-__builtin_clz(x)-_mm_popcnt_u32(x);

    cmp eax,[k];is there k non-leading bits in ecx?
    jnz notk
    ;if so, then increment ans
    
    mov edx,[ans]
    add edx,1
    mov [ans],edx
notk:
    ;increment counter, compare to n and loop
    inc ecx
    cmp ecx,dword[n]
    jna loop_
    
    ;print ans
    PRINT_DEC 4,ans
    xor  eax, eax
    ret

It should be okay in terms of speed (~0.8 sec), but it wasn't accepted because (I guess) CPU used on testing server is too old so it shows that runtime error happened.

I tried using precounting trick with a 64K * 4-byte lookup table, but it wasn't fast enough:

%include "io.inc"
section .bss
    n resd 1
    k resd 1
    ans resd 1
    wordbits resd 65536; bits set in numbers from 0 to 65536
section .text
global CMAIN
CMAIN:
    mov ebp, esp; for correct debugging
    mov ecx,0
    ;mov eax, ecx
    ;fill in wordbits, ecx is wordbits array index
precount_:
    mov eax,ecx
    xor ebx,ebx
    ;c is ebx, v is eax
    ;for (c = 0; v; c++){
    ;    v &= v - 1; // clear the least significant bit set
    ;}
lloop_:
    mov edx,eax
    dec edx
    and eax,edx
    inc ebx
    test eax,eax
    jnz lloop_
    
    ;computed bits set
    mov dword[wordbits+4*ecx],ebx
    
    inc ecx
    cmp ecx,65536
    jna precount_
    
    ;0'th element should be 0
    mov dword[wordbits],0
    
    GET_DEC 4,edi;n
    GET_DEC 4,esi;k
    
    mov ecx,1
    xor edx,edx
    xor ebp,ebp
    
loop_:
    mov eax, ecx
    ;popcnt eax,eax
    mov edx,ecx
    and eax,0xFFFF 
    shr edx,16
    mov eax,dword[wordbits+4*eax]
    add eax,dword[wordbits+4*edx]
    ;previous lines are to implement absent instruction popcnt.
    ; they simply do eax=wordbits[x & 0xFFFF] + wordbits[x >> 16]
    mov edx, 32
    sub edx, eax
    ;and the same as before: 
    ;non-leading zero bits=32-bits set-__builtin_clz(x)
    mov eax, ecx
    bsr eax, eax
    xor eax, 0x1f
    sub edx, eax
    mov eax, edx

    ;compare to k again to see if this number has exactly k 
    ;non-leading zero bits

    cmp edx, esi
    jnz notk

    ;increment ebp (answer) if so
    mov edx, ebp
    add edx, 1
    mov ebp, edx
    ;and (or) go to then next iteration 
notk:
    inc ecx
    cmp ecx, edi
    jna loop_
    
    ;print answer what is in ebp
    PRINT_DEC 4, ebp
    xor  eax, eax
    ret

(>1 sec)

Should I speed up second program (if so, then how?) or somehow replace POPCNT with some other (which?) instructions (I guess SSE2 and older should be available)?

Crag answered 7/3, 2021 at 18:45 Comment(6)

The bithack in How to count the number of set bits in a 32-bit integer? is reasonably good, but significantly slower than popcnt. It's also possible to emulate with SSSE3 pshufb, but only a few CPUs have SSSE3 without popcnt (e.g. Core 2, and first-gen Core 2 has slow pshufb). Probably for this case you'd have better luck with algorithmic tricks like your second attempts, not just using a drop-in replacement for popcnt. But you didn't comment your code so it's not that easy to follow the logic. – Casual 8/3, 2021 at 2:56

popcount of a 32-bit integer always fits in 1 byte, so your table should "only" be 64k, not 4 * 64k. Still too larger to fit in L1d cache, though. Also, your first version should use registers, not [ans]. Push/pop EBX around your function so you can use it, too, avoiding a loop-carried dependency through a store/reload. Also, you can count down ECX from [ans] to 1, avoiding a memory compare in the loop. (Looks like you optimized better in your 2nd one, but you use some call-preserved registers without saving/restoring them.) – Casual 8/3, 2021 at 11:25

@PeterCordes Yes, it should. But it was easier to use 4 times more memory and not to use smaller parts of 32-bit registers. I also tried sending "pshufb mm1,mm2" (and "psufb xmm1,xmm2")-containing code. And it hadn't produce errors. So, I guess, it should be possible use this instruction. But how? Also, replacing memory by registers made things worse: 1.1s vs 0.8 – Crag 8/3, 2021 at 11:45

Code alignment of exactly where your branches land relative to 32-byte boundaries matters on recent Intel CPUs (because of a workaround for a CPU bug that disables the uop cache for lines with a cmp/jcc that spans a 32-byte boundary). See this answer. I wouldn't be surprised if changing things around made that worse. Use

perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,idq.dsb_uops,idq.mite_uops -r 1 ./a.out

or similar. to check for MITE uops – Casual 8/3, 2021 at 12:12

On my i7-6700k Skylake, I sped up your original version (with popcnt) from 1.3 sec to 0.54 sec (for k=8, N=1000000000) by significantly optimizing the 32-clz-popcnt math (to take advantage of bsr = 31-clt), and avoiding the JCC erratum problem, and making the common case (no increment) the fall-through (0.9 sec vs. 0.5 sec). Now it's running almost 2 cycles per iteration, 3.4 front-end uops / clock. (4.3 IPC thanks to macro-fusion). So pretty close to maxing out throughput of the popcnt / bsr unit on port 1. godbolt.org/z/n4vh5x has source and perf results – Casual 8/3, 2021 at 12:29

@PeterCordes on my i7-8700k performance gain is not that great: from 0.74 to 0.58 sec (for the same k and n). I also tried replacing popcnt here, but the result was "Time-limit exceeded" again – Crag 8/3, 2021 at 13:40

First of all, a server too old to have popcnt will be significantly slower in other ways and have different bottlenecks. Given that it has pshufb but not popcnt, it's a Core 2 first or second-gen (Conroe or Penryn). See Agner Fog's microarch PDF (on https://agner.org/optimize/). Also lower clock speeds, so the best you can do on that CPU might not be enough to let brute-force work.

There are probably algorithmic improvements that could save huge amounts of time, like noting that every 4 increments cycle the low 2 bits through a 00, 01, 10, 11 pattern: 2 zeros happens once per four increments, 1 zero happens twice, no zeros happens once. For every number >= 4, these 2 bits are below the leading bit and thus part of the count. Generalizing this into a combinatorics formula for each MSB-position between 1 and log2(N) might be a way to do vastly less work. Handling the numbers between 2^M and N is less obvious.

Versions here:

cleaned up popcnt version, 536ms on i7-6700k @ 3.9GHz, no algorithmic optimization across iterations. For k=8, N=1000000000
Naive LUT version (2 loads per iteration, no inter-iteration optimization): ~595 ms on a good run, more often ~610 ms for k=8, N=1000000000. Core2Duo (Conroe) @ 2.4GHz: 1.69 s. (A couple worse versions of that in the edit history, the first having partial-register stalls on Core 2.)
(unfinished, cleanup code not written) Optimized LUT version (unrolled, and high-half/MSB BSR work hoisted leaving only 1 lookup (cmp/jcc) per iteration), 210 ms on Skylake, 0.58s on Core 2 @ 2.4GHz. Time should be realistic; we're all the work, just missing the last 2^16 iterations where the MSB is in the low 16. Handling any necessary corner cases in the outer loop, and cleanup, shouldn't affect speed more than 1%.
(even more unfinished): vectorize the optimized LUT version with pcmpeqb / psubb (with psadbw in an outer loop, like How to count character occurrences using SIMD shows - the inner loop reduces to counting byte elements in a fixed-size array that match a value calculated in the outer loop. Just like the scalar version). 18ms on Skylake, ~0.036s on Core 2. Those times are probably now including a considerable amount of startup overhead. But as expected/hoped, about 16x faster on both.
Histogram the wordbits table once (perhaps as you generate it). Instead of searching 64kiB to find matching bytes, just look up the answer for every outer-loop iteration! That should let you go thousands of times faster for large N. (Although you still need to handle the low 1..64K and the partial range when N isn't a multiple of 64K.)

To usefully measure the faster versions, you could slap a repeat loop around the whole thing so the whole process still takes some measurable time, like half a second. (Since it's asm, no compiler will optimize away the work from doing the same N,k repeatedly.) Or you could do the timing inside the program, with rdtsc if you know the TSC frequency. But being able to use perf stat on the whole process easily is nice, so I'd keep doing that (take out the printf and make a static executable to further minimize startup overhead).

You seem to be asking about micro-optimizing the brute-force approach that still checks every number separately. (There are significant optimizations possible to how you implement the 32 - clz - popcnt == k though.)

There are other ways to do popcnt that are generally faster, e.g. bithacks like in How to count the number of set bits in a 32-bit integer?. But when you have a lot of popcounting to do in a tight loop (enough to keep a lookup table hot in cache), the LUT can be good.

If you have fast SSSE3 pshufb, it could be worth using it to do a SIMD popcount for four dwords in parallel in an XMM register (auto-vectorizing the loop), or even better in a YMM register with AVX2. (First-gen Core2 has pshufb but it's not single uop until 2nd-gen Core2. Still possibly worth it.)

Or much better, using SIMD to count LUT elements that match what we're looking for, for a given high-half of a number.

The brute force checking contiguous ranges of numbers opens up a major optimization for the LUT strategy: the upper n bits of the number only change once per 2^n increments. So you can hoist the count of those bits out of an inner-most loop. This also can make it worth using a smaller table (that fits in L1d cache).

Speaking of which, your 64k * 4 table is 256KiB, the size of your L2 cache. This means it's probably having to come in from L3 every time you loop through it. Your desktop CPU should have enough L3 bandwidth for that (and the access pattern is contiguous thanks to the increments), and modern servers have larger L2, but there's very little reason not to use a byte LUT (popcnt(-1) is only 32). Modern Intel CPUs (since about Haswell) don't rename AL separately from the rest of EAX/RAX, and a movzx byte load is just as cheap as a mov dword load.

; General LUT lookup with two 16-bit halves
    movzx  edx, cx            ; low 16 bits
    mov    eax, ecx
    shr    eax, 16            ; high 16 bits
    movzx  edx, byte [wordbits + edx]
    add     dl,      [wordbits + eax]
      ; no partial-reg stall for reading EDX after this, on Intel Sandybridge and later
      ; on Core 2, set up so you can cmp al,dl later to avoid it

On an Intel CPU so old that it doesn't support popcnt, that will cause a partial-register stall. Do the next compare with cmp al, dl instead. (Use lea or add or sub on the bsr result, instead of the popcount LUT load, so you can avoid a partial-register stall.)

Normally you'd want to use a smaller LUT, like maybe 11 bits per step, so 3 steps handles a whole 32-bit number (2^11 = 2048 bytes, a small fraction of 32k L1d). But with this sequential access pattern, hardware prefetch can handle it and fully hide the latency, especially when the L1d prefetches will hit in L2. Again, this is good because this loop touches no memory other this lookup table. Lookup tables are a lot worse in the normal case where significant amounts of other work happen between each popcount, or you have any other valuable data in cache you'd rather not evict.

Optimized for Skylake (i7-6700k): even with 2 LUT accesses per iteration: 0.600 seconds at 3.9GHz

vs. 0.536 seconds with popcnt. Hoisting the high-half LUT lookup (and maybe the 32 constant) might even let that version be faster.

Note: a CPU so old that it doesn't have popcnt will be significantly different from Skylake. Optimizing this for Skylake is a bit silly, unless you take this further and wind up beating the popcnt version on Skylake, which is possible if we can hoist the BSR work by having nested loops, with an inner loop that uses the same BSR result for the whole range of numbers from 2^m .. 2^(m+1)-1 (clamped to a 64k range so you can also hoist the high half popcnt LUT lookup). popcnt_low(i) == some constant calculated from k, popcnt_high(i), and clz(i).

3 major things were quite important for Skylake (some of them relevant for older CPUs, including avoiding taken branches for front-end reasons):

Avoid having a cmp/jcc touching a 32-byte boundary on Intel Skylake-derived CPUs with up-to-date microcode, because Intel mitigated the JCC erratum by disabling the uop cache for such lines: 32-byte aligned routine does not fit the uops cache

This looking at the disassembly and deciding whether to make instructions longer (e.g. with lea eax, [dword -1 + edx] to force a 4-byte displacement instead of the smaller disp8.) and whether to use align 32 at the top of a loop.
No-increment is much more common than increment, and Intel CPUs can only run taken branches at 1/clock. (But since Haswell have a 2nd execution unit on another port that can run predicted-not-taken branches.) Change jne notk to je yesk to a block below the function that jumps back. Tail-duplication of the dec ecx / jnz .loop / else fall through to a jmp print_and_exit helped a tiny amount vs. just jumping back to after the je yesk.

It's taken so rarely (and has a consistent enough pattern) that it doesn't mispredict often, so setnz al / add ebx, eax would probably be worse.
Optimize the 32 - clz - popcnt == k check, taking advantage of the fact that bsr gives you 31-clz. So 31-clz - (popcnt-1) = 32-clz-popcnt.
Since we're comparing that for == k, that can be further rearranged to popcnt-1 + k == 31-clz.
When we're using a LUT for popcount, instead of a popcnt instruction that has to run on port 1, we can afford to use a 3-component (slow) LEA like lea edx, [edx + esi - 1] to do the popcnt-1+k. Since it has 3 components (2 registers and a displacement, 2 + signs in the addressing mode), it can only run on port 1 (with 3 cycle latency), competing with bsr (and popcnt if we were using it).

Taking advantage of lea saved instructions in general, even in the popcnt version. So did counting the loop down towards 0, with a macro-fused 1-uop dec/jnz instead of inc + cmp/jne. (I haven't tried counting up to see if L1d HW prefetch works better in that direction; the popcnt version won't care but the LUT version might.)

Ported to work without io.inc, just using hard-coded N and k with printf for output. This is not "clean" code, e.g. nasty hacks like %define wordbits edi that I put in to test changing alignment of branches by using indexed addressing modes instead of [reg + disp32] for every access to the array. That happened to do the trick, getting almost all of the uops to come from DSB (the uop cache) instead of MITE, i.e. avoided the JCC erratum slowdown. The other way to do it would be making instructions longer, to push the cmp/je and dec/jnz past a 32-byte boundary. (Or to change the alignment of the start of the loop.) Uop-cache fetch happens in lines of up-to-6 uops and can be a bottleneck if you end up with a line with only a couple uops. (Skylake's loop-buffer aka LSD is also disabled by microcode to fix an earlier erratum; Intel had more big bugs with Skylake than most designs.)

%use SMARTALIGN
alignmode p6, 64

section .bss
 wordbits: resb 65536
;    n resd 1
;    k resd 1
    ans resd 1
section .rodata
  n: dd 1000000000
  k: dd 8
  print_fmt: db `ans: %d\n`, 0

section .text

global main
main:            ; no popcnt version
    push  ebp
    push  edi    ; save some call-preserved registers
    push  esi
    push  ebx

    mov   edi, wordbits
%define wordbits edi             ; dirty hack, use indexed addressing modes instead of reg+disp32.
                                 ; Avoids Skylake JCC erratum problems, and is is slightly better on Core2 with good instruction scheduling
    ;fill in wordbits, ecx is wordbits array index
    mov   ecx, 1     ; leave wordbits[0] = 0
.init_loop:
    mov   eax,ecx
    xor   ebx,ebx
  .popc_loop:
      lea   edx, [eax-1]
      inc   ebx
      and   eax,edx         ; v &= v - 1; // blsr
      jnz  .popc_loop

    ;computed bits set
    mov [wordbits + ecx], bl

    inc ecx
    cmp ecx,65536
    jb .init_loop       ; bugfix: array out of bounds with jna: stores to wordbits[65536]


;    GET_DEC 4,n
;    GET_DEC 4,k
    mov   ecx, [n]      ; ecx counts from n down to 1
;    mov   esi, [k]
    xor   ebx, ebx      ; ebx = ans

    mov   esi, 1
    sub   esi, [k]      ; 1-k
align 32
.loop:
    ;popcnt eax, ecx
    movzx  eax, cx
    mov    ebp, ecx         ; using an extra register (EBP) to schedule instructions better(?) for Core2 decode
    movzx  edx, byte [wordbits + eax]
    shr    ebp, 16
;    xor eax, eax        ; break false dependency, or just let OoO exec hide it after breaking once per iter
    bsr    eax, ecx         ; eax = 31-lzcnt for non-zero ecx
;    sub    edx, esi         ; sub now avoids partial-reg stuff.  Could have just used EBX to allow BL.
    add eax, esi           ; Add to BSR result seems slightly better on Core2 than sub from popcnt
    add     dl,      [wordbits + ebp]   ; we don't read EDX, no partial-register stall even on P6-family

                        ;; want: k == 32-__builtin_clz(x)-_mm_popcnt_u32(x)
    cmp  al, dl         ; 31-clz+(1-k)  == popcount.  or  31-clz == popcnt - (1-k)
    je .yesk          ; not-taken is the more common fast path
 .done_inc:
    dec ecx
    jnz .loop         ; }while(--n >= 0U)

.print_and_exit:
    ;print ans
;    PRINT_DEC 4,ans
    push  ebx
    push  print_fmt
extern printf
    call  printf
    add   esp, 8

    pop  ebx
    pop  esi
    pop  edi
    pop  ebp
    xor  eax, eax
    ret

align 8
.yesk:
   inc  ebx
;   jmp  .done_inc           ; tail duplication is a *tiny* bit faster
   dec  ecx
   jnz  .loop
   jmp  .print_and_exit

This is version 3, updated to avoid partial-register penalties on Core 2 (Conroe). Runs in ~~1.78s~~ 1.69s vs. 3.18s. Still sometimes as fast on Skylake, but more often 610ms instead of 594ms. I don't have perf counter access on my Core 2; it's too old for perf to fully support, and I don't have perf for the kernel that booted last.

(disassembly and perf results for version 1 on Godbolt: https://godbolt.org/z/ox7e8G)

On my Linux desktop, i7-6700k at 3.9GHz. (EPP = balance_performance, not full performance, so it doesn't want to turbo to 4.2GHz apparently.) I don't need sudo to use perf because I set /proc/sys/kernel/perf_event_paranoid = 0. I use taskset -c 3 just to avoid CPU migrations for single-threaded workloads.

# Results from version 1, not the Core2-friendly version.
# Version 3 sometimes runs this fast, but more often ~610ms
# Event counts are near identical for both, except cycles, but uops_issue and executed are mysteriously lower, like 9,090,858,203 executed.
$ nasm -felf32 foo.asm -l/dev/stdout &&
    gcc -m32 -no-pie -fno-pie -fno-plt foo.o 
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 2 ./a.out 
ans: 12509316
ans: 12509316

 Performance counter stats for './a.out' (2 runs):

            597.78 msec task-clock                #    0.999 CPUs utilized            ( +-  0.12% )
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                62      page-faults               #    0.103 K/sec                    ( +-  0.81% )
     2,328,038,096      cycles                    #    3.894 GHz                      ( +-  0.12% )
     2,000,637,322      branches                  # 3346.789 M/sec                    ( +-  0.00% )
         1,719,724      branch-misses             #    0.09% of all branches          ( +-  0.02% )
    11,015,217,584      instructions              #    4.73  insn per cycle           ( +-  0.00% )
     9,148,164,159      uops_issued.any           # 15303.609 M/sec                   ( +-  0.00% )
     9,102,818,982      uops_executed.thread      # 15227.753 M/sec                   ( +-  0.00% )

   (from a separate run):
      9,204,430,548      idq.dsb_uops              # 15513.249 M/sec                   ( +-  0.00% )
         1,008,922      idq.mite_uops             #    1.700 M/sec                    ( +- 20.51% )


          0.598156 +- 0.000760 seconds time elapsed  ( +-  0.13% )

This is about 3.93 fused-domain (front-end) uops/clock. So we're pretty close to the 4/clock front-end width.

With popcnt:

Your original (with GET_DEC replaced by loading a constant) ran in 1.3 sec on my desktop, for k=8 N=1000000000. This version runs in about 0.54 sec. My version of your original wasn't even the worst possible case for alignment of branches (another version was about 1.6 sec), although since I did have to change things it could be different from your machine.

I used mostly the same optimizations as above to save uops and help out the front-end inside the loop. (But I did this first, so it's missing some optimizations.)

align 32
.loop:
    mov    eax, ecx
    popcnt eax,eax
    lea    edx, [dword eax - 32 + 31]  ; popcnt - 32  =  -(bits not set)
                   ; dword displacement pads the cmp/jnz location to avoid the JCC erratum penalty on Intel

;    xor eax, eax         ; break false dependency, or just let OoO exec hide it after breaking once per iter
    bsr eax, ecx         ; eax = 31-lzcnt
;    xor eax, 0x1f        ; eax = lzcnt (for non-zero x)
    ; want:  32-__builtin_clz(x)-_mm_popcnt_u32(x)  = (31-clz) + 1-popcnt = (31-clz) - (popcnt-1)
    sub eax, edx

    cmp eax, esi  ;is there k non-leading bits in ecx?
%if 0
    jnz .notk
    inc ebx       ;if so, then increment ans
.notk:
%else
    jz .yesk      ; not-taken is the more common fast path
 .done_inc:
%endif
    dec ecx
    jnz .loop   ; }while(--n >= 0U)
    
    ;print ans
;    PRINT_DEC 4,ans
    push  ebx
    push  print_fmt
extern printf
    call  printf
    add   esp, 8

    pop  ebx
    pop  esi
    xor  eax, eax
    ret

.yesk:
   inc  ebx
   jmp  .done_inc         ;; TODO: tail duplication

(Unfinished) Inner loop with invariant clz(x) and high-half popcount

This version runs in only 0.58 sec on my 2.4GHz Core 2 Duo E6600 (Conroe), same microarchitecture as your Xeon 3050 2.13GHz.
And in 210ms on my Skylake.

It does most of the work, only missing cleanup for N < 65536 (or the low 65536 of a larger N, where the MSB is in the low half), and maybe missing handling a couple other corner cases in the outer loop. But the inner loop totally dominates the run-time, and it doesn't have to run more so these times should be realistic.

It still brute-force checks every single number, but most of the per-number work dependent on the high half is loop invariant and hoisted out. Assuming non-zero high halves, but only 2^16 numbers have their MSB in the low 16. And narrowing to only the low 12 or 14 bits means less cleanup, as well as a smaller part of the LUT to loop over that can stay hot in L1d.

%use SMARTALIGN
alignmode p6, 64

section .bss
align 4096
 wordbits: resb 65536
;    n resd 1
;    k resd 1
;    ans resd 1
section .rodata
  ;n: dd 0x40000000        ; low half zero, maybe useful to test correctness for a version that doesn't handle that.
  n:  dd 1000000000 ; = 0x3b9aca00
  k: dd 8
  print_fmt: db `ans: %d\n`, 0

section .text
global main

align 16
main:
main_1lookup:
    push  ebp
    push  edi    ; save some call-preserved registers
    push  esi
    push  ebx

    mov   edi, wordbits
;%define wordbits edi             ; dirty hack, use indexed addressing modes instead of reg+disp32.
                                 ; actually slightly worse on Skylake: causes un-lamination of cmp bl, [reg+reg],
                                 ; although the front-end isn't much of a bottleneck anymore
                                 ; also seems pretty much neutral to use disp32+reg on Core 2, maybe reg-read stalls or just not a front-end bottleneck
    ;fill in wordbits, ecx is wordbits array index
    mov   ecx, 1     ; leave wordbits[0] = 0
.init_loop:
    mov   eax,ecx
    xor   ebx,ebx
  .popc_loop:
      lea   edx, [eax-1]
      inc   ebx
      and   eax,edx         ; v &= v - 1; // blsr
      jnz  .popc_loop

    ;computed bits set
    mov [wordbits + ecx], bl

    inc ecx
    cmp ecx,65536
    jb .init_loop


;    GET_DEC 4,n
;    GET_DEC 4,k
    mov   ecx, [n]      ; ecx counts from n down to 1
;    mov   esi, [k]
    xor   esi, esi      ; ans

    mov   ebp, 1
    sub   ebp, [k]      ; 1-k
align 32
.outer:
    mov    eax, ecx         ; using an extra register (EBP) to schedule instructions better(?) for Core2 decode
    shr    eax, 16
;    xor eax, eax        ; break false dependency, or just let OoO exec hide it after breaking once per iter
    bsr    ebx, ecx         ; eax = 31-lzcnt for non-zero ecx
         ;; want: k == 32-__builtin_clz(x)-_mm_popcnt_u32(x)
         ; 31-clz+(1-k)  == popcount.  or  31-clz == popcnt - (1-k)
         ; 31-cls+(1-k) - popcount(hi(x)) == popcount(lo(x))
    add    ebx, ebp
    sub     bl, byte [wordbits + eax]

    ;movzx  edx, cx
    lea    edx, [ecx - 4]   ; TODO: handle cx < 4 making this wrap
    movzx  edx, dx
    and    ecx, -65536      ; clear low 16 bits, which we're processing with the inner loop.
align 16
  .low16:
    cmp   bl, [wordbits + edx + 0]
    je    .yesk0
  .done_inc0:
    cmp   bl, [wordbits + edx + 1]
    je    .yesk1
  .done_inc1:
    cmp   bl, [wordbits + edx + 2]
    je    .yesk2
  .done_inc2:
    cmp   bl, [wordbits + edx + 3]
    je    .yesk3
  .done_inc3:

; TODO: vectorize with pcmpeqb / psubb / psadbw!!
; perhaps over fewer low bits to only use 16kiB of L1d cache
    
    sub  edx, 4
    jae  .low16        ; }while(lowhalf-=4 doesn't wrap)

   sub   ecx, 65536
   ja   .outer
; TODO: handle ECX < 65536 initially or after handling leading bits.  Probably with BSR in the inner loop


.print_and_exit:
    ;print ans
;    PRINT_DEC 4,ans
    push  esi
    push  print_fmt
extern printf
    call  printf
    add   esp, 8

    pop  ebx
    pop  esi
    pop  edi
    pop  ebp
    xor  eax, eax
    ret

align 16
%assign i 0
%rep 4
;align 4
.yesk%+i:
   inc  esi
   jmp  .done_inc%+i
%assign i  i+1
%endrep
  ; could use a similar %rep block for the inner loop

     ; attempt tail duplication?
     ; TODO: skip the next cmp/jcc when jumping back.
     ; Two in a row will never both be equal

;   dec  ecx
;   jnz  .loop
;   jmp  .print_and_exit

Skylake perf results:

(update after outer-loop over-count on first iter bugfix, ans: 12497876)

ans: 12498239        # This is too low by a bit vs. 12509316
                     # looks reasonable given skipping cleanup

            209.46 msec task-clock                #    0.992 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                62      page-faults               #    0.296 K/sec                  
       813,311,333      cycles                    #    3.883 GHz                    
     1,263,086,089      branches                  # 6030.123 M/sec                  
           824,103      branch-misses             #    0.07% of all branches        
     2,527,743,287      instructions              #    3.11  insn per cycle         
     1,300,567,770      uops_issued.any           # 6209.065 M/sec                  
     2,299,321,355      uops_executed.thread      # 10977.234 M/sec                 

(from another run)
        37,150,918      idq.dsb_uops              #  174.330 M/sec                  
     1,266,487,977      idq.mite_uops             # 5942.976 M/sec

       0.211235157 seconds time elapsed

       0.209838000 seconds user
       0.000000000 seconds sys

Note that uops_issued.any is about the same as idq.DSB_uops + idq.MITE_uops - if we'd used EDI as a pointer to save code-size, uops_issued.any would be much higher because of unlamination of the indexed addressing modes from micro + macro-fused cmp+jcc.

Also interesting that branch misses is even lower; perhaps the unrolling helped distribute the history better in the IT-TAGE predictor table.

SSE2 SIMD

Also unfinished, not handling corner cases or cleanup, but I think doing approximately the right amount of work.

Unlike in How to count character occurrences using SIMD, the array we're matching against has known limits on how often matches can occur, so it happens to be (mostly?) safe to not do nested loops, just a 2^14 (16384) iteration loop unrolled by 2 before widening the byte counters out to dword. At least for k=8.

This gets a total count of 12507677, just slightly lower than 12509316 (correct for N=1000000000, k=8), but I haven't checked if that's all due to not doing 1..16384, or if I'm losing any counts anywhere.

You could unroll over outer loop iterations to make use of each XMM vector twice or 4 times for each load. (With sequential access to an array in L1d cache, that could possibly let us go slightly faster by doing more ALU work per load, but not much faster.) By setting up 2 or 4 vectors to match against for 2 or 4 different high halves, you can spend longer in the inner loop. Possibly we could go a bit faster than 1 compare/accumulate per clock. But that might run into (cold) register read bottlenecks on Core 2, though.

The version below just does a standard unroll.

;;;;;  Just the loop from main_SSE2, same init stuff and print as main_1lookup
align 32
.outer:
    mov    eax, ecx         ; using an extra register (EBP) to schedule instructions better(?) for Core2 decode
    shr    eax, 16-2

;    xor eax, eax        ; break false dependency, or just let OoO exec hide it after breaking once per iter
    bsr    ebx, ecx         ; eax = 31-lzcnt for non-zero ecx
         ;; want: k == 32-__builtin_clz(x)-_mm_popcnt_u32(x)
         ; 31-clz+(1-k)  == popcount.  or  31-clz == popcnt - (1-k)
         ; 31-cls+(1-k) - popcount(hi(x)) == popcount(lo(x))
    add    ebx, ebp
    movzx  edx, al
;    movzx  edx, byte [wordbits + edx]
    sub    bl, byte [wordbits + edx]
    shr    eax, 8            ; high part is more than 16 bits if low is 14, needs to be broken up
    sub    bl, byte [wordbits + eax]
;    movzx  eax, byte [wordbits + eax]
;    add    eax, edx
;    sub    ebx, eax

    movzx  eax,  bl
    movd   xmm7, eax
    pxor   xmm0, xmm0
    pxor   xmm1, xmm1    ; 2 accumulators
    pshufb xmm7, xmm0    ; broadcast byte to search for.
      ;;  Actually SSSE3, but it only takes a few more insns to broadcast a byte with just SSE2.  
      ;; e.g. imul eax, 0x01010101 / movd / pshufd

    ;movzx  edx, cx
;    lea    edx, [ecx - 4]   ; TODO: handle cx < 4 making this wrap
;    movzx  edx, dx
    and    ecx, -16384      ; clear low bits, which we're processing with the inner loop.

    mov    edx, wordbits     ; quick and dirty, just loop forward over the array
     ;; FIXME: handle non-zero CX on first outer loop iteration, maybe loop backwards so we can go downwards toward 0,
     ;; or calculate an end-pointer if we can use that without register-read stalls on Core 2.
     ;; Also need to handle the leftover part not being a multiple of 32 in size
     ;; So maybe just make a more-flexible copy of this loop and peel the first outer iteration (containing that inner loop)
     ;;  if the cleanup for that slows down the common case of doing exactly 16K 
align 16
  .low14:
    movdqa  xmm2, [edx]
    movdqa  xmm3, [edx + 16]
 ds   pcmpeqb xmm2, xmm7           ; extra prefixes for padding for Skylake JCC erratum: 18ms vs. 25ms
 ds   psubb   xmm0, xmm2
    ds add     edx, 32
 cs   pcmpeqb xmm3, xmm7
 cs   psubb   xmm1, xmm3

  ; hits are rare enough to not wrap counters?
  ; TODO: may need an inner loop to accumulate after 256 steps if every other 32nd element is a match overflowing some SIMD element
    cmp    edx, wordbits + 16384
    jb   .low14

   pxor   xmm7, xmm7
   psadbw xmm0, xmm7
   psadbw xmm1, xmm7       ; byte -> qword horizontal sum
   paddd  xmm0, xmm1       ; reduce to 1 vector
   movhlps xmm1, xmm0
   paddd  xmm0, xmm1       ; hsum the low/high counts
   movd   eax, xmm0
   add    esi, eax         ; sum in scalar (could sink this out)

   sub   ecx, 16384
   ja   .outer
; TODO: handle ECX < 65536 initially or after handling leading bits.  Probably with BSR in the inner loop

Probably can just PADDD into a vector accumulator and only hsum to scalar outside the loop, but we might want more free vector regs?

Casual answered 8/3, 2021 at 15:0 Comment(28)

The 2-access LUT version is getting about 3.93 fused-domain uops/clock through the front-end, out of a theoretical max of 4 on SKL. (uops_issued.any). So cache loads aren't a bottleneck, and neither is the front-end except for branch mispredicts. Further improvement would require hoisting more work out of the loop, getting closer to the back-end limit of 2 branches/clock (which we're already within ~15% of). – Casual 8/3, 2021 at 16:14

For the popcnt version, setting all the upper bits to 1 might be workable, to reduce it to popcnt(x) == 32-k without needing BSF / CLZ. For a range of numbers with the same MSB position, we can just increment until that produces a carry-out all the way to CF. Or maybe we can count down from -2 .. -(N+1) or something like that, if that will cover the same range of bit-patterns but with the high bits all set instead of clear. Actually the bit patterns are inverted, so we're looking for a number of ones... but then we need to ignore high ones, and we have the original problem. Hmm. – Casual 8/3, 2021 at 16:21

I don't even know how slow the CPU in server is, but "Time-limit exceeded" (non-popcnt version). – Crag 8/3, 2021 at 16:39

@Anonymix321: not totally surprising, probably slower than 4GHz (like 3GHz or under), and maybe even sharing a physical core with another hyperthread; pretty clearly you're intended to do something smarter than brute force. Or if the server is truly too old for popcnt, then it's a Core2 or older, or old AMD, and tuning for Skylake was pointless. (And its front-end bottlenecks will be severe compared to Skylake hitting nearly 4 uops / clock; no uop cache. And also back-end bottlenecks on branch throughput. All kinds of bottlenecks, see agner.org/optimize) – Casual 8/3, 2021 at 16:46

I guess that server is truly old for popcnt as it shows runtime error. Would it be simpler to just use multiple threads (how?) or use the fact that last 2 bits bits are repeated every 4 numbers? – Crag 8/3, 2021 at 17:9

@Anonymix321: If it would let you start multiple threads, you could do that. But the clever option would be to look for algorithmic optimizations by generalizing that low-2-bits idea to more bits, like probably some kind of combinatorics formula for each possible MSB position for the k you want, and then maybe it gets trickier the last bit position for N not being a power of 2. – Casual 8/3, 2021 at 17:25

I thought of computing for first number of quadruple (or 2^i-upleif it isn't fast enough) the LUT-way, and for other 3 something like that: #2=#1-1, #3=#2, #4=#3-1 (#i means i-th number counting from 1 to 4) – Crag 8/3, 2021 at 17:57

@Anonymix321: updated a bit with some ideas about the fact that a CPU without popcnt won't be Skylake (looks like Core 2), so some of the tuning choices are irrelevant, and partial-flag stalls are going to be a big problem along with other stalls. Probably still need algorithmic improvements, though, not just brute force; you likely won't be able to get close to 4 uops / clock on Core2. When I first wrote the answer, I was more interested in just optimizing it for Skylake. – Casual 8/3, 2021 at 21:59

@Crag if you're doing competitive programming then you can get information about the running platform with the OS or compiler tools, for example from /proc/cpuinfo on Linux. That'll help optimizing the code easier – Pelf 9/3, 2021 at 1:36

@Pelf I tried, but I am not allowed to read this file on server. If I try to launch cat /proc/cpuinfo - I'll get security error. And directly reading from this file is not possible - fopen returns NULL. But __builtin_cpu_is("core2") returns true (as I could now this only via some info I got from server) – Crag 9/3, 2021 at 6:44

@Anonymix321: You could of course write your own CPUID dumping code; if you can execute asm and see your program's output (not just a judge pass/fail/timeout), you can read the CPU model string straight from the CPU with the CPUID instruction (en.wikipedia.org/wiki/… shows a C equivalent). Not that you need to; __builtin_cpu_is is sufficient confirmation of what I'd already guessed based on feature set. But the model string would include the stock clock speed, e.g. Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz on mine – Casual 9/3, 2021 at 6:59

It's Xeon 3050 @ 2.12 GHz and program runs for about 1.1 sec – Crag 9/3, 2021 at 10:50

@Anonymix321: Ok, ark.intel.com/content/www/us/en/ark/products/27204/… is a dual-core Conroe, exactly what I have except lower clock speed. (And only 2M L2 cache) – Casual 9/3, 2021 at 10:54

So I have to think of smarter algorithm (if I count only for numbers in form i=4k then only 3 different values bill be for 4k,4k-1,4k-3. And for i=8k there will be only 4 different values) and it's not possible to fasten this program? – Crag 9/3, 2021 at 11:9

@Anonymix321: I was just working on an update. It seems hoisting the high-half / MSB-position-dependent work can let us go fast enough with just scalar code on a Core 2. But SSE2 could check 16 LUT entries at a time for matches, probably in the ballpark of 8x faster than scalar. Maybe even 16. So for this problem, it is possible to basically brute-force, without doing any significant math / combinatorics type stuff. – Casual 9/3, 2021 at 11:21

@Anonymix321: (English usage not: "fasten" = attach things together like a bolt or glue! "quicken" would work, although normally we'd say "and it's not possible to speed up this program?". There isn't a form of "fast" that means "to make faster / to speed up" because "fasten" already exists and means something else.) – Casual 9/3, 2021 at 11:23

English is not my native, so these not-correct word usages may happen. But thanks for correcting me;). – Crag 9/3, 2021 at 11:40

Will you try writing SSE2 program for this? It's interesting how fast could it be using these instructions. – Crag 9/3, 2021 at 11:53

@Anonymix321: Yeah, English is weird and sometimes interesting; native speakers realize this and don't have a problem as long as we can still figure out what someone is trying to say. I hadn't ever really thought about the fact that "fasten" can't mean "make faster", although now that I think some more, there's no word like "highen" for "more high". – Casual 9/3, 2021 at 11:56

@Anonymix321: Like I said in my answer, it's just an application of How to count character occurrences using SIMD. To get a sense of the possible inner-loop speed, you just have to port that to __m128i XMM instead of __m256i AVX2; all the necessary instructions are in SSE2. An inner-loop count of 128 would maintain the power-of-2 stuff so you can avoid doing any SIMD cleanup. I expect Core2 could check pretty close to 16 bytes per clock cycle, with movdqa load / pcmpeqb / psubb leaving some front-end bandwidth for overhead of an unrolled loop. – Casual 9/3, 2021 at 12:4

@Anonymix321: you could unroll over outer loop iterations to make use of each XMM vector twice or 4 times for each load. (Although with sequential access to an array in L1d cache, that's not much of a factor.) By setting up 2 or 4 vectors for 2 or 4 different high halves, you can spend longer in the inner loop. Not sure if the SIMD ALU throughput can let you get more than 1 compare/accumulate done per clock. That might run into (cold) register read bottlenecks on Core 2, though. Try it yourself if you're interested; good example / use-case to get started playing with SIMD. – Casual 9/3, 2021 at 12:9

@Anonymix321: Oh, this reveals a huge algorithmic optimization: histogram the wordbits table (perhaps as you generate it) so instead of searching 64kiB to find matching bytes, just look up the answer! – Casual 9/3, 2021 at 12:26

@Anonymix321: You got me curious and tempted me into seeing how fast Core2 would run the SSE2 version. As expected it's just about 16x faster, although at this point startup overhead and printf is becoming a significant part of the total time! (like 36 or 18 ms, core2 vs. Skylake). Even if we end up needing a nested loop for correctness, it should be pretty similar speed. Brute force is fun: this is still actually doing a compare/add for every single number in the range (except for skipping cleanup). Of course it could be going thousands of times faster than this, but still :P – Casual 9/3, 2021 at 13:22

What do you mean by "histogram the wordbits"? I am also curious is it possible to generate LUT at compile-time (and is it worth it?). And I checked (with popcnt and original LUT versions) SSE2 version: from 1 to 16384 there are 2002 appropriate numbers, from 1 to 65536 - 11440; 12507677+2002=12509679>12509316. And if n=16384, k=8 used then it gives 3003. n=16384 and k=0 it gives 1. n=256, k=0 => 2002. Non-SSE2 version is strange too:12498239+11440=12509679 (familiar number, right?). Is it supposed to be like that or should be considered as error? – Crag 9/3, 2021 at 18:11

@Anonymix321: Oh, the SSE2 version over-counts in the first partial chunk: it counts all 16k numbers above N&-16384 (round down to a multiple of 16k), not just the N .. N&-16384 range. I thought I was getting that right for the non-SSE2 version, but I see now I rounded ECX down before LEA/MOVZX, creating the same over-count problem at the top, because 1000000000 isn't a multiple of 64Ki. So the pieces of the idea were there, I just put them together wrong. :P Thanks for checking. – Casual 10/3, 2021 at 0:12

@Anonymix321: A histogram is counts[ arr[i] ]++. It would be a lookup table that can replace the inner loop like I said, e.g. instead of linear searching for matches for bl=12 for example, you'd just look up the answer from a 0..31 lookup table of dwords. Pre-computing that and/or the LUT would be possible, but probably not worth it. If that time is getting expensive, use SSSE3 pshufb to popcount like 0x80.pl/articles/sse-popcount.html to init 16 elements of the LUT in parallel. (Maybe with inner/outer loops.) – Casual 10/3, 2021 at 0:16

@Anonymix321: Fully pre-computed is ok if the executable is already hot in the pagecache and can just get mapped with a soft pagefault, but in general we tune programs to start up efficiently when they might have to get loaded from disk. Loading an extra 64k from disk may not be faster than we can compute it. Or shrink the LUT to 16k or 4k so it's faster to compute (or load, though); the outer loop will have to run slightly more often. To actually do the precompute, you'd probably write a separate program to create a binary and do incbin foo.bin, rather than NASM %rep / macros. – Casual 10/3, 2021 at 0:19

@Anonymix321: updated with the bugfix for the outer loop over-count on the first iteration; now ans: 12497876 for the non-SIMD version, which is exactly correct for the 64K .. N range: 12497876 + 11440 = 12509316. The SIMD loop could get rewritten to loop downwards or to a variable end-point, but will itself need cleanup to handle a non-multiple-of-32 part. So worst case you'd peel that first outer iteration and make a dedicate partial-range inner loop for it, if the extra flexibility stops is from running as fast for each of the later outer-loop iterations that always do a full 16K – Casual 11/3, 2021 at 3:29

Here's an attempt at algorithmic optimization.

I. Number of desired integers within the range [0; 2 ** floor(log2(N)))

All of these integers are less than N, therefore we only need to check how many of them have exactly K zero-bits below the leading one bit.

For an integer of bit-length n, there are n - 1 possible positions to place our zeros (bits below leading one bit). Therefore number of desired integers of bit-length n is the number of ways to pick k zeros out of n - 1 places (without repetition, unordered). We can compute that using binomial coefficient formula:

n! / (k! * (n - k)!)

If we're using 32-bit integers, then max possible value of n is 31 (and same for k). Factorial for 31 is still huge and won't fit even in a 64-bit number, so we have to perform repeated division (can be constexpr precomputed at compile time).

To get total number of integers we compute binomial coefficient for n from 1 up to floor(log2(N)) and sum them up.

II. Number of desired integers within the range [2 ** floor(log2(N)); N]

Start with the bit right after the leading one bit. And apply the following algorithm:

If current bit is zero, then we can't do anything about that bit (it has to be zero, if we change it to one, then the integer value becomes larger than N), so we simply decrement our zero-bits budget K and move to the next bit.
If current bit is one, then we can pretend that it is zero. Now any combination of remaining lower-significance bits will fit in range below N. Fetch binomial coefficient value to figure out how many ways to pick remaining number of zeros from remaining number of bits and add to the total.

Algorithm stops once we run out of bits or K becomes zero. At this point if K equals remaining number of bits, that means we can zero them out to get the desired integer, therefore we increment the total count by one (count N itself towards the total). Or if K is zero and all the remaining bits are one, then we can also count N towards the total.

Code:

#include <stdio.h>
#include <chrono>

template<typename T>
struct Coefficients {
  static constexpr unsigned size_v = sizeof(T) * 8;

  // Zero-initialize.
  // Indexed by [number_of_zeros][number_of_bits]
  T value[size_v][size_v] = {};

  constexpr Coefficients() {
    // How many different ways we can choose k items from n items
    // without order and without repetition.
    //
    // n! / k! (n - k)!

    value[0][0] = 1;
    value[0][1] = 1;
    value[1][1] = 1;

    for(unsigned i = 2; i < size_v; ++i) {
      value[0][i] = 1;
      value[1][i] = i;

      T r = i;

      for(unsigned j = 2; j < i; ++j) {
        r = (r * (i - j + 1)) / j;
        value[j][i] = r;
      }

      value[i][i] = 1;
    }
  }
};


template<typename T>
__attribute__((noinline)) // To make it easier to benchmark
T count_combinations(T max_value, T zero_bits) {
  if( max_value == 0 )
    return 0;

  constexpr int size = sizeof(T) * 8;
  constexpr Coefficients<T> coefs;
  // assert(zeros_bits < size)

  int bits = size - __builtin_clz(max_value);

  T total = 0;

  // Count all-ones count.
#pragma clang loop vectorize(disable)
  for(int i = 0; i < bits - 1; ++i) {
    total += coefs.value[zero_bits][i];
  }

  // Count interval [2**bits, max_value]
  bits -= 1;
  T mask = T(1) << bits;
  max_value &= ~mask;      // Remove leading bit
  mask = mask >> 1;

#pragma clang loop vectorize(disable)
  while( zero_bits && zero_bits < bits ) {
    if( max_value & mask ) {
      // If current bit is one, then we can pretend that it is zero
      // (which would only make the value smaller, which means that
      // it would still be < max_value) and grab all combinations of
      // zeros within the remaining bits.
      total += coefs.value[zero_bits - 1][bits - 1];

      // And then stop pretending it's zero and continue as normal.
    } else {
      // If current bit is zero, we can't do anything about it, just
      // have to spend a zero from our budget.

      zero_bits--;
    }

    max_value &= ~mask;
    mask = mask >> 1;
    bits--;
  }

  // At this point we don't have any more zero bits, or we don't
  // have any more bits at all.

  if( (zero_bits == bits) ||
      (zero_bits == 0 && max_value == ((mask << 1) - 1)) ) {
    total++;
  }

  return total;
}

int main() {
  using namespace std::chrono;

  unsigned count = 0;
  time_point t0 = high_resolution_clock::now();

  for(int i = 0; i < 1000; ++i) {
    count |= count_combinations<unsigned>(1'000'000'000, 8);
  }
  time_point t1 = high_resolution_clock::now();

  auto duration = duration_cast<nanoseconds>(t1 - t0).count();

  printf("result = %u, time = %lld ns\n", count, duration / 1000);

  return 0;
}

Results (for N=1'000'000'000, K=8, running on i7-9750H):

result = 12509316, time = 35 ns

If binomial coefficients are computed at runtime, then takes ~3.2 µs.

Musjid answered 19/4, 2021 at 23:44 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Optimized for Skylake (i7-6700k): even with 2 LUT accesses per iteration: 0.600 seconds at 3.9GHz

With popcnt:

(Unfinished) Inner loop with invariant clz(x) and high-half popcount

SSE2 SIMD

Recommended topics

Hot tags