AVX-512 and Branching

Asked 25/11, 2017 at 1:6 Answered 25/11, 2017 at 22:27

Solved x86 fortran vectorization simd avx512

I'm confused as to what masking can do in theory in relation to branches. Let's say I have a Skylake-SP (ha, I wish..), and we're ignoring compiler capabilities, just what's possible in theory:

If a branch conditional is dependant on a static flag, and all branches set an array to a computational result, assuming the compiler does not optimize this to two separate loops anyways, can it vectorize?

do i = 1, nx
  if (my_flag .eq. 0) then
    a(i) = b(i) ** 2
  else
    a(i) = b(i) ** 3
  end if
end do

If only as subset of the branches are setting the value in question, can it vectorize?

do i = 1, nx
  if (my_flag .eq. 0) then
    a(i) = b(i) ** 2
  end if
end do

If a branch conditional is in itself dependent on vector data, can it vectorize?

do i = 1, nx
  if (c(i) > 0) then
    a(i) = b(i) ** 2
  else
    a(i) = b(i) ** 3
  end if
end do

Capelin answered 25/11, 2017 at 1:6 Comment(4)

It's pretty hard to generalize for these kinds of things, but here's my two cents. If the compiler is certain a flag has a certain value, the branch can be removed. So if my_flag was a parameter, I expect example 1&2 would vectorize. Also, compilers tend to 'hoist' conditionals out of loops. – Persecute 25/11, 2017 at 5:40

I'm aware of that, however what I meant with my_flag is something decided at runtime that's specifically not hoisted (for whatever reason, maybe it's not possible due to some sideeffects within the branch). AVX-512's new mask intrinsics are supposed to help in some of these cases, but I can't find any documentation about what kind of branches can be vectorized with those. – Branton 25/11, 2017 at 7:28

@Ross: Not that hard to generalize for the 3rd case (a per-element condition), as long as both sides of the branch are cheap enough to evaluate in parallel. Agreed that the first 2 cases don't make sense because a good compiler will hoist the condition, or at least have no trouble doing it in the loop if it's independent of the array data. – Targett 25/11, 2017 at 13:24

The first two examples really don't make a lot of sense. You can't just handwave away the loop hoisting part. If the condition couldn't be hoisted because it had side-effects, then that would be the thing preventing vectorization. So it's hard to envision a world where the compiler couldn't hoist the check, but still wanted to vectorize, but the check doesn't dependent on i or the vector data. What would that look like? This isn't just being pedantic: this detail can easily change the answer from "yes" to "no". – Kirghiz 25/11, 2017 at 21:15

Note: This answer mostly discusses a very specific memory-access issue when it comes to vectorization and it applies mostly at a conceptual level to transforming a series of scalar accesses to arrays into vectorized accesses without assuming anything about what portions of the underlying arrays are mapped. In languages like Fortran, the semantics of the language itself may guarantee that arrays are contiguously mapped, or bounds-checks before entering the loop might be enough to avoid the problem mentioned below.

This answer shouldn't be seen as a good treatment of vectorization in general and certainly not in Fortran specifically. A more comprehensive treatment of vectorization issues appears in another answer, which also specifically addresses AVX-512.

One often overlooked issue with vectorizing conditions is that compilers can vectorize conditional loops of the type you are interested in, via blending or other element-wise predication techniques, only if they can prove that the vectorization accesses the same elements as are accessed as in the scalar element-by-element implementation. If the instruction set doesn't offer an element-wise way to do vector loads respecting this condition, or if the compiler is unable to use them, this can effectively block vectorization.

Said another way, compilers can generally only fully vectorize with plain vector loads if all paths through the loop body access the same elements.

The underlying reason is that the compiled code must not access elements that aren't accessed by the semantics of the original code, even if they are later "blended away" since doing so might cause a fault! If the instruction set doesn't provide instructions to conditionally access elements in memory and suppress faults from not-selected elements, this is a significant barrier to optimization.

In the examples you gave this means that (1) and (3) could be vectorized "without hoisting the condition" while (2) could not, since (2) accesses a[i] and b[i] only in the if body, but not if the if isn't executed. Of course, a real compiler would just hoist a trivial flag check out of the loop and just not execute the loop at all in the myflag == false case, so it's not really a good example.

Let's just look at a couple of cases that subsumes all your examples. First, we need a flag that cannot be hoisted - let's just use an array of bool values. So an interesting somewhat general loop with an output array a, two input arrays b and c and a flag array f might look something like:

do i = 1, nx
  if (f(i) > 0) then
    a(i) = g(b(i), c(i));
  else
    a(i) = h(b(i), c(i));
  end if
end do

Depending on the flag f(i) corresponding to each element, we apply either the function g or h to the input elements b(i) and c(i). Per my condition above we can vectorize only if both g and h actually access the same elements of b and c.

Let's move on to two real work examples of the above:

void example1(bool* f, int* __restrict__ a, int* __restrict__ b, int* __restrict__ c, size_t n) {
    for (size_t i = 0; i < n; i++) {
        if (f[i]) {
            a[i] = b[i];
        } else {
            a[i] = c[i];
        }
    }
}

void example2(bool* f, int* __restrict__ a, int* __restrict__ b, int* __restrict__ c, size_t n) {
    for (size_t i = 0; i < n; i++) {
        if (f[i]) {
            a[i] = b[i] + c[i] ;
        } else {
            a[i] = b[i] - c[i] * 2 + 1 ;
        }
    }
}

Both have the same basic form, but which is harder to vectorize? The first is a simple direct assignment of either b[i] or c[i] depending on the flag. The second is a more complex function of both b[i] and c[i] which are significantly different on both paths.

Well the second is much easier to vectorize since it accesses b[i] and c[i] unconditionally. In fact, gcc doesn't manage to vectorize either one for some reason. clang only vectorizes the second. Somewhat surprisingly icc manages to vectorize both - since it is smart enough to use vpmaskmovd which is a masked load that suppresses faults for unloaded elements.

You can examine the generated assembly on godbolt.

I had originally started this answer with the idea that accessing different array elements is currently an insurmountable barrier to vectorization for current compilers, but that's because I usually don't check icc. It's actually news to me that icc uses masked moves in this way. So the barrier is there, but at least some compilers can fault over it².

As the developer, you usually know that both arrays are fully accessible, such that it is safe to access all elements of b and c in the range [0, n) and it would be nice to communicate that to the compiler. I've tried adding unconditional dummy statements like b[i] = b[i]; c[i] = c[i]; or ... + c[i] * 0 which should compile to nothing but at least allow the compiler to see that semantically all elements are accessed. The do indeed "compile away" but the code-generation is not improved: additional vectorization doesn't occur. Probably they are already eliminated early in the compilation process before the vectorizaton analysis is done, so that information is lost to the vectorizer.

Other than the masked-move instructions, which aren't free and are not fully general, are there any other ways this situation could be improved? Well a compiler could take advantage of its knowledge of the platform memory protection model. For example, once any byte in a 4K page on x86 has been accessed, it is free to read all other bytes on that page. One could imagine a complicated implementation that started out in safe scalar code but as soon as a write to both arrays was "noticed" switched over to a vectorized loop for the remainder of the page.

Similar tricks could be played if the array accesses were aligned: the vectorized loop could check that if flag array was uniformly 0 or uniformly 1, if not it is safe to use the straightforward unconditional unmasked read implementation, otherwise it would fall back to the more careful implementation. Such a transformation would evidently only be profitable if the masks were rarely uniform, or almost always uniform³, and so are probably unlikely to be implemented in practice.

² At least if AVX is available: icc will still fail to vectorize the first example if you restrict it to pre-AVX instructions, since that's when vpmaskmovd/q and vmaskmovps/pd were introduced.

³ Since in that case if you've already determined the mask is uniform, you can implement the operation unconditionally by just doing the selected side of the if without any masking/blending based on whether it was uniform-0 or uniform-1. So you end up with three loops that internally implement: the all-zeros flag case, the all-ones flag case, and the mixed flag case, with jumps between them when the next vector of flags isn't the same as the current loop.

Kirghiz answered 25/11, 2017 at 22:27 Comment(1)

Comments are not for extended discussion; this conversation has been moved to chat. – Abrade 27/11, 2017 at 18:52

Yes, an efficient asm implementation is possible with any of SSE2 / SSE4.1 (for blendps) / AVX / AVX-512, for all of your loops, and compilers do auto-vectorize in practice, but gcc7.2 / clang5.0 / ICC18 all have missed optimizations.

According to static analysis for Skylake-AVX512 (see below), an efficient unrolled implementation of your final loop can run at one 64 byte vector of results per 1.25 clock cycles (plus loop overhead depending on how much you unroll). In practice, 1.33 or 1.5 clock cycles per vector is probably achievable, if your data is hot in L1D cache. Otherwise you easily bottleneck on L2 bandwidth, because you load 2x 64B per store vector 64B store.

For a C version of your loop, gcc, clang, and ICC all auto-vectorize more or less like I did by hand: See source + asm on the Godbolt compiler explorer.

I had to use -ffast-math with gcc for it to auto-vectorize. IDK why it doesn't realize it can safely auto-vectorize without breaking strict FP rules.

Clang seems to be evaluating tmp*tmp and tmp*tmp*tmp separately, and blending those two results instead of conditionally doing the 2nd multiply.

gcc does both multiplies and uses a separate movaps to merge the other way because it doesn't figure out how to invert the condition.

ICC uses KNOTW to invert the condition but then does the 2nd multiply with merge-masking exactly like I do.

Changing the code to do the extra multiply (**3 instead of **2) in the if branch instead of the else branch made all 3 compilers generate better code without each of their missed-optimizations from branching the other way. (There are still missed optimizations for gcc, but ICC and clang are looking solid, both essentially doing the same thing my hand-written code does.)

ICC chooses to only auto-vectorize this with 256b vectors. Maybe it does that by default to avoid lowering the max turbo clock speed? Maybe there's an option to use full-width vectors? gcc 8.0 snapshot also does that, but gcc7.2 uses ZMM vectors.

AVX-512 mask registers and merge-masking makes it even more efficient, but doing both ways and then blending has been a thing with SIMD (or even non-SIMD branchless code) for a long time. e.g. to conditionally add based on a vector compare result, use that vector compare result as an AND mask to leave some elements untouched, and make other elements zero.

0 is the additive identity: x + 0 = x. So x + (y&mask) is a no-op if the mask is all-zero, or it's x+y if the mask is all-one. See How to use if condition in intrinsics. (Fun trick: use a packed-compare result as an integer -1 or 0, so you can count matches but subtracting the compare-mask).

It's less simple for multiply because 1 is the multiplicative identity, but you can solve that by blending.

assuming the compiler does not optimize this to two separate loops anyways, can it vectorize?

In that first case, you should be unhappy with your compiler if it doesn't hoist the condition out of the loop and make two loops. Especially in the 2nd case, where it only needs one loop, because if the condition is false the array isn't modified.

Let's just talk about the 3rd case, because it's only one where the compiler shouldn't just hoist the condition. (And if your compiler is feeling dumb, it can use this version with a loop-invariant mask of all-zero or all-one for the other versions).

if (c(i) > 0)

So we need to load a vector of elements from c and compare against zero. AVX512 can do this for a vector of 16 single-precision float with one instruction with a mask register destination and a memory source operand.

; with zmm0 = 0.0 in all elements, from vxorps xmm0,xmm0,xmm0 outside the loop.
vcmpps    k1, zmm0, [rdx],  _CMP_NLT_UQ     ; !(0 < c(i))

I know (from writing the next part already) that I'm going to want k1 to be true for elements where the c(i) > 0 condition is false. Only the 2nd vector operand can be memory instead of a register, so I had to reverse it and use not-less-than instead of not-greater-than. (And I can't just use >= instead of <, because that would put the unordered case (one or both NaN) in the wrong category. FP compares have 4 possible results: above/below/equal/unordered, so you have to pick a predicate that does what you want (i.e. what the source says, if you're a compiler) for all 4 cases. If you compile with -ffast-math, the compiler is allowed to ignore the possibility of NaN.

If you need to chain two conditions together, AVX512 compare-into-mask instructions can mask the operation of writing into the mask, with zero-masking or merge-masking.

vcmpltps    k1,        zmm1, zmm2       ; k1 = zmm1<zmm2
vcmpltps    k2{k1}{z}, zmm3, zmm4       ; k2 = (zmm3<zmm4) & (zmm1<zmm2)

k2 is 0 everywhere that that zmm3k1 was zero, because we used k1 as a zero-mask.

  if (c(i) > 0) then
    a(i) = b(i) ** 2
  else
    a(i) = b(i) ** 3
  end if

The common subexpression here is b(i) * b(i). We can get b(i)**3 from that by multiplying by b(i) one extra time.

vmovups    zmm1, [rsi]       ; load a vector from b(i)
vmulps     zmm2, zmm1, zmm1  ; zmm2 = zmm1*zmm1 = b(i)**2

AVX-512 can merge based on a mask as part of (almost) any other instruction.

vmulps     zmm2{k1}, zmm2, zmm1  ; zmm2 *= zmm1   for elements where k1 is true

vmovups    [rdi], zmm2           ; store all 16 elements into a(i)

BTW, AVX512 has merge-masking for stores. Previous SIMD instruction sets would load from [rdi], blend, then store back into [rdi]. This means you can implement your 2nd loop (sometimes leave a(i) unmodified) with a per-element condition more efficiently than with AVX1/ AVX2.

Putting this all together: (NASM syntax)

 ; x86-64 System V calling convention
 ; args: rdi = a() output array.
 ;       rsi = b() input array
 ;       rdx = c() array to be tested for positive numbers
 ;       rcx = count (in elements)
 ; preferably all 64-byte aligned, but will work slowly if some aren't
 ; rcx must be >= 16, and a multiple of 16, because I didn't write any cleanup code

global square_or_cube
square_or_cube: 

    vxorps     xmm0,  xmm0,xmm0

 .loop:                          ; do {
    vcmpps     k1, zmm0, [rdx], 21    ; _CMP_NLT_UQ  ; !(0 < c(i))

    vmovups    zmm1, [rsi]            ; load a vector from b(i)
    vmulps     zmm2,     zmm1, zmm1   ; zmm2 = zmm1*zmm1 = b(i)**2

    vmulps     zmm2{k1}, zmm2, zmm1   ; zmm2 *= zmm1   for elements where k1 is true, otherwise unmodified.
    vmovups    [rdi], zmm2            ; store all 16 elements into a(i)

    ; TODO: unroll some and/or use indexed addressing mode tricks to save instructions
    add         rdi, 64      ; pointer increments
    add         rsi, 64
    add         rdx, 64

    sub         rcx, 16         ;  count -= 16 
    ja        .loop             ; } while(count>0);

I analyzed this with IACA (omitting the pointer-increment instructions to simulate unrolling and more clever asm tricks). According to IACA, even the merge-masking vmulps is a single uop, and the memory-source instructions micro-fuses to a single uop for the front-end. (So does the store.) This is what I was hoping, and IACA's output looks correct for this case, although I don't have access to performance counters on SKL-SP hardware to check that.

$ iaca.sh -arch SKX avx512-conditional
Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300)
Analyzed File - avx512-conditional
Binary Format - 64Bit
Architecture  - SKX
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.50 Cycles       Throughput Bottleneck: FrontEnd

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.5    0.0  | 0.0  | 1.0    1.0  | 1.0    1.0  | 1.0  | 1.5  | 1.0  | 1.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   2^   |           |     | 1.0   1.0 |           |     | 1.0 |     |     | CP | vcmpps k1, zmm0, zmmword ptr [rdx], 0x15
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovups zmm1, zmmword ptr [rsi]
|   1    | 1.0       |     |           |           |     |     |     |     | CP | vmulps zmm2, zmm1, zmm1
|   1    | 0.5       |     |           |           |     | 0.5 |     |     | CP | vmulps zmm2{k1}, zmm2, zmm1
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovups zmmword ptr [rdi], zmm2
|   1    |           |     |           |           |     |     | 1.0 |     |    | sub rcx, 0x10
|   0F   |           |     |           |           |     |     |     |     |    | jnbe 0xffffffffffffffdd
Total Num Of Uops: 8

AVX-512 actually has vfpclassps (C/C++ intrinsic [_mm512_fpclass_ps_mask]4, asm documentation with a table in the related vfpclasspd (packed double)) to classify FP values according to your choice of predicates. It may be slightly more efficient than using a full comparison against another register which happens to be zero.
(Actually, according to IACA, it isn't. Both are listed as 3 cycle latency by the InstLatx64 spreadsheet. Agner Fog's measurement for AVX2 cmpps on Skylake-S (non-AVX512 desktop chips) shows 4 cycles, so it's strange that the AVX512 version is lower latency when producing a mask-register result instead of a vector.

I want the result to be false only for positive numbers, and I think vfpclassps can do that by setting almost all the predicate bits to get -Inf, finite negative, quiet and signalling NaN, -0.0, and +0.0.

vfpclassps    k1, [rdx], 0x1 | 0x2 | 0x4 | 0x10 | 0x40 | 0x80     ; QNaN | -0.0 | +0.0 | -Infinity | Negative (finite) | SNaN
; k1 = a 16-bit bitmap of which elements (from memory at [rdx]) need an extra multiply

vpfclassps is interesting because it lets you differentiate between +0.0 and -0.0, like you could by checking the sign bit in the binary representation (like you could with AVX2 vblendps to use the sign bit as a blend-control, without doing a comparison first).

Also, in this case, it saves one instruction outside the loop setting up a register of all-zeros.

related: AVX512 has instructions to multiply by 2**floor(x) (vscalefpd), but not to raise a number to an arbitrary power (integer or otherwise). Xeon Phi has AVX512ER, which gives you fast approximations for 2**x (without flooring x), but we can't directly use an exponential function here either, and SKL-SP doesn't have AVX512ER anyway.

NASM macros for IACA_start / end:

I wrote these based on the iaca_marks.h C/C++ header.

%if 1
%macro  IACA_start 0
     mov ebx, 111
     db 0x64, 0x67, 0x90
%endmacro
%macro  IACA_end 0
     mov ebx, 222
     db 0x64, 0x67, 0x90
%endmacro
%else
%define IACA_start
%define IACA_end
%endif

Wrap them around any code you want to analyze.

Conditional branch on a loop-invariant condition inside the loop

A compiler could branch inside the loop. IDK if any would make code like this, but they certainly could.

; rdi = destination
; rsi = source
; edx = condition
; rcx = element count
global square_or_cube
square_or_cube: 

 .loop:                          ; do {
    vmovups    zmm1, [rsi]            ; load a vector from b(i)
    vmulps     zmm2, zmm1, zmm1   ; zmm2 = zmm1*zmm1 = b(i)**2

    test       edx,edx
    jz        .only_square        ; test-and-branch to conditionally skip the 2nd multiply
    vmulps     zmm2, zmm2, zmm1   ; zmm2 *= zmm1
   .only_square:

    vmovups    [rdi], zmm2        ; store all 16 elements into a(i)

    add         rdi, 64      ; pointer increments
    add         rsi, 64

    sub         rcx, 16         ;  count -= 16 
    ja        .loop             ; } while(count>0);

Targett answered 25/11, 2017 at 11:37 Comment(22)

BTW, sorry this answer isn't very Fortranish (e.g. C syntax in the asm comments). I barely know Fortran, so I think in asm and C. I'm here because of the [simd] and [avx512] tags, in case that's not obvious :P – Targett 25/11, 2017 at 11:45

One might think the use of Fortran might be conducive to WHERE..ELSEWHERE but I don't expect this to add anything useful to Peter's excellent discussion. I might expect better results with unconditonal assignment of b(i)**2 followed by conditional multiplication for the **3 case. As Peter said, objective would be to generate blend instructions which should be more efficient than masked stores (assuming the compiler covers possibility of non-finite operands which are ruled out by gfortran -ffast-math). Intel directive !dir$ vector [always|aligned] also promotes vectorization of conditional. – Fransen 25/11, 2017 at 17:35

gfortran tends to prefer merge() intrinsic for vectorization, just as gcc prefers ? : (both assuming -ffast-math) – Fransen 25/11, 2017 at 17:44

@tim18: With gcc, in this case I got the same code from (c[i] > 0) ? tmp2 : tmp2*tmp; as for the if(). In scalar integer code, you're more likely to get a branchless cmov with the ternary operator than with an if(), but auto-vectorization requires if-conversion. And re: -ffast-math, I'm not sure what gcc is worried about, because gcc and clang both vectorize without it. I don't think gcc is worried about trapping math, because you need a special option to tell the compiler that FP exceptions might be unmasked. – Targett 25/11, 2017 at 22:54

One thing not covered above is that vectorization is quite a bit tougher when the different paths through the function access different parts of memory (e.g,. different array elements or different arrays) like the OP's example 2 (in theory, but not reality because hoisting). I wrote an answer starting with the concept that on current compilers it effectively disabled vectorization because they weren't smart to get around it, but actually icc uses maskmov to get around this if you enable AVX and that instruction is actually pretty efficient. – Kirghiz 25/11, 2017 at 23:4

My comment above may not apply in Fortran, however. For example, arrays carry bounds in fortran, right? So the compiler could just check the bounds on the arrays before entering the loop and be assured that the entire loop is in-bounds in all arrays. – Kirghiz 25/11, 2017 at 23:18

@Kirghiz Do you have any idea if masked memory accesses can have "false cache misses" if a disabled lane misses the cache or even page faults? This is something I've been wondering for a while, but I haven't gotten around to testing it. If a disabled lane does "access" the memory, then it could have severe performance consequences if misused. I imagine the answer could be different between generations - especially with AVX512 where it's a first-class citizen. – Easel 27/11, 2017 at 16:26

@Mysticial: Intel's optimization manual has a few details for VMASKMOVPS which don't directly answer the question: If the mask is not all 1 or all 0, loads that depend on the masked store have to wait until the store data is written to the cache. If the mask is all 1 the data can be forwarded from the masked store to the dependent loads. If the mask is all 0 the loads do not depend on the masked store. – Targett 27/11, 2017 at 16:32

And for Skylake: Loads that follow a masked store is no longer blocked until the mask value is known. Most of the discussion is about multi-hundred cycle assists if the instruction includes some illegal addresses but the fault is suppressed by the mask being 0 for those elements. Apparently a load with a [base+idx] addressing mode and an all-zero mask also triggers an assist! (even on Skylake). So yes, there are huge penalties for illegal addresses that would fault if not for the mask, but IDK about cache behviour. – Targett 27/11, 2017 at 16:33

@PeterCordes I'd imagine things could also be different for gather/scatters as well. The vast majority of masked gather/scatters will probably have invalid addresses on the disabled lanes unless it's a gather to some lookup table or something. So false-cache-missing on disabled lanes for gather/scatter sounds like a terrible idea from hardware design standpoint. – Easel 27/11, 2017 at 16:38

@Mysticial: the difference is that each cache access in a gather/scatter is done separately anyway, so it's easier to squash them on a per-element basis, I assume. But vmaskmovps is contiguous and "wants" to run like vmovups with a single TLB check. – Targett 27/11, 2017 at 16:40

@Easel - I am pretty sure the load form of masked moves will load all the cache lines (1 or 2) touched by the full-width load irrespective of the mask . I think the load form is simply implemented as a load + blend operation, with special case handling if a fault is detected to suppress the failure in the case the fault occurs in a page which is excluded by the mask. so yeah, I don't think you can use it to avoid false sharing for example: a differently-aligned mov is better for that. – Kirghiz 27/11, 2017 at 17:40

Stores are more complicated since the mask has to be propagated to the store buffer, and then you have the complexity of forwarding from masked loads and have to consider memory model issues as well. That results in the behavior Peter mentions above: it seems current implementation basically special case the all-zeros store case (essentially as a no-op), the all-ones case (as essentially a plain store) and the mixed case. It also seems to want to bring in all lines regardless of the mask, so probably suffers "false sharing" except in the all-zeros case too. – Kirghiz 27/11, 2017 at 17:42

Eek... If this true, then this falls along the same line as prefetching. For certain loops, you have to find a way to suppress the tails or the prefetch faults will kill performance worse than just wasting bandwidth. (Either by peeling the last iteration or replacing the prefetch pointer will a "safe" memory block.) On that topic, if there were two improvements I'd like to see in prefetching, it would be 1) Block Prefetch (prefetch an entire range), 2) Predicated Prefetch /cc @Kirghiz – Easel 27/11, 2017 at 19:7

@Easel - sure, but it depends on the use case. For the above kind of loop, you can be pretty sure that the b and c arrays have at least length n and in the case of a reasonably mixed flag array you are going to be fetching every cache line (semantically) anyways, at most you are spuriously fetching some extra lines from the array where the elements aren't going to be used. You don't have to worry about suppressing page-faults or anything like that. The scenario where you might really be hitting the fault case seem more like ones where you are using maskmov to ... – Kirghiz 27/11, 2017 at 19:41

... safely access in an unaligned way near the edge of a memory area. In that case yeah you may pay a big penalty if the next page is unmapped, and a smaller penalty in other cases for unnecessarily bringing in the next line. A differently aligned load and shift might be better. BTW, do you know if there are big penalties for prefetching hitting an page that would fault? Are you talking software or hardware prefetch? I asked recently if software prefetch would take lines in the E or M state on another core, since it could lead to "hidden" false sharing even when the code did the right thing. – Kirghiz 27/11, 2017 at 19:44

@Kirghiz That second case (edge of memory area) is precisely the scenario that I hit. And even when it isn't the edge, it still hurts since the application/environment is memory bound and you don't want to pull in unnecessary cache lines. (Intel's advice is to never prefetch when you're memory-bound. But I've found that it's possible to get large speedups anyway if you're careful enough.) Prefetching unmapped pages is as bad as a cache miss since (IIRC) it walks the TLB. Skylake has a special case that will NOP out nullptr prefetches. But that's actually not useful for streaming loops. – Easel 27/11, 2017 at 19:54

I'm talking about software prefetching. I'm unsure about hardware prefetches. In the past, hardware prefetching never crossed page boundaries for that reason. But they do now, and I'm not sure how they resolve it or if they just take the TLB walk penalty. On the topic of the states, I've done some testing that seems to suggest that software prefetching will pull in dirty cachelines from another core. But I'm not sure what state the prefetched line goes into (S or E). I'd assume S since there's a separate prefetch for writes. – Easel 27/11, 2017 at 20:0

Erg, I actually meant "hardware prefetches" when talking about the cache states. I seems reasonable that software prefetches would do it since there is an explicit directive that it is what is wanted and these are used anyway usually in loops over large regions (or in pointer chasing scenarios where it does the right thing). Having hardware prefetch grab more lines out from under other cores seems bad though: imagine 128-byte structures, carefully aligned to avoid false sharing, where different threads access different structures. If one thread hitting the two lines sequentially ... – Kirghiz 27/11, 2017 at 20:6

... in "its" structure went ahead and hardware PFed the next structure, it could really suck, so I wondered if that case was specifically suppressed. About hardware prefetch and 4k boundaries, I don't think they just transparently prefetch across the boundaries. At least based on some patents and other stuff it seemed like maybe they just implemented a "fast start" mode where they still stop at the page boundary, but when user code issues the first read in the new page they enqueue many PF requests right away. Maybe it's better now and they fetch if it doesn't need a page walk. – Kirghiz 27/11, 2017 at 20:9

@BeeOnRope: One of IvB's features was supposed to be next-page prefetching, including speculative TLB loading even if it needs a page-walk. I think this is part of why SKL has two page-walk units (although scatter/gather might be the other part). I don't think it's fully transparent, but are you sure what you were reading wasn't only for SnB and earlier? – Targett 28/11, 2017 at 2:50

@PeterCordes - what I was reading was for the NPP introduced in IvB, but I never really found a satisfying answer on all the details (e.g., patent filings may not reflect the actual implementation). It is clear that even with the NPP adding software prefetches only at the page boundaries still has a significant positive effect on throughput, so the hardware NPP is not as aggressive and/or still doesn't result in the page boundaries being treated "transparently" by the PF hardware. Two page walk units has significant benefit across a variety of code with with high MLP. – Kirghiz 28/11, 2017 at 3:7