Micro Optimization of a 4-bucket histogram of a large array or list

Asked 9/4, 2020 at 13:21 Answered 11/4, 2020 at 23:37

Solved c#optimization histogram simd micro-optimization

I have a special question. I will try to describe this as accurate as possible.

I am doing a very important "micro-optimization". A loop that runs for days at a time. So if I can cut this loops time it takes to half the time. 10 days would decrease to only 5 days etc.

The loop I have now is the function: "testbenchmark1".

I have 4 indexes that I need to increase in a loop like this. But when accessing an index from a list that takes some extra time actually as I have noticed. This is what I am trying to se if there is another solution to.

indexes[n]++; //increase correct index

Complete code for "testbenchmark1" which takes 122 ms:

void testbenchmark00()
{
    Random random = new Random();
    List<int> indexers = new List<int>();
    for (int i = 0; i < 9256408; i++)
    {
        indexers.Add(random.Next(0, 4));
    }
    int[] valueLIST = indexers.ToArray();


    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();

    int[] indexes = { 0, 0, 0, 0 };
    foreach (int n in valueLIST) //Takes 122 ms
    {
        indexes[n]++; //increase correct index
    }

    stopWatch.Stop();
    MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds");
}

Now the below "testbenchmark2" code is just experimental and I know it is not correct but I wonder if there is any simular way to use such kind of numbers: "1_00_00_00_00" and if it would be possible to see the: "00_00_00_00" as four different integer numbers. For example if I would do a summation of: 1_00_00_00_00 + 1_00_01_00_00 = 1_00_01_00_00 and then one could in the end extract each number, each of the four like this: 00, 01, 00, 00

But I don't know if this is possible in any way even using Binary numbers. Yes any kind of solution. To just add numbers like this. Just as a test that loop took only 59 ms which is half the time of 122 ms. So I am interesting to see if there is any idéa to this?

double num3 = 1_00_00_00_00;
double num4 = 1_00_01_00_00;
for (int i = 0; i < valueLIST.Count; i++) //Takes 59 ms
{
    num3 += num4;
}

Complete code for "testbenchmark2" which takes 59 ms:

void testbenchmark2()
{
    List<String> valueLIST = new List<String>(); 
    for (int i = 0; i < 9256408; i++) //56
    {
        valueLIST.Add(i.ToString());
    }

    //https://www.geeksforgeeks.org/binary-literals-and-digit-separators-in-c-sharp/
    double num3 = 1_00_00_00_00;
    double num4 = 1_00_01_00_00;

    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    for (int i = 0; i < valueLIST.Count; i++) //Takes 59 ms
    {
        num3 += num4;
    }
    stopWatch.Stop();
    MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds\n\n" + num3);
}

EDIT
The below is a more clean code of what I am trying to do Exactly!
But the below code will probably to be correct or the solution but it shows what I try to do I beleive.

        void newtest()
        {
            double num1 = 1_00_00_00_00;
            double num2 = 1_00_01_00_00;
            double num3 = 1_00_01_01_00;

            List<double> testnumbers = new List<double>();
            testnumbers.Add(num1);
            testnumbers.Add(num2);
            testnumbers.Add(num3);

            double SUM = 0;
            for (int i = 0; i < testnumbers.Count; i++)
            {
                SUM += testnumbers[i];
            }

            //The result is
            //300020100

            //Would it possible to extract the "four buckets" that I am interesting in somehow?
            //00_02_01_00
        }

Laurilaurianne answered 9/4, 2020 at 13:21 Comment(7)

So your first problem is a histogram into 4 buckets? You can unroll with multiple arrays of counts and combine at the end, reducing store-forwarding bottlenecks for incrementing the same counter multiple times. Or 4 buckets is so few you could consider just doing 4x SIMD compares for equality. You're running this on x86-64, right, so you can presumably use at least AVX2? – Etui 9/4, 2020 at 13:25

@Peter Yes it is 4 buckets so to speak I beleive. It sounds interesting what you talk about but I am not sure I have heard about those terms that you talk about before. I am not sure how this could be done? I am running this on 64 bits. I beleive I want to avoid increasing an index for a list or array as it seem to consume alot of time? – Laurilaurianne 9/4, 2020 at 13:30

No, iterating through an array can compile efficiently into asm. The slow part is the dependency chain through memory for incrementing a variable index of the array. Methods to vectorise histogram in SIMD? – Etui 9/4, 2020 at 13:45

If you are going to run this for loop for days at a time, go with the first option. I did a benchmark of both loops running 100 times and the first function took 26.27 seconds while the second function took 155.16 seconds. The second function is significantly slower when run constantly and it's a massive resource hog (almost using a gigabyte of ram). – Theriot 9/4, 2020 at 13:50

One reason for the second one being faster is that for loops are generally a lot faster than foreach loops. – Fireproof 9/4, 2020 at 13:53

@Peter this is very interesting. I have never heard about SIMD before. I have red the link but I am still not sure how this would be coded in practic. I have EDITED my original post with code at the bottom for a more clean code what I am trying to do. – Laurilaurianne 9/4, 2020 at 13:56

In C++ you can use SIMD to count matching bytes in a string like this: How to count character occurrences using SIMD. Counting matching integers is easier; they're already wide enough that you can use counters that width without overflow unless lists are billions of elements long. There are a couple ways to use SIMD in C#, apparently. If this is a huge performance issue, consider hiring a freelance expert to implement this function with SIMD intrinsics for you. (Unfortunately not me; I don't have a C# development setup.) – Etui 9/4, 2020 at 14:1

This should be possible at about 8 elements (1 AVX2 vector) per 2.5 clock cycles or so (per core) on a modern x86-64 like Skylake or Zen 2, using AVX2. Or per 2 clocks with unrolling. Or on your Piledriver CPU, maybe 1x 16-byte vector of indexes per 3 clocks with AVX1 _mm_cmpeq_epi32.

The general strategy works with 2 to 8 buckets. And for byte, 16-bit, or 32-bit elements. (So byte elements gives you 32 elements histogrammed per 2 clock cycles best case, with a bit of outer-loop overhead to collect byte counters before they overflow.)

Update: or mapping an int to 1UL << (array[i]*8) to increment one of 4 bytes of a counter with SIMD / SWAR addition, we can go close to 1 clock per vector of 8 int on SKL, or per 2 clocks on Zen2. (This is even more specific to 4 or fewer buckets, and int input, and doesn't scale down to SSE2. It needs variable-shifts or at least AVX1 variable-shuffles.) Using byte elements with the first strategy is probably still better in terms of elements per cycle.

As @JonasH points out, you could have different cores working on different parts of the input array. A single core can come close to saturating memory bandwidth on typical desktops, but many-core Xeons have lower per-core memory bandwidth and higher aggregate, and need more cores to saturate L3 or DRAM bandwidth. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

A loop that runs for days at a time.

On a single input list that's very very slow to iterate so it still doesn't overflow int counters? Or repeated calls with different large Lists (like your ~900k test array)?

I believe I want to avoid increasing an index for a list or array as it seem to consume a lot of time?

That's probably because you were benchmarking with optimization disabled. Don't do that, it's not meaningful at all; different code is slowed down different amounts by disabling optimization. More explicit steps and tmp vars can often make slower debug-mode code because there are more things that have to be there to look at with a debugger. But they can just optimize into a normal pointer-increment loop when you compile with normal optimization.

Iterating through an array can compile efficiently into asm.

The slow part is the dependency chain through memory for incrementing a variable index of the array. For example on a Skylake CPU, memory-destination add with the same address repeatedly bottlenecks at about one increment per 6 clock cycles because the next add has to wait to load the value stored by the previous one. (Store-forwarding from the store buffer means it doesn't have to wait for it to commit to cache first, but it's still much slower than add into a register.) See also Agner Fog's optimization guides: https://agner.org/optimize/

With counts only distributed across 4 buckets, you'll have a lot of cases where instructions are waiting to reload the data stored by another recent instruction, so you can't even achieve the nearly 1 element per clock cycle you might if counts were well distributed over more counters that still were all hot in L1d cache.

One good solution to this problem is unrolling the loop with multiple arrays of counters. Methods to vectorise histogram in SIMD?. Like instead of int[] indexes = { 0, 0, 0, 0 }; you can make it a 2D array of four counters each. You'd have to manually unroll the loop in the source to iterate over the input array, and handle the last 0..3 left over elements after the unrolled part.

This is a good technique for small to medium arrays of counts, but becomes bad if replicating the counters starts to lead to cache misses.

Use narrow integers to save cache footprint / mem bandwidth.

Another thing you can/should do is use as narrow a type as possible for your arrays of 0..3 values: each number can fit in a byte so using 8-bit integers would save you a factor of 4 cache footprint / memory bandwidth.

x86 can efficiently load/store bytes into to/from full registers. With SSE4.1 you also have SIMD pmovzxbd to make it more efficient to auto-vectorize when you have a byte_array[i] used with an int_array[i] in a loop.

(When I say x86 I mean including x86-64, as opposed to ARM or PowerPC. Of course you don't actually want to compile 32-bit code, what Microsoft calls "x86")

With very small numbers of buckets, like 4

This looks like a job for SIMD compares. With x86 SSE2 the number of int elements per 16-byte vector of data is equal to your number of histogram bins.

You already had a SIMD sort of idea with trying to treat a number as four separate byte elements. See https://en.wikipedia.org/wiki/SIMD#Software

But 00_01_10_11 is just source-level syntax for human-readable separators in numbers, and double is a floating-point type whose internal representation isn't the same as for integers. And you definitely don't want to use strings; SIMD lets you do stuff like operating on 4 elements of an integer array at once.

The best way I can see to approach this is to separately count matches for each of the 4 values, rather than map elements to counters. We want to process multiple elements in parallel but mapping them to counters can have collisions when there are repeated values in one vector of elements. You'd need to increment that counter twice.

The scalar equivalent of this is:

int counts[4] = {0,0,0,0};
for () {
    counts[0] += (arr[i] == 0);
    counts[1] += (arr[i] == 1);
    counts[2] += (arr[i] == 2);  // count matches
  //counts[3] += (arr[i] == 3);  // we assume any that aren't 0..2 are this
}
counts[3] = size - counts[0] - counts[1] - counts[2];
// calculate count 3 from other counts

which (in C++) GCC -O3 will actually auto-vectorize exactly the way I did manually below: https://godbolt.org/z/UJfzuH. Clang even unrolls it when auto-vectorizing, so it should be better than my hand-vectorized version for int inputs. Still not as good as the alternate vpermilps strategy for that case, though.

(And you do still need to manually vectorize if you want byte elements with efficient narrow sums, only widening in an outer loop.)

With byte elements, see How to count character occurrences using SIMD. The element size is too narrow for a counter; it would overflow after 256 counts. So you have to widen either in the inner loop, or use nested loops to do some accumulating before widening.

I don't know C#, so I could write the code in x86 assembly or in C++ with intrinsics. Perhaps C++ intrinsics is more useful for you. C# has some kind of vector extensions that should make it possible to port this.

This is C++ for x86-64, using AVX2 SIMD intrinsics. See https://stackoverflow.com/tags/sse/info for some info.

// Manually vectorized for AVX2, for int element size
// Going nearly 4x as fast should be possible for byte element size

#include <immintrin.h>

void count_elements_avx2(const std::vector<int> &input,  unsigned output_counts[4])
{
    __m256i  counts[4] = { _mm256_setzero_si256() };  // 4 vectors of zeroed counters
                  // each vector holds counts for one bucket, to be hsummed at the end

    size_t size = input.size();
    for(size_t i = 0 ; i<size ; i+=8) {  // 8x 32-bit elements per vector
        __m256i v = _mm256_loadu_si256((const __m256i*)&input[i]);  // unaligned load of 8 ints
        for (int val = 0 ; val < 3; val++) {
           // C++ compilers will unroll this with 3 vector constants and no memory access
            __m256i match = _mm256_cmpeq_epi32(v, _mm256_set1_epi32(val));  // 0 or all-ones aka -1
            counts[val] = _mm256_sub_epi32(counts[val], match);   // x -= -1 or 0 conditional increment
        }
    }


    // transpose and sum 4 vectors of 8 elements down to 1 vector of 4 elements
    __m128i summed_counts = hsum_xpose(counts);   // helper function defined in Godbolt link
    _mm_storeu_si128((__m128i*)output_counts, summed_counts);

    output_counts[3] = size - output_counts[0]
                       - output_counts[1] - output_counts[2];

    // TODO: handle the last size%8 input elements; scalar would be easy
}

This compiles nicely with clang (on the Godbolt compiler explorer). Presumably you can write C# that compiles to similar machine code. If not, consider calling native code from a C++ compiler (or hand-written in asm if you can't get truly optimal code from the compiler). If your real use-case runs as many iterations as your benchmark, that could amortize the extra overhead if the input array doesn't have to get copied.

 # from an earlier version of the C++, doing all 4 compares in the inner loop
 # clang -O3 -march=skylake
.LBB0_2:                                     # do {
    vmovdqu ymm7, ymmword ptr [rcx + 4*rdx]    # v = load arr[i + 0..7]
    vpcmpeqd        ymm8, ymm7, ymm3           # compare v == 0
    vpsubd  ymm4, ymm4, ymm8                   # total0 -= cmp_result
    vpcmpeqd        ymm8, ymm7, ymm5
    vpsubd  ymm2, ymm2, ymm8
    vpcmpeqd        ymm7, ymm7, ymm6           # compare v == 2
    vpsubd  ymm1, ymm1, ymm7                   # total2 -= cmp_result
    add     rdx, 8                             # i += 8
    cmp     rdx, rax
    jb      .LBB0_2                          # }while(i < size)

Estimated best-case Skylake performance: ~2.5 cycles per vector (8 int or 32 int8_t)

Or 2 with unrolling.

Without AVX2, using only SSE2, you'd have some extra movdqa instructions and only be doing 4 elements per vector. This would still be a win vs. scalar histogram in memory, though. Even 1 element / clock is nice, and should be doable with SSE2 that can run on any x86-64 CPU.

Assuming no cache misses of course, with hardware prefetch into L1d staying ahead of the loop. This might only happen with the data already hot in L2 cache at least. I'm also assuming no stalls from memory alignment; ideally your data is aligned by 32 bytes. If it usually isn't, possibly worth processing the first unaligned part and then using aligned loads, if the array is large enough.

For byte elements, the inner-most loop will look similar (with vpcmpeqb and vpsubb but run only at most 255 (not 256) iterations before hsumming to 64-bit counters, to avoid overflow. So throughput per vector will be the same, but with 4x as many elements per vector.

See https://agner.org/optimize/ and https://uops.info/ for performance analysis details. e.g. vpcmpeqd on uops.info

The inner loop is only 9 fused-domain uops for Haswell/Skylake, so best case front-end bottleneck of about 1 iteration per 2.25 cycles (the pipeline is 4 uops wide). Small-loop effects get in the way somewhat: Is performance reduced when executing loops whose uop count is not a multiple of processor width? - Skylake has its loop buffer disabled by a microcode update for an erratum, but even before that a 9 uop loop ended up issuing slightly worse than one iter per 2.25 cycles on average, let's say 2.5 cycles.

Skylake runs vpsubd on ports 0,1, or 5, and runs vpcmpeqd on ports 0 or 1. So the back-end bottleneck on ports 0,1,5 is 6 vector ALU uops for 3 ports, or 1 iteration per 2 cycles. So the front-end bottleneck dominates. (Ice Lake's wider front-end may let it bottleneck on the back-end even without unrolling; same back-end throughputs there unless you use AVX512...)

If clang had indexed from the end of the array and counted the index up towards zero (since it chose to use an indexed addressing mode anyway) it could have saved a uop for a total of 8 uops = one iter per 2 cycles in the front-end, matching the back-end bottleneck. (Either way, scalar add and macro-fused cmp/jcc, or add/jcc loop branch can run on port 6, and the load doesn't compete for ALU ports.) Uop replays of ALU uops dependent on the load shouldn't be a problem even on cache misses, if ALU uops are the bottleneck there will normally be plenty of older uops just waiting for an execution unit to be ready, not waiting for load data.

Unrolling by 2 would have the same benefit: amortizing that 2 uops of loop overhead. So 16 uops for 2 input vectors. That's a nice multiple of the pipeline width on SKL and IceLake, and the single-uop pipeline width on Zen. Unrolling even more could let the front-end can stay ahead of execution, but with them even any back-end delays will let the front end build up a cushion of uops in the scheduler. This will let it execute loads early enough.

Zen2 has a wider front-end (6 uops or 5 instructions wide, IIUC). None of these instructions are multi-uop because Zen2 widened the vector ALUs to 256-bit, so that's 5 single-uop instructions. vpcmpeq* runs on FP 0,1, or 3, same as vpsubd, so the back-end bottleneck is the same as on Skylake: 1 vector per 2 cycles. But the wider front-end removes that bottleneck, leaving the critical path being the back-end even without unrolling.

Zen1 takes 2 uops per 256-bit vector operation (or more for lane-crossing, but these are simple 2 uop). So presumably 12/3 = 4 cycles per vector of 8 or 32 elements, assuming it can get those uops through the front-end efficiently.

I'm assuming that the 1-cycle latency dependency chains through the count vectors are scheduled well by the back-ends and don't result in many wasted cycles. Probably not a big deal, especially if you have any memory bottlenecks in real life. (On Piledriver, SIMD-integer operations have 2 cycle latency, but 6 ALU uops for 2 vector ALU ports that can run them is 1 vector (128-bit) per 3 cycles so even without unrolling there's enough work to hide that latency.)

I didn't analyze the horizontal-sum part of this. It's outside the loop so it only has to run once per call. You did tag this micro-optimization, but we probably don't need to worry about that part.

Other numbers of buckets

The base case of this strategy is 2 buckets: count matches for one thing, count_other = size - count.

We know that every element is one of these 4 possibilities, so we can assume that any x that isn't 0, 1, or 2 is a 3 without checking. This means we don't have to count matches for 3 at all, and can get the count for that bucket from size - sum(counts[0..2]).

(See the edit history for the above perf analysis before doing this optimizations. I changed the numbers after doing this optimization and updating the Godbolt link, hopefully I didn't miss anything.)

AVX512 on Skylake-Xeon

For 64-byte vectors there is no vpcmpeqd to make a vector of all-zero (0) or all-one (-1) elements. Instead you'd compare into a mask register and use that to do a merge-masked add of set1(1). Like c = _mm512_mask_add_epi32(c, _mm512_set1_epi32(1)).

Unfortunately it's not efficient to do a scalar popcount of the compare-result bitmasks.

Random code review: in your first benchmark:

int[] valueLIST = indexers.ToArray();

This seems pointless; According to MS's docs (https://learn.microsoft.com/en-us/dotnet/standard/collections/), a List is efficiently indexable. I think it's equivalent to C++ std::vector<T>. You can just iterate it without copying to an array.

Alt strategy - map 0..3 to a set a bit in one byte of an int

Good if you can't narrow your elements to bytes for the input to save mem bandwidth.

But speaking of which, maybe worth it to use 2x _mm256_packs_epi32 (vpackssdw) and _mm256_packs_epi16 (vpacksswb) to narrow down to 8-bit integers before counting with 3x pcmpeqb / psubb. That costs 3 uops per 4 input vectors to pack down to 1 with byte elements.

But if your input has int elements to start with, this may be best instead of packing and then comparing 3 ways.

You have 4 buckets, and an int has 4 bytes. If we can transform each int element to a 1 at the bottom of the appropriate byte, that would let us add with _mm256_add_epi8 for up to 255 inner-loop iterations before widening to 64-bit counters. (With the standard _mm256_sad_epu8 against zero trick to hsum unsigned bytes without overflow.)

There are 2 ways to do this. The first: use a shuffle as a lookup table. AVX2 vpermd works (_mm256_permutexvar_epi32) using the data as the index vector and a constant _mm256_set_epi32(0,0,0,0, 1UL<<24, 1UL<<16, 1UL<<8, 1UL<<0) as the data being shuffled. Or type-pun the vector to use AVX1 vpermilps as a LUT with the LUT vector having those bytes in the upper half as well.

vpermilps is better: it's fewer uops on AMD Zen 1, and lower latency everywhere because it's in-lane. (Might cause a bypass delay on some CPUs, cutting into the latency benefit, but still not worse than vpermd).

For some reason vpermilps with a vector control has 2 cycle throughput on Zen2 even though it's still a single uop. Or 4 cycles on Zen1 (for the 2 uop YMM version). It's 1 cycle on Intel. vpermd is even worse on AMD: more uops and same poor throughput.

vpermilps xmm (16-byte vector) on Piledriver has 1/clock throughput according to Agner Fog's testing, and runs in the "ivec" domain. (So it actually has extra bypass delay latency when used on the "intended" floating point operands, but not on integer).

   // Or for Piledriver, __m128 version of this

    __m256 bytepatterns = _mm256_casts256_ps(_mm256_set_epi32(
         1<<24, 1<<16, 1<<8, 1<<0,
         1<<24, 1<<16, 1<<8, 1<<0) );
    __m256i v = _mm256_loadu_si256((const __m256i*)&input[i]);
    v = _mm256_castps_si256(_mm256_permutevar_ps(bytepatterns, v));  // vpermilps 32-bit variable shuffle
    counts = _mm256_add_epi8(counts, v);

     // after some inner iterations, separate out the 
     // set1_epi32(0x000000ff) counts, 0x0000ff00 counts, etc.

This will produce interleaved counters inside each int element. They will overflow if you don't accumulate them before 256 counts. See How to count character occurrences using SIMD for a simple version of that with a single counter.

Here we might unroll and use 2 different LUT vectors so when we want to group all the counts for 0 together, we could blend 2 vectors together and mask away the others.

Alternatively to shuffling, we can do this with AVX2 variable shifts.

sums += 1UL << (array[i]*8); where the *8 is the number of bits in a byte, also done with a shift. I wrote it as a scalar C++ expression because now's your chance to see how your bytes-in-an-integer idea can really work. As long as we don't let an individual byte overflow, it doesn't matter if SIMD bytes adds block carry between bytes or if we use 32-bit dword elements.

We'd do this with AVX2 as:

__m256i v = loadu...();
v = _mm256_slli_epi32(v, 3);  // v *= 8
v = _mm256_sllv_epi32(_mm256_set1_epi32(1), v);
counts = _mm256_add_epi8(counts, v);

This is 2 shift instructions plus the vpaddb. On Skylake the variable-count shifts vpsllvd is cheap: single-uop and runs on multiple ports. But on Haswell and Zen it's slower. (Same throughput as vpermilps on AMD)

And 2 uops for 2 ports still doesn't beat 1 uop for 1 port for the shuffle version. (Unless you use both strategies alternating to distribute the work over all ALU ports on SKL.)

So either way the inner-most loop can go 1 vector per clock or maybe slightly better with careful interleaving of shift vs. shuffle methods.

But it will require some small amount of overhead amortized over 128 or 255 inner loop iterations.

That cleanup at the end might blend 2 vectors together to get a vector with counts for just 2 buckets, then vpshufb (_mm256_shuffle_epi8) to group byte counters for the same bucket into the same qwords. Then vpsadbw (_mm256_sad_epu8) against zero can horizontal sum those byte elements within each qword for _mm256_add_epi64. So outer-loop work should be 2 vpblendw, 2x vpshufb, 2x vpsadbw, 2x vpaddq then back into another 255 iterations of the inner loop. Probably also checking if you're within 255 iterations of the end of the array to set the loop bound for the inner iteration.

Etui answered 9/4, 2020 at 18:41 Comment(1)

Comments are not for extended discussion; this conversation has been moved to chat. – Cappello 10/4, 2020 at 21:48

As mentioned by Peter Cordes, you could use SIMD to add multiple values together at a time, See vector. But it is not clear to me if this would actually help.

Edit: If you are running .Net core there are also SIMD intrinstics that provides lower level access to the hardware.

As mentioned by NerualHandle it might be better to use a for-loop than a foreach. But when I test it there does not seem to be a significant difference. I would guess the compiler can optimize foreach in this particular case.

When I am running your testbenchmark00 code it completes in ~6ms on my computer. Some rough calculations suggest each iteration of the loop takes about 0.78ns, or about 2-4 processor cycles, this seem to be near optimal. It seem odd that it takes ~20 times longer for you. Are you running in release mode?

You could parallelize the problem. Split the indexers array into multiple parts, and build the historgram for each part on different threads, and sum the historgram for each thread in the end. See Parallel.For since this can do the partitioning etc for you, but it requires the use of localInit and localFinally to ensure each thread writes to separate histograms to avoid concurrency issues.

As always with performance optimization, the recommended order to do things is:

Profile code to identify problem areas
Look for algorithmic improvements
Look for ways to do less work, like caching
Do existing work faster

Gaudette answered 9/4, 2020 at 14:0 Comment(9)

With AVX2 and 32-bit integers, you can vpcmpeqd / vpsubd against 4 different vector constants with 8 vector ALU instructions for one vector of 8 elements. (Plus loop overhead and a load). I'd expect better than 1 clock per element on Haswell/Skylake or Zen2. That should easily come out ahead of load + memory-destination add for each scalar int even if you unroll with multiple count arrays to hide the store/reload latency. Or if the data can be packed into 8-bit integers, that reduces memory bandwidth cost by a factor of 4, and also has 4x the elements per SIMD vector. – Etui 9/4, 2020 at 14:8

Yes I did run debug mode. Release mode took 31 ms actually. This vector/SIMD and Parallel.For seems really interesting. I am trying now to read if I can understand this. More or less, I could have a list of string like: 0,0,0,0 where I want to to a summation of those 4 "buckets" in a loop. – Laurilaurianne 9/4, 2020 at 14:13

It seems quite complicated to understand how to code this. It is very new to me. I am not sure if it would be possible to see a code example of my problem how to achieve this? – Laurilaurianne 9/4, 2020 at 14:18

@Andreas: benchmarking in debug mode is useless; different code is slowed down significantly different amounts by disabling optimization so it doesn't tell you much of anything about what will be fast in normal optimized code. But yes, SIMD is not simple and for best results requires thinking about the CPU more in terms of steps that the hardware can do efficiently on a 16-byte or 32-byte block of integers, not in terms of high level language constructs like range-for loops. Basically think in assembly language and implement in C++ or C#. And no you don't want strings, int vec[4] is closer! – Etui 9/4, 2020 at 14:24

@Peter that is great to know. I will then not use the debug for benchmarking. That was good to know. Yes SIMD seems like a new way of thinking but I have to understand this somehow. That was good to know about int vec[4]. Yes from before strings is normally slower also. – Laurilaurianne 9/4, 2020 at 14:29

@Andreas: The stuff in the question about treating an integer as 4 separate bytes is already a SIMD idea; you just got lost somewhere along the way. See en.wikipedia.org/wiki/SIMD#Software. Or some C# SIMD tutorial; I assume there are some. The thing that makes SIMD useful here is that your histogram has as few buckets as there are elements in one SIMD vector. Larger histograms wouldn't work this way. – Etui 9/4, 2020 at 14:35

@Peter that is great. I will start looking at SIMD tutorials as you mention to see if I can start understanding anything to try out some code and eventually reach a goal with this loop. That was interesting so I was on the right track now even knowing it perheps with 4 separate bytes. SIMD is truly interesting :) – Laurilaurianne 9/4, 2020 at 14:41

@JonasH: AVX2 should in theory give you 8 elements per 2.75 clock cycles on Skylake including loop overhead, more than twice as fast as the theoretical max scalar (1 element per clock, sustaining 2 loads and 1 store per clock if a memory-destination add qword [rdi + rax*8], 1 can stay micro-fused.) I was curious so I wrote an answer with a C++ version so I could see if compilers did a good job with the asm. – Etui 9/4, 2020 at 18:46

@Andreas: Updated my answer; had another idea that should run 8 elements per ~1 cycle in an inner-most loop with AVX2 on Haswell/Skylake, hopefully sustaining close to that even though it needs to leave the inner loop every 255 iterations to accumulate to wider counters. – Etui 10/4, 2020 at 3:47

This is the untested C# version of @PeterCordes answer.

private static Vector128<int> HsumTranspose( ReadOnlySpan<Vector256<int>> counts )
{
    var sum01 = Avx2.HorizontalAdd( counts[ 0 ], counts[ 1 ] );
    var sum23 = Avx2.HorizontalAdd( counts[ 2 ], counts[ 3 ] );
    var sum0123 = Avx2.HorizontalAdd( sum01, sum23 );

    var sumHigh = Avx2.ExtractVector128( sum0123, 1 );
    var sumLow = Avx2.ExtractVector128( sum0123, 0 );
    return Sse2.Add( sumHigh, sumLow );
}


private unsafe static int[ ] CountElements( ReadOnlySpan<int> input )
{
    var outputCounts = new int[ 4 ];
    // Four vectors of zeroed counters each vector holds
    // counts for one bucket, to be hsummed at the end.
    Span<Vector256<int>> counts = stackalloc Vector256<int>[ 4 ]
    {
        Vector256<int>.Zero,
        Vector256<int>.Zero,
        Vector256<int>.Zero,
        Vector256<int>.Zero
    };

    unsafe
    {
        fixed ( int* fixedInput = input )
        {
            var size = input.Length;
            for ( var i = 0; i < size; i += 8 )
            {
                var v = Avx.LoadVector256( &fixedInput[ i ] );
                for ( var val = 0; val < 3; val++ )
                {
                    var match = Avx2.CompareEqual( v, Vector256.Create( val ) );
                    counts[ val ] = Avx2.Subtract( counts[ val ], match );
                }
             }

             Vector128<int> summedCounts = HsumTranspose( counts );

             fixed ( int* fixedOutputCounts = outputCounts )
                 Sse2.Store( fixedOutputCounts, summedCounts );

             outputCounts[ 3 ] = size - outputCounts[ 0 ] -
                 outputCounts[ 1 ] - outputCounts[ 2 ];

             // TODO: handle the last size%8 input elements; scalar would be easy
            }                
        }            
    }
    return outputCounts;
}

Coquetry answered 11/4, 2020 at 2:22 Comment(1)

Comments are not for extended discussion; this conversation has been moved to chat. – Cappello 11/4, 2020 at 20:23

I have tried to rewrite the code for Vector128<byte> and have come up with this code.

I have first created indexesToSumFirst which is the number of iterations so the remaining will be a multiple of 16 to be consumed exactly by the following loops.

I have created 3 loops where an innerloop exists of 16x16 = 256 to not create any overflow for byte. Then the "outerloop" has an exact count that is calculated from before to maintain this.

After those 3 loops. The rest which is below 16*16 iterations is summed up in its own loop.

When I runned a benchmark between: normalCalculation and CountElements the CountElements SIMD approach is about 7.2 times faster.

    void calc()
    { 
        //Create 16 indexes with numbers between: 0-3. The goal is to count how many of those occurences we have for the numbers: 0-3
        int times = 6250;
        int bytes = times * 16;
        byte[] v1 = new byte[bytes];
        for (int i = 0; i < times; i++)
        {
            v1[0 + (i * 16)] = 0;
            v1[1 + (i * 16)] = 1;
            v1[2 + (i * 16)] = 2;
            v1[3 + (i * 16)] = 3;

            v1[4 + (i * 16)] = 1;
            v1[5 + (i * 16)] = 1;
            v1[6 + (i * 16)] = 1;
            v1[7 + (i * 16)] = 1;

            v1[8 + (i * 16)] = 1;
            v1[9 + (i * 16)] = 0;
            v1[10 + (i * 16)] = 0;
            v1[11 + (i * 16)] = 3;

            v1[12 + (i * 16)] = 1;
            v1[13 + (i * 16)] = 1;
            v1[14 + (i * 16)] = 1;
            v1[15 + (i * 16)] = 3;
        }
        /*---------------*/

        ReadOnlySpan<byte> input = v1;

        //Call function
        //normalCalculation(input);
        CountElements(input);
    }

    void normalCalculation(ReadOnlySpan<byte> inputArray)
    {
        int[] countArray0 = new int[4];
        for (int i = 0; i < inputArray.Length; i++)
        {
            countArray0[inputArray[i]]++;
        }

    }
    private unsafe static int[] CountElements(ReadOnlySpan<byte> inputArray)
    {

        //100000 indexes (This SIMD code goes 7.2 times faster than normal C# code)
        double[] countArray = new double[4];
        double arraylength = inputArray.Length; int loops = Convert.ToInt32(arraylength);
        double loopcount = arraylength / 3840; //100000 / 240 * 16 = 26.04
        double indexesToSumFirst = loopcount - Math.Floor(loopcount); //26.04 - 26 = 0.04
        indexesToSumFirst = indexesToSumFirst * 3840; //Num of indexes to be SUMMED first
        loopcount = arraylength - indexesToSumFirst; //100000 - 153.6 = 99846.4
        int outerloop = Convert.ToInt32(loopcount / 3840); //24

        //Sum the first indexes first. So the loops after those are exactly counts of: x16
        int index = Convert.ToInt32(indexesToSumFirst);
        if (index > 0)
        {
            for (int t = 0; t < index; t++)
            {
                countArray[inputArray[t]]++;
            }
        }

        //Below starts the SIMD calculations!
        Span<Vector128<byte>> counts = stackalloc Vector128<byte>[3];
        Span<Vector128<UInt64>> sum64 = stackalloc Vector128<UInt64>[3];
        unsafe
        {
            fixed (byte* fixedInput = inputArray)
            {
                for (int i = 0; i < outerloop; i++)
                {
                    counts.Clear();
                    for (int i2 = 0; i2 < 240; i2++)
                    {
                        var v = Avx.LoadVector128(&fixedInput[index]);
                        for (byte val = 0; val < 3; val++)
                        {
                            var match = Avx.CompareEqual(v, Vector128.Create(val)); //[1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0] == [1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0]
                            counts[val] = Avx.Subtract(counts[val], match);
                        }
                        index += 16;
                    }
                    //Here sum
                    for (int i3 = 0; i3 < 3; i3++)
                    {
                        //SumAbsoluteDifferences
                        sum64[i3] = Sse2.Add(sum64[i3], Sse2.SumAbsoluteDifferences(counts[i3], Vector128<byte>.Zero).AsUInt64()); //sum64: <2,0,0,0,3,0,0,0>
                    }
                }

                //UnpackHigh and get the lower element from the Vector128<UInt64>
                if (outerloop > 0)
                {
                    for (int i3 = 0; i3 < 3; i3++)
                    {
                        Vector128<UInt64> upper = Sse2.UnpackHigh(sum64[i3], sum64[i3]).AsUInt64(); //3
                        countArray[i3] += Sse2.Add(sum64[i3], upper).ToScalar();
                    }
                }
                //Calculate the last index
                countArray[3] = loops - countArray[0] - countArray[1] - countArray[2];
            }
        }

        var outputCounts = new int[4];
        return outputCounts;
    }

Laurilaurianne answered 11/4, 2020 at 23:37 Comment(24)

Comments are not for extended discussion; this conversation has been moved to chat. – Depart 13/4, 2020 at 2:46

@Peter, I created an Edit of the last post with a C# solution with the loops. It is perheps not entirely the same as in your example but perheps it is a valid enough solution. I could benchmark it against the fastest "normal" approach with C# and it was 8 times faster. I am not sure if there is something that could make it even faster but it is very fast anyway. – Laurilaurianne 15/4, 2020 at 14:27

Your index array is probably going to be aligned by 16 most of the time, so it would be best to do the N%16 cleanup at the end, not the start, letting the main loop load aligned vectors. Also, that would fix the bug of overwriting the scalar counts with the vector counts, in countArray[i3] = .... If you increment the counts with the cleanup loop, it can go at the very end. 8x faster sounds pretty reasonable. Did it not help to use a larger inner loop like i2 < 255? Or was 16 inner iterations already fine to bottleneck on memory bandwidth? – Etui 15/4, 2020 at 14:30

An even more efficient way to handle N%16 != 0 is to pad your indexes so you can always safely load 16 byte vectors. Separately keep track of the real N, and pad elements beyond it as 4 or 5 or -1, or anything that won't match what your loop is looking for. Pad out to the next multiple of 16 bytes. But anyway, other than what looks like a bug in your cleanup handling, this looks correct. I didn't check every line carefully or try to verify the loop bounds, but the right vector intrinsics are in the right places. – Etui 15/4, 2020 at 14:33

Wait a minute, you're benchmarking it against normalCalculation that uses double[] countArray0 inside its inner loop!!! Floating-point increment has another 4 cycle latency instead of 1 for integer, on K10, on top of the store/reload bottleneck from having counters in memory. So that 8x speedup is against crippled competition. It might "only" be 4x or 5x against a version using int[] like in the question. – Etui 15/4, 2020 at 14:44

I think because the number of elements will be a random number for the real use so I beleive the array will only align 1/16th of the time which means that the first cleanup will happen 15 out or 16 times. Perheps I could just use this code still. I never tested i2 < 255 but I am not sure but by iterate 16 times and have an increament of +=16 the maximum of a count in counts[val] can then only be 256 which was the idéa there if I am thinking right. So I am not sure if I do anything faster than that. I am not sure how the memory bandwith is working exactly to be honest. – Laurilaurianne 15/4, 2020 at 14:44

double has zero advantage for this use-case, unless maybe you're making obsolete 32-bit builds and need more range than int32. The first non-representable double is 2^53+1 but Uint64 goes all the way up to 2^64-1. – Etui 15/4, 2020 at 14:44

@Andreas: counts[val] is a vector of 8-bit elements with value-range 0..255. 256 would wrap to zero. But since each i2 iteration can only increment any element of counts[val] by 1, your way has a max val of 16. The whole point of Sse2.SumAbsoluteDifferences against zero is to sum those byte counters without overflow, so any value up to 255 is usable. Use a debugger to look at values if you aren't clear on how it works. – Etui 15/4, 2020 at 14:48

Re: cleanup: oh, I looked more carefully. I assumed you were reading the first N%16 elements scalar because you do it first, but you are actually reading the last N%16 elements. So your main loop index is still a multiple of 16, so probably the actual memory addresses of your vector loads are also aligned by 16, if .Net's internal allocator aligns large allocations. A 16-byte load from a 16-byte aligned address can be more efficient, and is never split across a boundary between two 64-byte cache lines (usually ~10 cycle penalty), or across two 4k pages (huge penalty). – Etui 15/4, 2020 at 14:55

Yes wow you are right what a mistake it was ´double[]´ and not an ´int[]´ which it all a crippled competition as you said. I changed it now and put all back for a benchmark and it is not 6.2 times faster on average when I tested it 10 times. – Laurilaurianne 15/4, 2020 at 14:56

So the problem with loading the end of the array first is that it will cause an extra cache miss. After the end of the vectorized loop, that tail of the array will already be hot in cache because you just loaded a vector from right next to it. So do it then. And you can simply loop from index up to inputArray.Length. Don't use double for array index calculations! Use integer math. It makes no sense to use double when you're using int index, not even Uint64 or whatever C# uses for a size_t equivalent that can hold an arbitrary array size. – Etui 15/4, 2020 at 14:59

Yes that is true I think as you also said that as I increase by +=16 the counts[val] could in some case then get a value of 256 there if I don't miss something there. – Laurilaurianne 15/4, 2020 at 14:59

@Andreas: read my earlier comment again where I explained that the max counts[val] is 16 in your code. Remember that each counts[val] holds 16 separate byte counters, each of which is either incremented by 1 or 0 for every i2 iteration. Horizontal summing is deferred until SumAbsoluteDifferences which does it without overflow. Vectorization distributes the 16 x 16 counts over 16-element vectors. You use index += 16; because that's the vector width of your count vectors. Seriously, use a debugger to step through your code and watch it work if you're getting lost on how it works – Etui 15/4, 2020 at 15:1

@Peter yes you are right. Now when I think about what you are saying. You are right each of the 16 separate bytes in the Vector can only get a max value of 16 in my code. Yes ofcourse. This 16 dimensional things gets one confused suddenly :) Do you think I could make the loops more effecient by reorganise so this innerloop can take a max count of 256 for the counts[val] ? That should mean that I will call SumAbsoluteDifferences not as often. – Laurilaurianne 15/4, 2020 at 15:4

Bytes have a value range of 0..255, so no, not 256. You need to remember that 256 & 0xff = 256 % 256 = 0 and stop thinking 256. That would let a count element overflow and wrap back to zero if every 16th input element matched that val. But yes, i2 < 255, 252, or 128 or some other convenient number would amortize that sum64[i3] update work over more inner iterations. It might or might not give an overall speedup; probably not if you bottleneck on memory bandwidth, especially with multiple cores. Your vector cleanup loop is just as efficient, so spending up to 254 iters there is ok. – Etui 15/4, 2020 at 15:10

Yes a debugger is not a bad idéa as you say to use to see how it aligns. It this case I was thinking in the right terms but still did a mistake there. – Laurilaurianne 15/4, 2020 at 15:11

Thats true I am thinking in 256 but know about the 255. It is a crazy habit. That is a good idéa to stop at for example 252 to let the sum[i3] update work run less times. Yes perheps it gives a small boost or not that is true but still interesting to change this one also. This is really interesting stuff Peter but quite new and different from normal C# programming. You opened my eyes to this definitively :) – Laurilaurianne 15/4, 2020 at 15:17

When you talk about memory bandwidth is it Cache memory you talk about. Can only a better processor solve this issue that has bigger amount of Cache? For example I will soon buy the Opteron 6376 so I have 64 cores of those. – Laurilaurianne 15/4, 2020 at 15:20

Spending money on an obsolete Piledriver CPU seems like a completely horrible idea. It will only be a bit faster per core than the CPUs you have. It's an obsolete architecture. A Zen2 ThreadRipper could run this more than twice as fast per core (with AVX2) as Piledriver, and have better memory bandwidth. If a larger cache lets your arrays fit entirely in L3 cache, that can help, but I meant DRAM bandwidth not cache size. Bytes transferred per clock cycle or per second. – Etui 15/4, 2020 at 22:29

Also, your update looks like it introduces the memory-alignment performance problem I thought you had earlier. Now you do have it. Doing the first N%16 elements first leaves index % 16 != 0 for the whole main loop. That probably means you're doing unaligned vector loads every 4th load, and will suffer for at every cache-line and page boundary if you have odd N. Put your scalar cleanup loop at the end. Testing with N = 100000 won't reveal this because the 100000 % 16 = 0. Try testing with N = 99999 and see if performance gets significantly worse. – Etui 15/4, 2020 at 22:35

Oh, and you removed the vectorized part of the cleanup so now you go pure scalar for up to 3839 elements instead of up to 15. Previously you had a separate loop that did fewer counts. You could have your outer calculate an iteration count for the inner loop, like outer_end = max (255, (length - index)/16); Or have the outer loop calculate an upper bound for index like endidx = index + max(255*16, length & -16), so your inner loop can just be while (index < endidx) { ... }. (length & -16 is length rounded down to a multiple of 16) – Etui 15/4, 2020 at 22:41

Yes, I know about it is not the best to spend money on the Piledriver but I will only need to put 250$ to get them including RAM and CPU coolers because I already have the MB etc. But in the future I will look for something better there. – Laurilaurianne 16/4, 2020 at 17:50

The new code that is in this post now which I changed to went from 6.2 times faster to 7.6 times faster. Even as you actually say has 3839 extra indexes for the scalar. Now I have made a code that do cleanup at the end and only uses 3 different for loops which seems optimal to my eyes. But that code is only 5.5 times faster. Also I tried N == 99999 with no speed difference at all in the above code. It is a little confusing why my new "perfect" code that only have 15 indexes cleanup/scalar is only 5.5 times faster. I also tried an approach with while loops which was 6 times faster. – Laurilaurianne 16/4, 2020 at 17:54

I will add a second part to this code solution in a completely new SO post that I wonder if it is possible. – Laurilaurianne 16/4, 2020 at 18:4

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++