Writing a vector sum function with SIMD (System.Numerics) and making it faster than a for loop

Asked 19/5, 2021 at 14:57 Answered 20/5, 2021 at 23:7

I wrote a function to add up all the elements of a double[] array using SIMD (System.Numerics.Vector) and the performance is worse than the naïve method.

On my computer Vector<double>.Count is 4 which means I could create an accumulator of 4 values and run through the array adding up the elements by groups.

For example a 10 element array, with a 4 element accumulator and 2 remaining elements I would get

//     | loop                  | remainder
acc[0] = vector[0] + vector[4] + vector[8]
acc[1] = vector[1] + vector[5] + vector[9]
acc[2] = vector[2] + vector[6] 
acc[3] = vector[3] + vector[7]

and the result sum = acc[0]+acc[1]+acc[2]+acc[3]

The code below produces the correct results, but the speed isn't there compared to just a loop adding up the values

public static double SumSimd(this Span<double> a)
{
    var n = System.Numerics.Vector<double>.Count;
    var count = a.Length;
    // divide array into n=4 element groups
    // Example, 57 = 14*4 + 3
    var groups = Math.DivRem(count, n, out int remain);
    var buffer = new double[n];
    // Create buffer with remaining elements (not in groups)
    a.Slice(groups*n, remain).CopyTo(buffer);
    // Scan through all groups and accumulate
    var accumulator = new System.Numerics.Vector<double>(buffer);
    for (int i = 0; i < groups; i++)
    {
        //var next = new System.Numerics.Vector<double>(a, n * i);
        var next = new System.Numerics.Vector<double>(a.Slice(n * i, n));
        accumulator += next;
    }
    var sum = 0.0;
    // Add up the elements of the accumulator vs
    for (int j = 0; j < n; j++)
    {
        sum += accumulator[j];
    }
    return sum;
}

So my question is why aren't I realizing any benefits here with SIMD?

Baseline

The baseline code looks like this

public static double LinAlgSum(this ReadOnlySpan<double> span)
{
    double sum = 0;
    for (int i = 0; i < span.Length; i++)
    {
        sum += span[i];
    }
    return sum;
}

In benchmarking the SIMD code comparing to the above, the SIMD code is 5× slower for size=7, 2.5× slower for size=144 and about the same for size=770.

I am running release mode using BenchmarkDotNet. Here is the driver class

[GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByCategory)]
[CategoriesColumn]
public class LinearAlgebraBench
{
    [Params(7, 35, 77, 144, 195, 311, 722)]
    public int Size { get; set; }

    [IterationSetup]
    public void SetupData()
    {
        A = new LinearAlgebra.Vector(Size, (iter) => 2 * Size - iter).ToArray();
        B = new LinearAlgebra.Vector(Size, (iter) => Size/2 + 2* iter).ToArray();
    }

    public double[] A { get; set; }
    public double[] B { get; set; }

    [BenchmarkCategory("Sum"), Benchmark(Baseline = true)]
    public double BenchLinAlgSum()
    {
        return LinearAlgebra.LinearAlgebra.Sum(A.AsSpan().AsReadOnly());
    }
    [BenchmarkCategory("Sum"), Benchmark]
    public double BenchSimdSum()
    {
        return LinearAlgebra.LinearAlgebra.SumSimd(A);
    }
}

Duvall answered 19/5, 2021 at 14:57 Comment(13)

You are calling a method and there is an overhead in doing the call. – Punctilio 19/5, 2021 at 15:1

@Punctilio - My SIMD vector add function is faster than a for loop and it still has overhead. There is something odd about summing here I think. – Duvall 19/5, 2021 at 15:4

Are you running in debug or release mode? – Raila 19/5, 2021 at 15:30

Anyhow: copying the data from the input to the Vector, or even allocating the object, might already take up more cycles than evaluating the actual instruction. How are you timing the performance? (do you have some github repo, so I can experiment on this without having to reinvent the wheel?) – Raila 19/5, 2021 at 15:45

@Raila release mode. – Duvall 19/5, 2021 at 15:47

The managed c# software adds a lot of overhead to calls so it there is an exception the PC doesn't crash (blue screen). – Punctilio 19/5, 2021 at 15:54

@Punctilio - I am not comparing C# to C++, so I don't understand how your comment is relevant here. It is all managed code, one using double as a primitive and the other using Vector<double> which is a SIMD packing vector. – Duvall 19/5, 2021 at 16:4

@Raila I did update the post with the baseline method, but none of it is on GitHub. – Duvall 19/5, 2021 at 16:6

Thanks, yeah, i figured that, but that's rather straightforward. It would save me/us time if we wouldn't have to make everything ourselves. Anyhow, what is important is how do you measure the evaluation time? – Raila 19/5, 2021 at 17:28

@Raila - I added the benchmark driver class. – Duvall 19/5, 2021 at 19:6

As for me, there's no sense to test it on array of less than 1000 elements. Could you show some more real test case? – Reclaim 20/5, 2021 at 19:43

@Reclaim - The scenario involves billions of iterations with smaller arrays with < 100 elements. – Duvall 20/5, 2021 at 19:51

@Reclaim - In fact, the sum() function is part of a dot() product function where SIMD is used for the element to element multiplication and then the sum computes the inner product. – Duvall 21/5, 2021 at 18:50

As per @JonasH answer

It is also worth noting that the compiler does not seem to produce very efficient SIMD code with the generic API.

I disagree. It's only worth to ensure that the method is properly implemented. In some cases - yes, direct using Intrinsics instead of Numerics vector gives a serious boost but not always.

The issue here is measuring very small iteration. Benchmark.NET can't do it in general. The possible solution is wrapping target method in a loop.

As for me, writing a reperesentative benchmark is a hard work and I'm probably not enough good in it. But I'll try.

public class SumTest
{
    [Params(7, 35, 77, 144, 195, 311, 722)]
    public int Size { get; set; }

    [IterationSetup]
    public void SetupData()
    {
        A = Enumerable.Range(0, Size).Select(x => 1.1).ToArray();
    }

    public double[] A { get; set; }


    [BenchmarkCategory("Sum"), Benchmark(Baseline = true)]
    public double BenchScalarSum()
    {
        double result = 0;
        for (int i = 0; i < 10000; i++)
            result = SumScalar(A);
        return result;
    }

    [BenchmarkCategory("Sum"), Benchmark]
    public double BenchNumericsSum()
    {
        double result = 0;
        for (int i = 0; i < 10000; i++)
            result = SumNumerics(A);
        return result;
    }

    [BenchmarkCategory("Sum"), Benchmark]
    public double BenchIntrinsicsSum()
    {
        double result = 0;
        for (int i = 0; i < 10000; i++)
            result = SumIntrinsics(A);
        return result;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public double SumScalar(ReadOnlySpan<double> numbers)
    {
        double result = 0;
        for (int i = 0; i < numbers.Length; i++)
        {
            result += numbers[i];
        }
        return result;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public double SumNumerics(ReadOnlySpan<double> numbers)
    {
        ReadOnlySpan<Vector<double>> vectors = MemoryMarshal.Cast<double, Vector<double>>(numbers);
        Vector<double> acc = Vector<double>.Zero;
        for (int i = 0; i < vectors.Length; i++)
        {
            acc += vectors[i];
        }
        double result = Vector.Dot(acc, Vector<double>.One);
        for (int i = vectors.Length * Vector<double>.Count; i < numbers.Length; i++)
            result += numbers[i];
        return result;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public double SumIntrinsics(ReadOnlySpan<double> numbers)
    {
        ReadOnlySpan<Vector256<double>> vectors = MemoryMarshal.Cast<double, Vector256<double>>(numbers);
        Vector256<double> acc = Vector256<double>.Zero;
        for (int i = 0; i < vectors.Length; i++)
        {
            acc = Avx.Add(acc, vectors[i]);
        }
        Vector128<double> r = Sse2.Add(acc.GetUpper(), acc.GetLower());
        double result = Sse3.HorizontalAdd(r, r).GetElement(0); // I'm aware that VHADDPD probably not enough efficient but leaving it for simplicity here
        for (int i = vectors.Length * Vector256<double>.Count; i < numbers.Length; i++)
            result += numbers[i];
        return result;
    }
}

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-4700HQ CPU 2.40GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.203
  [Host]     : .NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT
  Job-NQCIIR : .NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT

Method	Size	Mean	Error	StdDev	Median	Ratio	RatioSD
BenchScalarSum	7	53.34 us	0.056 us	0.050 us	53.30 us	1.00	0.00
BenchNumericsSum	7	48.95 us	2.262 us	6.671 us	44.95 us	0.95	0.10
BenchIntrinsicsSum	7	55.85 us	2.089 us	6.128 us	51.90 us	1.07	0.10

BenchScalarSum	35	258.46 us	2.319 us	3.541 us	257.00 us	1.00	0.00
BenchNumericsSum	35	94.14 us	1.989 us	5.705 us	91.00 us	0.36	0.02
BenchIntrinsicsSum	35	90.82 us	2.465 us	7.073 us	92.10 us	0.35	0.03

BenchScalarSum	77	541.18 us	10.401 us	11.129 us	536.95 us	1.00	0.00
BenchNumericsSum	77	161.05 us	3.171 us	7.475 us	159.30 us	0.30	0.01
BenchIntrinsicsSum	77	153.19 us	3.063 us	7.906 us	150.50 us	0.29	0.02

BenchScalarSum	144	1,166.72 us	6.945 us	5.422 us	1,166.10 us	1.00	0.00
BenchNumericsSum	144	294.72 us	5.675 us	10.520 us	292.50 us	0.26	0.01
BenchIntrinsicsSum	144	287.18 us	5.661 us	13.671 us	284.20 us	0.25	0.01

BenchScalarSum	195	1,671.83 us	32.634 us	34.918 us	1,663.30 us	1.00	0.00
BenchNumericsSum	195	443.19 us	7.916 us	11.354 us	443.10 us	0.26	0.01
BenchIntrinsicsSum	195	444.21 us	8.876 us	7.868 us	443.55 us	0.27	0.01

BenchScalarSum	311	2,742.78 us	35.797 us	29.892 us	2,745.70 us	1.00	0.00
BenchNumericsSum	311	778.00 us	34.173 us	100.759 us	719.20 us	0.30	0.04
BenchIntrinsicsSum	311	776.30 us	29.304 us	86.404 us	727.45 us	0.29	0.03

BenchScalarSum	722	6,607.72 us	79.263 us	74.143 us	6,601.20 us	1.00	0.00
BenchNumericsSum	722	1,870.81 us	43.390 us	127.936 us	1,850.30 us	0.28	0.02
BenchIntrinsicsSum	722	1,867.57 us	39.718 us	117.110 us	1,851.50 us	0.28	0.02

Looks like using Vectors at least not less efficient than the baseline method.

As a bonus, let's look at the output assembly code using https://sharplab.io/ (x64)

SumTest.SumScalar(System.ReadOnlySpan`1<Double>)
    L0000: vzeroupper
    L0003: mov rax, [rdx]
    L0006: mov edx, [rdx+8]
    L0009: vxorps xmm0, xmm0, xmm0
    L000d: xor ecx, ecx
    L000f: test edx, edx
    L0011: jle short L0022
    L0013: movsxd r8, ecx
    L0016: vaddsd xmm0, xmm0, [rax+r8*8]
    L001c: inc ecx
    L001e: cmp ecx, edx
    L0020: jl short L0013
    L0022: ret

SumTest.SumNumerics(System.ReadOnlySpan`1<Double>)
    L0000: sub rsp, 0x28
    L0004: vzeroupper
    L0007: mov rax, [rdx]
    L000a: mov edx, [rdx+8]
    L000d: mov ecx, edx
    L000f: shl rcx, 3
    L0013: shr rcx, 5
    L0017: cmp rcx, 0x7fffffff
    L001e: ja short L0078
    L0020: vxorps ymm0, ymm0, ymm0
    L0024: xor r8d, r8d
    L0027: test ecx, ecx
    L0029: jle short L0040
    L002b: movsxd r9, r8d
    L002e: shl r9, 5
    L0032: vaddpd ymm0, ymm0, [rax+r9]
    L0038: inc r8d
    L003b: cmp r8d, ecx
    L003e: jl short L002b
    L0040: vmulpd ymm0, ymm0, [SumTest.SumNumerics(System.ReadOnlySpan`1<Double>)]
    L0048: vhaddpd ymm0, ymm0, ymm0
    L004c: vextractf128 xmm1, ymm0, 1
    L0052: vaddpd xmm0, xmm0, xmm1
    L0056: shl ecx, 2
    L0059: cmp ecx, edx
    L005b: jge short L0070
    L005d: cmp ecx, edx
    L005f: jae short L007e
    L0061: movsxd r8, ecx
    L0064: vaddsd xmm0, xmm0, [rax+r8*8]
    L006a: inc ecx
    L006c: cmp ecx, edx
    L006e: jl short L005d
    L0070: vzeroupper
    L0073: add rsp, 0x28
    L0077: ret
    L0078: call 0x00007ffc9de2b710
    L007d: int3
    L007e: call 0x00007ffc9de2bc70
    L0083: int3

SumTest.SumIntrinsics(System.ReadOnlySpan`1<Double>)
    L0000: sub rsp, 0x28
    L0004: vzeroupper
    L0007: mov rax, [rdx]
    L000a: mov edx, [rdx+8]
    L000d: mov ecx, edx
    L000f: shl rcx, 3
    L0013: shr rcx, 5
    L0017: cmp rcx, 0x7fffffff
    L001e: ja short L0070
    L0020: vxorps ymm0, ymm0, ymm0
    L0024: xor r8d, r8d
    L0027: test ecx, ecx
    L0029: jle short L0040
    L002b: movsxd r9, r8d
    L002e: shl r9, 5
    L0032: vaddpd ymm0, ymm0, [rax+r9]
    L0038: inc r8d
    L003b: cmp r8d, ecx
    L003e: jl short L002b
    L0040: vextractf128 xmm1, ymm0, 1
    L0046: vaddpd xmm0, xmm1, xmm0
    L004a: vhaddpd xmm0, xmm0, xmm0
    L004e: shl ecx, 2
    L0051: cmp ecx, edx
    L0053: jge short L0068
    L0055: cmp ecx, edx
    L0057: jae short L0076
    L0059: movsxd r8, ecx
    L005c: vaddsd xmm0, xmm0, [rax+r8*8]
    L0062: inc ecx
    L0064: cmp ecx, edx
    L0066: jl short L0055
    L0068: vzeroupper
    L006b: add rsp, 0x28
    L006f: ret
    L0070: call 0x00007ffc9de2b710
    L0075: int3
    L0076: call 0x00007ffc9de2bc70
    L007b: int3

Here you can see that JIT produces almost the same code for Vector<T> as for Vector256<T>.

Reclaim answered 20/5, 2021 at 23:7 Comment(13)

Yup, microbenchmarking is hard. Especially for very short loops, especially if you want to realistically measure branch prediction effects without letting the branch predictor simply learn the pattern for doing the same size repeatedly. (Unless that is realistic). But yeah, I wouldn't be at all surprised if the results in the question are just failure to do warm-up runs etc. (Idiomatic way of performance evaluation?). – Calliopsis 20/5, 2021 at 23:33

For reducing from 256->128, that first 128-bit add can be plain vertical add. Sse3.HorizontalAdd is a compact but inefficient way to shuffle between the halves of the lower lane, but no need to pay that penalty for two different vectors. Sse2.Add should be good in the first one. – Calliopsis 20/5, 2021 at 23:36

If you really wanted to optimize for very small arrays, some kind of masked thing might be worth considering, although with at most 3 odd elements left maybe not, especially if you're optimizing for the size being the same so branch prediction will hit. Maybe one final 128-bit vector as part of cleanup, after reducing the main sum to 128-bit. – Calliopsis 20/5, 2021 at 23:36

@PeterCordes Sse3.HorizontalAdd is a compact but inefficient way yes, I've left a comment in the code near it. But this answer was a bird-look not a production-ready solution. I just showed OP that it's possible to optimize it. StackOverflow contains enough content for reading related to the further optimizations of the above solution. – Reclaim 20/5, 2021 at 23:49

Right, sorry I got too long winded there and obscured my key point. Yes, the 2nd Sse3.HorizontalAdd is fine for a quick one-off, although it's not that hard to look up the intrinsics for unpckhpd / addpd instead. The point I was trying to make was that the 1st Sse3.HorizontalAdd should be Sse2.Add, with no other changes to the surrounding code needed. You're reducing two 2 vectors of 2 doubles to 1 vector, no horizontal stuff needed. (But if you changed it now you'd have to re-run the benchmarks, so maybe just update the comment.) – Calliopsis 21/5, 2021 at 0:0

@PeterCordes good catch but far from PC. I must update everything: bench, source and assembly then. Will fix it later. Thank you! – Reclaim 21/5, 2021 at 0:4

heh, was just looking at the asm and Vector.Dot(acc, Vector<double>.One); results in an actual multiply, before a shuffle/add hsum pattern that involves one vaddpd and one vhaddpd just like I was saying you should do with intrinsics. (Except Dot does the 256-bit vhaddpd first, wasting uops on Zen1 and earlier AMD instead of first reducing to 128-bit.) Again, compact source code, but not as efficient as manually skipping the multiply. (If there's a good way to get System.Numerics to horizontal sum with shuffles, instead of store/reload.) – Calliopsis 21/5, 2021 at 0:6

Also, for larger arrays, as Soonts pointed out you'd want multiple accumulators to hide FP add latency. If you're going to take the time to re-run benchmarks at all, might want to throw in unrolled versions to show how they help with larger arrays, but may even hurt with arrays too small for one unrolled iteration. – Calliopsis 21/5, 2021 at 0:9

Thanks for taking the time to update. :) Seems like your last benchmark run hit a lot more noise in the low-iteration-count tests. :/ Especially for 7, and for scalar 35. stddev and error are pretty significant compared to the total, about an order of magnitude worse than your earlier run. Might want to re-run that. At least the median is included, and it's still good when there aren't too many outliers slowed down by other load. Oh, and BTW, there's markdown for tables now, although IDK if that'd be more readable than the code block table. – Calliopsis 21/5, 2021 at 7:18

@PeterCordes yes, I see that. That's because of not completely started PC, Windows was performing a lot of background activities. But again I'm far from that PC :) Will do it later. – Reclaim 21/5, 2021 at 7:28

I like your Vector.Dot(acc, Vector<double>.One) idea for summing up data. – Duvall 21/5, 2021 at 12:34

@PeterCordes updated the Benchmark.NET output. – Reclaim 21/5, 2021 at 18:28

@Reclaim - I won't forget. I am still assessing the answers. – Duvall 21/5, 2021 at 18:49

I would suggest you take a look at this article exploring SIMD performance in .Net.

The overall algorithm looks identical for summing using regular vectorization. One difference is that the multiplication can be avoided when slicing the array:

while (i < lastBlockIndex)
{
    vresult += new Vector<int>(source.Slice(i));
    i += Vector<int>.Count;
}

One multiplication should have fairly small effect on performance, but for this kind of code it might be relevant.

It is also worth noting that the compiler does not seem to produce very efficient SIMD code with the generic API. Performance for summing 32768 items:

SumUnrolled - 8,979.690 ns
SumVectorT - 6,689.829 ns
SumIntrinstics - 2,200.996 ns

So, the generic version of SIMD only gains ~30% performance, while the intrinstics version gain ~400% performance, near the theoretical max.

Salmi answered 19/5, 2021 at 18:28 Comment(1)

Great reference. I do see similarities with my implementation and the vectorized version. But I do not see the performance boost they do. This givens something to investigate for now. – Duvall 19/5, 2021 at 19:12

As per @JonasH answer

It is also worth noting that the compiler does not seem to produce very efficient SIMD code with the generic API.

I disagree. It's only worth to ensure that the method is properly implemented. In some cases - yes, direct using Intrinsics instead of Numerics vector gives a serious boost but not always.

The issue here is measuring very small iteration. Benchmark.NET can't do it in general. The possible solution is wrapping target method in a loop.

As for me, writing a reperesentative benchmark is a hard work and I'm probably not enough good in it. But I'll try.

public class SumTest
{
    [Params(7, 35, 77, 144, 195, 311, 722)]
    public int Size { get; set; }

    [IterationSetup]
    public void SetupData()
    {
        A = Enumerable.Range(0, Size).Select(x => 1.1).ToArray();
    }

    public double[] A { get; set; }


    [BenchmarkCategory("Sum"), Benchmark(Baseline = true)]
    public double BenchScalarSum()
    {
        double result = 0;
        for (int i = 0; i < 10000; i++)
            result = SumScalar(A);
        return result;
    }

    [BenchmarkCategory("Sum"), Benchmark]
    public double BenchNumericsSum()
    {
        double result = 0;
        for (int i = 0; i < 10000; i++)
            result = SumNumerics(A);
        return result;
    }

    [BenchmarkCategory("Sum"), Benchmark]
    public double BenchIntrinsicsSum()
    {
        double result = 0;
        for (int i = 0; i < 10000; i++)
            result = SumIntrinsics(A);
        return result;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public double SumScalar(ReadOnlySpan<double> numbers)
    {
        double result = 0;
        for (int i = 0; i < numbers.Length; i++)
        {
            result += numbers[i];
        }
        return result;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public double SumNumerics(ReadOnlySpan<double> numbers)
    {
        ReadOnlySpan<Vector<double>> vectors = MemoryMarshal.Cast<double, Vector<double>>(numbers);
        Vector<double> acc = Vector<double>.Zero;
        for (int i = 0; i < vectors.Length; i++)
        {
            acc += vectors[i];
        }
        double result = Vector.Dot(acc, Vector<double>.One);
        for (int i = vectors.Length * Vector<double>.Count; i < numbers.Length; i++)
            result += numbers[i];
        return result;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public double SumIntrinsics(ReadOnlySpan<double> numbers)
    {
        ReadOnlySpan<Vector256<double>> vectors = MemoryMarshal.Cast<double, Vector256<double>>(numbers);
        Vector256<double> acc = Vector256<double>.Zero;
        for (int i = 0; i < vectors.Length; i++)
        {
            acc = Avx.Add(acc, vectors[i]);
        }
        Vector128<double> r = Sse2.Add(acc.GetUpper(), acc.GetLower());
        double result = Sse3.HorizontalAdd(r, r).GetElement(0); // I'm aware that VHADDPD probably not enough efficient but leaving it for simplicity here
        for (int i = vectors.Length * Vector256<double>.Count; i < numbers.Length; i++)
            result += numbers[i];
        return result;
    }
}

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-4700HQ CPU 2.40GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.203
  [Host]     : .NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT
  Job-NQCIIR : .NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT

Method	Size	Mean	Error	StdDev	Median	Ratio	RatioSD
BenchScalarSum	7	53.34 us	0.056 us	0.050 us	53.30 us	1.00	0.00
BenchNumericsSum	7	48.95 us	2.262 us	6.671 us	44.95 us	0.95	0.10
BenchIntrinsicsSum	7	55.85 us	2.089 us	6.128 us	51.90 us	1.07	0.10

BenchScalarSum	35	258.46 us	2.319 us	3.541 us	257.00 us	1.00	0.00
BenchNumericsSum	35	94.14 us	1.989 us	5.705 us	91.00 us	0.36	0.02
BenchIntrinsicsSum	35	90.82 us	2.465 us	7.073 us	92.10 us	0.35	0.03

BenchScalarSum	77	541.18 us	10.401 us	11.129 us	536.95 us	1.00	0.00
BenchNumericsSum	77	161.05 us	3.171 us	7.475 us	159.30 us	0.30	0.01
BenchIntrinsicsSum	77	153.19 us	3.063 us	7.906 us	150.50 us	0.29	0.02

BenchScalarSum	144	1,166.72 us	6.945 us	5.422 us	1,166.10 us	1.00	0.00
BenchNumericsSum	144	294.72 us	5.675 us	10.520 us	292.50 us	0.26	0.01
BenchIntrinsicsSum	144	287.18 us	5.661 us	13.671 us	284.20 us	0.25	0.01

BenchScalarSum	195	1,671.83 us	32.634 us	34.918 us	1,663.30 us	1.00	0.00
BenchNumericsSum	195	443.19 us	7.916 us	11.354 us	443.10 us	0.26	0.01
BenchIntrinsicsSum	195	444.21 us	8.876 us	7.868 us	443.55 us	0.27	0.01

BenchScalarSum	311	2,742.78 us	35.797 us	29.892 us	2,745.70 us	1.00	0.00
BenchNumericsSum	311	778.00 us	34.173 us	100.759 us	719.20 us	0.30	0.04
BenchIntrinsicsSum	311	776.30 us	29.304 us	86.404 us	727.45 us	0.29	0.03

BenchScalarSum	722	6,607.72 us	79.263 us	74.143 us	6,601.20 us	1.00	0.00
BenchNumericsSum	722	1,870.81 us	43.390 us	127.936 us	1,850.30 us	0.28	0.02
BenchIntrinsicsSum	722	1,867.57 us	39.718 us	117.110 us	1,851.50 us	0.28	0.02

Looks like using Vectors at least not less efficient than the baseline method.

As a bonus, let's look at the output assembly code using https://sharplab.io/ (x64)

SumTest.SumScalar(System.ReadOnlySpan`1<Double>)
    L0000: vzeroupper
    L0003: mov rax, [rdx]
    L0006: mov edx, [rdx+8]
    L0009: vxorps xmm0, xmm0, xmm0
    L000d: xor ecx, ecx
    L000f: test edx, edx
    L0011: jle short L0022
    L0013: movsxd r8, ecx
    L0016: vaddsd xmm0, xmm0, [rax+r8*8]
    L001c: inc ecx
    L001e: cmp ecx, edx
    L0020: jl short L0013
    L0022: ret

SumTest.SumNumerics(System.ReadOnlySpan`1<Double>)
    L0000: sub rsp, 0x28
    L0004: vzeroupper
    L0007: mov rax, [rdx]
    L000a: mov edx, [rdx+8]
    L000d: mov ecx, edx
    L000f: shl rcx, 3
    L0013: shr rcx, 5
    L0017: cmp rcx, 0x7fffffff
    L001e: ja short L0078
    L0020: vxorps ymm0, ymm0, ymm0
    L0024: xor r8d, r8d
    L0027: test ecx, ecx
    L0029: jle short L0040
    L002b: movsxd r9, r8d
    L002e: shl r9, 5
    L0032: vaddpd ymm0, ymm0, [rax+r9]
    L0038: inc r8d
    L003b: cmp r8d, ecx
    L003e: jl short L002b
    L0040: vmulpd ymm0, ymm0, [SumTest.SumNumerics(System.ReadOnlySpan`1<Double>)]
    L0048: vhaddpd ymm0, ymm0, ymm0
    L004c: vextractf128 xmm1, ymm0, 1
    L0052: vaddpd xmm0, xmm0, xmm1
    L0056: shl ecx, 2
    L0059: cmp ecx, edx
    L005b: jge short L0070
    L005d: cmp ecx, edx
    L005f: jae short L007e
    L0061: movsxd r8, ecx
    L0064: vaddsd xmm0, xmm0, [rax+r8*8]
    L006a: inc ecx
    L006c: cmp ecx, edx
    L006e: jl short L005d
    L0070: vzeroupper
    L0073: add rsp, 0x28
    L0077: ret
    L0078: call 0x00007ffc9de2b710
    L007d: int3
    L007e: call 0x00007ffc9de2bc70
    L0083: int3

SumTest.SumIntrinsics(System.ReadOnlySpan`1<Double>)
    L0000: sub rsp, 0x28
    L0004: vzeroupper
    L0007: mov rax, [rdx]
    L000a: mov edx, [rdx+8]
    L000d: mov ecx, edx
    L000f: shl rcx, 3
    L0013: shr rcx, 5
    L0017: cmp rcx, 0x7fffffff
    L001e: ja short L0070
    L0020: vxorps ymm0, ymm0, ymm0
    L0024: xor r8d, r8d
    L0027: test ecx, ecx
    L0029: jle short L0040
    L002b: movsxd r9, r8d
    L002e: shl r9, 5
    L0032: vaddpd ymm0, ymm0, [rax+r9]
    L0038: inc r8d
    L003b: cmp r8d, ecx
    L003e: jl short L002b
    L0040: vextractf128 xmm1, ymm0, 1
    L0046: vaddpd xmm0, xmm1, xmm0
    L004a: vhaddpd xmm0, xmm0, xmm0
    L004e: shl ecx, 2
    L0051: cmp ecx, edx
    L0053: jge short L0068
    L0055: cmp ecx, edx
    L0057: jae short L0076
    L0059: movsxd r8, ecx
    L005c: vaddsd xmm0, xmm0, [rax+r8*8]
    L0062: inc ecx
    L0064: cmp ecx, edx
    L0066: jl short L0055
    L0068: vzeroupper
    L006b: add rsp, 0x28
    L006f: ret
    L0070: call 0x00007ffc9de2b710
    L0075: int3
    L0076: call 0x00007ffc9de2bc70
    L007b: int3

Here you can see that JIT produces almost the same code for Vector<T> as for Vector256<T>.

Reclaim answered 20/5, 2021 at 23:7 Comment(13)

@PeterCordes good catch but far from PC. I must update everything: bench, source and assembly then. Will fix it later. Thank you! – Reclaim 21/5, 2021 at 0:4

I like your Vector.Dot(acc, Vector<double>.One) idea for summing up data. – Duvall 21/5, 2021 at 12:34

@PeterCordes updated the Benchmark.NET output. – Reclaim 21/5, 2021 at 18:28

@Reclaim - I won't forget. I am still assessing the answers. – Duvall 21/5, 2021 at 18:49

Try this version. It uses four independent accumulators trying to hide the latency of vaddpd instruction, which is 3-4 cycles on modern AVX CPUs. Untested.

public static double vectorSum( this ReadOnlySpan<double> span )
{
    int vs = Vector<double>.Count;
    int end = span.Length;
    int endVectors = ( end / vs ) * vs;

    // Using 4 independent accumulators because on modern CPUs the latency of `vaddpd` is 3-4 cycles.
    // One batch consumes 4 vectors.
    int endBatches = ( endVectors / 4 ) * 4;

    Vector<double> a0 = Vector<double>.Zero;
    Vector<double> a1 = Vector<double>.Zero;
    Vector<double> a2 = Vector<double>.Zero;
    Vector<double> a3 = Vector<double>.Zero;

    // Handle majority of data unrolling by 4 vectors (e.g. 16 scalars with AVX)
    int i;
    for( i = 0; i < endBatches; i += vs * 4 )
    {
        a0 += new Vector<double>( span.Slice( i, vs ) );
        a1 += new Vector<double>( span.Slice( i + vs, vs ) );
        a2 += new Vector<double>( span.Slice( i + vs * 2, vs ) );
        a3 += new Vector<double>( span.Slice( i + vs * 3, vs ) );
    }

    // Handle a few remaining complete vectors
    for( ; i < endVectors; i += vs )
        a0 += new Vector<double>( span.Slice( i, vs ) );

    // Add the accumulators together
    a0 += a1;
    a2 += a3;
    a0 += a2;

    // Compute horizontal sum of a0
    double sum = 0;
    for( int j = 0; j < vs; j++ )
        sum += a0[ j ];

    // Add the remaining few scalars
    for( ; i < end; i++ )
        sum += span[ i ];
    return sum;
}

Have not benchmarked but looked at the disassembly on sharplab.io. While not as good as equivalent C++ would be (too much scalar code in the loop), it does not look terribly bad: uses AVX, has no function calls or unneeded loads/stores in the main loop.

Faith answered 20/5, 2021 at 16:7 Comment(2)

The question seemed to be interested in performance on very small arrays, like 7 or 10 elements. (At that point the naive hsum might be a problem vs. shuffle/add.) – Calliopsis 20/5, 2021 at 16:15

The vectorized unrolled loop above does indeed perform very well from the get go, but there is a bug preventing it from executing for small arrays. I am guessing there is assumption the array > 16 elements (4 accumulators times 4 packed values). – Duvall 20/5, 2021 at 20:4

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Baseline

Recommended topics

Hot tags