Compute sum of array values in parallel with metal swift

Asked 2/7, 2016 at 21:29 Answered 23/8, 2019 at 10:54

I am trying to compute sum of large array in parallel with metal swift.

Is there a god way to do it?

My plane was that I divide my array to sub arrays, compute sum of one sub arrays in parallel and then when parallel computation is finished compute sum of sub sums.

for example if I have

array = [a0,....an]

I divide array in sub arrays :

array_1 = [a_0,...a_i],
array_2 = [a_i+1,...a_2i],
....
array_n/i = [a_n-1, ... a_n]

sums for this arrays is computed in parallel and I get

sum_1, sum_2, sum_3, ... sum_n/1

at the end just compute sum of sub sums.

I create application which run my metal shader, but some things I don't understand quite.

        var array:[[Float]] = [[1,2,3], [4,5,6], [7,8,9]]

        // get device
        let device: MTLDevice! = MTLCreateSystemDefaultDevice()

        // get library
        let defaultLibrary:MTLLibrary! = device.newDefaultLibrary()

        // queue
        let commandQueue:MTLCommandQueue! = device.newCommandQueue()

        // function
        let kernerFunction: MTLFunction! = defaultLibrary.newFunctionWithName("calculateSum")

        // pipeline with function
        let pipelineState: MTLComputePipelineState! = try device.newComputePipelineStateWithFunction(kernerFunction)

        // buffer for function
        let commandBuffer:MTLCommandBuffer! = commandQueue.commandBuffer()

        // encode function
        let commandEncoder:MTLComputeCommandEncoder = commandBuffer.computeCommandEncoder()

        // add function to encode
        commandEncoder.setComputePipelineState(pipelineState)

        // options
        let resourceOption = MTLResourceOptions()

        let arrayBiteLength = array.count * array[0].count * sizeofValue(array[0][0])

        let arrayBuffer = device.newBufferWithBytes(&array, length: arrayBiteLength, options: resourceOption)

        commandEncoder.setBuffer(arrayBuffer, offset: 0, atIndex: 0)

        var result:[Float] = [0,0,0]

        let resultBiteLenght = sizeofValue(result[0])

        let resultBuffer = device.newBufferWithBytes(&result, length: resultBiteLenght, options: resourceOption)

        commandEncoder.setBuffer(resultBuffer, offset: 0, atIndex: 1)

        let threadGroupSize = MTLSize(width: 1, height: 1, depth: 1)

        let threadGroups = MTLSize(width: (array.count), height: 1, depth: 1)

        commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupSize)

        commandEncoder.endEncoding()

        commandBuffer.commit()

        commandBuffer.waitUntilCompleted()

        let data = NSData(bytesNoCopy: resultBuffer.contents(), length: sizeof(Float), freeWhenDone: false)

        data.getBytes(&result, length: result.count * sizeof(Float))

        print(result)

is my Swift code,

my shader is :

kernel void calculateSum(const device float *inFloat [[buffer(0)]],
                     device float *result [[buffer(1)]],
                     uint id [[ thread_position_in_grid ]]) {


    float * f = inFloat[id];
    float sum = 0;
    for (int i = 0 ; i < 3 ; ++i) {
        sum = sum + f[i];
    }

    result = sum;
}

I don't know how to defined that inFloat is array of array. I don't know exactly what is threadGroupSize and threadGroups. I don't know what is device and uint in shader properties.

Is this right approach?

Nutbrown answered 2/7, 2016 at 21:29 Comment(5)

Why do you choose Metal for this? – Herold 2/7, 2016 at 21:36

Firstly to see how it is done, secondly because GPU can compute a lot of parallel computation, CPU only 2 or 4. – Nutbrown 2/7, 2016 at 21:38

How big is your "large array"? And are you planning to run this on macOS or iOS? – Herold 2/7, 2016 at 21:45

array will be 200 000 000 or 2 Mil big. I will run it on iOS. – Nutbrown 2/7, 2016 at 21:47

Sorry type error : 200 000 or 2 000 000. – Nutbrown 2/7, 2016 at 21:59

I took the time to create a fully working example of this problem with Metal. The explanation is in the comments:

let count = 10_000_000
let elementsPerSum = 10_000

// Data type, has to be the same as in the shader
typealias DataType = CInt

let device = MTLCreateSystemDefaultDevice()!
let library = self.library(device: device)
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)

// Our data, randomly generated:
var data = (0..<count).map{ _ in DataType(arc4random_uniform(100)) }
var dataCount = CUnsignedInt(count)
var elementsPerSumC = CUnsignedInt(elementsPerSum)
// Number of individual results = count / elementsPerSum (rounded up):
let resultsCount = (count + elementsPerSum - 1) / elementsPerSum

// Our data in a buffer (copied):
let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)

let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!

encoder.setComputePipelineState(pipeline)

encoder.setBuffer(dataBuffer, offset: 0, index: 0)

encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)

// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)

// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)

encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()

var start, end : UInt64
var result : DataType = 0

start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results {
    result += elem
}

end = mach_absolute_time()

print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
result = 0

start = mach_absolute_time()
data.withUnsafeBufferPointer { buffer in
    for elem in buffer {
        result += elem
    }
}
end = mach_absolute_time()

print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")

I used my Mac to test it, but it should work just fine on iOS.

Output:

Metal result: 494936505, time: 0.024611456
CPU result: 494936505, time: 0.163341018

The Metal version is about 7 times faster. I'm sure you can get more speed if you implement something like divide-and-conquer with cutoff or whatever.

Serai answered 3/7, 2016 at 0:47 Comment(11)

Thanks, this is really great. Only one question. If I understand this shared correctly, the sub sum for each resultIndex will be calculated 1000 times (the same result)? Or I don't understand how shader is called sequentially.? – Nutbrown 3/7, 2016 at 5:17

@MarkoZadravec 10.000 is the amount of elements each individual task sums up. This means if the data size is 1.000.000 there's gonna be 100 tasks. The results array holds all results from these 100 tasks sums. The shader only knows it's task number (resultIndex) and how many items it needs to sum up (elementsPerSum), where the task number says where in the results buffer it needs to store the sum and by multiplying those two numbers it gets the first of those 10000 elements it needs to sum up. I know this is not the best explanation, maybe you can read some articles to find out more – Serai 3/7, 2016 at 5:25

No, I understand the arithmetics how you get start and end position. (and I mistaken 100 for 10000) My question is, each block (thread) is size 10.000. And inside block I calculate start and end position and calculate sum. But is it shader called only once per block, or it is called for each element in block? Because in this case we would calculate one sub sum 10.000 times (result will always be the same, because we calculated start and end position the sam). If this is true, wouldn't be better to have instead array of integers, a array of array of integers? – Nutbrown 3/7, 2016 at 5:35

@MarkoZadravec the shader is called once per subsum/block with its corresponding index. This happens with the dispatch method, which dispatches in total threadgroups * threadsPerThreadgroup threads and giving each one of them their threadgroup index and thread_in_threadgroup index. An array of an array is the same as a single array, since memory is just linear, not 2-dimensional. Also in Swift, an array is just a reference, so this wouldn't even work. – Serai 3/7, 2016 at 5:41

Ok thanks, I was confused, because you get whole array and indexes (and because you can't print to log in metal shader) and I thought that it is called n-times. – Nutbrown 3/7, 2016 at 5:44

Is there a reason that you use CUnsignedInt? If I try to change it to either Int or CInt I get wrong sum from GPU. Is there a reason for this? – Nutbrown 3/7, 2016 at 16:53

@MarkoZadravec DataType has to be the same type as the shader uses => If you change it to e.g. CInt, you need to change it to int in the shader as well. Remember to always use the Swift equivalents with C prefix, as Metal uses C data types. I updated my answer with CInt and fixed the types, have a look at the changes – Serai 3/7, 2016 at 17:16

Kametrixom, thank you for your answer. I have some additional questions regarding this : #38233140. – Nutbrown 6/7, 2016 at 19:55

You might get better performance if you change the line sums[resultIndex] += data[dataIndex]; to sum the values in a local variable and then write that sum to sums in only one write operation. You would have less memory accesses. – Abdulabdulla 7/12, 2016 at 16:35

using multithreaded vector sum with a 8 thread cpu my timings are: Metal lap time: 177.420020 cpu MT lap time: 42.018056, cpu = 4 times faster but using gpu matrix arithmetic gpu it 20 times faster in equivalent operation – Hendeca 28/3, 2019 at 18:8

Interesting. When I run this example on my M1 MacBook Pro, and make sure to compile the app with Release build (i.e. full optimisation) the CPU version is actually slightly faster than the Metal version. With Debug build the CPU version is much slower but the Metal version remains the same – Verney 28/6 at 1:21

The accepted answer is annoyingly missing the kernel that was written for it. The source is here, but here is the full program and shader that can be run as a swift command line application.

/*
 * Command line Metal Compute Shader for data processing
 */

import Metal
import Foundation
//------------------------------------------------------------------------------
let count = 10_000_000
let elementsPerSum = 10_000
//------------------------------------------------------------------------------
typealias DataType = CInt // Data type, has to be the same as in the shader
//------------------------------------------------------------------------------
let device = MTLCreateSystemDefaultDevice()!
let library = device.makeDefaultLibrary()!
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)
//------------------------------------------------------------------------------
// Our data, randomly generated:
var data = (0..<count).map{ _ in DataType(arc4random_uniform(100)) }
var dataCount = CUnsignedInt(count)
var elementsPerSumC = CUnsignedInt(elementsPerSum)
// Number of individual results = count / elementsPerSum (rounded up):
let resultsCount = (count + elementsPerSum - 1) / elementsPerSum
//------------------------------------------------------------------------------
// Our data in a buffer (copied):
let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)
//------------------------------------------------------------------------------
let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!
//------------------------------------------------------------------------------
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(dataBuffer, offset: 0, index: 0)
encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)
//------------------------------------------------------------------------------
// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)

// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
//------------------------------------------------------------------------------
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
//------------------------------------------------------------------------------
var start, end : UInt64
var result : DataType = 0
//------------------------------------------------------------------------------
start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results {
    result += elem
}

end = mach_absolute_time()
//------------------------------------------------------------------------------
print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
//------------------------------------------------------------------------------
result = 0

start = mach_absolute_time()
data.withUnsafeBufferPointer { buffer in
    for elem in buffer {
        result += elem
    }
}
end = mach_absolute_time()

print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
//------------------------------------------------------------------------------

#include <metal_stdlib>
using namespace metal;

typedef unsigned int uint;
typedef int DataType;

kernel void parsum(const device DataType* data [[ buffer(0) ]],
                   const device uint& dataLength [[ buffer(1) ]],
                   device DataType* sums [[ buffer(2) ]],
                   const device uint& elementsPerSum [[ buffer(3) ]],
                   
                   const uint tgPos [[ threadgroup_position_in_grid ]],
                   const uint tPerTg [[ threads_per_threadgroup ]],
                   const uint tPos [[ thread_position_in_threadgroup ]]) {
    
    uint resultIndex = tgPos * tPerTg + tPos;
    
    uint dataIndex = resultIndex * elementsPerSum; // Where the summation should begin
    uint endIndex = dataIndex + elementsPerSum < dataLength ? dataIndex + elementsPerSum : dataLength; // The index where summation should end
    
    for (; dataIndex < endIndex; dataIndex++)
        sums[resultIndex] += data[dataIndex];
}

Objective-C

The same Swift command-line programme, but in Objective-C

#import <Foundation/Foundation.h>
#import <Metal/Metal.h>

typedef int DataType;

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        unsigned int count = 10000000;
        unsigned int elementsPerSum = 10000;
        //----------------------------------------------------------------------
        id<MTLDevice> device  = MTLCreateSystemDefaultDevice();
        id<MTLLibrary>library = [device newDefaultLibrary];
        
        id<MTLFunction>parsum = [library newFunctionWithName:@"parsum"];
        id<MTLComputePipelineState> pipeline = [device newComputePipelineStateWithFunction:parsum error:nil];
        //----------------------------------------------------------------------
        DataType* data = (DataType*) malloc(sizeof(DataType) * count);
        for (int i = 0; i < count; i++){
            data[i] = arc4random_uniform(100);
        }
        unsigned int dataCount = count;
        unsigned int elementsPerSumC = elementsPerSum;
        unsigned int resultsCount = (count + elementsPerSum - 1) / elementsPerSum;
        //------------------------------------------------------------------------------
        id<MTLBuffer>dataBuffer = [device newBufferWithBytes:data
                                                      length:(sizeof(int) * count)
                                                     options:MTLResourceStorageModeManaged];
        
        id<MTLBuffer>resultsBuffer = [device newBufferWithLength:(sizeof(int) * count)
                                                         options:0];
        
        DataType* results = resultsBuffer.contents;
        //----------------------------------------------------------------------
        id<MTLCommandQueue>queue = [device newCommandQueue];
        id<MTLCommandBuffer>cmds = [queue commandBuffer];
        id<MTLComputeCommandEncoder> encoder = [cmds computeCommandEncoder];
        //----------------------------------------------------------------------
        [encoder setComputePipelineState:pipeline];
        [encoder setBuffer:dataBuffer offset:0 atIndex:0];
        [encoder setBytes:&dataCount length:sizeof(unsigned int) atIndex:1];
        [encoder setBuffer:resultsBuffer offset:0 atIndex:2];
        [encoder setBytes:&elementsPerSumC length:sizeof(unsigned int) atIndex:3];
        //----------------------------------------------------------------------
        MTLSize threadgroupsPerGrid =
        {
            (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth,
            1,
            1
        };
        
        MTLSize threadsPerThreadgroup =
        {
            pipeline.threadExecutionWidth,
            1,
            1
        };
        //----------------------------------------------------------------------
        [encoder dispatchThreadgroups:threadgroupsPerGrid threadsPerThreadgroup:threadsPerThreadgroup];
        [encoder endEncoding];
        //----------------------------------------------------------------------
        uint64_t start, end;
        DataType result = 0;
        
        start = mach_absolute_time();
        [cmds commit];
        [cmds waitUntilCompleted];

        for (int i = 0; i < resultsCount; i++){
            result += results[i];
        }

        end = mach_absolute_time();

        NSLog(@"Metal Result %d. time %f", result, (float)(end - start)/(float)(NSEC_PER_SEC));
        //----------------------------------------------------------------------
        result = 0;

        start = mach_absolute_time();

        for (int i = 0; i < count; i++){
            result += data[i];
        }

        end = mach_absolute_time();
        NSLog(@"Metal Result %d. time %f", result, (float)(end - start)/(float)(NSEC_PER_SEC));

        //------------------------------------------------------------------------------
        free(data);
    }
    return 0;
}

Koller answered 23/8, 2019 at 10:54 Comment(1)

Thank you for this! I found that you now need to use let device = MTLCopyAllDevices()[0] when setting the device because the compiler complains that the program is non-interactive now when using MTLCreateSystemDefaultDevice(). – Adele 20/12, 2022 at 18:0

-3

i've been running the app. on a gt 740 (384 cores) vs. i7-4790 with a multithreader vector sum implementation and here are my figures:

Metal lap time: 19.959092
cpu MT lap time: 4.353881

that's a 5/1 ratio for cpu, so unless you have a powerful gpu using shaders is not worth it.

i've been testing the same code in a i7-3610qm w/ igpu intel hd 4000 and surprisely results are much better for metal: 2/1

edited: after tweaking with thread parameter i've finally improved gpu performance, now it's upto 16xcpu

Hendeca answered 29/3, 2019 at 14:54 Comment(1)

Can you post your updated solution using matrix calculations using height and depth? – Tangelatangelo 1/2, 2020 at 4:26

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Objective-C

Recommended topics

Hot tags