Compute sum of array values in parallel with metal swift
Asked Answered
N

3

12

I am trying to compute sum of large array in parallel with metal swift.

Is there a god way to do it?

My plane was that I divide my array to sub arrays, compute sum of one sub arrays in parallel and then when parallel computation is finished compute sum of sub sums.

for example if I have

array = [a0,....an] 

I divide array in sub arrays :

array_1 = [a_0,...a_i],
array_2 = [a_i+1,...a_2i],
....
array_n/i = [a_n-1, ... a_n]

sums for this arrays is computed in parallel and I get

sum_1, sum_2, sum_3, ... sum_n/1

at the end just compute sum of sub sums.

I create application which run my metal shader, but some things I don't understand quite.

        var array:[[Float]] = [[1,2,3], [4,5,6], [7,8,9]]

        // get device
        let device: MTLDevice! = MTLCreateSystemDefaultDevice()

        // get library
        let defaultLibrary:MTLLibrary! = device.newDefaultLibrary()

        // queue
        let commandQueue:MTLCommandQueue! = device.newCommandQueue()

        // function
        let kernerFunction: MTLFunction! = defaultLibrary.newFunctionWithName("calculateSum")

        // pipeline with function
        let pipelineState: MTLComputePipelineState! = try device.newComputePipelineStateWithFunction(kernerFunction)

        // buffer for function
        let commandBuffer:MTLCommandBuffer! = commandQueue.commandBuffer()

        // encode function
        let commandEncoder:MTLComputeCommandEncoder = commandBuffer.computeCommandEncoder()

        // add function to encode
        commandEncoder.setComputePipelineState(pipelineState)

        // options
        let resourceOption = MTLResourceOptions()

        let arrayBiteLength = array.count * array[0].count * sizeofValue(array[0][0])

        let arrayBuffer = device.newBufferWithBytes(&array, length: arrayBiteLength, options: resourceOption)

        commandEncoder.setBuffer(arrayBuffer, offset: 0, atIndex: 0)

        var result:[Float] = [0,0,0]

        let resultBiteLenght = sizeofValue(result[0])

        let resultBuffer = device.newBufferWithBytes(&result, length: resultBiteLenght, options: resourceOption)

        commandEncoder.setBuffer(resultBuffer, offset: 0, atIndex: 1)

        let threadGroupSize = MTLSize(width: 1, height: 1, depth: 1)

        let threadGroups = MTLSize(width: (array.count), height: 1, depth: 1)

        commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupSize)

        commandEncoder.endEncoding()

        commandBuffer.commit()

        commandBuffer.waitUntilCompleted()

        let data = NSData(bytesNoCopy: resultBuffer.contents(), length: sizeof(Float), freeWhenDone: false)

        data.getBytes(&result, length: result.count * sizeof(Float))

        print(result)

is my Swift code,

my shader is :

kernel void calculateSum(const device float *inFloat [[buffer(0)]],
                     device float *result [[buffer(1)]],
                     uint id [[ thread_position_in_grid ]]) {


    float * f = inFloat[id];
    float sum = 0;
    for (int i = 0 ; i < 3 ; ++i) {
        sum = sum + f[i];
    }

    result = sum;
}

I don't know how to defined that inFloat is array of array. I don't know exactly what is threadGroupSize and threadGroups. I don't know what is device and uint in shader properties.

Is this right approach?

Nutbrown answered 2/7, 2016 at 21:29 Comment(5)
Why do you choose Metal for this?Herold
Firstly to see how it is done, secondly because GPU can compute a lot of parallel computation, CPU only 2 or 4.Nutbrown
How big is your "large array"? And are you planning to run this on macOS or iOS?Herold
array will be 200 000 000 or 2 Mil big. I will run it on iOS.Nutbrown
Sorry type error : 200 000 or 2 000 000.Nutbrown
S
27

I took the time to create a fully working example of this problem with Metal. The explanation is in the comments:

let count = 10_000_000
let elementsPerSum = 10_000

// Data type, has to be the same as in the shader
typealias DataType = CInt

let device = MTLCreateSystemDefaultDevice()!
let library = self.library(device: device)
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)

// Our data, randomly generated:
var data = (0..<count).map{ _ in DataType(arc4random_uniform(100)) }
var dataCount = CUnsignedInt(count)
var elementsPerSumC = CUnsignedInt(elementsPerSum)
// Number of individual results = count / elementsPerSum (rounded up):
let resultsCount = (count + elementsPerSum - 1) / elementsPerSum

// Our data in a buffer (copied):
let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)

let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!

encoder.setComputePipelineState(pipeline)

encoder.setBuffer(dataBuffer, offset: 0, index: 0)

encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)

// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)

// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)

encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()

var start, end : UInt64
var result : DataType = 0

start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results {
    result += elem
}

end = mach_absolute_time()

print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
result = 0

start = mach_absolute_time()
data.withUnsafeBufferPointer { buffer in
    for elem in buffer {
        result += elem
    }
}
end = mach_absolute_time()

print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")

I used my Mac to test it, but it should work just fine on iOS.

Output:

Metal result: 494936505, time: 0.024611456
CPU result: 494936505, time: 0.163341018

The Metal version is about 7 times faster. I'm sure you can get more speed if you implement something like divide-and-conquer with cutoff or whatever.

Serai answered 3/7, 2016 at 0:47 Comment(11)
Thanks, this is really great. Only one question. If I understand this shared correctly, the sub sum for each resultIndex will be calculated 1000 times (the same result)? Or I don't understand how shader is called sequentially.?Nutbrown
@MarkoZadravec 10.000 is the amount of elements each individual task sums up. This means if the data size is 1.000.000 there's gonna be 100 tasks. The results array holds all results from these 100 tasks sums. The shader only knows it's task number (resultIndex) and how many items it needs to sum up (elementsPerSum), where the task number says where in the results buffer it needs to store the sum and by multiplying those two numbers it gets the first of those 10000 elements it needs to sum up. I know this is not the best explanation, maybe you can read some articles to find out moreSerai
No, I understand the arithmetics how you get start and end position. (and I mistaken 100 for 10000) My question is, each block (thread) is size 10.000. And inside block I calculate start and end position and calculate sum. But is it shader called only once per block, or it is called for each element in block? Because in this case we would calculate one sub sum 10.000 times (result will always be the same, because we calculated start and end position the sam). If this is true, wouldn't be better to have instead array of integers, a array of array of integers?Nutbrown
@MarkoZadravec the shader is called once per subsum/block with its corresponding index. This happens with the dispatch method, which dispatches in total threadgroups * threadsPerThreadgroup threads and giving each one of them their threadgroup index and thread_in_threadgroup index. An array of an array is the same as a single array, since memory is just linear, not 2-dimensional. Also in Swift, an array is just a reference, so this wouldn't even work.Serai
Ok thanks, I was confused, because you get whole array and indexes (and because you can't print to log in metal shader) and I thought that it is called n-times.Nutbrown
Is there a reason that you use CUnsignedInt? If I try to change it to either Int or CInt I get wrong sum from GPU. Is there a reason for this?Nutbrown
@MarkoZadravec DataType has to be the same type as the shader uses => If you change it to e.g. CInt, you need to change it to int in the shader as well. Remember to always use the Swift equivalents with C prefix, as Metal uses C data types. I updated my answer with CInt and fixed the types, have a look at the changesSerai
Kametrixom, thank you for your answer. I have some additional questions regarding this : #38233140.Nutbrown
You might get better performance if you change the line sums[resultIndex] += data[dataIndex]; to sum the values in a local variable and then write that sum to sums in only one write operation. You would have less memory accesses.Abdulabdulla
using multithreaded vector sum with a 8 thread cpu my timings are: Metal lap time: 177.420020 cpu MT lap time: 42.018056, cpu = 4 times faster but using gpu matrix arithmetic gpu it 20 times faster in equivalent operationHendeca
Interesting. When I run this example on my M1 MacBook Pro, and make sure to compile the app with Release build (i.e. full optimisation) the CPU version is actually slightly faster than the Metal version. With Debug build the CPU version is much slower but the Metal version remains the sameVerney
K
7

The accepted answer is annoyingly missing the kernel that was written for it. The source is here, but here is the full program and shader that can be run as a swift command line application.

/*
 * Command line Metal Compute Shader for data processing
 */

import Metal
import Foundation
//------------------------------------------------------------------------------
let count = 10_000_000
let elementsPerSum = 10_000
//------------------------------------------------------------------------------
typealias DataType = CInt // Data type, has to be the same as in the shader
//------------------------------------------------------------------------------
let device = MTLCreateSystemDefaultDevice()!
let library = device.makeDefaultLibrary()!
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)
//------------------------------------------------------------------------------
// Our data, randomly generated:
var data = (0..<count).map{ _ in DataType(arc4random_uniform(100)) }
var dataCount = CUnsignedInt(count)
var elementsPerSumC = CUnsignedInt(elementsPerSum)
// Number of individual results = count / elementsPerSum (rounded up):
let resultsCount = (count + elementsPerSum - 1) / elementsPerSum
//------------------------------------------------------------------------------
// Our data in a buffer (copied):
let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)
//------------------------------------------------------------------------------
let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!
//------------------------------------------------------------------------------
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(dataBuffer, offset: 0, index: 0)
encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)
//------------------------------------------------------------------------------
// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)

// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
//------------------------------------------------------------------------------
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
//------------------------------------------------------------------------------
var start, end : UInt64
var result : DataType = 0
//------------------------------------------------------------------------------
start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results {
    result += elem
}

end = mach_absolute_time()
//------------------------------------------------------------------------------
print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
//------------------------------------------------------------------------------
result = 0

start = mach_absolute_time()
data.withUnsafeBufferPointer { buffer in
    for elem in buffer {
        result += elem
    }
}
end = mach_absolute_time()

print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
//------------------------------------------------------------------------------
#include <metal_stdlib>
using namespace metal;

typedef unsigned int uint;
typedef int DataType;

kernel void parsum(const device DataType* data [[ buffer(0) ]],
                   const device uint& dataLength [[ buffer(1) ]],
                   device DataType* sums [[ buffer(2) ]],
                   const device uint& elementsPerSum [[ buffer(3) ]],
                   
                   const uint tgPos [[ threadgroup_position_in_grid ]],
                   const uint tPerTg [[ threads_per_threadgroup ]],
                   const uint tPos [[ thread_position_in_threadgroup ]]) {
    
    uint resultIndex = tgPos * tPerTg + tPos;
    
    uint dataIndex = resultIndex * elementsPerSum; // Where the summation should begin
    uint endIndex = dataIndex + elementsPerSum < dataLength ? dataIndex + elementsPerSum : dataLength; // The index where summation should end
    
    for (; dataIndex < endIndex; dataIndex++)
        sums[resultIndex] += data[dataIndex];
}

Objective-C

The same Swift command-line programme, but in Objective-C

#import <Foundation/Foundation.h>
#import <Metal/Metal.h>

typedef int DataType;

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        unsigned int count = 10000000;
        unsigned int elementsPerSum = 10000;
        //----------------------------------------------------------------------
        id<MTLDevice> device  = MTLCreateSystemDefaultDevice();
        id<MTLLibrary>library = [device newDefaultLibrary];
        
        id<MTLFunction>parsum = [library newFunctionWithName:@"parsum"];
        id<MTLComputePipelineState> pipeline = [device newComputePipelineStateWithFunction:parsum error:nil];
        //----------------------------------------------------------------------
        DataType* data = (DataType*) malloc(sizeof(DataType) * count);
        for (int i = 0; i < count; i++){
            data[i] = arc4random_uniform(100);
        }
        unsigned int dataCount = count;
        unsigned int elementsPerSumC = elementsPerSum;
        unsigned int resultsCount = (count + elementsPerSum - 1) / elementsPerSum;
        //------------------------------------------------------------------------------
        id<MTLBuffer>dataBuffer = [device newBufferWithBytes:data
                                                      length:(sizeof(int) * count)
                                                     options:MTLResourceStorageModeManaged];
        
        id<MTLBuffer>resultsBuffer = [device newBufferWithLength:(sizeof(int) * count)
                                                         options:0];
        
        DataType* results = resultsBuffer.contents;
        //----------------------------------------------------------------------
        id<MTLCommandQueue>queue = [device newCommandQueue];
        id<MTLCommandBuffer>cmds = [queue commandBuffer];
        id<MTLComputeCommandEncoder> encoder = [cmds computeCommandEncoder];
        //----------------------------------------------------------------------
        [encoder setComputePipelineState:pipeline];
        [encoder setBuffer:dataBuffer offset:0 atIndex:0];
        [encoder setBytes:&dataCount length:sizeof(unsigned int) atIndex:1];
        [encoder setBuffer:resultsBuffer offset:0 atIndex:2];
        [encoder setBytes:&elementsPerSumC length:sizeof(unsigned int) atIndex:3];
        //----------------------------------------------------------------------
        MTLSize threadgroupsPerGrid =
        {
            (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth,
            1,
            1
        };
        
        MTLSize threadsPerThreadgroup =
        {
            pipeline.threadExecutionWidth,
            1,
            1
        };
        //----------------------------------------------------------------------
        [encoder dispatchThreadgroups:threadgroupsPerGrid threadsPerThreadgroup:threadsPerThreadgroup];
        [encoder endEncoding];
        //----------------------------------------------------------------------
        uint64_t start, end;
        DataType result = 0;
        
        start = mach_absolute_time();
        [cmds commit];
        [cmds waitUntilCompleted];

        for (int i = 0; i < resultsCount; i++){
            result += results[i];
        }

        end = mach_absolute_time();

        NSLog(@"Metal Result %d. time %f", result, (float)(end - start)/(float)(NSEC_PER_SEC));
        //----------------------------------------------------------------------
        result = 0;

        start = mach_absolute_time();

        for (int i = 0; i < count; i++){
            result += data[i];
        }

        end = mach_absolute_time();
        NSLog(@"Metal Result %d. time %f", result, (float)(end - start)/(float)(NSEC_PER_SEC));

        //------------------------------------------------------------------------------
        free(data);
    }
    return 0;
}

Koller answered 23/8, 2019 at 10:54 Comment(1)
Thank you for this! I found that you now need to use let device = MTLCopyAllDevices()[0] when setting the device because the compiler complains that the program is non-interactive now when using MTLCreateSystemDefaultDevice().Adele
H
-3

i've been running the app. on a gt 740 (384 cores) vs. i7-4790 with a multithreader vector sum implementation and here are my figures:

Metal lap time: 19.959092
cpu MT lap time: 4.353881

that's a 5/1 ratio for cpu, so unless you have a powerful gpu using shaders is not worth it.

i've been testing the same code in a i7-3610qm w/ igpu intel hd 4000 and surprisely results are much better for metal: 2/1

edited: after tweaking with thread parameter i've finally improved gpu performance, now it's upto 16xcpu

Hendeca answered 29/3, 2019 at 14:54 Comment(1)
Can you post your updated solution using matrix calculations using height and depth?Tangelatangelo

© 2022 - 2024 — McMap. All rights reserved.