Memory write performance - GPU CPU Shared Memory

Asked 8/3, 2016 at 6:54 Answered 30/12, 2017 at 5:35

I'm allocating both input and output MTLBuffer using posix_memalign according to the shared GPU/CPU documentation provided by memkite.

Aside: it is easier to just use latest API than muck around with posix_memalign

let metalBuffer = self.metalDevice.newBufferWithLength(byteCount, options: .StorageModeShared)

My kernel function operates on roughly 16 million complex value structs and writes out an equal number of complex value structs to memory.

I've performed some experiments and my Metal kernel 'complex math section' executes in 0.003 seconds (Yes!), but writing the result to the buffer takes >0.05 (No!) seconds. In my experiment I commented out the math-part and just assign the zero to memory and it takes 0.05 seconds, commenting out the assignment and adding the math back, 0.003 seconds.

Is the shared memory slow in this case, or is there some other tip or trick I might try?

Additional detail

Test platforms

iPhone 6S - ~0.039 seconds per frame
iPad Air 2 - ~0.130 seconds per frame

The streaming data

Each update to the shader receives approximately 50000 complex numbers in the form of a pair of float types in a struct.

struct ComplexNumber {
    float real;
    float imaginary;
};

Kernel signature

kernel void processChannelData(const device Parameters *parameters [[ buffer(0) ]],
                               const device ComplexNumber *inputSampleData [[ buffer(1) ]],
                               const device ComplexNumber *partAs [[ buffer(2) ]],
                               const device float *partBs [[ buffer(3) ]],
                               const device int *lookups [[ buffer(4) ]],
                               device float *outputImageData [[ buffer(5) ]],
                               uint threadIdentifier [[ thread_position_in_grid ]]);

All the buffers contain - currently - unchanging data except inputSampleData which receives the 50000 samples I'll be operating on. The other buffers contain roughly 16 million values (128 channels x 130000 pixels) each. I perform some operations on each 'pixel' and sum the complex result across channels and finally take the absolute value of the complex number and assign the resulting float to outputImageData.

Dispatch

commandEncoder.setComputePipelineState(pipelineState)

commandEncoder.setBuffer(parametersMetalBuffer, offset: 0, atIndex: 0)
commandEncoder.setBuffer(inputSampleDataMetalBuffer, offset: 0, atIndex: 1)
commandEncoder.setBuffer(partAsMetalBuffer, offset: 0, atIndex: 2)
commandEncoder.setBuffer(partBsMetalBuffer, offset: 0, atIndex: 3)
commandEncoder.setBuffer(lookupsMetalBuffer, offset: 0, atIndex: 4)
commandEncoder.setBuffer(outputImageDataMetalBuffer, offset: 0, atIndex: 5)

let threadExecutionWidth = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(width: threadExecutionWidth, height: 1, depth: 1)
let threadGroups = MTLSize(width: self.numberOfPixels / threadsPerThreadgroup.width, height: 1, depth:1)

commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
metalCommandBuffer.commit()
metalCommandBuffer.waitUntilCompleted()

GitHub example

I've written an example called Slow and put it up on GitHub. Seems the bottleneck is the write of the values into the input Buffer. So, I guess the question becomes how to avoid the bottleneck?

Memory copy

I wrote up a quick test to compare the performance of various byte copying methods.

Current Status

I've reduced execution time to 0.02ish seconds which doesn't sound like a lot, but it makes a big difference in the number of frames per second. Currently the biggest improvements are a result of switching to cblas_scopy().

Mexican answered 8/3, 2016 at 6:54 Comment(19)

Can you characterize your memory bandwidth in GB/s? It's not obvious exactly how many bytes you're writing per frame. I would expect that a trivial kernel could write from between 0.5GB/s to 1.5GB/s for an iPhone 6, and about double that for an iPhone 6s, based on empirical data. Folks might be able to help more if you mention which device you're testing on, and what your target performance characteristics are. – Sylvia 8/3, 2016 at 23:2

@Sylvia I've added detail. Input: 128 * 51200 * 2 * sizeof(float), Output: 130806 * sizeof(float) per frame. – Mexican 9/3, 2016 at 7:31

I wonder if using mmap + mlock would help. – Forty 10/3, 2016 at 8:55

@Forty I should point out that I have also tried let metalBuffer = self.metalDevice.newBufferWithLength(byteCount, options: .StorageModeShared). This makes no difference to performance. One would think at least the Apple API call would know what to do for best performance. – Mexican 10/3, 2016 at 11:1

In a shared memory architecture like iOS, there actually is no difference in the type or location of memory you get in the two instances. Using the no-copy variant simply uses the memory you've allocated (which must be page-aligned) rather than doing the allocation internally. – Sylvia 10/3, 2016 at 17:32

@CameronLowellPalmer I am not a Metal expert.. the Metal allocator would be my choice, but since there's no performance difference... – Forty 10/3, 2016 at 17:42

@Sylvia agreed. Just simply pointing out that I've tried the no copy posix_memalign and newBufferWithLength. It should be just find as long as it is 4k aligned. – Mexican 11/3, 2016 at 5:43

Fair enough. Also, not to be too pedantic, but pages aren't always 4k on iOS. – Sylvia 11/3, 2016 at 17:30

@Sylvia fair enough. :) – Mexican 11/3, 2016 at 17:32

Your sample spends ~93% of its time populating the array. If you move the initialization step out of the block whose execution time you measure, the kernel itself executes in ~5ms on an iPhone 6s. – Sylvia 11/3, 2016 at 18:37

Right but the point is I have to populate the array with each frame. – Mexican 11/3, 2016 at 19:0

That makes this sound more like a Swift performance problem than a Metal performance problem to me. What am I not understanding? – Sylvia 11/3, 2016 at 19:20

@Sylvia Maybe that is a correct assessment. As stated the performance of the computation seems be more or less instantaneous. It is memory reading and writing that seems to be my greatest performance bottleneck. Let me hardcode some data in the app and see if performance goes up. – Mexican 12/3, 2016 at 6:56

Also, I've thought about reducing the data volume by switching to halfdata type. Although, that isn't directly supported by Swift. – Mexican 12/3, 2016 at 6:58

have you looked at vDSP? – Chemisette 12/3, 2016 at 12:14

@Helium3 as in did I try using vDSP for the calculations? Yes, that was my original attempt. The performance was much much worse. 30x slower at least. – Mexican 12/3, 2016 at 12:17

@Sylvia I have been looking at faster ways of doing the memcpy and it seems cblas_scopy() from Accelerate is 2x faster in a Playground. I should see how it performs on iOS – Mexican 13/3, 2016 at 9:50

This might or might not be related: long time ago, I had to deal with a performance regression on OS X (desktop, circa 10.5) which was triggered by a library reading from video memory. Pumping data in and executing was fast, as expected, but reading back was simply not an important use case for the hardware/OS. Could this have something to do here? – Gimcrack 16/3, 2016 at 10:2

@Gimcrack hmmm. These days iDevices have unified memory unlike desktops, sharing video and cpu memory, the caveat being it needs to be set up correctly. So, yes the read from video memory bandwidth problem is a real thing, but I don't believe it applies here. – Mexican 16/3, 2016 at 10:6

Reduce the size of the type

Originally, I was pre-converting signed 16-bit sized Integers as Floats (32-bit) since ultimately that is how they'll be used. This is a case where performance starts forcing you to store the values as 16-bits to cut your data-size in half.

Objective-C over Swift

For the code dealing with movement of data, you might choose Objective-C over Swift (Warren Moore recommendation). Performance of Swift in these special situations still isn't up to scratch. You can also try calling out to memcpy or similar methods. I've seen a couple of examples that used for-loop Buffer Pointers and this in my experiments performed slowly.

Difficulty of testing

I really wanted to do some of the experiments with relation to various copying methods in a playground on the machine and unfortunately this was useless. The iOS device versions of the same experiments performed completely differently. One might think that the relative performance would be the similar, but I found this to also be an invalid assumption. It would be really convenient if you could have a playground that used the iOS device as the interpreter.

Mexican answered 3/4, 2016 at 8:37 Comment(2)

if you're not only interested in iOS devices you can still run your tests in a playground but with an OS X target. – Crippling 3/4, 2016 at 10:53

@Marius I'm only interested in iOS, but good point none the less – Mexican 3/4, 2016 at 11:42

You might get a large speedup via encoding your data to huffman codes and decoding on the GPU, see MetalHuffman. It depends on your data though.

Arrhenius answered 30/12, 2017 at 5:35 Comment(2)

That is interesting. Thanks for the tip! – Mexican 31/12, 2017 at 9:4

If you liked the huffman GPU decoder, I just uploaded source code for a rice based GPU decoder for Metal, it gets significantly better compression and executes nearly 2x as fast as the huffman version. On github: github.com/mdejong/MetalRice – Arrhenius 23/11, 2018 at 22:5