Fastest way to compare two byte arrays in F#
Asked Answered
S

3

7

I am writing an application that uses the Kinect photo data and compares two frames against each other in F#. I am using Rob Miles Learn the Kinect Api page 74 as a guidepost but I am not using pointers and the performance is suffering. The byte from the Kinect frame is 1,228,800 bytes. I wrote the comparison like this:

member this.differenceCount(currentImageBytes) =
    if  previousImageBytes |> Seq.length = 0 then
        previousImageBytes <- currentImageBytes
        0
    else
        let bytes = Seq.zip previousImageBytes currentImageBytes
        let differenceCount = bytes |> Seq.mapi(fun i e -> i, e)
                                    |> Seq.filter(fun (i,e) -> i % 4 <> 0 )
                                    |> Seq.map snd
                                    |> Seq.filter(fun (p,c) -> p <> c)
                                    |> Seq.length
        previousImageBytes <- currentImageBytes
        differenceCount

When I run it, the screen lags because (I think) it is taking too long to process the arrays. In addition, the error rate is approaching 50%.

1) Am I approaching the problem wrong? 2) Is there a way to optimize my code to speed things up?

Scrope answered 28/10, 2014 at 10:4 Comment(9)
hi - this problem is not F# specific (see #43789) - indeed you can translate all of the answers ther e - but if you really need speed (and are on Windwos) than grap you the PInvoke answer from there (you can put it into a C# dll if you want) and use it - you'll never get something as fast inside .netConfident
I think that is the right track. I have the PInvoke working with memcmp but it is returning 0, 1 or -1. I need the total number of differences. Is there a native method that does that?Scrope
oh - I guess I have missed this - I don't know a native version for this no - it's a wild guess but I would bet that the best way to do something like this is to move it to GPU shaders (or write a low-level C/C++/Asm implementation)Confident
:-( I am not sure what a GPU shader is. I'll see if someone has written a C/C++ implementation that I can use like this: #15314146Scrope
A shader is a simple programm that runs on one of the shader units in a GPU - and as a GPU has very many of those you can get massive parallel processing power, and of course this is used to grank out your newest game - but I meant it as a kind of half-joke - right now I would look for a native implementation or go with what John proposedConfident
John's answer is too slow so I am looking for a native implementation. I can also use the unsafe C# code in the book (which I know works) alsoScrope
You could investigate writing an SSE2 algorithm in C++/CLI and then consuming that from F#Aguascalientes
Using a loop will avoid all of the extra allocations. The GC pressure by this will likely be a huge culprit...Uball
For further information re. GPU processing in F# (as mentioned by @Carsten), see the F# Foundation website: "GPU execution is a technique for high-performance financial, image processing and other data-parallel numerical programming."Teirtza
T
10

Your sequence mapping/filtering via tuples causes a lot of boxing overhead. The following example avoids the boxing and works in parallel, which (on my machine) is 50 times faster.

let parallelRanges =
    let kinectSize = 1228800
    let subSize = kinectSize / Environment.ProcessorCount
    let iMax = kinectSize - 1
    let steps = [| -1 .. subSize .. iMax |]
    steps.[steps.Length - 1] <- iMax
    steps |> Seq.pairwise |> Seq.toArray |> Array.map (fun (x, y) -> x + 1, y)

let countDiffs (prevBytes:byte[]) (curBytes:_[]) =
    let count (fromI, toI) =
        let rec aux i acc =
            if i > toI then acc
            elif i % 4 <> 0 && prevBytes.[i] <> curBytes.[i] then 
                aux (i + 1) (acc + 1)
            else aux (i + 1) acc
        aux fromI 0

    parallelRanges
    |> Array.Parallel.map count
    |> Array.sum
Teirtza answered 28/10, 2014 at 18:44 Comment(2)
Curious - how much faster is this without running in parallel? I'd expect it to still be quicker than the previous one.Uball
@Reed: When using only one core, it's still ca. 20x faster than the original. Using four cores, it's about 50x faster.Teirtza
H
3

Without some test data to profile and compare I would do something like

let len = Array.length previousImageBytes
let mutable count = 0
for i in 0 .. 4 .. (len-1) do
    if previousImageBytes.[i] <> currentImageBytes.[i] then
        count <- count+1

which has one pass over the data rather than 5 and avoids the seq functions which are slow.

Honestly answered 28/10, 2014 at 10:18 Comment(4)
One small change: for i in 0 .. 4 .. len - 1 doScrope
But still not fast enough. I might have to use another thread and just sample the frames.Scrope
This is very hard without some data to benchmark against. and fixed the first bugHonestly
FYI - The original only had 2 passes over the data (Seq.length in the if and the maps/filters, which happen in one pass)Uball
P
2

I have experimented with several different implementations of byte array compare in F# for a project I've been working on.

Broadly speaking, for this particular problem, I found that "more idiomatic F#" == "slower". So I ended up with this:

// this code is very non-F#-ish.  but it's much faster than the
// idiomatic version which preceded it.

let Compare (x:byte[]) (y:byte[]) =
    let xlen = x.Length
    let ylen = y.Length
    let len = if xlen<ylen then xlen else ylen
    let mutable i = 0
    let mutable result = 0
    while i<len do
        let c = (int (x.[i])) - int (y.[i])
        if c <> 0 then
            i <- len+1 // breaks out of the loop, and signals that result is valid
            result <- c
        else
            i <- i + 1
    if i>len then result else (xlen - ylen)

I've violated several guidelines of good functional programming. I've used if-then-else instead of match. I've got mutables. And I faked a break statement with a cheesy change to a loop index variable.

Nonetheless, this ended up being dramatically faster than anything more idiomatic. Even something as simple as a closure with tail recursion was slower. (The profiler indicates that a closure involves a memory allocation on the heap.)

The project containing this code is on GitHub:

https://github.com/ericsink/LSM

The commit history of fs/lsm.fs should contain a record of my other failed attempts.

Perpetuity answered 2/11, 2014 at 12:32 Comment(2)
Why don't you use recursion, breaking out of the loop by just returning?Zephan
Note that in this example, we want to find the number of elements that are not equal, not just the index of the first oneHonestly

© 2022 - 2024 — McMap. All rights reserved.