Fastest way to check if a byte array is all zeros
Asked Answered
G

5

45

I have a byte[4096] and was wondering what the fastest way is to check if all values are zero?

Is there any way faster than doing:

byte[] b = new byte[4096];
b[4095] = 1;
for(int i=0;i<b.length;i++)
    if(b[i] != 0)
        return false; // Not Empty
Gurrola answered 23/5, 2014 at 8:25 Comment(18)
Probably not, but do you think this way is very slow? It checks 4k of memory, and who knows what it gets compiled to. Unless you're dealing with a lot of huge arrays, this is probably not a bottleneck.Korykorzybski
Besides multithreading (which almost certainly won't help here), no.Hexarchy
I don't know any Java, but can you point an integer pointer or better still a long long pointer to its start address and check 4 or 8 bytes in one go rather than 1 byte at a time?Tensible
@user432 Sorting is O(n lg n) best case, and a linear search is O(n), and sorting as additional complexity overhead. Sorting is almost certain to make things slower.Hexarchy
@MarkSetchell Unfortunately you can't combine elements in an array like that; each element has to be dereferenced separately. I'm guessing you're from a C/C++ background?Hexarchy
I am from a C background :-) Another option would be to add all the elements up and see if the total is zero because branches and tests for zero at each element really slow down modern CPUs - but that only works if the byte type can only store positive and not negative numbers...Tensible
@MarkSetchell Yeah, unfortunately Java doesn't support unsigned byte values... Wouldn't the branch predictor be able to negate most of the branching penalty though? Either you get an early return or you return as soon as the branch predictor messes up after finding the initial "pattern" of 0 values?Hexarchy
Benchmark the adding up method. If it is faster, you could use it to weed out all the arrays that are definitely non-zero in a quick time and then recheck more closely those that could be non-zero despite adding up to zero.Tensible
I don't suppose Java has a fast memcmp() function for comparing memory that you could compare your array with a pre-created zero 4k array? Ok, I'll shut up now!Tensible
@MarkSetchell I'm actually pretty curious about whether the if checks actually impose that much of a penalty, since the branch isn't taken most of the time. And unfortunately, no such function exists in Java. I suppose you could always make a JNI/JNA call (or maybe there's a method in sun.misc.UNSAFE?), but the overhead for such a call is probably too high to make it worth it.Hexarchy
@MarkSetchell Ran some tests on the adding method and it showed 10x faster with 1GB of data, but 2x slower with 4KBGurrola
Interesting and curious! Thank you for feeding back.Tensible
@MarkSetchell - You can't alias stuff in Java like you can in C. At best there are some weird NIO interfaces that allow odd operations on byte buffers, but they're impossible for mortal programmers to grok.Paulapauldron
@MarkSetchell: adding the elements doesn't sound like a good idea to me. You could get overflow. This is 4K of bytes .... unless you want to use a 4K long 'integer' you will have a false negative when the integer overflows (even with unsigned bytes). And if you do have a 4K long 'integer' then how do you compare that against 0 ..... isn't that where we started?Knowlton
@Knowlton I was thinking that I could quite comfortably add 4,096 values of up to 127 max each (making for a max total of 520,192) within an int which can hold up to 2,147,483,647.Tensible
@MarkSetchell adding the whole 4k array while the array has non-empty elements at the beginning takes much more time than early breakingCanuck
@LưuVĩnhPhúc I think the idea was to avoid branching, which could actually be faster than early breaking, depending on the code. I don't know enough about C/C++ to know whether cleverly written branchless code could outperform early-exit code, but in Java it seems that adding/ORing every element isn't the way to goHexarchy
@HotLicks, NIO interfaces literally CANNOT operate on byte[]/int[] or anything javaish (DirectBuffers are closer to C than Java). The closest hack would be Unsafe.getLong(byte[], offset) and masking out the last few bytes and that's nothing to do with NIO... and that would be if the JIT really fails on byte array access aside unroll.Lycian
V
74

I have rewritten this answer as I was first summing all bytes, this is however incorrect as Java has signed bytes, hence I need to or. Also I have changed the JVM warmup to be correct now.

Your best bet really is to simply loop over all values.

I suppose you have three major options available:

  1. Or all elements and check the sum.
  2. Do branchless comparisons.
  3. Do comparisons with a branch.

I don't know how good the performance is of adding bytes using Java (low level performance), I do know that Java uses (low level) branch predictors if you give branched comparisons.

Therefore I expect the following to happen on:

byte[] array = new byte[4096];
for (byte b : array) {
    if (b != 0) {
        return false;
    }
}
  1. Relatively slow comparison in the first few iterations when the branch predictor is still seeding itself.
  2. Very fast branch comparisons due to branch prediction as every value should be zero anyway.

If it would hit a non-zero value, then the branch predictor would fail, causing a slow-down of the comparison, but then you are also at the end of your computation as you want to return false either way. I think the cost of one failing branch prediction is an order of magnitude smaller as the cost of continuing to iterate over the array.

I furthermore believe that for (byte b : array) should be allowed as it should get compiled directly into indexed array iteration as as far as I know there is no such thing as a PrimitiveArrayIterator which would cause some extra method calls (as iterating over a list) until the code gets inlined.

Update

I wrote my own benchmarks which give some interesting results... Unfortunately I couldn't use any of the existing benchmark tools as they are pretty hard to get installed correctly.

I also decided to group options 1 and 2 together, as I think they are actually the same as with branchless you usually or everything (minus the condition) and then check the final result. And the condition here is x > 0 and hence a or of zero is a noop presumably.

The code:

public class Benchmark {
    private void start() {
        //setup byte arrays
        List<byte[]> arrays = createByteArrays(700_000);

        //warmup and benchmark repeated
        arrays.forEach(this::byteArrayCheck12);
        benchmark(arrays, this::byteArrayCheck12, "byteArrayCheck12");

        arrays.forEach(this::byteArrayCheck3);
        benchmark(arrays, this::byteArrayCheck3, "byteArrayCheck3");

        arrays.forEach(this::byteArrayCheck4);
        benchmark(arrays, this::byteArrayCheck4, "byteArrayCheck4");

        arrays.forEach(this::byteArrayCheck5);
        benchmark(arrays, this::byteArrayCheck5, "byteArrayCheck5");
    }

    private void benchmark(final List<byte[]> arrays, final Consumer<byte[]> method, final String name) {
        long start = System.nanoTime();
        arrays.forEach(method);
        long end = System.nanoTime();
        double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
        System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
    }

    private List<byte[]> createByteArrays(final int amount) {
        Random random = new Random();
        List<byte[]> resultList = new ArrayList<>();
        for (int i = 0; i < amount; i++) {
            byte[] byteArray = new byte[4096];
            byteArray[random.nextInt(4096)] = 1;
            resultList.add(byteArray);
        }
        return resultList;
    }

    private boolean byteArrayCheck12(final byte[] array) {
        int sum = 0;
        for (byte b : array) {
            sum |= b;
        }
        return (sum == 0);
    }

    private boolean byteArrayCheck3(final byte[] array) {
        for (byte b : array) {
            if (b != 0) {
                return false;
            }
        }
        return true;
    }

    private boolean byteArrayCheck4(final byte[] array) {
        return (IntStream.range(0, array.length).map(i -> array[i]).reduce(0, (a, b) -> a | b) != 0);
    }

    private boolean byteArrayCheck5(final byte[] array) {
        return IntStream.range(0, array.length).map(i -> array[i]).anyMatch(i -> i != 0);
    }

    public static void main(String[] args) {
        new Benchmark().start();
    }
}

The surprising results:

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 50.18817142857143ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 767.7371985714286ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 21145.03219857143ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 10376.119144285714ns

This shows that orring is a whole lots of faster than the branch predictor, which is rather surprising, so I assume some low level optimizations are being done.

As extra I've included the stream variants, which I did not expect to be that fast anyhow.

Ran on a stock-clocked Intel i7-3770, 16GB 1600MHz RAM.

So I think the final answer is: It depends. It depends on how many times you are going to check the array consecutively. The "byteArrayCheck3" solution is always steadily at 700~800ns.

Follow up update

Things actually take another interesting approach, turns out the JIT was optimizing almost all calculations away due to resulting variables not being used at all.

Thus I have the following new benchmark method:

private void benchmark(final List<byte[]> arrays, final Predicate<byte[]> method, final String name) {
    long start = System.nanoTime();
    boolean someUnrelatedResult = false;
    for (byte[] array : arrays) {
        someUnrelatedResult |= method.test(array);
    }
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Result: " + someUnrelatedResult);
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

This ensures that the result of the benchmarks cannot be optimized away, the major issue hence was that the byteArrayCheck12 method was void, as it noticed that the (sum == 0) was not being used, hence it optimized away the entire method.

Thus we have the following new result (omitted the result prints for clarity):

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 1370.6987942857143ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 736.1096242857143ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 20671.230327142857ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 9845.388841428572ns

Hence we think that we can finally conclude that branch prediction wins. It could however also happen because of the early returns, as on average the offending byte will be in the middle of the byte array, hence it is time for another method that does not return early:

private boolean byteArrayCheck3b(final byte[] array) {
    int hits = 0;
    for (byte b : array) {
        if (b != 0) {
            hits++;
        }
    }
    return (hits == 0);
}

In this way we still benefit from the branch prediction, however we make sure that we cannot return early.

Which in turn gives us more interesting results again!

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 1327.2817714285713ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 753.31376ns
Benchmark: byteArrayCheck3b / iterations: 700000 / time per iteration: 1506.6772842857142ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 21655.950115714284ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 10608.70917857143ns

I think we can though finally conclude that the fastest way is to use both early-return and branch prediction, followed by orring, followed by purely branch prediction. I suspect that all of those operations are highly optimized in native code.

Update, some additional benchmarking using long and int arrays.

After seeing suggestions on using long[] and int[] I decided it was worth investigating. However these attempts may not be fully in line with the original answers anymore, nevertheless may still be interesting.

Firstly, I changed the benchmark method to use generics:

private <T> void benchmark(final List<T> arrays, final Predicate<T> method, final String name) {
    long start = System.nanoTime();
    boolean someUnrelatedResult = false;
    for (T array : arrays) {
        someUnrelatedResult |= method.test(array);
    }
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Result: " + someUnrelatedResult);
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

Then I performed conversions from byte[] to long[] and int[] respectively before the benchmarks, it was also neccessary to set the maximum heap size to 10 GB.

List<long[]> longArrays = arrays.stream().map(byteArray -> {
    long[] longArray = new long[4096 / 8];
    ByteBuffer.wrap(byteArray).asLongBuffer().get(longArray);
    return longArray;
}).collect(Collectors.toList());
longArrays.forEach(this::byteArrayCheck8);
benchmark(longArrays, this::byteArrayCheck8, "byteArrayCheck8");

List<int[]> intArrays = arrays.stream().map(byteArray -> {
    int[] intArray = new int[4096 / 4];
    ByteBuffer.wrap(byteArray).asIntBuffer().get(intArray);
    return intArray;
}).collect(Collectors.toList());
intArrays.forEach(this::byteArrayCheck9);
benchmark(intArrays, this::byteArrayCheck9, "byteArrayCheck9");

private boolean byteArrayCheck8(final long[] array) {
    for (long l : array) {
        if (l != 0) {
            return false;
        }
    }
    return true;
}

private boolean byteArrayCheck9(final int[] array) {
    for (int i : array) {
        if (i != 0) {
            return false;
        }
    }
    return true;
}

Which gave the following results:

Benchmark: byteArrayCheck8 / iterations: 700000 / time per iteration: 259.8157614285714ns
Benchmark: byteArrayCheck9 / iterations: 700000 / time per iteration: 266.38013714285717ns

This path may be worth exploring if it is possibly to get the bytes in such format. However when doing the transformations inside the benchmarked method, the times were around 2000 nanoseconds per iteration, so it is not worth it when you need to do the conversions yourself.

Virtually answered 23/5, 2014 at 8:47 Comment(31)
+1 for analysis. Math operations are actually only overloaded to operate on int and long values; any other types are promoted to ints, so byte addition would be just as fast as int addition. And you are correct that a for-each loop would get compiled to a regular loop.Hexarchy
@user3580294 Updated it with more analysis information which is actually quite surprising!Virtually
Interesting indeed! Looks like the branch predictor wasn't as effective as I thought. Got similar results on my machine. Also interesting is how the stream seems to reverse the results...Hexarchy
Also, possible optimization for the adding code: maybe OR the bytes together? Because as is, having positive and negative bytes can result in a false positive. And maybe OR is faster than adding...Hexarchy
@user3580294 Indeed, I forgot that the bytes in Java are signed... Will fix it later today once I have time.Virtually
Hmmm, seems that ORing is a bit (haha, bit) slower -- ~190 ns for 100000 iterations. Wonder why...Hexarchy
And shouldn't each warmup phase go immediately before the corresponding test? Tried moving things to do this, and the running time for byteArrayCheck12 roughly doubled. Nothing else really changed. Also, the position of the first non-zero byte value seriously affects running time -- putting a single non-zero byte at the end of the array caused the branch code to take ~1500ns, but having it at the beginning caused the branch code to take ~50ns. Having it in the middle produced pretty much the same as randomly placing data, which is expected.Hexarchy
@user3580294 I updated the answer with both considerations. It still shows that xorring is loads faster than the branch predictor.Virtually
XOR is ^. You're using regular bitwise OR (as you should, because XOR would give false positives). It's worth noting that the OR-based solution won't quit early if a nonzero element appears early, which could put a major dent in its apparent speed advantage. A hybrid approach may be worth considering, where you OR the bytes together and check the value every hundred iterations or so.Globule
Hmm, you know what... Could it be that the branch predictor limits performance? Integer addition/bit manipulation are superscalar instructions (is that the right term?) on modern processors, so the processor can work through many operations a second. Perhaps the processor can't check conditions fast enough to keep its pipelines full? I thought modern processors had a hardware instruction for comparison to zero, but I suppose it's still slower... Also, interesting question -- I'll be watching closely. It's certainly quite odd behavior...Hexarchy
@Globule How would you check every hundred iterations though? As while you might limit the number of checks, you would still every iteration need to check whether you are at the hundred iteration. I'll fix up the xor vs or.Virtually
@skiwi: You're already checking every iteration whether you're at the end of the array. Checking whether 100 iterations have passed would just be another counter to increment and test.Globule
@Globule I don't think I'm checking whether I'm at the end of the array currently though... Could you eloborate more on that?Virtually
@Globule another counter to increment and test is another addition instruction to execute, as well as another branch for the test. I'm not sure it would help speed that much unless the advantage from bailing out early overcomes the potential performance hit...Hexarchy
@skiwi: It's hidden in the for-each syntax, which translates to for (int i = 0; i < array.length; i++) {byte b = array[i]; ...}. For-each doesn't let you skip that i < array.length check.Globule
Why are you using IntStream.mapToLong(…) for processing bytes? Why aren’t you using IntStream.map(…)?Northward
@Northward I want both cases to be identical. As I am working with a long in the non-stream code, I'm also doing it in the stream code. Although I am not sure whether overflows actually affect the results when ORring data together.Virtually
Of course, using long for oring bytes in the loop version makes no sense as well.Northward
@Northward It's actually a remainder of the fact that I was using addition before, I'll fix it up.Virtually
@user3580294 A new versions is out! Which gets rid of the JIT "optimization".Virtually
I'm tempted to look at the code to see if the JIT does the clever optimizations of combining the byte accesses into larger ones or not. If not a JNI solution could be faster. byteArrayCheck3b is btw almost certainly changed to some conditional instructions so actually doesn't need branch prediction.Tala
You know, if I could vote up multiple times I'd spend the next hour or so doing just that. The amount of effort you put into this question is insane, and just voting up doesn't wouldn't show how much I (and probably the other readers) appreciate your effort. As for the results, it's quite interesting how the results are reversed now -- seems like the JIT compiler really knows how to do its thing. I suppose the branch predictor is some form of black magic after all. Also interesting is how even if the entire array is looped over the additional OR instruction slows down evaluation that much.Hexarchy
Also, I'd consider reposting your other question with the updated benchmarking code, since as it is your update might not get much attention due to the presence of multiple upvoted answers. I'm not sure whether this would be allowed though, but considering the answer could very well be different given the new benchmarking code, I wouldn't be surprised if it was. In any case, these questions just seem like an endless rabbit hole that one could spend hours working on (not that that would be a waste of time at all).Hexarchy
@user3580294 You can offer a bounty for an exceptionally meritorious answer.Erick
Hmmm... I turned on JIT compilation statements, and it seems that the JIT compiler is still doing some optimization during the benchmarking phase. I tried replaced your warmup with just another call to benchmark(), ignoring the first call, but this didn't seem to totally eliminate JIT factors; however, there did seem to be far fewer compilations for the actual timing phase.Hexarchy
Results I got were: byteArrayCheck12(): warmup(1319.527142857143ns) actual(1339.98ns). byteArrayCheck3(): warmup(833.4471428571428ns) actual(824.3428571428572ns). byteArrayCheck3b(): warmup(1643.1328571428571ns) actual(1617.87ns). byteArrayCheck4(): warmup(26166.987142857142ns) actual(1856.6257142857144ns) (!!!!!!) byteArrayCheck5(): warmup(9498.395714285714ns) actual(9293.211428571429ns)Hexarchy
So in other words, your conclusion seems to mostly hold, as the streams swapped places. branch < branchless < forced-complete-branch < branchless-stream < branch-stream. I did notice that the branchless stream variant could be multithreaded. There was a portion where the JIT compiler ate up 6 or 7 cores, and while the warmup seemed to be single-threaded, the actual run seems to have a spike of some sort the second time around. Not sure what's going on.... Also not very sure if the JIT can store results from a prior run. Maybe if I have enough time to kill I can try a benchmarking framework...Hexarchy
@user3580294 It seems like actually most of my code is using some form of multithreading, which is rather interesting. As nowhere I have indicated it to use multithreading.Virtually
@Virtually From the testing I've done it appears the compiler is multithreaded, so depending on how much that runs that could be it; for example, I noticed a pretty big spike right before byteArrayCheck4() at the very least (that was where 6 or 7 cores were busy), and that was before any of your code was executed, so I'm guessing that that was the compiler.... It could be possible that streams are parallelized where possible even when the user doesn't explicitly say so, but I haven't heard of such behavior. Not too sure what's going on...Hexarchy
@Virtually jmh is not that difficult to use and would have saved yousome trouble!Mcmillon
few notes: ByteBuffer.wrap(byteArray).asIntBuffer() is actually slow, the directBuffers.asXXX are fast. Using unseeded random for test is a bad idea as the data is different each time (and can skew the tests). Try using jmh for benchmarks instead.Lycian
C
11

This may not be the fastest or most memory performant solution but it's a one liner:

byte[] arr = randomByteArray();
assert Arrays.equals(arr, new byte[arr.length]);
Crap answered 22/10, 2015 at 22:51 Comment(2)
In fact, this may be the fastest solution since you can cache the all-zeros array you use for comparison, so you don't need to create it on each call. Unfortunately it seems that current JVMs (JDK 8 r111) don't implement Arrays.equals as an intrinsic. I measured times of about 0.7 cycles/element for the simple "loop with if check" version and 1.1 cycles/iteration for the Arrays.equals version. Both are pretty fast - it means that absent some type of vectorization, the loop version is averaging about ~1.5 loads per cycle, pretty close to the theoretical max of 2.Dorena
In fact, it looks like only Arrays.equals(char[], char[]) is an intrinsic in JDK8, but in JDK9, both byte and char versions are intrinsic.Dorena
C
3

For Java 8, you can simply use this:

public static boolean isEmpty(final byte[] data){
    return IntStream.range(0, data.length).parallel().allMatch(i -> data[i] == 0);
}
Chambray answered 29/10, 2015 at 12:35 Comment(0)
H
0

I think that theoretically your way in the fastest way, in practice you might be able to make use of larger comparisons as suggested by one of the commenters (1 byte comparison takes 1 instruction, but so does an 8-byte comparison on a 64-bit system).

Also in languages closer to the hardware (C and variants) you can make use of something called vectorization where you could perform a number of the comparisons/additions simultaneously. It looks like Java still doesn't have native support for it but based on this answer you might be able to get some use of it.

Also in line with the other comments I would say that with a 4k buffer it's probably not worth the time to try and optimize it (unless it is being called very often)

Hem answered 23/5, 2014 at 19:24 Comment(0)
P
0

Someone suggested checking 4 or 8 bytes at a time. You actually can do this in Java:

LongBuffer longBuffer = ByteBuffer.wrap(b).asLongBuffer();
while (longBuffer.hasRemaining()) {
    if (longBuffer.get() != 0) {
        return false;
    }
}
return true;

Whether this is faster than checking byte values is uncertain, since there is so much potential for optimization.

Poucher answered 24/5, 2014 at 2:52 Comment(2)
I tried some benchmarking on this and can conclude that it is highly performant, but cannot beat the code of byteArrayCheck3b. And ByteBuffer, etc. are directly mapped to machine instructions in the JVM, hence it seems to not work. Then again, I haven't tested this kind of code in C or C++ either.Virtually
Moreover, using an IntBuffer is actually faster as the LongBuffer.Virtually

© 2022 - 2024 — McMap. All rights reserved.