What is the fastest way to XOR A LOT of binary arrays in python?

Asked 31/5, 2018 at 1:3 Answered 31/5, 2018 at 1:43

I am tasked with calculating hamming distances between 1D binary arrays in two groups - a group of 3000 arrays and a group of 10000 arrays, and every array is 100 items(bits) long. So thats 3000x10000 HD calculations on 100 bit long objects.And all that must be done in at most a dozen minutes

Here's the best of what I came up with

#X - 3000 by 100 bool np.array 
#Y - 10000 by 100 bool np.array
hd = []
i=1
for x in X:
    print("object nr " + str(i) + "/" + str(len(X)))
    arr = np.array([x] * len(Y)) 
    C = Y^arr # just xor this array by all the arrays in the other group simultainously
    hd.append([sum(c) for c in C]) #add up all the bits to get the hamming distance
    i+=1

return np.array(hd)

And it's still going to take 1-1.5 hours for it to finish. How do I go about making this faster?

Exhibitionist answered 31/5, 2018 at 1:3 Comment(1)

Do you have to use Python? – Earwitness 31/5, 2018 at 1:7

You should be able to dramatically improve the summing speed by using numpy to perform it, rather than using a list comprehension and the built-in sum function (that takes no advantage of numpy vectorized operations).

Just replace:

hd.append([sum(c) for c in C])

with:

# Explicitly use uint16 to reduce memory cost; if array sizes might increase
# you can use uint32 to leave some wiggle room
hd.append(C.sum(1, dtype=np.uint16))

which, for a 2D array, will return a new 1D array where each value is the sum of the corresponding row (thanks to specifying it should operate on axis 1). For example:

>>> arr = np.array([[True,False,True], [False,False,True], [True, True,True]], dtype=np.bool)
>>> arr.sum(1, np.uint16)
array([ 2, 1, 3], dtype=uint16)

Since it performs all the work at the C layer in a single operation without type conversions (instead of your original approach that requires a Python level loop that operates on each row, then an implicit loop that, while at the C layer, must still implicitly convert each numpy value one by one from np.bool to Python level ints just to sum them), this should run substantially faster for the array scales you're describing.

Side-note: While not the source of your performance problems, there is no reason to manually maintain your index value; enumerate can do that more quickly and easily. Simply replace:

i=1
for x in X:
    ... rest of loop ...
    i+=1

with:

for i, x in enumerate(X, 1):
    ... rest of loop ...

and you'll get the same behavior, but slightly faster, more concise and cleaner in general.

Blynn answered 31/5, 2018 at 1:21 Comment(2)

Performance-wise, on an input array of 100 rows of 10000 elements each, [sum(c) for c in C)] took about 2 seconds to run on my machine; C.sum(1, dtype=np.uint16) took 737 µs, or about 0.037% of the original runtime (so 1-1.5 hours would reasonably be expected to drop to the 1-2 second range on this change alone). For 10000 rows of 100 elements each, the timings are longer, but still similar; the listcomp+sum solution took 2.11 seconds, the numpy solution took 914 µs. Either way, it should dramatically reduce runtime. – Blynn 31/5, 2018 at 1:31

Holy Guacamole, you Sir are a genius! It runs so fast I honestly didn't think it was even possible! – Exhibitionist 31/5, 2018 at 13:13

IIUC, you can use np.logical_xor and list comprehension:

result = np.array([[np.logical_xor(X[a], Y[b].T).sum() for b in range(len(Y))] 
                                                       for a in range(len(X))])

The whole operation runs in 7 seconds in my machine.

0:00:07.226645

Brave answered 31/5, 2018 at 1:36 Comment(0)

Just in case you are not limited to using Python, this is a solution in C++ using bitset:

#include <iostream>
#include <bitset>
#include <vector>
#include <random>
#include <chrono>

using real = double;

std::mt19937_64 rng;
std::uniform_real_distribution<real> bitset_dist(0, 1);
real prob(0.75);

std::bitset<100> rand_bitset()
{
    std::bitset<100> bitset;
    for (size_t idx = 0; idx < bitset.size(); ++idx)
    {
         bitset[idx] = (bitset_dist(rng) < prob) ? true : false;
    }
    return std::move(bitset);
}

int main()
{
    rng.seed(std::chrono::high_resolution_clock::now().time_since_epoch().count());

    size_t v1_size(3000);
    size_t v2_size(10000);

    std::vector<size_t> hd;
    std::vector<std::bitset<100>> vec1;
    std::vector<std::bitset<100>> vec2;
    vec1.reserve(v1_size);
    vec2.reserve(v2_size);
    hd.reserve(v1_size * v2_size); /// Edited from hd.reserve(v1_size);

    for (size_t i = 0; i < v1_size; ++i)
    {
        vec1.emplace_back(rand_bitset());
    }
    for (size_t i = 0; i < v2_size; ++i)
    {
        vec2.emplace_back(rand_bitset());
    }

    std::cout << "vec1 size: " << vec1.size() << '\n'
              << "vec2 size: " << vec2.size() << '\n';

    auto start(std::chrono::high_resolution_clock::now());
    for (const auto& b1 : vec1)
    {
        for (const auto& b2 : vec2)
        {
            /// Count only the bits that are set and store them
            hd.emplace_back((b1 ^ b2).count());
        }
    }
    auto time(std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count());

    std::cout << vec1.size() << " x " << vec2.size()
              << " xor operations on 100 bits took " << time << " ms\n";
    return 0;
}

On my machine, the whole operation (3000 x 10000) takes about 300 ms.

You could put this into a function, compile it into a library and call it from Python. Another option is to store the distances to a file and then read them in Python.

EDIT: I had the wrong size for the hd vector. Reserving the proper amount of memory reduces the operation to about 190 ms because relocations are avoided.

Earwitness answered 31/5, 2018 at 1:43 Comment(0)

Recommended topics

Hot tags