Simple Python Challenge: Fastest Bitwise XOR on Data Buffers
Asked Answered
K

11

53

Challenge:

Perform a bitwise XOR on two equal sized buffers. The buffers will be required to be the python str type since this is traditionally the type for data buffers in python. Return the resultant value as a str. Do this as fast as possible.

The inputs are two 1 megabyte (2**20 byte) strings.

The challenge is to substantially beat my inefficient algorithm using python or existing third party python modules (relaxed rules: or create your own module.) Marginal increases are useless.

from os import urandom
from numpy import frombuffer,bitwise_xor,byte

def slow_xor(aa,bb):
    a=frombuffer(aa,dtype=byte)
    b=frombuffer(bb,dtype=byte)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

aa=urandom(2**20)
bb=urandom(2**20)

def test_it():
    for x in xrange(1000):
        slow_xor(aa,bb)
Kisner answered 22/1, 2010 at 19:8 Comment(10)
It sounds like Python maybe isn't the best language for whatever problem you're trying to solve.Strove
I can assure you, it is. Python makes me curse the least.Kisner
If you want speed in bitwise operations, the lower you go the better it is. You can do a XOR over an array in C in a few lines and it will beat any Python implementation.Cheremkhovo
Do you have the supporting code to make this module?Kisner
@Max S. What level is the use of NumPy in your opinion?Bibliography
Why aren't you doing this in C, assembly, or GPGPU?Hurtado
I can't believe this doesn't have any close (Not a question)'s...Bicentennial
@Max S. A naive C implementation will do very poorly if the compiler doesn't auto-vectorize it, as we have seen in several examples here.Kisner
It might be worth checking out what gcc 4.4 does when forcing loop unroll and -O3 (which includes vectorization), or icc, or clang for that matter. Optimizing a "normal" loop to a vectorized one is nontrivial though, because both misalignment and trailing elements (i.e. impossible to fill the last 128bits) must be handled properly, and for small arrays the overhead of that is going to outweigh the benefit. Optimizations like using MOVNTDQ instead of MOVDQA are even more difficult (if not impossible, in the general case).Halting
I'm voting to close this question as off-topic because it's a coding challenge and not a real question.Abortion
H
38

First Try

Using scipy.weave and SSE2 intrinsics gives a marginal improvement. The first invocation is a bit slower since the code needs to be loaded from the disk and cached, subsequent invocations are faster:

import numpy
import time
from os import urandom
from scipy import weave

SIZE = 2**20

def faster_slow_xor(aa,bb):
    b = numpy.fromstring(bb, dtype=numpy.uint64)
    numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b)
    return b.tostring()

code = """
const __m128i* pa = (__m128i*)a;
const __m128i* pend = (__m128i*)(a + arr_size);
__m128i* pb = (__m128i*)b;
__m128i xmm1, xmm2;
while (pa < pend) {
  xmm1 = _mm_loadu_si128(pa); // must use unaligned access 
  xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries
  _mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2));
  ++pa;
  ++pb;
}
"""

def inline_xor(aa, bb):
    a = numpy.frombuffer(aa, dtype=numpy.uint64)
    b = numpy.fromstring(bb, dtype=numpy.uint64)
    arr_size = a.shape[0]
    weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"'])
    return b.tostring()

Second Try

Taking into account the comments, I revisited the code to find out if the copying could be avoided. Turns out I read the documentation of the string object wrong, so here goes my second try:

support = """
#define ALIGNMENT 16
static void memxor(const char* in1, const char* in2, char* out, ssize_t n) {
    const char* end = in1 + n;
    while (in1 < end) {
       *out = *in1 ^ *in2;
       ++in1; 
       ++in2;
       ++out;
    }
}
"""

code2 = """
PyObject* res = PyString_FromStringAndSize(NULL, real_size);

const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT;
const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT;

memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head);

const __m128i* pa = (__m128i*)((char*)a + head);
const __m128i* pend = (__m128i*)((char*)a + real_size - tail);
const __m128i* pb = (__m128i*)((char*)b + head);
__m128i xmm1, xmm2;
__m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head);
while (pa < pend) {
    xmm1 = _mm_loadu_si128(pa);
    xmm2 = _mm_loadu_si128(pb);
    _mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2));
    ++pa;
    ++pb;
    ++pc;
}
memxor((const char*)pa, (const char*)pb, (char*)pc, tail);
return_val = res;
Py_DECREF(res);
"""

def inline_xor_nocopy(aa, bb):
    real_size = len(aa)
    a = numpy.frombuffer(aa, dtype=numpy.uint64)
    b = numpy.frombuffer(bb, dtype=numpy.uint64)
    return weave.inline(code2, ["a", "b", "real_size"], 
                        headers = ['"emmintrin.h"'], 
                        support_code = support)

The difference is that the string is allocated inside the C code. It's impossible to have it aligned at a 16-byte-boundary as required by the SSE2 instructions, therefore the unaligned memory regions at the beginning and the end are copied using byte-wise access.

The input data is handed in using numpy arrays anyway, because weave insists on copying Python str objects to std::strings. frombuffer doesn't copy, so this is fine, but the memory is not aligned at 16 byte, so we need to use _mm_loadu_si128 instead of the faster _mm_load_si128.

Instead of using _mm_store_si128, we use _mm_stream_si128, which will make sure that any writes are streamed to main memory as soon as possible---this way, the output array does not use up valuable cache lines.

Timings

As for the timings, the slow_xor entry in the first edit referred to my improved version (inline bitwise xor, uint64), I removed that confusion. slow_xor refers to the code from the original questions. All timings are done for 1000 runs.

  • slow_xor: 1.85s (1x)
  • faster_slow_xor: 1.25s (1.48x)
  • inline_xor: 0.95s (1.95x)
  • inline_xor_nocopy: 0.32s (5.78x)

The code was compiled using gcc 4.4.3 and I've verified that the compiler actually uses the SSE instructions.

Halting answered 1/2, 2010 at 20:7 Comment(7)
Thanks! It's probably possible to speed it up a little bit by using prefetching (_mm_prefetch intrinsic), but I wasn't able to produce any spectacular results with it.Halting
I've found that in cases of linear walks through an array prefetch really doesn't help. Intel processors are already smart about prefetching for linear walks.Smyth
Is there a simple way to eliminate the frombuffer and tostring calls? Those seem to be the largest bottleneck now. Promising approach though.Kisner
That code really bothered me, so I gave it another try ;-) Doesn't really have much to do with Python anymore, though :-(Halting
I just did XOR over an array in C, and my implementation did 1000 runs in 0.03s... An order of magnitude faster.Bodine
Ratio slow_xor/inline_xor_nocopy is 11 (2020 usec vs. 172 usec per iteration). Ratio slow_xor/inline_xor is 1.6; slow_xor/faster_slow_xor is 1.5 (Python 2.6.4 x86_64 GNU/Linux)Curlpaper
I've posted results of comparison of all presented approaches #2120261Curlpaper
C
38

Performance comparison: numpy vs. Cython vs. C vs. Fortran vs. Boost.Python (pyublas)

| function               | time, usec | ratio | type         |
|------------------------+------------+-------+--------------|
| slow_xor               |       2020 |   1.0 | numpy        |
| xorf_int16             |       1570 |   1.3 | fortran      |
| xorf_int32             |       1530 |   1.3 | fortran      |
| xorf_int64             |       1420 |   1.4 | fortran      |
| faster_slow_xor        |       1360 |   1.5 | numpy        |
| inline_xor             |       1280 |   1.6 | C            |
| cython_xor             |       1290 |   1.6 | cython       |
| xorcpp_inplace (int32) |        440 |   4.6 | pyublas      |
| cython_xor_vectorised  |        325 |   6.2 | cython       |
| inline_xor_nocopy      |        172 |  11.7 | C            |
| xorcpp                 |        144 |  14.0 | boost.python |
| xorcpp_inplace         |        122 |  16.6 | boost.python |
#+TBLFM: $3=@2$2/$2;%.1f

To reproduce results, download http://gist.github.com/353005 and type make (to install dependencies, type: sudo apt-get install build-essential python-numpy python-scipy cython gfortran, dependencies for Boost.Python, pyublas are not included due to they require manual intervention to work)

Where:

And xor_$type_sig() are:

! xorf.f90.template
subroutine xor_$type_sig(a, b, n, out)
  implicit none
  integer, intent(in)             :: n
  $type, intent(in), dimension(n) :: a
  $type, intent(in), dimension(n) :: b
  $type, intent(out), dimension(n) :: out

  integer i
  forall(i=1:n) out(i) = ieor(a(i), b(i))

end subroutine xor_$type_sig

It is used from Python as follows:

import xorf # extension module generated from xorf.f90.template
import numpy as np

def xor_strings(a, b, type_sig='int64'):
    assert len(a) == len(b)
    a = np.frombuffer(a, dtype=np.dtype(type_sig))
    b = np.frombuffer(b, dtype=np.dtype(type_sig))
    return getattr(xorf, 'xor_'+type_sig)(a, b).tostring()

xorcpp_inplace() (Boost.Python, pyublas):

xor.cpp:

#include <inttypes.h>
#include <algorithm>
#include <boost/lambda/lambda.hpp>
#include <boost/python.hpp>
#include <pyublas/numpy.hpp>

namespace { 
  namespace py = boost::python;

  template<class InputIterator, class InputIterator2, class OutputIterator>
  void
  xor_(InputIterator first, InputIterator last, 
       InputIterator2 first2, OutputIterator result) {
    // `result` migth `first` but not any of the input iterators
    namespace ll = boost::lambda;
    (void)std::transform(first, last, first2, result, ll::_1 ^ ll::_2);
  }

  template<class T>
  py::str 
  xorcpp_str_inplace(const py::str& a, py::str& b) {
    const size_t alignment = std::max(sizeof(T), 16ul);
    const size_t n         = py::len(b);
    const char* ai         = py::extract<const char*>(a);
    char* bi         = py::extract<char*>(b);
    char* end        = bi + n;

    if (n < 2*alignment) 
      xor_(bi, end, ai, bi);
    else {
      assert(n >= 2*alignment);

      // applying Marek's algorithm to align
      const ptrdiff_t head = (alignment - ((size_t)bi % alignment))% alignment;
      const ptrdiff_t tail = (size_t) end % alignment;
      xor_(bi, bi + head, ai, bi);
      xor_((const T*)(bi + head), (const T*)(end - tail), 
           (const T*)(ai + head),
           (T*)(bi + head));
      if (tail > 0) xor_(end - tail, end, ai + (n - tail), end - tail);
    }
    return b;
  }

  template<class Int>
  pyublas::numpy_vector<Int> 
  xorcpp_pyublas_inplace(pyublas::numpy_vector<Int> a, 
                         pyublas::numpy_vector<Int> b) {
    xor_(b.begin(), b.end(), a.begin(), b.begin());
    return b;
  }
}

BOOST_PYTHON_MODULE(xorcpp)
{
  py::def("xorcpp_inplace", xorcpp_str_inplace<int64_t>);     // for strings
  py::def("xorcpp_inplace", xorcpp_pyublas_inplace<int32_t>); // for numpy
}

It is used from Python as follows:

import os
import xorcpp

a = os.urandom(2**20)
b = os.urandom(2**20)
c = xorcpp.xorcpp_inplace(a, b) # it calls xorcpp_str_inplace()
Curlpaper answered 2/4, 2010 at 10:15 Comment(0)
A
17

Here are my results for cython

slow_xor   0.456888198853
faster_xor 0.400228977203
cython_xor 0.232881069183
cython_xor_vectorised 0.171468019485

Vectorising in cython shaves about 25% off the for loop on my computer, However more than half the time is spent building the python string (the return statement) - I don't think the extra copy can be avoided (legally) as the array may contain null bytes.

The illegal way would be to pass in a Python string and mutate it in place and would double the speed of the function.

xor.py

from time import time
from os import urandom
from numpy import frombuffer,bitwise_xor,byte,uint64
import pyximport; pyximport.install()
import xor_

def slow_xor(aa,bb):
    a=frombuffer(aa,dtype=byte)
    b=frombuffer(bb,dtype=byte)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

def faster_xor(aa,bb):
    a=frombuffer(aa,dtype=uint64)
    b=frombuffer(bb,dtype=uint64)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

aa=urandom(2**20)
bb=urandom(2**20)

def test_it():
    t=time()
    for x in xrange(100):
        slow_xor(aa,bb)
    print "slow_xor  ",time()-t
    t=time()
    for x in xrange(100):
        faster_xor(aa,bb)
    print "faster_xor",time()-t
    t=time()
    for x in xrange(100):
        xor_.cython_xor(aa,bb)
    print "cython_xor",time()-t
    t=time()
    for x in xrange(100):
        xor_.cython_xor_vectorised(aa,bb)
    print "cython_xor_vectorised",time()-t

if __name__=="__main__":
    test_it()

xor_.pyx

cdef char c[1048576]
def cython_xor(char *a,char *b):
    cdef int i
    for i in range(1048576):
        c[i]=a[i]^b[i]
    return c[:1048576]

def cython_xor_vectorised(char *a,char *b):
    cdef int i
    for i in range(131094):
        (<unsigned long long *>c)[i]=(<unsigned long long *>a)[i]^(<unsigned long long *>b)[i]
    return c[:1048576]
Apoenzyme answered 1/2, 2010 at 0:22 Comment(4)
Somewhere between Cython and the C compiler, there is a failure to vectorize into SIMD instructions. Shame. Good demonstration of a very simple optimization though. Also good for being the first, at the time of posting, to remove the expensive buffer type conversion operations.Kisner
@user213060, I'm sure there would be a decent speedup by casting to 64 or 128bit types for the xor. I don't know enough cython to do that though.Apoenzyme
Ratio slow_xor/cython_xor_vectorised is 6.2 (2020 usec vs. 325 usec for 2**20 size). Ratio slow_xor/cython_xor is 1.6 (Python 2.6.4 x86_64 GNU/Linux)Curlpaper
I've posted results of comparison of all presented approaches #2120261Curlpaper
B
10

An easy speedup is to use a larger 'chunk':

def faster_xor(aa,bb):
    a=frombuffer(aa,dtype=uint64)
    b=frombuffer(bb,dtype=uint64)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

with uint64 also imported from numpy of course. I timeit this at 4 milliseconds, vs 6 milliseconds for the byte version.

Babism answered 22/1, 2010 at 19:39 Comment(3)
Marginal improvement though, especially for smaller buffers.Kisner
+1 for good suggestion. I timed it and it's much faster than the original, see my comment to Ira Baxter's answer below.Bibliography
This requires the buffer length to be a multiple of 8, but the challenge is for 2**20, so no special handling of other cases is requiredApoenzyme
S
8

Your problem isn't the speed of NumPy's xOr method, but rather with all of the buffering/data type conversions. Personally I suspect that the point of this post may have really been to brag about Python, because what you are doing here is processing THREE GIGABYTES of data in timeframes on par with non-interpreted languages, which are inherently faster.

The below code shows that even on my humble computer Python can xOr "aa" (1MB) and "bb" (1MB) into "c" (1MB) one thousand times (total 3GB) in under two seconds. Seriously, how much more improvement do you want? Especially from an interpreted language! 80% of the time was spent calling "frombuffer" and "tostring". The actual xOr-ing is completed in the other 20% of the time. At 3GB in 2 seconds, you would be hard-pressed to improve upon that substantially even just using memcpy in c.

In case this was a real question, and not just covert bragging about Python, the answer is to code so as to minimize the number, amount and frequency of your type conversions such as "frombuffer" and "tostring". The actual xOr'ing is lightning fast already.

from os import urandom
from numpy import frombuffer,bitwise_xor,byte,uint64

def slow_xor(aa,bb):
    a=frombuffer(aa,dtype=byte)
    b=frombuffer(bb,dtype=byte)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

bb=urandom(2**20)
aa=urandom(2**20)

def test_it():
    for x in xrange(1000):
    slow_xor(aa,bb)

def test_it2():
    a=frombuffer(aa,dtype=uint64)
    b=frombuffer(bb,dtype=uint64)
    for x in xrange(1000):
        c=bitwise_xor(a,b);
    r=c.tostring()    

test_it()
print 'Slow Complete.'
#6 seconds
test_it2()
print 'Fast Complete.'
#under 2 seconds

Anyway, the "test_it2" above accomplishes exactly the same amount of xOr-ing as "test_it" does, but in 1/5 the time. 5x speed improvement should qualify as "substantial", no?

Squad answered 31/1, 2010 at 21:6 Comment(1)
isn't it perhaps because test_it2 runs c.tostring once while test_it a thousand times?Orpington
B
5

The fastest bitwise XOR is "^". I can type that much quicker than "bitwise_xor" ;-)

Bloodyminded answered 2/2, 2010 at 23:44 Comment(1)
perl has ^ for strings!Leong
B
5

Python3 has int.from_bytes and int.to_bytes, thus:

x = int.from_bytes(b"a" * (1024*1024), "big")
y = int.from_bytes(b"b" * (1024*1024), "big")
(x ^ y).to_bytes(1024*1024, "big")

It's faster than IO, kinda hard to test just how fast it is, looks like 0.018 .. 0.020s on my machine. Strangely "little"-endian conversion is a little faster.

CPython 2.x has the underlying function _PyLong_FromByteArray, it's not exported but accessible via ctypes:

In [1]: import ctypes
In [2]: p = ctypes.CDLL(None)
In [3]: p["_PyLong_FromByteArray"]
Out[3]: <_FuncPtr object at 0x2cc6e20>

Python 2 details are left as exercise to the reader.

Beefwood answered 6/5, 2014 at 12:55 Comment(0)
R
1

If you want to do fast operations on array data types, then you should try Cython (cython.org). If you give it the right declarations it should be able to compile down to pure c code.

Retroflexion answered 23/1, 2010 at 21:22 Comment(1)
It compiles to machine code. It always gets converted to c code before compiling.Reld
P
1

How badly do you need the answer as a string? Note that the c.tostring() method has to copy the data in c to a new string, as Python strings are immutable (and c is mutable). Python 2.6 and 3.1 have a bytearray type, which acts like str (bytes in Python 3.x) except for being mutable.

Another optimization is using the out parameter to bitwise_xor to specify where to store the result.

On my machine I get

slow_xor (int8): 5.293521 (100.0%)
outparam_xor (int8): 4.378633 (82.7%)
slow_xor (uint64): 2.192234 (41.4%)
outparam_xor (uint64): 1.087392 (20.5%)

with the code at the end of this post. Notice in particular that the method using a preallocated buffer is twice as fast as creating a new object (when operating on 4-byte (uint64) chunks). This is consistent with the slower method doing two operations per chunk (xor + copy) to the faster's 1 (just xor).

Also, FWIW, a ^ b is equivalent to bitwise_xor(a,b), and a ^= b is equivalent to bitwise_xor(a, b, a).

So, 5x speedup without writing any external modules :)

from time import time
from os import urandom
from numpy import frombuffer,bitwise_xor,byte,uint64

def slow_xor(aa, bb, ignore, dtype=byte):
    a=frombuffer(aa, dtype=dtype)
    b=frombuffer(bb, dtype=dtype)
    c=bitwise_xor(a, b)
    r=c.tostring()
    return r

def outparam_xor(aa, bb, out, dtype=byte):
    a=frombuffer(aa, dtype=dtype)
    b=frombuffer(bb, dtype=dtype)
    c=frombuffer(out, dtype=dtype)
    assert c.flags.writeable
    return bitwise_xor(a, b, c)

aa=urandom(2**20)
bb=urandom(2**20)
cc=bytearray(2**20)

def time_routine(routine, dtype, base=None, ntimes = 1000):
    t = time()
    for x in xrange(ntimes):
        routine(aa, bb, cc, dtype=dtype)
    et = time() - t
    if base is None:
        base = et
    print "%s (%s): %f (%.1f%%)" % (routine.__name__, dtype.__name__, et,
        (et/base)*100)
    return et

def test_it(ntimes = 1000):
    base = time_routine(slow_xor, byte, ntimes=ntimes)
    time_routine(outparam_xor, byte, base, ntimes=ntimes)
    time_routine(slow_xor, uint64, base, ntimes=ntimes)
    time_routine(outparam_xor, uint64, base, ntimes=ntimes)
Preinstruct answered 2/4, 2010 at 20:55 Comment(0)
B
0

You could try the symmetric difference of the bitsets of sage.

http://www.sagemath.org/doc/reference/sage/misc/bitset.html

Burkitt answered 22/1, 2010 at 19:51 Comment(0)
L
0

The fastest way (speedwise) will be doing what Max. S recommended. Implement it in C.

The supporting code for this task should be rather simple to write. It is just one function in a module creating a new string and doing the xor. That's all. When you have implemented one module like that, it is simple to take the code as template. Or you even take a module implemented from somebody else that implements a simple enhancement module for Python and just throw out everything not needed for your task.

The real complicated part is just, doing the RefCounter-Stuff right. But once realized how it works, it is manageable -- also since the task at hand is really simple (allocate some memory, and return it -- params are not to be touched (Ref-wise)).

Lapsus answered 29/1, 2010 at 22:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.