Parallelizing numpy.sort
Asked Answered
C

2

6

I need to sort uint64 arrays of length 1e8-1e9, which is one of the performance bottlenecks in my current project. I have just recently updated numpy v2.0 version, in which the sorting algorithm is significantly optimized. Testing it on my hardware, its about 5x faster than numpy v1.26 version. But currently numpy's sorting algorithm cannot utilize multi-core CPUs even though it uses SIMD.

I tried to parallelize it and sort multiple np.array at the same time. One possible approach is to use numba prange, but numba has always had poor support for numpy sorting. numba.jit even has a slowdown effect on np.sort, and numba v0.60.0 fails to follow up on numpy v2.0's optimizations for sorting (https://github.com/numba/numba/issues/9611). The alternative is cython prange, but cython does not allow the creation of Python objects at nogil. Is there a way to sort numpy.array in parallel using cython or otherwise? If using cpp's parallel sorting libraries, are they faster than numpy's own sorting, taking into account the overhead of data type conversions?

arr=np.random.randint(0,2**64,int(3e8),dtype='uint64')  
sorted_arr=np.sort(arr)  # single thread np.sort takes 4 seconds (numpy v2.0.0)
Celom answered 21/6 at 17:19 Comment(17)
Cython prange will probably work. You can reacquire the gil with with gil:. Assuming that numpy.sort releases the GIL internally, this will probably work reasonably efficiently.Tungsten
how exactly are you proposing to use prange or numba to speed up np.sort? the only way I can imagine re-implementing the sort using some parallel sorting algorithm by hand. Is that what you are asking?Candida
@Candida The method is to split the array into multiple parts and sort them separately, then merge the results (o(n)). The merging part is easy to implement efficiently using numba with numpy. In fact I can store the data to be sorted in different arrays when I generate them, so I just need to be able to sort them in parallelCelom
I do no think it is possible to call sort of Numpy in parallel with Cython/Numba because of the GIL. Indeed, Numba parallel loops need the GIL to be release and no object can be manipulated inside it. Numba sort does not actually call Numpy functions but its own implementation which does not use the GIL nor create any Python object (which require the GIL to be enabled). The Numba sequential implementation is inefficient anyway. AFAIK, prange of Cython require the GIL to be disabled so creating Python object is not possible in the parallel loop preventing np.sort to be called...Wheresoever
I tried tricks like using view and call the in-place view.sort() so not to create views but Cython explicitly states : "Memoryview slices can only be shared in parallel sections" (when copying views) and "Converting to Python object not allowed without gil" (when calling view.sort() on a shared view).Wheresoever
Given (cit.) "which is one of the performance bottlenecks in my current project" - what is your expectation of speedup? Best be quantitative. As you claim 4 [s] are bottleneck, what amount of seconds is not? Amdahl's Law is ( overhead respecting formulation ) is rather clear on this - you simply cannot pay more ( in add-on costs ) than you will ever get in return - speedups << 1.0 are then actually slowdowns. Given your computing fabric is probably having no more than 2~3 mem-I/O channels, do not expect silicon to move data faster across this principal blocker at whatever amount of CPU-coresTomlin
I'm trying to fully parallelize a certain part of my project, where the data generation phase can be accelerated using numba prange. And the sorting phase takes up about 50% of the total time, so it would be very helpful if the data sorting could be made twice as fast through parallelization. If sorting is not parallelizable, then it becomes more obvious as a performance bottleneck.Celom
@Tungsten I tried using cython prange parallel numpy sort with gil supplied, and it proved not to be faster,numpy sort does not automatically release the GILCelom
I succeeded to write an implementation which is faster in parallel by writing the code in C++ and calling it from Python. The merge idea is really not efficient (1.5x faster on 6 core) mainly because the merge is very slow (especially the last one which is sequential). With a SIMD parallel quick-sort, I get 2x faster code which seems not great, but the operation is memory-bound once parallelized. An optimal parallelization might theoretically reach 3.5x speed-up, but no more (on my machine).Wheresoever
I barely know how to use C++, so implementing a parallel algorithm of my own in C++ might not work very well @JérômeRichardCelom
@JérômeRichard It's true that merging may not be fast. But for my project, the data has a large percentage of duplicates, and I need to de-duplicate after sorting, which will reduce the array length to 10-20%, at which point the merge will be much fasterCelom
Is the sort only for removing duplicates? A sort may not be optimal for this, especially so large ones. See The most efficient way rather than using np.setdiff1d and np.in1d, to remove common values of 1D arrays with unique values. It is for 32-bit int, but can certainly be adapted for 64-bit ones (though it will likely not be as good as for 32-bit ones). Such strategy can be used to avoid expensive merges. Filtering numbers in each chunk before merging can indeed also help though not so much if duplicates are uniformly spread in chunks.Wheresoever
Sorting and de-duplication are both necessary. Sorting is for easy searching later @JérômeRichardCelom
If you only search items by value and do not care about the order, then hash-maps are often better. There is no hash-maps in Numpy unfortunately. The one of Python will be so expensive in your case (huge memory overhead + GC one) that it does not even worth trying. If the order matters, then sorting is certainly mandatory.Wheresoever
I succeed to reach a very-good 5x speed-up on 6 cores using a sort written in C++ combining an optimized radix-sort with a custom data structure and the SIMD sorting library used internally by Numpy. I think the implementation is closed to be optimal but it is quite complex. There is probably no chance for this to be feasible in Cython (it is not possible in Numba without huge dirty hacks). The radix idea can certainly be used to write a less efficient version somehow in Numba/Cython but it I expect it to be pretty complex anyway.Wheresoever
I will save many of these data files to disk and look them up later. To avoid reading the entire file into memory each time, a binary lookup is currently used to read only a very small amount of data @JérômeRichardCelom
The 5x is amazingly fast, would I be able to learn your C++ implementation. Maybe you can write an answer @JérômeRichardCelom
E
4

This answer show why a pure-Python, Numba or Cython implementation certainly cannot be used to write a (reasonably-simple) efficient implementation (this summaries the comments). It provides a C++ version which can be called from CPython. The provided version is fast independently of the Numpy version used (so Numpy 2.0 is not required).


Why it is certainly not possible directly with Numba/Cython/pure-Python

I do no think it is possible to call sort of Numpy in parallel with Cython/Numba because of the GIL and many additional issues.

Regarding Numba, parallel loops need the GIL to be release and no object can be manipulated inside it. The Numba sorting function does not actually call Numpy functions, but its own implementation which does not use the GIL nor create any Python object (which require the GIL to be enabled). The Numba sequential implementation is inefficient anyway. While one can try to re-implement a parallel sort from scratch, the parallel features are too limited for the resulting implementation to be really fast or reasonable simple (or both). Indeed, it is limited to a simple parallel for loop called prange (no atomics, critical sections, barriers, TLS storage, etc.).

Regarding Cython, prange of Cython requires the GIL to be disabled so creating Python object is not possible in the parallel loop preventing np.sort to be called... Cython provides more parallel features than Numba so re-implementing a parallel sort from scratch seems possible at first glance. However, in practice, it is really complicated (if even possible) to write a fast implementation because of many issues and opened/unknown bugs. Here are the issues I found out while trying to write such a code:

  • OpenMP barriers are not yet available and there is no sane (portable) replacement;
  • critical sections are also not yet available so one need to use manual locks (instead of just #pragma omp critical;
  • arrays must be allocated and freed manually using malloc and free in parallel sections (bug prone and resulting in a more complex code);
  • It is not possible to create views in parallel sections (only outside);
  • Cython does not seems to support well Numpy 2.0 yet causing many compilation errors and also runtime ones (see this post which seems related to this);
  • the documentation of OpenMP functions is rather limited (parts are simply missing);
  • variables of a prange-based loop cannot be reused in a range-based loop outside the prange-loop

I also tried to use a ThreadPoolExecutor so to call some optimized Cython/Numba functions and circumvent the aforementioned limitations of the two but it resulted in a very slow implementation (slower than just calling np.sort) mainly because of the GIL (nearly no speed up) and Numpy overhead (mainly temporary arrays and more precisely page-faults).


Efficient parallel C++ solution

We can write an efficient parallel C++ code performing the following steps:

  • split the input array in N slices
  • perform a bucket sort on each part in parallel so we get M buckets for each slice
  • merge the resulting buckets so to get M buckets from the M x N buckets
  • sort the M buckets in parallel using a SIMD-optimized sort -- this can be done with the x86simdsort C++ library (used internally by Numpy) though it only works on x86-64 CPUs
  • merge the M buckets so to get the final array

We need to write a BucketList data structure so to add numbers in a variable-size container. This is basically a linked list of chunks. Note a growing std::vector is not efficient because each resize put too much pressure on memory (and std::deque operations are so slow that is is even slower).

Here is the resulting C++ code:

// File: wrapper.cpp
// Assume x86-simd-sort has been cloned in the same directory and built
#include "x86-simd-sort/lib/x86simdsort.h"
#include <cstdlib>
#include <cstring>
#include <forward_list>
#include <mutex>
#include <omp.h>


template <typename T, size_t bucketMaxSize>
struct BucketList
{
    using Bucket = std::array<T, bucketMaxSize>;

    std::forward_list<Bucket> buckets;
    uint32_t bucketCount;
    uint32_t lastBucketSize;

    BucketList() : bucketCount(1), lastBucketSize(0)
    {
        buckets.emplace_front();
    }

    void push_back(const T& value)
    {
        if (lastBucketSize == bucketMaxSize)
        {
            buckets.emplace_front();
            lastBucketSize = 0;
            bucketCount++;
        }

        Bucket* lastBucket = &*buckets.begin();
        (*lastBucket)[lastBucketSize] = value;
        lastBucketSize++;
    }

    size_t size() const
    {
        return (size_t(bucketCount) - 1lu) * bucketMaxSize + lastBucketSize;
    }

    size_t bucketSize(size_t idx) const
    {
        return idx == 0 ? lastBucketSize : bucketMaxSize;
    }
};


extern "C" void parallel_sort(int64_t* arr, size_t size)
{
    static const size_t bucketSize = 2048;
    static const size_t radixBits = 11;
    static const size_t bucketCount = 1 << radixBits;

    struct alignas(64) Slice
    {
        int64_t* data = nullptr;
        size_t size = 0;
        size_t global_offset = 0;
        size_t local_offset = 0;
        std::mutex mutex;
    };

    std::array<Slice, bucketCount> slices;

    #pragma omp parallel
    {
        std::array<BucketList<int64_t, bucketSize>, bucketCount> tlsBuckets;

        #pragma omp for nowait
        for (size_t i = 0; i < size; ++i)
        {
            constexpr uint64_t signBit = uint64_t(1) << uint64_t(63);
            const uint64_t idx = (uint64_t(arr[i]) ^ signBit) >> (64 - radixBits);
            tlsBuckets[idx].push_back(arr[i]);
        }

        #pragma omp critical
        for (size_t i = 0; i < bucketCount; ++i)
            slices[i].size += tlsBuckets[i].size();

        #pragma omp barrier

        #pragma omp single
        {
            size_t offset = 0;

            for (size_t i = 0; i < bucketCount; ++i)
            {
                Slice& slice = slices[i];
                slice.data = &arr[offset];
                slice.global_offset = offset;
                offset += slice.size;
            }
        }

        for (size_t i = 0; i < bucketCount; ++i)
        {
            Slice& slice = slices[i];
            size_t local_offset;
            size_t local_offset_end;

            {
                std::scoped_lock lock(slice.mutex);
                local_offset = slice.local_offset;
                slice.local_offset += tlsBuckets[i].size();
                local_offset_end = slice.local_offset;
            }

            uint32_t bucketListId = 0;

            for(const auto& kv : tlsBuckets[i].buckets)
            {
                const size_t actualBucketSize = tlsBuckets[i].bucketSize(bucketListId);
                memcpy(&slice.data[local_offset], &kv[0], sizeof(int64_t) * actualBucketSize);
                local_offset += actualBucketSize;
                bucketListId++;
            }
        }

        #pragma omp barrier

        #pragma omp for schedule(dynamic)
        for (size_t i = 0; i < bucketCount; ++i)
            x86simdsort::qsort(&slices[i].data[0], slices[i].size);
    }
}

A simple header can be written if you want to call this implementation from Cython (though it can be complicated due to the aforementioned Cython/Numpy-2.0 compatibility issue). Here is an example:

// File: wrapper.h
#include <stdlib.h>
#include <stdint.h>
void parallel_sort(int64_t* arr, size_t size)

You can compile the code with Clang using the following command lines on Linux:

clang++ -O3 -fopenmp -c wrapper.cpp -fPIC -g
clang wrapper.o -o wrapper.so -fopenmp --shared -Lx86-simd-sort/build -lx86simdsortcpp

The following one may also be needed to find the x86-simd-sort library at runtime once cloned and built:

export LD_LIBRARY_PATH=x86-simd-sort/build:$LD_LIBRARY_PATH

You can finally use the fast sorting function from a Python code. I personally use ctypes because it worked directly with no issues (except when the code is compiled with GCC for unknown strange reasons). Here is an example:

import numpy as np

import ctypes
lib = ctypes.CDLL('./wrapper.so')
parallel_sort = lib.parallel_sort
parallel_sort.argtypes = [ctypes.c_voidp, ctypes.c_size_t]
parallel_sort.restype = None

fullCheck = False

print('Generating...')
a = np.random.randint(0, (1<<63) - 1, 1024*1024**2)
if fullCheck: b = a.copy()

print('Benchmark...')
#%time a.sort()
%time parallel_sort(a.ctypes.data, a.size)

print('Full check...' if fullCheck else 'Check...')
if fullCheck:
    b.sort()
    assert np.array_equal(b, a)
else:
    assert np.all(np.diff(a) >= 0)

Notes and performance results

Note this require a lot of memory to do the test, especially if fullCheck is set to true.

Note that the C++ code is optimized for sorting huge arrays (with >1e8 items). The memory consumption will be significant for smaller arrays compared to their size. The current code will even be slow for small arrays (<1e5). You can tune constants/parameters regarding your needs. For tiny arrays, you can directly call the x86-simd-sort library. Once tuned properly, it should be faster than np.sort for all arrays (whatever their size). I strongly advise you to tune the parameters regarding your specific input and target CPU, especially radixBits. The current code/parameters are optimized for mainstream Intel CPUs (not recent big-little Intel ones nor AMD ones) and positive numbers. If you know there are only positive numbers in the input, you can skip the most-significantly bit (sign bit).

Here is the resulting timings on my 6-core i5-9600KF CPU (with Numpy 2.0):

np.sort:             19.3 s
Proposed C++ code:    4.3 s

The C++ parallel implementation is 4.5 times faster than the sequential optimized Numpy one.

Note I did not massively test the code but basic checks like the one proposed in the provided Python script reported no error so far (even on negative numbers apparently).

Note that this sort is efficient if the highest bits of the sorted numbers are different set (this is the downside of bucket/radix sorts). Ideally, numbers should be uniformly distributed and use all the highest bits. If this is not the case, then the buckets will be unbalanced resulting in a lower scalability. In the worst case, only 1 bucket is used resulting in a serial implementation. You can track the highest bit set so to mitigate this issue. More complex approaches are required when there are some rare big outliers (eg. remapping preserving the ordering).

Elseelset answered 23/6 at 17:44 Comment(8)
Thank you so much for your help! I'm thrilled to report that I was able to achieve nearly 4x speedup on my machine, which is fantastic. However it seems that the sorting performance is somewhat unstable, fluctuating significantly based on parameters and array length. May need more debugging. Your assistance has been invaluable!Celom
You are welcome. What CPU do you use out of curiosity? To track which part is less stable, you can add #pragma omp barrier and add timers (omp_get_wtime) in #pragma omp master sections.Wheresoever
Thank you for the suggestion! my CPU is a 4-core i7-1165G7 on my laptop, it gets 2-3x faster on average, with the best being close to 4x at one timeCelom
Additionally I find that algorithms such as sorting or de-duplication that are implemented by myself suffer a sharp drop in execution efficiency when they hit a RAM bottleneck. But numpy implementations suffer relatively much less. Do you know anything about this? What are some directions to optimize the performance of one's implementation when RAM is constrained?Celom
Bucket sorts tend to be less stable because of cache effects and latency (including the RAM one). This part does not scale well since it is memory bound. However, on my machine it is faster, even in sequential for large arrays. The SIMD-algorithm use in Numpy 2.0 (ie. the one of the x86-simd-sort library) tends to put more pressure on the memory hierarchy making it slower. More specifically, it tends to be bound by the memory bandwidth. Meanwhile, I think memory latency is the issue for bucket sorts (at least with a large number of buckets). It is quite hard for CPUs to hide this latency.Wheresoever
I expect that reducing the number of buckets and using multiple steps can help to make the computation more stable (eg. by fitting in the L1 cache having a lower latency) but also more bandwidth bound (due to more steps). I think one can mitigate the impact on bandwidth by trying to apply the multiple steps in a hierarchical cache friendly way (like in quick-sort) but it is pretty complex to implement. This is why I choose this approach : its good tread-off between simplicity and performance, though it is a bit aggressive for the L1/L2 cache and rather brittle performance.Wheresoever
Your CPU architecture is not so different from mine. You have a wider L2 cache so I think it might be faster to increase radixBits so fit the L2 size (the L2 latency is significantly higher which is bad for the bucket sort, but the L2 is much wider which is good to make the bucket sort faster -- IDK which one win so the best is to test it). It as a limited power budget and turbo boost which can impact the stability of performance (especially parallel codes). I wonder if you have a LPDDR RAM. AFAIK, it has a higher latency and it is harder for CPU to saturate it (than classical DDR).Wheresoever
Put it shortly : you can tune parameters (easy), try to write a hierarchical cache-friendly 2-step bucket sort (hard), check if the work is well balanced between both bucket and threads (medium) and possibly optimize the scheduling/partitioning based on this.Wheresoever
R
1

You can use the algorithms from the C++ standard library [ Microsoft standard library, thx @DavidW hahaha ] in Cython:

import cython
cimport cython

ctypedef fused real: # overloading, very beautiful how Cython does that 
    cython.short
    cython.ushort
    cython.int
    cython.uint
    cython.long
    cython.ulong
    cython.longlong
    cython.ulonglong
    cython.float
    cython.double


cdef extern from "<ppl.h>" namespace "concurrency":
    cdef void parallel_sort[T](T first, T last) nogil # import has to be nogil

@cython.boundscheck(False) # much faster, but be careful! Don't go beyond the last index!
@cython.wraparound(False) # much faster, but negative index == UB
def parallelsort(real[:] a):
    parallel_sort(&a[0], &a[a.shape[0]])

There are a lot more, like parallel_radixsort (40 times faster than NumPy, but no floats), buffered_sort and so on.
Microsoft has some comprehensible tutorials online: https://learn.microsoft.com/en-us/cpp/parallel/concrt/parallel-algorithms?view=msvc-170 They all can be easily used in Cython. By the way, Cython already supports OpenMp with native locks - the real deal, not that crappy "with gil" thing: However, it is not documented, I found it coincidentally. Here is an example: Parallelising a 'for' loop in Cython: beyond prange "with (no)gil:" is only worth it when you know that you have only a few results. Calling the gil is very, very expensive. So expensive that it sometimes is much slower than a single CPU!! Native OpenMp is much better, you can create your own algorithm. It is not hard at all. Another option is using some old school C - qsort in Cython. Also very fast, but not parallel: https://mcmap.net/q/1778847/-using-qsort-in-cython-to-get-a-sorting-index-permutation

Roadstead answered 22/7 at 7:56 Comment(10)
That parallel sort is a microsoft-specific extension and not in the standard library. It's a good solution but it'll be Windows only.Tungsten
Thank you, @Tungsten . To be honest, I didn't know that. But there is lots of stuff on GitHub that works on Linux as well: github.com/DragonSpit/ParallelAlgorithms It shouldn't be hard to make a wrapper in Cython.Roadstead
I tried wrapping concurrency::parallel_sort and concurrency::parallel_radixsort as dynamic link libraries and using ctypes to call them in Python. I observed that the CPU usage reached 100%, proving that they fully utilized parallelism. Unfortunately, however, even the faster parallel_sort only achieved one-tenth of the speed of numpy.sort in numpy 2.0Celom
Nice! If NumPy sorts faster than C++ then the problem is solved. I have to try NumPy 2.0. I am still a little afraid to break all my code. Lately, I have been experimenting with ZIG - awesome! It's also possible to wrap its dlls in Python!Roadstead
It looks like my previous test didn't make sufficient use of compiler optimisations. My latest test results are that calling concurrency::parallel_radixsort in python is about 30% faster than numpy.sort @RoadsteadCelom
However, concurrency::parallel_radixsort requires extra o(n) space and does not perform as well as numpy sort when memory is limited. parallel_sort which does not require extra space is always slower than numpy sort.Celom
But it is still a great result for Numpy. I have just found my first experiments with radixsort, C++ and a jerry-rigged overloaded function in Python: github.com/hansalemaos/cppradixsort Back than, I didn't use Cython, i exported a dll. It was 4.96 ms vs. 63.8 msRoadstead
I tried your library and it was about the same speed as numpy 2.0 sort for 1e8 uint64 sortingCelom
I brought up a new question for the update on my issue .#78844196Celom
Thanks for calling it "library". hahha It was my first experiment with C++ It's crazy how Numpy has improved. I don't know if you can make it any faster. Maybe Zig? The first time I tested it, (RGB color search), it was between 2 times (one color) and 12 times (9 colors) faster than C. But I have no idea if that is the case when sorting.Roadstead

© 2022 - 2024 — McMap. All rights reserved.