What is the fastest way to map group names of numpy array to indices?

Asked 8/12, 2019 at 21:2 Answered 14/12, 2019 at 16:19

I'm working with 3D pointcloud of Lidar. The points are given by numpy array that looks like this:

points = np.array([[61651921, 416326074, 39805], [61605255, 416360555, 41124], [61664810, 416313743, 39900], [61664837, 416313749, 39910], [61674456, 416316663, 39503], [61651933, 416326074, 39802], [61679969, 416318049, 39500], [61674494, 416316677, 39508], [61651908, 416326079, 39800], [61651908, 416326087, 39802], [61664845, 416313738, 39913], [61674480, 416316668, 39503], [61679996, 416318047, 39510], [61605290, 416360572, 41118], [61605270, 416360565, 41122], [61683939, 416313004, 41052], [61683936, 416313033, 41060], [61679976, 416318044, 39509], [61605279, 416360555, 41109], [61664837, 416313739, 39915], [61674487, 416316666, 39505], [61679961, 416318035, 39503], [61683943, 416313004, 41054], [61683930, 416313042, 41059]])

I'd like to keep my data grouped into cubes of size 50*50*50 so that every cube preserves some hashable index and numpy indices of my points it contains. In order to get splitting, I assign cubes = points \\ 50 which outputs to:

cubes = np.array([[1233038, 8326521, 796], [1232105, 8327211, 822], [1233296, 8326274, 798], [1233296, 8326274, 798], [1233489, 8326333, 790], [1233038, 8326521, 796], [1233599, 8326360, 790], [1233489, 8326333, 790], [1233038, 8326521, 796], [1233038, 8326521, 796], [1233296, 8326274, 798], [1233489, 8326333, 790], [1233599, 8326360, 790], [1232105, 8327211, 822], [1232105, 8327211, 822], [1233678, 8326260, 821], [1233678, 8326260, 821], [1233599, 8326360, 790], [1232105, 8327211, 822], [1233296, 8326274, 798], [1233489, 8326333, 790], [1233599, 8326360, 790], [1233678, 8326260, 821], [1233678, 8326260, 821]])

My desired output looks like this:

{(1232105, 8327211, 822): [1, 13, 14, 18]), 
(1233038, 8326521, 796): [0, 5, 8, 9], 
(1233296, 8326274, 798): [2, 3, 10, 19], 
(1233489, 8326333, 790): [4, 7, 11, 20], 
(1233599, 8326360, 790): [6, 12, 17, 21], 
(1233678, 8326260, 821): [15, 16, 22, 23]}

My real pointcloud contains up to few hundreds of millions of 3D points. What is the fastest way to do this kind of grouping?

I've tried a majority of various solutions. Here is comparison of time compsumption assuming size of points is arround 20 millions and size of distinct cubes is arround 1 million:

Pandas [tuple(elem) -> np.array(dtype=int64)]

import pandas as pd
print(pd.DataFrame(cubes).groupby([0,1,2]).indices)
#takes 9sec

Defauldict [elem.tobytes() or tuple -> list]

#thanks @abc:
result = defaultdict(list)
for idx, elem in enumerate(cubes):
    result[elem.tobytes()].append(idx) # takes 20.5sec
    # result[elem[0], elem[1], elem[2]].append(idx) #takes 27sec
    # result[tuple(elem)].append(idx) # takes 50sec

numpy_indexed [int -> np.array]

# thanks @Eelco Hoogendoorn for his library
values = npi.group_by(cubes).split(np.arange(len(cubes)))
result = dict(enumerate(values))
# takes 9.8sec

Pandas + dimensionality reduction [int -> np.array(dtype=int64)]

# thanks @Divakar for showing numexpr library:
import numexpr as ne
def dimensionality_reduction(cubes):
    #cubes = cubes - np.min(cubes, axis=0) #in case some coords are negative 
    cubes = cubes.astype(np.int64)
    s0, s1 = cubes[:,0].max()+1, cubes[:,1].max()+1
    d = {'s0':s0,'s1':s1,'c0':cubes[:,0],'c1':cubes[:,1],'c2':cubes[:,2]}
    c1D = ne.evaluate('c0+c1*s0+c2*s0*s1',d)
    return c1D
cubes = dimensionality_reduction(cubes)
result = pd.DataFrame(cubes).groupby([0]).indices
# takes 2.5 seconds

It's possible to download cubes.npz file here and use a command

cubes = np.load('cubes.npz')['array']

to check performance time.

Olsewski answered 8/12, 2019 at 21:2 Comment(12)

Do you always have the same number of indices in each list in your result? – Bello 12/12, 2019 at 11:21

Yes, it is always the same: 983234 distinct cubes for all the solutions above mentioned. – Olsewski 12/12, 2019 at 22:8

It is unlikely that such a simple Pandas solution would be beaten by a simple approach, as a great deal of effort has been spent into optimizing it. A Cython-based approach could probably approach it, but I doubt it would outperform it. – Carnelian 14/12, 2019 at 14:0

@Olsewski Do you have to have the final output as a dictionary or would it be okay to have the groups and their indices as two outputs? – Vocalic 14/12, 2019 at 17:13

@Carnelian numpy_indexed only approaches it too. I guess it's right. I use pandas for my classification processes currently. – Olsewski 14/12, 2019 at 19:54

@Vocalic Good point! I prefer to have an output as a dictionary where keys are tuples of 3 coordinates. But it's not neccessary. Let this be any dictionary that preserves second output as values. Btw, I keep a command dict(enumerate(values)) costless because it takes only 0.15 seconds on my laptop. – Olsewski 14/12, 2019 at 20:7

@Olsewski That was the point Mykola Zotko was asking about if every group/cube would have exactly the same number of indices, which is 4 in your small dataset. But, apparently in your np.load('cubes.npz')['array'] case, upon using np.unique, it reveals that the number is not a constant, but varies. – Vocalic 14/12, 2019 at 20:51

@Olsewski Also, I am not sure what is values for you. In the expected dictionary output, I see the keys as the 3D coordinates and the indices as the values. Can you clarify on what you said earier about it's not neccessary.. Is there any other output format - dictionary or non-dictionary format that could work for you? From the timings I see this conversion to dictionary being the bottleneck. So, clarification on this could influence the solutions in a big way. – Vocalic 14/12, 2019 at 23:36

I mean it's not necessary keys are tuples of 3 coordinates. It can be anything that is hashable. – Olsewski 15/12, 2019 at 10:28

@Olsewski Appreciate the bounty! So, did either of the posted solutions solve the question for you? – Vocalic 20/12, 2019 at 10:59

I didn't have time to review all the answers in details yet but I'm going to post summaries of my post in a week. I appreciate a variety of new methods and packages used that I haven't met before. Even more than efficiency of code. It will help to solve similar problems in a future for me definitely. Many thanks to you. – Olsewski 20/12, 2019 at 11:51

My current solution uses dimensionality-reduction achieved by using this 'c0+c1*s0+c2*s0*s1' - like looking thing and groupby method of Pandas. So it takes around 5sec of consumption instead of 9sec in a case of my post. I need to check all the answers if it can speed up even more. – Olsewski 20/12, 2019 at 12:0

Constant number of indices per group

Approach #1

We can perform dimensionality-reduction to reduce cubes to a 1D array. This is based on a mapping of the given cubes data onto a n-dim grid to compute the linear-index equivalents, discussed in detail here. Then, based on the uniqueness of those linear indices, we can segregate unique groups and their corresponding indices. Hence, following those strategies, we would have one solution, like so -

N = 4 # number of indices per group
c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1)
sidx = c1D.argsort()
indices = sidx.reshape(-1,N)
unq_groups = cubes[indices[:,0]]

# If you need in a zipped dictionary format
out = dict(zip(map(tuple,unq_groups), indices))

Alternative #1 : If the integer values in cubes are too large, we might want to do the dimensionality-reduction such that the dimensions with shorter extent are choosen as the primary axes. Hence, for those cases, we can modify the reduction step to get c1D, like so -

s1,s2 = cubes[:,:2].max(0)+1
s = np.r_[s2,1,s1*s2]
c1D = cubes.dot(s)

Approach #2

Next up, we can use Cython-powered kd-tree for quick nearest-neighbor lookup to get nearest neighbouring indices and hence solve our case like so -

from scipy.spatial import cKDTree

idx = cKDTree(cubes).query(cubes, k=N)[1] # N = 4 as discussed earlier
I = idx[:,0].argsort().reshape(-1,N)[:,0]
unq_groups,indices = cubes[I],idx[I]

Generic case : Variable number of indices per group

We will extend the argsort based method with some splitting to get our desired output, like so -

c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1)

sidx = c1D.argsort()
c1Ds = c1D[sidx]
split_idx = np.flatnonzero(np.r_[True,c1Ds[:-1]!=c1Ds[1:],True])
grps = cubes[sidx[split_idx[:-1]]]

indices = [sidx[i:j] for (i,j) in zip(split_idx[:-1],split_idx[1:])]
# If needed as dict o/p
out = dict(zip(map(tuple,grps), indices))

Using 1D versions of groups of cubes as keys

We will extend earlier listed method with the groups of cubes as keys to simplify the process of dictionary creating and also make it efficient with it, like so -

def numpy1(cubes):
    c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1)        
    sidx = c1D.argsort()
    c1Ds = c1D[sidx]
    mask = np.r_[True,c1Ds[:-1]!=c1Ds[1:],True]
    split_idx = np.flatnonzero(mask)
    indices = [sidx[i:j] for (i,j) in zip(split_idx[:-1],split_idx[1:])]
    out = dict(zip(c1Ds[mask[:-1]],indices))
    return out

Next up, we will make use of numba package to iterate and get to the final hashable dictionary output. Going with it, there would be two solutions - One that gets the keys and values separately using numba and the main calling will zip and convert to dict, while the other one will create a numba-supported dict type and hence no extra work required by the main calling function.

Thus, we would have first numba solution :

from numba import  njit

@njit
def _numba1(sidx, c1D):
    out = []
    n = len(sidx)
    start = 0
    grpID = []
    for i in range(1,n):
        if c1D[sidx[i]]!=c1D[sidx[i-1]]:
            out.append(sidx[start:i])
            grpID.append(c1D[sidx[start]])
            start = i
    out.append(sidx[start:])
    grpID.append(c1D[sidx[start]])
    return grpID,out

def numba1(cubes):
    c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1)
    sidx = c1D.argsort()
    out = dict(zip(*_numba1(sidx, c1D)))
    return out

And second numba solution as :

from numba import types
from numba.typed import Dict

int_array = types.int64[:]

@njit
def _numba2(sidx, c1D):
    n = len(sidx)
    start = 0
    outt = Dict.empty(
        key_type=types.int64,
        value_type=int_array,
    )
    for i in range(1,n):
        if c1D[sidx[i]]!=c1D[sidx[i-1]]:
            outt[c1D[sidx[start]]] = sidx[start:i]
            start = i
    outt[c1D[sidx[start]]] = sidx[start:]
    return outt

def numba2(cubes):
    c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1)    
    sidx = c1D.argsort()
    out = _numba2(sidx, c1D)
    return out

Timings with cubes.npz data -

In [4]: cubes = np.load('cubes.npz')['array']

In [5]: %timeit numpy1(cubes)
   ...: %timeit numba1(cubes)
   ...: %timeit numba2(cubes)
2.38 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.13 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 s ± 5.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Alternative #1 : We can achieve further speedup with numexpr for large arrays to compute c1D, like so -

import numexpr as ne

s0,s1 = cubes[:,0].max()+1,cubes[:,1].max()+1
d = {'s0':s0,'s1':s1,'c0':cubes[:,0],'c1':cubes[:,1],'c2':cubes[:,2]}
c1D = ne.evaluate('c0+c1*s0+c2*s0*s1',d)

This would be applicable at all places that require c1D.

Vocalic answered 14/12, 2019 at 15:7 Comment(9)

Thanks a lot for response! I didn't expect usage of cKDTree is possible here. However, there are still some problems with your #Approach1. The length of output is 915791 only. I guess this is some kind of conflict between dtypes int32and int64 – Olsewski 14/12, 2019 at 20:21

@Olsewski I am assuming number of indices per group would be a constant number that I gathered off the comments. Would that be a safe assumption? Also, are you testing cubes.npz for the output of 915791? – Vocalic 14/12, 2019 at 20:26

Yes, I do. I didn't test number of indices per group because order of group names may be different. I test the length of dictionary of output from cubes.npz only and it was 983234 for the other approaches I suggested. – Olsewski 14/12, 2019 at 20:31

@Olsewski Check out Approach #3 for that generic case of variable number of indices. – Vocalic 14/12, 2019 at 21:37

Thanks you very much for a new approach. – Olsewski 14/12, 2019 at 21:42

I am impressed by the numbers you get with Numba. However, in my version of numba (0.40.1) I have no submodule that goes by numba.typed, so I could not test one variant. – Carnelian 18/12, 2019 at 20:20

@Carnelian Yeah, that's a recent addition, so one needs to update numba. – Vocalic 18/12, 2019 at 20:21

@Vocalic I've tested numexpr based solution, that's brilliant! Only two issues are worth attention: 1) product of maximum coordinates exceeds boundaries of int32 so we need to change type to int64; 2) s0, s1 can't be just max+1 if coordinates are negative. In this case I would change cubes to cubes - np.min(cubes, axis=0) – Olsewski 6/1, 2020 at 23:21

@Olsewski Yup that offsetting is needed generally if the minimum is less than 0. Good catch on the precision! – Vocalic 7/1, 2020 at 4:46

You might just iterate and add the index of each element to the corresponding list.

from collections import defaultdict

res = defaultdict(list)

for idx, elem in enumerate(cubes):
    #res[tuple(elem)].append(idx)
    res[elem.tobytes()].append(idx)

Runtime can be further improved by using tobytes() instead of converting the key to a tuple.

Coroner answered 8/12, 2019 at 21:53 Comment(2)

I'm trying to do review of performance time at the moment (for 20M points). It seems that my solution is more efficient in terms of time because iteration is avoided. I agree, memory consumption is enormous. – Olsewski 8/12, 2019 at 23:56

another proposal res[tuple(elem)].append(idx) took 50 seconds vs its edition res[elem[0], elem[1], elem[2]].append(idx) which took 30 seconds. – Olsewski 9/12, 2019 at 0:36

You could use Cython:

%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True

import math
import cython as cy

cimport numpy as cnp


cpdef groupby_index_dict_cy(cnp.int32_t[:, :] arr):
    cdef cy.size_t size = len(arr)
    result = {}
    for i in range(size):
        key = arr[i, 0], arr[i, 1], arr[i, 2]
        if key in result:
            result[key].append(i)
        else:
            result[key] = [i]
    return result

but it will not make you faster than what Pandas does, although it is the fastest after that (and perhaps the numpy_index based solution), and does not come with the memory penalty of it. A collection of what has been proposed so far is here.

In OP's machine that should get close to ~12 sec execution time.

Carnelian answered 14/12, 2019 at 16:19 Comment(1)

Thanks a lot, I'll test it later. – Olsewski 14/12, 2019 at 20:22

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++