Query points on the vertices of a Hamming cube
Asked Answered
O

1

0

I have N points that lie only on the vertices of a cube, of dimension D, where D is something like 3.

A vertex may not contain any point. So every point has coordinates in {0, 1}D. I am only interested in query time, as long as the memory cost is reasonable ( not exponential in N for example :) ).

Given a query that lies on one of the cube's vertices and an input parameter r, find all the vertices (thus points) that have hamming distance <= r with the query.

What's the way to go in a environment?


I am thinking of a kd-tree, but I am not sure and want help, any input, even approximative, would be appreciated! Since hamming distance comes into play, bitwise manipulations should help (e.g. XOR).

Overstreet answered 23/11, 2016 at 15:37 Comment(12)
A downvote, why? :/ Please Mr. Downvoter, suggest how I can improve my question - you didn't even cast a downvote to help me...!Overstreet
Not my -1, but as I understand "the curse of dimensionality", it means exactly that for D >> 1, no one knows a good way to solve this problem. As D increases, the dumbest-possible algorithm (check each of your N points to see if it's within r of the query point) quickly becomes the best-known algorithm.Zeph
@j_random_hacker, as one of the guys that created kd-geraf, I can say that we don't have to do with the curse of dimensionality that much here, I mean D will never be 10.000 in my case, but thanks for the tip, I will fix it! ;)Overstreet
Using some bitmath tricks you can simply enumerate all vertices that are at distance exactly k, then just loop k up to r. I've answered this before, let me just go find it..Habited
Yes @harold, you commented at the same time I commented that bitwise manipulations should help, I am looking for something like it. Great, I will wait!Overstreet
I found it, but the context was a quite different and it was Java.. still, the same idea appliesHabited
@harold interesting, I would suggest that you modify it a bit to match my question and post an answer, please. :)Overstreet
"N points that lie only on the vertices of a cube" and "A vertex may not contain any point": hem, isn't that contradictory ? What I am missing ?Elegist
"D is something like 3": why worry then ?Elegist
@YvesDaoust I am sorry to see you downvoting. In my real applications I am mapping points to a hypercube (points are being assigned to a hypercube), but during that mapping process some vertices might not get any point! Well D = 3 is a super strong assumption, I just had to make to survive the -2 at that time. The idea is to get an answer for that and then generalize it myself!Overstreet
"to survive the -2": ??? And you didn't answer my question.Elegist
Yes @YvesDaoust, I was about to delete the question since it was too broad (see comments above), so I decided to make some strong assumptions. I thought I did, what didn't I cover? :/Overstreet
H
4

There is a nice bithack to go from one bitmask with k bits set to the lexicographically next permutation, which means it's fairly simple to loop through all masks with k bits set. XORing these masks with an initial value gives all the values at hamming distance exactly k away from it.

So for D dimensions, where D is less than 32 (otherwise change the types),

uint32_t limit = (1u << D) - 1;
for (int k = 1; k <= r; k++) {
    uint32_t diff = (1u << k) - 1;
    while (diff <= limit) {
        // v is the input vertex
        uint32_t vertex = v ^ diff;
        // use it
        diff = nextBitPermutation(diff);
    }
}

Where nextBitPermutation may be implemented in C++ as something like (if you have __builtin_ctz)

uint32_t nextBitPermutation(uint32_t v) {
    // see https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
    uint32_t t = v | (v - 1);
    return (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
}

Or for MSVC (not tested)

uint32_t nextBitPermutation(uint32_t v) {
    // see https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
    uint32_t t = v | (v - 1);
    unsigned long tzc;
    _BitScanForward(&tzc, v); // v != 0 so the return value doesn't matter
    return (t + 1) | (((~t & -~t) - 1) >> (tzc + 1));
}

If D is really low, 4 or lower, the old popcnt-with-pshufb works really well and generally everything just lines up well, like this:

uint16_t query(int vertex, int r, int8_t* validmask)
{
    // validmask should be array of 16 int8_t's,
    // 0 for a vertex that doesn't exist, -1 if it does
    __m128i valid = _mm_loadu_si128((__m128i*)validmask);
    __m128i t0 = _mm_set1_epi8(vertex);
    __m128i r0 = _mm_set1_epi8(r + 1);
    __m128i all =        _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
    __m128i popcnt_lut = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2,  2,  3,  2,  3,  3,  4);
    __m128i dist = _mm_shuffle_epi8(popcnt_lut, _mm_xor_si128(t0, all));
    __m128i close_enough = _mm_cmpgt_epi8(r0, dist);
    __m128i result = _mm_and_si128(close_enough, valid);
    return _mm_movemask_epi8(result);
}

This should be fairly fast; fast compared to the bithack above (nextBitPermutation, which is fairly heavy, is used a lot there) and also compared to looping over all vertices and testing whether they are in range (even with builtin popcnt, that automatically takes at least 16 cycles and the above shouldn't, assuming everything is cached or even permanently in a register). The downside is the result is annoying to work with, since it's a mask of which vertices both exist and are in range of the queried point, not a list of them. It would combine well with doing some processing on data associated with the points though.

This also scales down to D=3 of course, just make none of the points >= 8 valid. D>4 can be done similarly but it takes more code then, and since this is really a brute force solution that is only fast due to parallelism it fundamentally gets slower exponentially in D.

Habited answered 23/11, 2016 at 16:16 Comment(9)
I can use unit8_t, since D won't be more than this, right? But the problem here is that I don't understand which is the query, is it v? And where are my points? Some (many?) vertices of the cube might be empty, so not all permutations are pointing to points.Overstreet
@Overstreet yes, v. And yes, this stops being a good idea if N is much less than 2^D, in which case this spends too much time hitting empty space. Are you in a situation like that?Habited
Actually for D < 9 I have an other idea, just have to work that out a bitHabited
No, D should be something like log(N)/2. So what you are doing, in every iteration, is that based on the query v, you produce ALL the possible permutations with hamming distance k? For example v = 111, k = 1 and D = 3, then you produce 011, 101, 110? Maybe I should make a minimal example to check this. | About the new idea: Oh cool, maybe you can post it as a second answer. :)Overstreet
@Overstreet the new idea isn't working well yet, the details would depend a lot on what you would actually do with the results.Habited
I have N points and one query, right? I want to find all the points that satisfy this: hamming_dist(query, current_point) <= r. I want to report these points, that's it, that's my goal, is that clear?Overstreet
Ok that's just a bit more annoying than, say, counting them or summing corresponding values or something like that. Anyway, is D=4 of interest? It works out particularly well.Habited
Thank you for the updated answer. To be honest I am yet sure on how to report the points that satisfy hamming_dist(query, current_point) <= r, yet so I am trying to understand on how to use your query(), can you please give an intuitive example (not a minimal one)? :)Overstreet
Let us continue this discussion in chat.Habited

© 2022 - 2024 — McMap. All rights reserved.