Search for bitstring most unlike a set of bitstrings

Asked 7/4, 2019 at 0:16 Answered 10/4, 2019 at 4:47

Solved python python-3.x search bit hamming-distance

I have a set of bitstrings: {'0011', '1100', '1110'} (all bitstrings within a set are of same length).

I want to quickly find the bitstring of same length that has the smallest max-similarity to the set. Max-similarity can be computed as such:

def max_similarity(bitstring, set):
    max = 0
    for item in set:
        temp = 0
        for i in range(len(bitstring)):
            if bitstring[i] == item[i]:
                temp += 1
        if temp > max:
            max = temp
    return max

I know that I can iterate through all the possible bitstrings of that length, compute the max-similarity for each one and finally keep the smallest of these iterations. But this solves the problem in O(2^n). I would like to know if someone sees any quicker alternatives.

I've been playing around with Pythons XOR:

def int2bin(integer, digits):
    if integer >= 0:
        return bin(integer)[2:].zfill(digits)
    else:
        return bin(2**digits + integer)[2:]


def XOR(bitset):  
    intset = [int('{}'.format(bitstring), 2) for bitstring in bitset]

    digits = len(bitset.pop())

    if len(intset) == 1:
        return int2bin(~intset.pop(), digits)        
    else:
        curr = 0    
        while intset:
            curr = curr ^ intset.pop()

        return int2bin(curr, digits)

if __name__ == '__main__':
    bitset1 = {'0011', '1100', '1110'}
    bitset2 = {'01001', '11100', '10111'}
    print(XOR(bitset1))
    print(XOR(bitset2))

>>> python test.py
0001
00010

(Function int2bin stolen from here)

But I've found it works for some input, and not for others. In the test above it finds the correct solution for bitset2, but not for bitset1. Any solutions below O(2^n) available?

Achromic answered 7/4, 2019 at 0:16 Comment(15)

The brute force algorithm is O(2^n), not O(n^2)! – Mesoglea 7/4, 2019 at 17:14

I'm voting to close this question as off-topic because it is about an abstract algorithm, not programming. You might try asking on the CS StackExchange. – Mesoglea 7/4, 2019 at 17:25

Thank you for the correction. It seems my question is "off topic" everywhere. It is a concrete problem though, it's not abstract. – Achromic 7/4, 2019 at 18:37

It’s abstract in that it has no dependence on any programming language or on anything in your existing implementation. – Mesoglea 7/4, 2019 at 18:44

If I included my complete implementation in the question, would that help? I do have a complete implementation, of course. I tried to only extract what is most relevant. – Achromic 7/4, 2019 at 18:48

Only if you believe that there is an error in your code, rather than just “I haven’t found an efficient algorithm to implement yet”. I think the XOR is clearly wrong: having duplicate bitstrings to avoid doesn’t change the answer, for instance. – Mesoglea 7/4, 2019 at 18:57

Right. No, there's no error in my working code. It is just not efficient enough, so I guess it doesn't help then. And yes, I agree the XOR is clearly wrong. Duplicates is handled by the fact that all strings are put into a set. There are no duplicates. – Achromic 7/4, 2019 at 19:12

Of course exact duplicates can be discarded, but if two strings are duplicates in all but one bit, should XOR all but ignore them? – Mesoglea 7/4, 2019 at 19:14

I'm so stupid. Your last comment gave me an idea that I think will fix my full implementation. Thank you! – Achromic 7/4, 2019 at 19:17

Er, great! (It didn’t suggest an algorithm to me.) If the question doesn’t get closed, you should answer it. – Mesoglea 7/4, 2019 at 19:29

Turns out the metric is called hamming distance. See this question – Crispation 9/4, 2019 at 0:56

#6390341 – Crispation 9/4, 2019 at 2:7

Naming the function XOR is a bit of a misnomer, as it's not actually computing the exclusive or value - that's the job of the ^ operator used in the example. I went with closest() in the answer below, but open to suggestions. Also, XOR as a function name goes against naming conventions, capitals (let alone all caps) shouldn't be used for function names. – Enrapture 9/4, 2019 at 4:14

Additionally, it appears that the XOR() in the example frequently returns sub-optimal answers? I haven't looked into the why, but perhaps someone has an idea what's wrong with it...? – Enrapture 9/4, 2019 at 4:42

Yes, I had to make some changes to the actual XOR (^) as it does not compute what I wanted. Therefore the "I played around with". Nevertheless, it was a bad turn. Yesterday I was looking into solving it by splicing the string into smaller and smaller parts, and solving the minimum max-similarity for the smaller parts – I was thinking length 2 to 3 – and then traverse it as a graph. But there has been several great answers recently, from you as well. I will review to see if it should be accepted. – Achromic 9/4, 2019 at 6:35

This question is in part algorithmic (what's the best algorithm to get to the solution) and in part a Python question (on what parts of Python to use to then efficiently implement that best algorithm).

On the algorithm: you define the max distance for a bit string to a set of (same size) bit strings to be the largest number of bits the target bit string differs on with any of the strings in the set. The goal of the algorithm is to find a new bit string with the same length as the strings in the set that has the lowest max distance.

It is assumed all the starting bit strings are different (as it is defined as a set, not a list). The distance you're computing is known as the Hamming distance, so you're looking for a new bit string with minimum Hamming distance to a starting set of strings.

Generating all possible bit strings of the right length and calculating the max distance to each starting string is brute forcing the problem, which can be optimised(*) using backtracking (discarding a result as soon as the lowest current max is exceeded for a candidate bit string).

(*: for people looking to correct my spelling, please consider the fact that I'm using UK English, not US English - feel free to propose improvements with that in mind)

However, the problem can also be viewed as follows.

For bit strings of length 1, the entire space of strings has only two options, {'0', '1'}. This can be visualised as '0' and '1' sitting at either end of a line segment of length 1, both a distance of 1 away from each other.

For bit strings of length 2, the entire space of strings has 4 options, namely the bit representations of 0 through 3 {'00', '01', '10', '11'} 0 is distance 1 away from 1 and 2, both of which are distance 1 away from 3. When visualised, they form the four corners of a square, none of them ever more than 2 steps from any other.

For bit strings of length 3, the entire space has 8 options, namely the bit representations of 0 through 7. When visualised, the form the 8 corners of a cube, none of them ever more than 3 steps from any other.

This pattern continues (into a 4D hypercube, 5D, etc.) and finding the answer to the problem effectively translates into: given a set of corners on one of these graphs, find the point with the lowest maximum distance to any of them.

An algorithm to find such a point, given a graph like that would be to:

Start with a list of the points in a set by themselves; if there's only one, that's the trivial answer.
Set the current distance to 1.
For all of the sets, add to it any point only one step away from points already in the set.
Intersect all of the resulting sets; if the intersection is not empty, these are all of the points that are the current distance (or less) away from the starting set of points; otherwise, increase the current distance by 1 and go to step 3.

This could probably be optimised further by keeping track of points visited as they are added to sets (for long bit strings), to avoid adding the same points over and over, quickly slowing down the given algorithm. I.e. instead of turning {'001'} into {'001', '101', '011', '000'}, you could go from [{'001'}] to [{'001'}, {'101', '011', '000'}] - the union of the sets still gets you all of the points reachable within 1 or fewer steps, but the next step in the series would be easier to compute, by finding all points that are one step farther out, but excluding points in the previous direction.

Finding points one step out is actually quite simple, if you turn the strings into the numbers the represent and calculate the exclusive bit-wise or of the numbers with all of the single '1'-bit numbers with the same bit string length, i.e. to find all points one step away from '001', you can xor 1 with 4, 2 and 1, yielding {5, 3, 0}, matching the correct points.

Putting all of that together in a tight bit of Python (without the optimisation for longer strings):

def closest(strings):
    if len(strings) == 1:
        return next(iter(strings))

    size = len(next(iter(strings)))
    points = [{int(s, 2)} for s in strings]
    powers = {1 << n for n in range(size)}

    d = 0
    while True:
        d += 1
        points = [{n ^ p for p in powers for n in nums} | nums for nums in points]
        intersection = set.intersection(*points)
        if len(intersection) > 0:
            return d, {f"{n:b}".zfill(size) for n in intersection}


print(closest({'1000', '0001', '0011'}))

Note that closest returns the actual distance and all of the optimal answers, not just one. Output:

(2, {'0000', '0010', '1001', '0001', '1011'})

Adding the discussed optimisation to closest:

def closest_optimised(strings):
    if len(strings) == 1:
        return next(iter(strings))

    size = len(next(iter(strings)))
    points = [({int(s, 2)}, {int(s, 2)}) for s in strings]
    powers = {1 << n for n in range(size)}

    d = 0
    while True:
        d += 1
        new_points = [{n ^ p for p in powers for n in rp} - ap for ap, rp in points]
        points = [(ap | np, np) for (ap, _), np in zip(points, new_points)]
        intersection = set.intersection(*[ap for ap, _ in points])
        if len(intersection) > 0:
            return d, {f"{n:b}".zfill(size) for n in intersection}

Note that running this through a profiler has the optimised code running in about half the time on average for these settings:

from random import randint

s = 10
x = 500
numbers = [randint(0, 2**s-1) for _ in range(x)]
number_strings = {f"{n:b}".zfill(s) for n in numbers}
print(number_strings)
print(closest_optimised(number_strings))
print(closest(number_strings))

Edit: Out of curiosity, I ran my example against the original given in the question and found that it frequently returns a far from optimal result. I haven't tried to find out why, but I figured it bears mentioning.

Someone pointed out that the OP may actually want the point with the maximum Hamming distance to all provided bit strings. With a similar approach:

def farthest(strings):
    s = next(iter(strings))
    size = len(s)
    if len(strings) == 1:
        return ''.join(['0' if c == '1' else '1' for c in s])

    all_visited = {int(s, 2) for s in strings}
    visited = [set(), all_visited]
    powers = {1 << n for n in range(size)}

    d = 0
    while True:
        d += 1
        visited.append({n ^ p for p in powers for n in visited[-1]} - all_visited)
        all_visited = all_visited | visited[-1]
        if len(all_visited) == 2**size:
            return d, {f"{n:b}".zfill(size) for n in visited[-1]}

Enrapture answered 9/4, 2019 at 3:5 Comment(8)

Note that I'm assuming the implementation Python's set.intersection() is optimal, i.e. it does not keep trying to intersect sets when the intersection of the first few sets is already empty - otherwise, this is another spot where you could optimise by brewing your own set.intersection(). – Enrapture 9/4, 2019 at 3:8

I think the OP wants to find the largest hamming distance since that would be the most dissimilar. Not the minimum hamming distance. – Crispation 9/4, 2019 at 4:51

Hmm, you may be right "find the bitstring of same length that has the smallest max-similarity to the set" - a similar approach would work, but you'd want the last point that shows up in the set (or rather, all the points that show up in the final iteration of increasing the distance). – Enrapture 9/4, 2019 at 4:59

Updated accordingly, having both closest and farthest Hamming distances. – Enrapture 9/4, 2019 at 6:34

I've tested this solution against my "slow but working" implementation, and it is indeed correct. So I've set it as the accepted solution. @Crispation 's solution seems to do a majority-vote on each "bit", which is incorrect. But the farthest-function Grismar provided here is exactly what I was looking for! – Achromic 9/4, 2019 at 6:52

There are small issues, like for instance it doesn't recognize that the solution to a set of only zeroes as only 1s. But that's an easy fix. – Achromic 9/4, 2019 at 7:15

The fix is a typo in return ''.join(['0' if c == '1' else '0' for c in s]), it should be return ''.join(['0' if c == '1' else '1' for c in s]). If you want to change that. I can't make edit on less than 6 characters, apparently. – Achromic 9/4, 2019 at 7:24

Switching from minimum to maximum Hamming distance is fortunately trivial: just invert all the strings to avoid. – Mesoglea 10/4, 2019 at 13:9

Here's an algorithm with cost O(n * b) where n is the size of the set and b is the fixed bit length.

The intuition with this algorithm is to check what the majority bit position is for each bit index (0 or 1) and score accordingly.

A higher score means the given bitstring had a bit position that went against the majority the most times. Although, I haven't handled ties.

import operator

def max_hamming(bitstrings):
    n_bits = len(bitstrings[0])
    # Track bit set/unset for each bit position
    scores = {
        n: {'0': [], '1': []} for n in range(n_bits)
    }
    # Increment on each bit position if not with the majority
    total = {b: 0 for b in bitstrings}

    # O(b * n)
    for n in range(n_bits):
        n_set = 0
        for b in bitstrings:
            is_set = b[n]
            scores[n][is_set].append(b)
            if is_set:
                n_set += 1

        # If majority have this bit set, give a point to those with unset or vice versa
        outliers = scores[n]['0'] if n_set > len(bitstrings) else scores[n]['1']
        for s in outliers:
            total[s] += 1

    return max(total.items(), key=operator.itemgetter(1))[0]

Also side note, I'm passing this a list instead of a set because python sets are non-deterministic with their ordering.

Usage:

bitset1 = [
    '0011',
    '1100',
    '1110'
]
bitset2 = [
    '01001',
    '11100',
    '10111'
]
print(max_hamming(bitset1))
print(max_hamming(bitset2))

Crispation answered 9/4, 2019 at 6:13 Comment(1)

I might have formulated the goal badly in the OP, so for that I am sorry. But essentially I was looking for the same output as from given all_possible_bitstrings_of_len_n (APBN), for bitstring in APBN: current = max(similarity(bitstring, bstring) for bstring in bitstrings), where similarity is as in OP, and then minimize current. This would always yield the wanted result. But it was really slow. – Achromic 9/4, 2019 at 7:3

Can I use numpy or is this supposed to be an algorithm? Let's assume everything is a bitstring like you have it.

import numpy as np

def bitstring2np(bitstring):
    """
    Convert a bitstring to np.array
    i.e. '0011' to np.array([0, 0, 1, 1])
    """
    return np.array([int(bit) for bit in bitstring], dtype=int)

def unlike(bitset):
    """
    Gets the most 'unlike' string between a bitset.
    Accomplishes this by creating a 2D array from the bitsets,
    figuring out the number of 1s in a column, and if that
    number of 1s is >=50%, then gives it a 0 in that place, otherwise
    gives it a 1.
    """
    bset = list(bitset)
    # Create an empty 2D array to store the bitsets into
    arr = np.empty((len(bset), len(bset[0])), dtype=int)
    for idx in range(len(bset)):
        # Store that bitset into the row of our array
        arr[idx,:] = bitstring2np(bset[idx])

    # Count the number of 1's in each column
    nonzero = np.count_nonzero(arr, axis=0)
    total = len(bset) # how many to compare against
    # Since you want the most unlike and since we are counting
    # number of 1s in a column, if the rate is >=.5 give it a 0, otherwise 
    # 1
    most_unlike = ''.join('0' if count/total >=.5 else '1' for count in nonzero)

    return most_unlike


>>> print(unlike(bitset1))
0001
>>> print(unlike(bitset2))  
00010

Now I know you said 0001 is not the correct solution for bitset, but I'm pretty sure it is unless I'm not understanding the question correctly.

Footplate answered 10/4, 2019 at 4:47 Comment(5)

Like @stacksonstacks, you’re computing the string with the greatest average Hamming distance, not the greatest minimum distance. – Mesoglea 10/4, 2019 at 13:11

@DavisHerring what should the answers be for the two provided bitset examples then? – Footplate 10/4, 2019 at 17:21

For 0011/1100/1110, the answer is any of 0000, 0101, or 1001. For 01001/11100/10111, one best answer is 00010; I haven’t checked for ties with it. – Mesoglea 10/4, 2019 at 22:25

That’s the unique best answer for the second case, as it turns out. – Mesoglea 10/4, 2019 at 22:49

Oh, I now understand the question. I was creating a bitstring to be least similar to the entire bitset, not to a single one. I'll have to think about that then.... – Footplate 10/4, 2019 at 22:58

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags