Finding a number of maximally different binary vectors from a set
Asked Answered
A

3

15

Consider the set, S, of all binary vectors of length n where each contains exactly m ones; so there are n-m zeros in each vector.
My goal is to construct a number, k, of vectors from S such that these vectors are as different as possible from each other.

As a simple example, take n=4, m=2 and k=2, then a possible solution is: [1,1,0,0] and [0,0,1,1].

It seems that this is an open problem in the coding theory literature (?).

Is there any way (i.e. algorithm) to find a suboptimal yet good solution ?
Is Hamming distance the right performance measure to use in this case ?

Some thoughts:
In this paper, the authors propose a couple of algorithms to find the subset of vectors such that the pairwise Hamming distance is >= a certain value, d.
I have implemented the Random approach as follows: take a set SS, which is initialized by any vector from S. Then, I consider the remaining vectors in S. For each of these vectors, I check if this vector has at least a distance d with respect to each vector in SS. If so, then it is added to SS.
By taking the maximal possible d, if the size of SS is >= k, then I consider SS as an optimal solution, and I choose any subset of k vectors from SS. Using this approach, I think that the resulting SS will depend on the identity of the initial vector in SS; i.e. there are multiple solutions(?).
But how to proceed if the size of SS is < k ?
From the proposed algorithms in the paper, I have only understood the Random one. I am interested in the Binary lexicographic search (section 2.3) but I don't know how to implement it (?).

Aerometry answered 11/5, 2018 at 12:37 Comment(6)
the title is misleading, when I read it i thought that m = n/2Prosperous
@WalterTross I agree. I have edited the title.Aerometry
thanks @din. Now, is this a real problem? If it is, what are the orders of magnitude of n, m and k?Prosperous
Let's say: n<25, 1<m<10, 10<k<100 (or 10<k<50).Aerometry
Can you define what you are measuring difference between two vectors? As an example, given the vectors 1001, 0101, 0110, and 1010, would they all be the same "distance" away when compared to 1100 as they are all different in two of the bits?Distichous
@JosephWood Yes, they all have the same distance from 1100.Aerometry
O
5

Maybe you find this paper useful (I wrote it). It contains algorithms that efficiently create permutations of bitstrings.

For example, the inc() algorithm:

long  inc(long h_in , long m0 , long m1) {
    long  h_out = h_in | (~m1); //pre -mask
    h_out ++;
    // increment
    h_out = (h_out & m1) | m0; //post -mask
    return  h_out;
}

It takes an input h_in and return the next higher value that is at least 1 larger than h_in and 'matches' the boundaries m0 and m1. 'Matching' means: the result has a 1 whereever m0 has a 1, and the result has a 0 whereever m1 has a 0. Not that h_in MUST BE a valid value with regards to mo and m1! Also, note that m0 has to be bitwise smaller than m1, which means that m0 cannot have a 1 in a position where m1 has a 0.

This could be used to generate permutations with a minimum edit distance to a given input string:

Let's assume you have 0110, you first NEGATE it to 1001 (edit distance = k). Set 'm0=1001' and 'm1=1001'. Using this would result only on '1001' itself.

Now to get all values with edit distance k-1, you can do the following, simply flip one of the bits of m0 or m1, then inc() will return an ordered series of all bitstring that have a difference of k or k-1.

I know, not very interesting yet, but you can modify up to k bits, and inc() will always return all permutations with the maximum allowed edit difference with regard to m0 and m1.

Now, to get all permutations, you would have to re-run the algorithm with all possibly combinations of m0 and m1.

Example: To get all possible permutations of 0110 with edit distance 2, you would have to run inc() with the following permutations of m0=0110 and m1=0110 (to get permutations, a bit position has to be expanded, meaning that m0 is set to 0 and m1 is set to 1:

  • Bit 0 and 1 expanded: m0=0010 and m1=1110
  • Bit 0 and 2 expanded: m0=0100 and m1=1110
  • Bit 0 and 3 expanded: m0=0110 and m1=1111
  • Bit 1 and 2 expanded: m0=0000 and m1=0110
  • Bit 1 and 3 expanded: m0=0010 and m1=0111
  • Bit 2 and 3 expanded: m0=0100 and m1=0111

As starting value for h_0 I suggest to use simply m0. Iteration can be aborted once inc() returns m1.

Summary The above algorithm generates in O(x) all x binary vectors that differ in at least y bits (configurable) from a given vector v.

Using your definition of n=number of bits in a vector v, setting y=n generates exactly 1 vector which is the exact opposite of the input vector v. For y=n-1, it will generate n+1 vectors: n vectors which differ in all but one bits and 1 vector that differs in all bits. And so on different values of y.

**EDIT: Added summary and replaced erroneous 'XOR' with 'NEGATE' in the text above.

Opportina answered 12/5, 2018 at 13:19 Comment(3)
Sorry but I am not sure I have understood all the ideas in your answer. Could you please describe in simple words what your algorithm is able to find? I mean, does it find a subset of e.g. k vectors that are maximally different from each other?Aerometry
I added a summary and a fixed a word. To answer your question, no, it does not find a subset of e.g. k vectors that are maximally different from each other. Instead it finds all vectors that have a (configurable) minimum Hamming distance (== different) from a single given vector (but not from each other). Sorry if I misunderstood your request...?Opportina
Actually, as mentioned in the question, I am looking for a way to find a subset of e.g. k vectors that are maximally different (i.e. distant) from each other. Your algorithm is a low computational-complexity approach to find all the vectors that have at least some distance d from a given vector; this is an interesting approach, thank you. In addition to your alg., one can use an exhaustive search to find these vectors. However, the problem is how to find the k vectors that are maximally different from each other. Maybe I am missing something here..Aerometry
P
4

I don't know if maximizing the sum of the Hamming distances is the best criterion to obtain a set of "maximally different" binary vectors, but I strongly suspect it is. Furthermore I strongly suspect that the algorithm that I'm going to present yields exactly a set of k vectors that maximizes the sum of Hamming distances for vectors of n bits of with m ones and n - m zeroes. Unfortunately I don't have the time to prove it (and, of course, I might be wrong – in which case you would be left with a “suboptimal yet good” solution, as per your request.)

Warning: In the following I'm assuming that, as a further condition, the result set may not contain the same vector twice.

The algorithm I propose is the following:

Starting from a result set with just one vector, repeatedly add one of those remaining vectors that have the maximum sum of Hamming distances from all the vectors that are already in the result set. Stop when the result set contains k vectors or all available vectors have been added.

Please note that the sum of Hamming distances of the result set does not depend on the choice of the first or any subsequent vector.

I found a “brute force” approach to be viable, given the constraints you mentioned in a comment:

n<25, 1<m<10, 10<k<100 (or 10<k<50)

The “brute force” consists in precalculating all vectors in “lexicographical” order in an array, and also keeping up-to-date an array of the same size that contains, for each vector with the same index, the total Hamming distance of that vector to all the vectors that are in the result set. At each iteration the total Hamming distances are updated, and the first (in “lexicographical” order) of all vectors that have the maximum total Hamming distance from the current result set is chosen. The chosen vector is added to the result set, and the arrays are shifted in order to fill in its place, effectively decreasing their size.

Here is my solution in Java. It's meant to be easily translatable to any procedural language, if needed. The part that calculates the combinations of m items out of n can be replaced by a library call, if one is available. The following Java methods have a corresponding C/C++ macro that uses fast specialized processor instructions on modern CPUs: Long.numberOfTrailingZeros__builtin_ctzl, Long.bitCount__builtin_popcountl.

package waltertross.bits;

public class BitsMain {

    private static final String USAGE =
        "USAGE: java -jar <thisJar> n m k (1<n<64, 0<m<n, 0<k)";

    public static void main (String[] args) {

        if (args.length != 3) {
            throw new IllegalArgumentException(USAGE);
        }
        int n = parseIntArg(args[0]); // number of bits
        int m = parseIntArg(args[1]); // number of ones
        int k = parseIntArg(args[2]); // max size of result set
        if (n < 2 || n > 63 || m < 1 || m >= n || k < 1) {
            throw new IllegalArgumentException(USAGE);
        }
        // calculate the total number of available bit vectors
        int c = combinations(n, m);
        // truncate k to the above number
        if (k > c) {
            k = c;
        }
        long[] result   = new long[k]; // the result set (actually an array)
        long[] vectors  = new long[c - 1]; // all remaining candidate vectors
        long[] hammingD = new long[c - 1]; // their total Hamming distance to the result set
        long firstVector = (1L << m) - 1; // m ones in the least significant bits
        long lastVector  = firstVector << (n - m); // m ones in the most significant bits
        result[0] = firstVector; // initialize the result set
        // generate the remaining candidate vectors in "lexicographical" order
        int size = 0;
        for (long v = firstVector; v != lastVector; ) {
            // See http://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
            long t = v | (v - 1); // t gets v's least significant 0 bits set to 1
            // Next set to 1 the most significant bit to change,
            // set to 0 the least significant ones, and add the necessary 1 bits.
            v = (t + 1) | (((~t & -~t) - 1) >>> (Long.numberOfTrailingZeros(v) + 1));
            vectors[size++] = v;
        }
        assert(size == c - 1);
        // chosenVector is always the last vector added to the result set
        long chosenVector = firstVector;
        // do until the result set is filled with k vectors
        for (int r = 1; r < k; r++) {
            // find the index of the new chosen vector starting from the first
            int chosen = 0;
            // add the distance to the old chosenVector to the total distance of the first
            hammingD[0] += Long.bitCount(vectors[0] ^ chosenVector);
            // initialize the maximum total Hamming distance to that of the first
            long maxHammingD = hammingD[0];
            // for all the remaining vectors
            for (int i = 1; i < size; i++) {
                // add the distance to the old chosenVector to their total distance
                hammingD[i] += Long.bitCount(vectors[i] ^ chosenVector);
                // whenever the calculated distance is greater than the max,
                // update the max and the index of the new chosen vector
                if (maxHammingD < hammingD[i]) {
                    maxHammingD = hammingD[i];
                    chosen = i;
                }
            }
            // set the new chosenVector to the one with the maximum total distance
            chosenVector = vectors[chosen];
            // add the chosenVector to the result set
            result[r] = chosenVector;
            // fill in the hole left by the chosenVector by moving all vectors
            // that follow it down by 1 (keeping vectors and total distances in sync)
            System.arraycopy(vectors,  chosen + 1, vectors,  chosen, size - chosen - 1);
            System.arraycopy(hammingD, chosen + 1, hammingD, chosen, size - chosen - 1);
            size--;
        }
        // dump the result set
        for (int r = 0; r < k; r++) {
            dumpBits(result[r], n);
        }
    }

    private static int parseIntArg(String arg) {
        try {
            return Integer.parseInt(arg);
        } catch (NumberFormatException ex) {
            throw new IllegalArgumentException(USAGE);
        }
    }

    private static int combinations(int n, int m) {
        // calculate n over m = n! / (m! (n - m)!)
        // without using arbitrary precision numbers
        if (n <= 0 || m <= 0 || m > n) {
            throw new IllegalArgumentException();
        }
        // possibly avoid unnecessary calculations by swapping m and n - m
        if (m * 2 < n) {
            m = n - m;
        }
        if (n == m) {
            return 1;
        }
        // primeFactors[p] contains the power of the prime number p
        // in the prime factorization of the result
        int[] primeFactors = new int[n + 1];
        // collect prime factors of each term of n! / m! with a power of 1
        for (int term = n; term > m; term--) {
            collectPrimeFactors(term, primeFactors, 1);
        }
        // collect prime factors of each term of (n - m)! with a power of -1
        for (int term = n - m; term > 1; term--) {
            collectPrimeFactors(term, primeFactors, -1);
        }
        // multiply the collected prime factors, checking for overflow
        int combinations = 1;
        for (int f = 2; f <= n; f += (f == 2) ? 1 : 2) {
            // multiply as many times as requested by the stored power
            for (int i = primeFactors[f]; i > 0; i--) {
                int before = combinations;
                combinations *= f;
                // check for overflow
                if (combinations / f != before) {
                    String msg = "combinations("+n+", "+m+") > "+Integer.MAX_VALUE;
                    throw new IllegalArgumentException(msg);
                }
            }
        }
        return combinations;
    }

    private static void collectPrimeFactors(int n, int[] primeFactors, int power) {
        // for each candidate prime that fits in the remaining n
        // (note that non-primes will have been preceded by their component primes)
        for (int i = 2; i <= n; i += (i == 2) ? 1 : 2) {
            while (n % i == 0) {
                primeFactors[i] += power;
                n /= i;
            }
        }
    }

    private static void dumpBits(Long bits, int nBits) {
        String binary = Long.toBinaryString(bits);
        System.out.println(String.format("%"+nBits+"s", binary).replace(' ', '0'));
    }
}

The algorithm's data for n=5, m=2, k=4:

result
00011   00101 00110 01001 01010 01100 10001 10010 10100 11000 vectors
          0→2   0→2   0→2   0→2   0→4   0→2   0→2   0→4   0→4 hammingD
                                    ^                         chosen
00011   00101 00110 01001 01010 10001 10010 10100 11000
01100     2→4   2→4   2→4   2→4   2→6   2→6   4→6   4→6
                                    ^
00011   00101 00110 01001 01010 10010 10100 11000
01100     4→6   4→8   4→6   4→8   6→8   6→8   6→8
10001             ^

00011   00101 01001 01010 10010 10100 11000
01100       6     6     8     8     8     8
10001
00110

Sample output (n=24, m=9, k=20):

[wtross ~/Dropbox/bits]$ time java -jar bits-1.0-SNAPSHOT.jar 24 9 20
000000000000000111111111
000000111111111000000000
111111000000000000000111
000000000000111111111000
000111111111000000000000
111000000000000000111111
000000000111111111000000
111111111000000000000000
000000000000001011111111
000000111111110100000000
111111000000000000001011
000000000000111111110100
001011111111000000000000
110100000000000000111111
000000001011111111000000
111111110100000000000000
000000000000001101111111
000000111111110010000000
111111000000000000001101
000000000000111111110010

real    0m0.269s
user    0m0.244s
sys     0m0.046s

The toughest case within your constraints (n=24, m=9, k=99) takes ~550 ms on my Mac.

The algorithm could be made even faster by some optimization, e.g., by shifting shorter array chunks. Remarkably, in Java I found shifting "up" to be considerably slower than shifting "down".

Prosperous answered 20/5, 2018 at 22:9 Comment(7)
Would you please explain how you intialize your 'result set'? I suppose that using your solution, the resulting set depends on the vector you initialize the set with (?). Since I am not fimiliar with java, I would appreciate if you could add a description for each part of your code; or please just provide the algorithm as a pseudo-code.Aerometry
the result set is initialized with the first vector in “lexicographical order”. The other vectors are added taking the first (in “lexicographical order”) among those that have the maximum total Hamming distance from those that already are in the result set. The sample output that I provided helps in visualizing that. The output sequence is fixed, and truncated at the k-th vector. The code I wrote is really very similar to C code in its main parts (BTW I literally pasted code lines from the bithacks URL that I provided). Anyway, I'll improve the answer as soon as I can.Prosperous
@Aerometry I tried to clarify the algorithm, and also added many comments in the code. The part that calculates C(n, m) = n! / (m! (n-m)!) might be completely redundant if a library function that does the same thing is available, so I consider it interesting for itself, but not for your problem. Please let me know if you need any further clarification.Prosperous
If I understand correctly, you are ordering the vectors in lexicographical order and choosing the first vector (in such order) that maximizes the sum of hamming distance from the vectors in the result set. Is this ordering important ? What if I use a random ordering and choose the vector (if there are multiple vectors, choose one at random) that yields the maximum hamming distance from the result set ?Aerometry
no, the ordering is not important at all. You can use a random order and choose a vector at random as long as it has the maximum Hamming distance from the result set. The advantage of the lexicographical order is that it easy to generate and easy to follow when debugging.Prosperous
I think this indeed finds the maximum overall Hamming distance; at least I'm getting the same results for simple examples with a very different method. But looking at the output, I think the method for generating all solutions (or a random solution) can be further simplified.Kilan
@m69 I think you are right. Unfortunately I don't have the time now to improve on this solution, which remains a “brute force” one.Prosperous
G
3

UPDATED ANSWER

Looking at the example output of Walter Tross's code, I think that generating a random solution can be simplified to this:

Take any vector to start with, e.g. for n=8, m=3, k=5:

A:   01001100  

After every step, sum the vectors to get the number of times each position has been used:

SUM: 01001100

Then, for the next vector, place the ones at positions that have been used least (in this case zero times), e.g.:

B:   00110001

to get:

A:   01001100  
B:   00110001
SUM: 01111101  

Then, there are 2 least-used positions left, so for the 3 ones in the next vector, use those 2 positions, and then put the third one anywhere:

C:   10010010

to get:

A:   01001100  
B:   00110001
C:   10010010
SUM: 11121111  (or reset to 00010000 at this point)  

Then for the next vector, you have 7 least-used positions (the ones in the sum), so choose any 3, e.g.:

D:   10100010

to get:

A:   01001100  
B:   00110001
C:   10010010
D:   10100010
SUM: 21221121  

And for the final vector, choose any of the 4 least-used positions, e.g.:

E:   01000101

To generate all solutions, simply generate every possible vector in each step:

A:   11100000, 11010000, 11001000, ... 00000111

Then, e.g. when A and SUM are 11100000:

B:   00011100, 00011010, 00011001, ... 00000111

Then, e.g. when B is 00011100 and SUM is 11111100:

C:   10000011, 01000011, 00100011, 00010011, 00001011, 00000111

Then, e.g. when C is 10000011 and SUM is 21111111:

D:   01110000, 01101000, 01100100, ... 00000111

And finally, e.g. when D is 01110000 and SUM is 22221111:

E:   00001110, 00001101, 00001011, 00000111

This would result in C(8,3) × C(5,3) × C(8,1) × C(7,3) × C(4,3) = 56 × 10 × 8 × 35 × 4 = 627,200 solutions for n=8, m=3, k=5.


Actually, you need to add a method to avoid repeating the same vector, and avoid painting yourself into a corner; so I don't think this will be simpler than Walter's answer.


INITIAL ANSWER - HAS MAJOR ISSUES

(I will assume than m is not greater than n/2, i.e. the number of ones is not greater than the number of zeros. Otherwise, use a symmetrical approach.)

When k×m is not greater than n, there obviously are optimal solutions, e.g.:

n=10, m=3, k=3:  
A: 1110000000  
B: 0001110000  
C: 0000001110  

where the Hamming distances are all 2×m:

|AB|=6, |AC|=6, |BC|=6, total=18

When k×m is greater than n, solutions where the difference in Hamming distances between consecutive vectors are minimized offer the greatest total distance:

n=8, m=3, k=4:
A: 11100000
B: 00111000
C: 00001110
D: 10000011
|AB|=4, |AC|=6, |AD|=4, |BC|=4, |BD|=6, |CD|=4, total=28  
n=8, m=3, k=4:
A: 11100000
B: 00011100
C: 00001110
D: 00000111
|AB|=6, |AC|=6, |AD|=6, |BC|=2, |BD|=4, |CD|=2, total=26  

So, practically, you take m×k and see how much greater it is than n, let's call it x = m×k−n, and this x is the number of overlaps, i.e. how often a vector will have a one in the same position as the previous vector. You then spread out the overlap over the different vectors as evenly as possible to maximize the total distance.

In the example above, x = 3×4−8 = 4 and we have 4 vectors, so we can spread out the overlap evenly and every vector has 1 one in the same position as the previous vector.


To generate all unique solutions, you could:

  • Calculate x = m×k−n and generate all partitions of x into k parts, with the lowest possible maximum value:
n=8, m=3, k=5  ->  x=7  
22111, 21211, 21121, 21112, 12211, 12121, 12112, 11221, 11212, 11122  
(discard partitions with value 3)  
  • Generate all vectors to be used as vector A, e.g.:
A: 11100000, 11010000, 11001000, 11000100, ... 00000111
  • For each of these, generate all vectors B, which are lexicographically smaller than vector A, and have the correct number of overlapping ones with vector A (in the example that is 1 and 2), e.g.:
A: 10100100
overlap=1:  
B: 10011000, 10010010, 10010001, 10001010, 10001001, 10000011, 01110000, ... 00000111
overlap=2:  
B: 10100010, 10100001, 10010100, 10001100, 10000110, 10000101, 01100100, ... 00100101  
  • For each of these, generate all vectors C, and so on, until you have sets of k vectors. When generating the last vector, you have to take into account the overlapping with the previous as well as the next (i.e. first) vector.

I assume it's best to treat the partitions of x into k as a binary tree:

                   1                                      2
      11                      12                    21         22
111        112           121       122        211       212    221
1112   1121   1122   1211   1212   1221   2111   2112   2121   2211
11122  11212  11221  12112  12121  12211  21112  21121  21211  22111

and traverse this tree while creating solutions, so that each vector only needs to be generated once.


I think this method only works for some values of n, m and k; I'm not sure it can be made to work for the general case.

Gard answered 24/5, 2018 at 0:29 Comment(6)
I think this gives the same results as Walter's method, which looks easier to code. But looking at his example output, I think you can simply add up all the vectors, and then put the 1's in the least-used positions to generate the next vector in a solution.Kilan
@WalterTross As far as I can tell, that's not a problem, e.g. for n=4, m=2, k=3, both [1100, 0110, 0011] and [1100, 0011, 1100] have a total Hamming distance of 8.Kilan
@m69 I am interested in what you say about the minimum distance. However, in your example you consider 1100 and 0011, then you take 1100 and say that the minimum hamming distance is 0. But, in this case, you are considering the same vector (1100) twice, which is not correct. I mean once you have a vector (e.g.1100) in the set, you cannot consider it again. I have not succeeded to find an example for which your statement holds, i.e. the minimum hamming distance is not the same for two vectors for which the total hamming distance from the set is maximal and the same.Aerometry
@Aerometry Ah, I overlooked the fact that a result cannot contain identical vectors. Anyway, I saw your question quite late, and I'm still digesting all the details of the problem. I'll probably update my answer further in the coming days. Are you interested in generating all solutions or just one random solution at a time, and is uniform distribution important?Kilan
@m69 One solution is enough as long as there is no better solution than it; maybe that's what you mean by 'uniform distribution important' (?).Aerometry
@Aerometry In that case, Walter's solution is probably best. The more I look at my initial answer, the more I think it only works for certain values of n, m and k.Kilan

© 2022 - 2024 — McMap. All rights reserved.