Generate first N combinations of length L from weighted set in order of weight
Asked Answered
S

3

5

I have a set of letters with weights, which gives their probability of appearing in a string:

a - 0.7
b - 0.1
c - 0.3
...
z - 0.01

As such, the word aaaa has a probability of 0.7*0.7*0.7*0.7 = 0.24. The word aaac would have probability 0.7*0.7*0.7*0.3 = 0.10. All permutations of the same word have the same probability, so we only need to worry about combinations.

I would like to generate the first unique N strings of length L in order of probability (for example here, with 4 letters and length 4, that should be aaaa, aaac, aacc, aaab, accc, aabc, cccc, etc).

Assume that the brute force approach of generating all combinations with their probabilities and sorting by weight to not be possible here. The algorithm, should it exist, must be able to work for any set size and any length of string (e.g. all 256 bytes with weighted probabilities, 1024 length string, generate first trillion.)

Schooner answered 4/5, 2022 at 23:14 Comment(3)
I think the solution is a priority queue. The initial input to the queue is a string with the highest probability letter (aaaa). After reading a string from the queue, replace a letter with the next lower probability letter, and add that new string to the queue. So after reading aaaa, write aaac. After reading aaac, write aacc (changing an a to c) and aaab (changing a c to a b).Shinleaf
You're right about the order, I've fixed it.Schooner
@Shinleaf Hope you don't mind, I took the liberty of implementing your commentGrout
K
6

Below is some enumeration code that uses a heap. The implementation principle is slightly different from what user3386109 proposes in their comment.

Order the symbols by decreasing probability. There’s a constructive one-to-one correspondence between length-L combinations of S symbols, and binary strings of length S + L − 1 with L − 1 zeros (counting out each symbol in unary with L − 1 delimiters). We can do bit-at-a-time enumeration of the possibilities for the latter.

The part that saves us from having to enumerate every single combination is that, for each binary prefix, the most probable word can be found by repeating the most probable letter still available. By storing the prefixes in a heap, we can open only the ones that appear in the top N.

Note that this uses memory proportional to the number of enumerated combinations. This may still be too much, in which case you probably want something like iterative deepening depth-first search.

symbol_probability_dict = {"a": 0.7, "b": 0.1, "c": 0.3, "z": 0.01}
L = 4

import heapq
import math

loss_symbol_pairs = [(-math.log(p), c) for (c, p) in symbol_probability_dict.items()]
loss_symbol_pairs.sort()

heap = [(0, 0, "")]
while heap:
    min_loss, i, s = heapq.heappop(heap)
    if len(s) < L:
        heapq.heappush(heap, (min_loss, i, s + loss_symbol_pairs[i][1]))
        if i + 1 < len(loss_symbol_pairs):
            heapq.heappush(
                heap,
                (
                    min_loss
                    + (L - len(s))
                    * (loss_symbol_pairs[i + 1][0] - loss_symbol_pairs[i][0]),
                    i + 1,
                    s,
                ),
            )
    else:
        print(s)

Output:

aaaa
aaac
aacc
aaab
accc
aacb
cccc
accb
aabb
aaaz
cccb
acbb
aacz
ccbb
abbb
accz
aabz
cbbb
cccz
acbz
bbbb
ccbz
abbz
aazz
cbbz
aczz
bbbz
cczz
abzz
cbzz
bbzz
azzz
czzz
bzzz
zzzz
Kittiekittiwake answered 5/5, 2022 at 1:21 Comment(0)
G
0

This answer provides an implementation of @user3386109's comment:

I think the solution is a priority queue. The initial input to the queue is a string with the highest probability letter (aaaa). After reading a string from the queue, replace a letter with the next lower probability letter, and add that new string to the queue. So after reading aaaa, write aaac. After reading aaac, write aacc (changing an a to c) and aaab (changing a c to a b).

I wrote a generator function, following these exact steps:

  • Define a helper functions: prio(word) which returns the priority of a word (as a negative number because python heaps are min-heaps);
  • Define a helper dict: next_letter.get(letter) is the next higher-priority letter after letter, if any;
  • Initialise a heapq queue with first word aaaa, and its corresponding priority;
  • When popping words from the queue, avoid possible duplicates by comparing the current word with the previous word;
  • After popping a word from the queue, if it is not a duplicate, then yield this word, and push the words obtained by replacing a letter by the next probability letter;

Since it is a generator function, it is lazy: you can get the first N words without computing all words. However, some extra words will still be computed, since the whole idea of the priority queue is that we don't know the exact order in advance.

import heapq
from math import prod
from itertools import pairwise

def gen_combs(weights, r = 4):
    next_letter = dict(pairwise(sorted(weights, key=weights.get, reverse=True)))
    def prio(word, weights=weights):
        return -prod(map(weights.get, word))
    first_word = max(weights, key=weights.get) * r
    queue = [(prio(first_word), first_word)]
    prev_word = None
    while queue:
        w, word = heapq.heappop(queue)
        if word != prev_word:
            yield word
            prev_word = word
            seen_letters = set()
            for i, letter in enumerate(word):
                if letter not in seen_letters:
                    new_letter = next_letter.get(letter)
                    if new_letter:
                        new_word = ''.join((word[:i], new_letter, word[i+1:]))
                        heapq.heappush(queue, (prio(new_word), new_word))
                        seen_letters.add(letter)

# TESTING

weights = {'a': 0.7, 'b': 0.1, 'c': 0.3, 'z': 0.01}

# print all words
print(list(gen_combs(weights)))
# ['aaaa', 'caaa', 'ccaa', 'baaa', 'ccca', 'bcaa', 'cccc',
#  'bcca', 'bbaa', 'zaaa', 'bccc', 'bbca', 'zcaa', 'bbcc',
#  'bbba', 'zcca', 'zbaa', 'bbbc', 'zccc', 'zbca', 'bbbb',
#  'zbcc', 'zbba', 'zzaa', 'zbbc', 'zzca', 'zbbb', 'zzcc',
#  'zzba', 'zzbc', 'zzbb', 'zzza', 'zzzc', 'zzzb', 'zzzz']

n = 10

# print first n words, one per line
for _,word in zip(range(n), gen_combs(weights)):
    print(word)

# print first n words
from itertools import islice
print(list(islice(gen_combs(weights), n)))
# ['aaaa', 'caaa', 'ccaa', 'baaa', 'ccca', 'bcaa', 'cccc', 'bcca', 'bbaa', 'zaaa']
Grout answered 5/5, 2022 at 9:42 Comment(0)
D
0

First pardon me for pointing out your probabilities do not sum to 1 which instantly makes me feel something is wrong.

I would like to share my version (credit to Stef, David, and user3386109), which is easier to prove to be O(NL) in space and O(N*max(L,log(N))) in time. Because you need O(NL) to store/print out the result, and add a log(N) for prio-queue, you can see it is hard to further make improvement. I don't see the folks here have proved efficient enough on the complexity-wise. P.S, making it a generator won't help since the space is mostly caused by the heap itself. P.S.2 for large L, the size of N can be astronomical.

Complexity analysis: The heap starts with 1 element and each pop, which is one of the top-N, pushes up to 2 elements back in. Therefore, the total space is O(N2item size)=O(NL). Time cost is N*max(L,log(N)) because push/pop each costs log(N) while many steps in the inner loop costs O(L).

You can see it print the same 35 results as Stef's. Now the core part of this algo is how to choose the 2 elements to push after each pop. First consider the opposite question, how to find the parent of each child so (1) all possible combos form a partial order (maximal element is 'aaaa', and No we don't need Zorn's lemma here), (2) we won't miss anything larger than the child but smaller than the parent (not a entirely correct statement for 2 letters sharing the same probability, but in actuality I found it does not matter), and (3) no repeats so we don't have to manage caching the visited strings. My way of choosing the parent is always lower the letters on the front of the string until it reaches 'a'. E.g, parent of 'cbbz' will be 'abbz' instead of 'ccbz'. Conversely, we find the children of a given parent, and it is nice to see we have up to 2 children in this case: raise the first non-'a' letter until it equals the next letter, ('acbb'=>'abbb', while 'accb' has no child in this direction) or raise the last 'a' ('acbb'=>'ccbb', 'accb'=>'cccb')

import heapq, math

letters=[[-math.log(x[1]),x[0]] for x in zip('abcz',[0.7,0.1,0.3,0.01])]
N=35 #something big to force full print in testing. Here 35 is calculated by DP or Pascal triangle
L=4
letters.sort()
letters=[[i,x[0],x[1]] for i,x in enumerate(letters)]
heap=[[letters[0][1]*L,[0]*L,L-1]]

while heap and N>0:
    N-=1
    p,nextLargest,childPtr=heapq.heappop(heap)
    print(''.join([letters[x][2] for x in nextLargest]))
    if childPtr<L-1 and nextLargest[childPtr+1]<len(letters)-1:
        child=list(nextLargest)
        child[childPtr+1]+=1
        if childPtr+2==L or child[childPtr+1]<=child[childPtr+2]:
            prob=sum([letters[x][1] for x in child]) #this can be improved by using a delta, but won't change overall complexity
            heapq.heappush(heap,[prob,child,childPtr])
    if childPtr>=0:
        child=list(nextLargest)
        child[childPtr]+=1
        prob=sum([letters[x][1] for x in child])
        heapq.heappush(heap,[prob,child,childPtr-1])
Disjunction answered 6/5, 2022 at 1:0 Comment(1)
"First pardon me for pointing out your probabilities do not sum to 1 which instantly makes me feel something is wrong." If the weight ofa given letter a is the probability that it will appear in a word, there is no reason that the weights of the 26 letter sum to 1; in fact, the sum should likely be higher than 1, since there are usually more than one letter per word.Grout

© 2022 - 2024 — McMap. All rights reserved.