This answer provides an implementation of @user3386109's comment:
I think the solution is a priority queue. The initial input to the queue is a string with the highest probability letter (aaaa
). After reading a string from the queue, replace a letter with the next lower probability letter, and add that new string to the queue. So after reading aaaa
, write aaac
. After reading aaac
, write aacc
(changing an a
to c
) and aaab
(changing a c
to a b
).
I wrote a generator function, following these exact steps:
- Define a helper functions:
prio(word)
which returns the priority of a word (as a negative number because python heaps are min-heaps);
- Define a helper dict:
next_letter.get(letter)
is the next higher-priority letter after letter
, if any;
- Initialise a
heapq
queue with first word aaaa
, and its corresponding priority;
- When popping words from the queue, avoid possible duplicates by comparing the current word with the previous word;
- After popping a word from the queue, if it is not a duplicate, then
yield
this word, and push the words obtained by replacing a letter by the next probability letter;
Since it is a generator function, it is lazy: you can get the first N
words without computing all words. However, some extra words will still be computed, since the whole idea of the priority queue is that we don't know the exact order in advance.
import heapq
from math import prod
from itertools import pairwise
def gen_combs(weights, r = 4):
next_letter = dict(pairwise(sorted(weights, key=weights.get, reverse=True)))
def prio(word, weights=weights):
return -prod(map(weights.get, word))
first_word = max(weights, key=weights.get) * r
queue = [(prio(first_word), first_word)]
prev_word = None
while queue:
w, word = heapq.heappop(queue)
if word != prev_word:
yield word
prev_word = word
seen_letters = set()
for i, letter in enumerate(word):
if letter not in seen_letters:
new_letter = next_letter.get(letter)
if new_letter:
new_word = ''.join((word[:i], new_letter, word[i+1:]))
heapq.heappush(queue, (prio(new_word), new_word))
seen_letters.add(letter)
# TESTING
weights = {'a': 0.7, 'b': 0.1, 'c': 0.3, 'z': 0.01}
# print all words
print(list(gen_combs(weights)))
# ['aaaa', 'caaa', 'ccaa', 'baaa', 'ccca', 'bcaa', 'cccc',
# 'bcca', 'bbaa', 'zaaa', 'bccc', 'bbca', 'zcaa', 'bbcc',
# 'bbba', 'zcca', 'zbaa', 'bbbc', 'zccc', 'zbca', 'bbbb',
# 'zbcc', 'zbba', 'zzaa', 'zbbc', 'zzca', 'zbbb', 'zzcc',
# 'zzba', 'zzbc', 'zzbb', 'zzza', 'zzzc', 'zzzb', 'zzzz']
n = 10
# print first n words, one per line
for _,word in zip(range(n), gen_combs(weights)):
print(word)
# print first n words
from itertools import islice
print(list(islice(gen_combs(weights), n)))
# ['aaaa', 'caaa', 'ccaa', 'baaa', 'ccca', 'bcaa', 'cccc', 'bcca', 'bbaa', 'zaaa']
aaaa
). After reading a string from the queue, replace a letter with the next lower probability letter, and add that new string to the queue. So after readingaaaa
, writeaaac
. After readingaaac
, writeaacc
(changing ana
toc
) andaaab
(changing ac
to ab
). – Shinleaf