Efficiently build a graph of words with given Hamming distance
Asked Answered
D

4

19

I want to build a graph from a list of words with Hamming distance of (say) 1, or to put it differently, two words are connected if they only differ from one letter (lol -> lot).

so that given

words = [ lol, lot, bot ]

the graph would be

{
  'lol' : [ 'lot' ],
  'lot' : [ 'lol', 'bot' ],
  'bot' : [ 'lot' ]
}

The easy way is to compare every word in the list with every other and count the different chars; sadly, this is a O(N^2) algorithm.

Which algo/ds/strategy can I use to to achieve better performance?

Also, let's assume only latin chars, and all the words have the same length.

Droll answered 28/6, 2015 at 13:57 Comment(11)
Do you happen to have any good input data? It would be easier to explore the problem if you gave us (or at least me) some nice data to work with.Parvenu
What exactly do you mean by "with Hamming distance of (day) 1"? for every word there is at least one other words with that distance?Scenography
@shapiro.yaacov This is very true. So, in an inductive sort of way, your graph should contain evey word. Maybe this is what's taking so long?Parvenu
@gragas ''.join(random.choice(string.ascii_letters) for x in range(y)) will do the trick if you need some input data ;)Droll
How long are the words actually (in relation to the number of words)? what if you generated every words that is 1-hamming-away and made and edge to it (if it existed)?Scenography
Since the number of edges is potentially O(N^2), I don't think you can come up with an algorithm that is asymptotically better -- which isn't to say that you can't come up with an algorithm which works better in practice.Ingeborg
@JohnColeman I figured I could get some sort of O(N) algorithm if I save all the words as keys in a dict, and then go through the list trying to match partial keys - which could or could not break the O(1) lookup though, I'm not sure at all.Droll
only latin characters?Endemic
O(LRN) could be achieved (L - length of word, R - radix, N - count of words). L*R is < 1000Endemic
@Droll If the length of the words is bounded by k then you should be able to come up with an O(k*N) algorithm -- which is linear in N if k is fixed. Also -- it is important to remember that not all quadratic algorithms are equally poor. After all, quicksort is quadratic but necertheless often the sorting algorithm of choice.Ingeborg
@shapiro.yaacov I don't think generating all the 1-hamming-away words is asymptotically better than just check all the words in the list :(Droll
L
21

Assuming you store your dictionary in a set(), so that lookup is O(1) in the average (worst case O(n)).

You can generate all the valid words at hamming distance 1 from a word:

>>> def neighbours(word):
...     for j in range(len(word)):
...         for d in string.ascii_lowercase:
...             word1 = ''.join(d if i==j else c for i,c in enumerate(word))
...             if word1 != word and word1 in words: yield word1
...
>>> {word: list(neighbours(word)) for word in words}
{'bot': ['lot'], 'lol': ['lot'], 'lot': ['bot', 'lol']}

If M is the length of a word, L the length of the alphabet (i.e. 26), the worst case time complexity of finding neighbouring words with this approach is O(L*M*N).

The time complexity of the "easy way" approach is O(N^2).

When this approach is better? When L*M < N, i.e. if considering only lowercase letters, when M < N/26. (I considered only worst case here)

Note: the average length of an english word is 5.1 letters. Thus, you should consider this approach if your dictionary size is bigger than 132 words.

Probably it is possible to achieve better performance than this. However this was really simple to implement.

Experimental benchmark:

The "easy way" algorithm (A1):

from itertools import zip_longest
def hammingdist(w1,w2): return sum(1 if c1!=c2 else 0 for c1,c2 in zip_longest(w1,w2))
def graph1(words): return {word: [n for n in words if hammingdist(word,n) == 1] for word in words}

This algorithm (A2):

def graph2(words): return {word: list(neighbours(word)) for word in words}

Benchmarking code:

for dict_size in range(100,6000,100):
    words = set([''.join(random.choice(string.ascii_lowercase) for x in range(3)) for _ in range(dict_size)])
    t1 = Timer(lambda: graph1()).timeit(10)
    t2 = Timer(lambda: graph2()).timeit(10)
    print('%d,%f,%f' % (dict_size,t1,t2))

Output:

100,0.119276,0.136940
200,0.459325,0.233766
300,0.958735,0.325848
400,1.706914,0.446965
500,2.744136,0.545569
600,3.748029,0.682245
700,5.443656,0.773449
800,6.773326,0.874296
900,8.535195,0.996929
1000,10.445875,1.126241
1100,12.510936,1.179570
...

data plot

I ran another benchmark with smaller steps of N to see it closer:

10,0.002243,0.026343
20,0.010982,0.070572
30,0.023949,0.073169
40,0.035697,0.090908
50,0.057658,0.114725
60,0.079863,0.135462
70,0.107428,0.159410
80,0.142211,0.176512
90,0.182526,0.210243
100,0.217721,0.218544
110,0.268710,0.256711
120,0.334201,0.268040
130,0.383052,0.291999
140,0.427078,0.312975
150,0.501833,0.338531
160,0.637434,0.355136
170,0.635296,0.369626
180,0.698631,0.400146
190,0.904568,0.444710
200,1.024610,0.486549
210,1.008412,0.459280
220,1.056356,0.501408
...

data plot 2

You see the tradeoff is very low (100 for dictionaries of words with length=3). For small dictionaries the O(N^2) algorithm perform slightly better, but that is easily beat by the O(LMN) algorithm as N grows.

For dictionaries with longer words, the O(LMN) algorithm remains linear in N, it just has a different slope, so the tradeoff moves slightly to the right (130 for length=5).

Levinson answered 28/6, 2015 at 15:12 Comment(3)
This is a fantastic answer, I would accept it twice if I could. Thanks :)Droll
Pardon me if I'm wrong, but isn't the complexity O(L*M^2*N), since creating a string of length M takes O(M) time?Nipissing
I think @Nipissing is correct here.Preferment
R
6

There's no need to take a dependency on the alphabet size. Given a word bot, for example, insert it into a dictionary of word lists under the keys ?ot, b?t, bo?. Then, for each word list, connect all pairs.

import collections


d = collections.defaultdict(list)
with open('/usr/share/dict/words') as f:
    for line in f:
        for word in line.split():
            if len(word) == 6:
                for i in range(len(word)):
                    d[word[:i] + ' ' + word[i + 1:]].append(word)
pairs = [(word1, word2) for s in d.values() for word1 in s for word2 in s if word1 < word2]
print(len(pairs))
Rhone answered 28/6, 2015 at 15:57 Comment(2)
Just to check if I got what you mean, that's basically what @mescalinum proposed in his answer, correct?Droll
@Droll Nope, that answer is generating all substitutions explicitly (for d in string.ascii_lowercase).Rhone
S
5

Ternary Search Trie supports Near-Neighbor Searching pretty well.

If your dictionary is stored as TST then, I believe, average complexity of lookups while building your graph would be close to O(N*log(N)) on real world word dictionaries.

And check Efficient auto-complete with a ternary search tree article.

Swinson answered 28/6, 2015 at 16:31 Comment(2)
Is the definition of "near" for this data structure really compatible with the result we're trying to achieve?Forepleasure
@Forepleasure Not by default but it could be made to work by allowig it to go one step further than a trie would by default.Rendezvous
E
1

Here is linear O(N) algorithm, but with big constant factor (R * L * 2). R is radix (for latin alphabet it is 26). L is a medium length of word. 2 is a factor of adding/replacing wildcard character. So abc and aac and abca are two ops wich leads to hamming distance of 1.

It is written in Ruby. And for 240k words it takes ~250Mb RAM and 136 seconds on average hardware

Blueprint of graph implementation

class Node
  attr_reader :val, :edges

  def initialize(val)
    @val = val
    @edges = {}
  end

  def <<(node)
    @edges[node.val] ||= true
  end

  def connected?(node)
    @edges[node.val]
  end

  def inspect
    "Val: #{@val}, edges: #{@edges.keys * ', '}"
  end
end

class Graph
  attr_reader :vertices
  def initialize
    @vertices = {}
  end

  def <<(val)
    @vertices[val] = Node.new(val)
  end

  def connect(node1, node2)
    # print "connecting #{size} #{node1.val}, #{node2.val}\r"
    node1 << node2
    node2 << node1
  end

  def each
    @vertices.each do |val, node|
      yield [val, node]
    end
  end

  def get(val)
    @vertices[val]
  end
end

The algorithm itself

CHARACTERS = ('a'..'z').to_a
graph = Graph.new

# ~ 240 000 words
File.read("/usr/share/dict/words").each_line.each do |word|
  word = word.chomp
  graph << word.downcase
end

graph.each do |val, node|
  CHARACTERS.each do |char|
    i = 0
    while i <= val.size
      node2 = graph.get(val[0, i] + char + val[i..-1])
      graph.connect(node, node2) if node2
      if i < val.size
        node2 = graph.get(val[0, i] + char + val[i+1..-1])
        graph.connect(node, node2) if node2
      end
      i += 1
    end
  end
end
Endemic answered 28/6, 2015 at 15:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.