How do I balance a BK-Tree and is it necessary?
Asked Answered
A

1

8

I am looking into using an Edit Distance algorithm to implement a fuzzy search in a name database.

I've found a data structure that will supposedly help speed this up through a divide and conquer approach - Burkhard-Keller Trees. The problem is that I can't find very much information on this particular type of tree.

If I populate my BK-tree with arbitrary nodes, how likely am I to have a balance problem?

If it is possibly or likely for me to have a balance problem with BK-Trees, is there any way to balance such a tree after it has been constructed?

What would the algorithm look like to properly balance a BK-tree?

My thinking so far:

It seems that child nodes are distinct on distance, so I can't simply rotate a given node in the tree without re-calibrating the entire tree under it. However, if I can find an optimal new root node this might be precisely what I should do. I'm not sure how I'd go about finding an optimal new root node though.

I'm also going to try a few methods to see if I can get a fairly balanced tree by starting with an empty tree, and inserting pre-distributed data.

  • Start with an alphabetically sorted list, then queue from the middle. (I'm not sure this is a great idea because alphabetizing is not the same as sorting on edit distance).
  • Completely shuffled data. (This relies heavily on luck to pick a "not so terrible" root by chance. It might fail badly and might be probabilistically guaranteed to be sub-optimal).
  • Start with an arbitrary word in the list and sort the rest of the items by their edit distance from that item. Then queue from the middle. (I feel this is going to be expensive, and still do poorly as it won't calculate metric space connectivity between all words - just each word and a single reference word).
  • Build an initial tree with any method, flatten it (basically like a pre-order traversal), and queue from the middle for a new tree. (This is also going to be expensive, and I think it may still do poorly as it won't calculate metric space connectivity between all words ahead of time, and will simply get a different and still uneven distribution).
  • Order by name frequency, insert the most popular first, and ditch the concept of a balanced tree. (This might make the most sense, as my data is not evenly distributed and I won't have pure random words coming in).

FYI, I am not currently worrying about the name-synonym problem (Bill vs William). I'll handle that separately, and I think completely different strategies would apply.

Ambrosio answered 31/12, 2012 at 11:10 Comment(2)
Did you find answers to your questions maybe?Lynd
I thought order by name frequency (insert most popular first) would be the fastest. However in my tries I measured that reversed frequency (inserting most popular last) outperformed my all tries. I couln't understand why, I was expecting the opposite.Basic
D
0

There is a lisp example in the article: http://cliki.net/bk-tree. About unbalancing the tree I think the data structure and the method seems to be complicated enough and also the author didn't say anything about unbalanced tree. When you experience unbalanced tree maybe it's not for you?

Dejesus answered 31/12, 2012 at 12:48 Comment(4)
Thanks for the link, but I am not having problems with the base algorithm for building a BK-tree. The lisp example is how to use their library, and says nothing about tree balance. "When you experience unbalanced tree maybe it's not for you?" - can you expand on this? Which other options do I have? For example, is there some specific Vantage Point Tree derivative I could use instead?Ambrosio
I'm not sure if the BK-tree is any good. For example a trie or a kart-tire can also solve your problem. Of course in 2d euklidian space you can have shortcut. Read about triangle inequaliy.Dejesus
Tries (radix trees) are helpful for auto-complete (which is not what I'm trying to implement), but not nearly as much for typos. I imagine they could be modified to help speed up Levinshtein calculations, but they wouldn't give me a fuzzy match set based on edit distance/metric space. "Of course in 2d euklidian space you can have shortcut" - that's what BK-trees are for... they're just a metric space tree.Ambrosio
Yes, but you can implement a wildcard search: phpir.com/tries-and-wildcards.Dejesus

© 2022 - 2024 — McMap. All rights reserved.