How to scale a trie across multiple servers
Asked Answered
S

1

11

Does anyone know how I might scale a Trie across multiple machines? Say the first machine runs out of space and I need to add more words from a very large dictionary, what might I do to add more words? (I am a Java thinker, but I believe the answer can be language agnostic). I have already realized that I cannot just say one machine for each first character, but that doesn't really scale.

Sympathetic answered 16/5, 2015 at 21:37 Comment(0)
L
7

Ok, given the assumption, that both of your machines have the same resources available, let’s first look at a simpler example:

how would you scale a binary tree? Or even better - an AVL tree? There are several examples to do this:

  1. If there would be only 2 machines and storage is your problem, I’d keep the root and the left subtree on one machine, and send the right subtree to the other.
  2. If you’d have 3 machines and would also want to have a load balancer, the root would stay on one machine and the left and right subtree would be split accross the other 2 machines. If you have 5 you keep the root and first level of children on the load balancer and split the rest of the tree.

(note that balancing such a distributed tree will be much more complicated, as you’ll need to communicate with other machines and do it possibly inside a distributed transaction, to be able to answer all requests concurrently)

So, now a trie, which - AFAIR - is a tree / letter. If the letters in your words would be distributed evenly, you could have A-M on one machine and N-Z on the other. This will probably not work, but you’ll for sure be able to split it more or less 50/50 like this.

If you now want to add more and more machines, I’d keep a main node which would work as a load balancer and distribute it to the child nodes, which would only take care of few letters. For instance you could have nodes

  • A-F
  • G-M
  • N-R
  • S
  • T-Z

Assuming, you have roughly as much data for the letters A-F as you have for the letter S. (There actually might be a language, where this would be at least close to the most optimal distribution)

Now if you get too many letters in A-F you can just split it into A-D and E-F for instance, nothing really changes there. The problem will be if you’d get too many letters in S. Now you’d have 3 possibilities:

  1. You make another load balancer for the letter S - this will be for sure easy, as you already have a load balancer implemented and you can use the same functionality on any level
  2. You keep letters SA-SM (for instance) in one node, which will be the main node, store SN-SZ on a separate node. So if you get SP.. the first load balancer would send it to your SA-SM node and that one would forward it to SN-SZ
  3. You modify the load root load balancer to be able to specify more complex boundaries between nodes, such as you’d have now the nodes

    • A-F
    • G-M
    • N-R
    • SA-SM
    • SN-SZ
    • T-Z

Here number 1 is probably the easiest and cleanest solution, but might have some unused hardware. In case you can use different resources for nodes, option 1 with a small load balancer for the letter S would be probably the way to go. Option 2 is a dirty mix, and option 3 might be the nicest way to go, but it makes the load balancer potentially complicated and error prone.

Hope this ideas help you.

Litta answered 17/5, 2015 at 9:16 Comment(4)
Thank you so much!!! I have marked your response as it answers the question. But now I have a crazy case to ask about. What if one word -- just one word -- is simply too big to store on one server. I know it's a crazy thing to ask. But how do you handle that case?Sympathetic
how do you create the trie? Because the way I've learned it, you take a word / sentence / <what ever you want to index> and put at first every existing character into the first level, and each subsequent level would have the next character in sequence, that's basically the way to find if a sequence of chracters exists in the text or not. This way, if the text (word) doesn't fit, you would just need to recognise it already during the indexing process and split the trie up, according to one of the 3 methods I've described.Litta
If, on the other hand, you're only indexing words for starting with a certain sequence of characters, than your trie will be much smaller, but you're right, splitting up a word/text which wouldn't fit, would be much more difficult. I guess the only way to go about that, would be, splitting away some middle layer, than if you access that layer, it will actually forward the call to a different machine, where the rest of the data is stored. But to figure out, where to do that cut, is of course much more complicated.Litta
This may seem very naive question, but I wanted to ask, like how a trie (or part of it) is stored in a machine. Trie is data structure, which mean it can reside in RAM. However, for DB, we can have Nosql or Sql. how will we map trie to DB.Distinctly

© 2022 - 2024 — McMap. All rights reserved.