X-Y Fast Trie in real world applications
Asked Answered
W

4

9

I am trying to understand the X and Y Fast Trie data structures and it's not clear why that structures are not used in large database since their asymptotic complexity is less than Log(N). In cases where we have a database of Terabytes, is not better use a Y Fast Trie than for example a B-tree?

Werewolf answered 13/10, 2014 at 17:34 Comment(3)
Which database implementation are you talking about? It would seem that if the states were high enough, as they might be in a flagship product, the vendor could simply try one, then try the other and use the best!Dragline
I meant to say 'if the stakes were high enough' ... but S.O. won't let me edit it...Dragline
I am looking for a data structure to handle potentially terabytes of data, and for example I found that BTree and derived are used to index data blocks or implement a filesystem. So my doubt is why not use something that is asymptotically better. I didn't find studies about the performace of Y Fast Trie in real context. I understood that Y Fast Trie could not be efficient when I have to store only few elements, or am I wrong? Thank youWerewolf
N
9

There are a few reasons that X-fast or Y-fast tries might not be useful in practice. Here are a few:

  1. X-fast tries internally require several linked structures, including a bitwise trie and a doubly-linked list of elements. These don't perform well in database environments where elements are stored on disks and following a pointer can require a disk seek. (For similar reasons, databases often use B-trees over binary search trees). Additionally, they require the use of balanced binary search trees augmented with information to perform a split or join, which adds in extra space and introduces even more pointers to follow.

  2. X-fast tries internally require the use of hash tables with worst-case O(1) lookups. Hash tables with these requirements usually require a variety of hash functions to be applied to look up an element and (generally speaking) don't have the best locality compared to, say, a linear-probing hash table, so lookups are a bit slower.

  3. Because of the hash tables in X-fast tries and the use of splitting and joining of BSTs in Y-fast tries, these two data structures are only amortized efficient rather than worst-case efficient. In some cases, this is unacceptable - it would be bad if, periodically, a database query ends up taking 100x or 1000x normal time even if on average everything works out quite well.

  4. The use of hash tables in X-fast tries and Y-fast tries means that there is an element of randomness involved in the runtimes of the data structures. On expectation they're efficient, but it's possible that due to bad luck, the runtimes of the data structures might be quite high. Specifically, the cost of doing a rehash in an internal hash table or doing a split or join on a tree can be quite high. In a database implementation, reliability is important, so this randomness might hurt.

  5. Due to all the reasons outlined above, the constant factors buried in the runtimes of X-fast and Y-fast tries are quite large. In the long run, they should end up being faster than other data structures, but "the long run" might require inputs that are vastly larger than the sorts of data sets that could feasibly fit into a database.

Hope this helps!

Nystagmus answered 15/10, 2014 at 20:55 Comment(2)
Very helpful! The last question is if you can suggest me some article, book or whatever else that explain with tests what you wrote. I didn't find nothing about the behaviour of the Tries in databases but I need some official study. Thank you very much!Werewolf
I don't know of any specific tests that confirm this - this is based on my knowledge of general data structures, database implementation, and x/y-fast tries. I taught a class on data structures last spring and spent a lecture on these data structures, so perhaps the slides might help a bit. If you're curious, you can see them here: web.stanford.edu/class/archive/cs/cs166/cs166.1146/lectures/15/…Nystagmus
M
12

While templatetypedef seems to have given a detailed answer, most points in that answer are simply not true. Let me refute it one by one:

  1. X-fast tries internally require several linked structures...don't perform well in database environments...

X-fast tries are essentially binary search trees defined on metric space. There are techniques to organize a binary trees into reasonable pages. The double linked structure seems bad, but the pointer will not be invoked until a successor was found. That's only 1 disk seek.

  1. X-fast tries internally require the use of perfect hash tables...

That's not true. You can use whatever hashing techniques. To guarantee the constant time look up, you can use cuckoo hashing instead of perfect hashing.

  1. Because of the hash tables in X-fast tries...only amortized efficient rather than worst-case efficient...

That's not true either. If you use perfect-hashing or cuckoo hashing, constant look-up time will be guaranteed. The insert/delete time will be amortized, but that's not too bad for a search intensive system (e.g. Lucene). Most modern search engines are built on the idea of sacrificing write time to boost search.

  1. The use of hash tables in X-fast tries and Y-fast tries means that there is an element of randomness involved in the runtimes of the data structures...

That's depend on the goal of the system. The above statement is true only if you want to support reliable and efficient write. But for the majority of the search-oriented applications, your priority is to boost the search speed.

  1. Due to all the reasons outlined above, the constant factors buried in the runtimes of X-fast and Y-fast tries are quite large...

It is true that the constant factor is not trivial - but I would be surprised to see anything more than 4, and consider the loglogM time (e.g., 6 for a 64 bit universe), that's nothing.

So what's the real reason why there are so few industrial applications of Y-fast tries? The database industry is a large, lucrative business that tends to be slow in adopting new techniques. The R-tree did not see much adoption until the late 1990s. The object-oriented concepts were consistently rejected by the industry, despite its theoretical maturity. They DB people also kept rejecting anything other than B-trees until open source search engines such as Lucene beats RDBMS in almost all fronts of search-related business. Just last month a senior Oracle guy told me a trie/hash table based system can never be real time - until I show him how to do it with in-memory caching/merging.

Manda answered 19/11, 2015 at 17:31 Comment(6)
But I have a doubt: I know that in general the hash tables use a load factor, that is used to decide the new dimension of hash table after a rebuild. So I have for each level of X-Y Fast Trie O(n) elements in the hash tables, I need during rebuild a complexity of O(n). Why I continue to have Log(Log (n)) for insert or remove operations? Maybe the hash table rebuild is amortized?Werewolf
No, using techniques such as cuckoo hashing, you always have constant query time (the constant factor is only 2 for cuckoo hashing, for example). The rehashing time is amortized during INSERT/UPDATE. That means you may have some prolonged latency when updating the data, but the search time is always fast.Manda
Thank you for your answers!Werewolf
@Manda I believe templatetypedef's answer has been updated with respect to (2), but cuckoo hashing tends to have poor locality of reference compared to probing. From what I understand, in practice the performance losses to cache misses eat the efficiency gains from the better algorithm.Craniology
@Craniology That's not entirely accurate. I have implemented several Cuckoo hash tables in practice, and they are very fast. Yes, memory locality has some negative effect on both read/write speed, but the effect is not that big. In fact, Cuckoo-2 (two tables) seems 2nd only to linear probing in terms of speed, on par with quadratic probing and faster than double-hashing. The real limitations of Cuckoo hash table in practice are 1) unpredictable write time with the possibility of large spikes in latency, and 2) space usage. But I don't think those are show-stoppers for a search-heavy use case.Manda
@Manda Good to know!Craniology
N
9

There are a few reasons that X-fast or Y-fast tries might not be useful in practice. Here are a few:

  1. X-fast tries internally require several linked structures, including a bitwise trie and a doubly-linked list of elements. These don't perform well in database environments where elements are stored on disks and following a pointer can require a disk seek. (For similar reasons, databases often use B-trees over binary search trees). Additionally, they require the use of balanced binary search trees augmented with information to perform a split or join, which adds in extra space and introduces even more pointers to follow.

  2. X-fast tries internally require the use of hash tables with worst-case O(1) lookups. Hash tables with these requirements usually require a variety of hash functions to be applied to look up an element and (generally speaking) don't have the best locality compared to, say, a linear-probing hash table, so lookups are a bit slower.

  3. Because of the hash tables in X-fast tries and the use of splitting and joining of BSTs in Y-fast tries, these two data structures are only amortized efficient rather than worst-case efficient. In some cases, this is unacceptable - it would be bad if, periodically, a database query ends up taking 100x or 1000x normal time even if on average everything works out quite well.

  4. The use of hash tables in X-fast tries and Y-fast tries means that there is an element of randomness involved in the runtimes of the data structures. On expectation they're efficient, but it's possible that due to bad luck, the runtimes of the data structures might be quite high. Specifically, the cost of doing a rehash in an internal hash table or doing a split or join on a tree can be quite high. In a database implementation, reliability is important, so this randomness might hurt.

  5. Due to all the reasons outlined above, the constant factors buried in the runtimes of X-fast and Y-fast tries are quite large. In the long run, they should end up being faster than other data structures, but "the long run" might require inputs that are vastly larger than the sorts of data sets that could feasibly fit into a database.

Hope this helps!

Nystagmus answered 15/10, 2014 at 20:55 Comment(2)
Very helpful! The last question is if you can suggest me some article, book or whatever else that explain with tests what you wrote. I didn't find nothing about the behaviour of the Tries in databases but I need some official study. Thank you very much!Werewolf
I don't know of any specific tests that confirm this - this is based on my knowledge of general data structures, database implementation, and x/y-fast tries. I taught a class on data structures last spring and spent a lecture on these data structures, so perhaps the slides might help a bit. If you're curious, you can see them here: web.stanford.edu/class/archive/cs/cs166/cs166.1146/lectures/15/…Nystagmus
L
0

The thing about B-trees is that they often appear in filesystem implementations and other applications with relatively large data blocks / smallest addressable units.

A regular hard disk drive with a block size of 4Kb can store up to 512 x 64-bit pointers in each block, which in practice makes for B-trees no taller than 3 or 4 nodes given that depth = log(N)/log(512) and N<2^40 in practice.

Now of course there are other practical uses for X/Y-fast tries, but filesystems and databases in the order of Tera- or Petabytes are not one of them.

Ludwick answered 1/12, 2023 at 18:14 Comment(0)
M
-1

A bit late but I think the other answers are wrong. The reason no one uses x/y-fast tries is that there's a better, faster alternative.

BSTs can achieve the same querys of y/x-fast tries but with much smaller constants and worst case of O(logn). Most of the times for realy big data structures you can not afford an O(nlogu) or O(n + logu) worst case (like in the x/y-fast tries) because n can be huge.

Also the BST almost certainly is going to be faster no matter how big n is because the log is very small (even if n is the number of atoms in the observable universe the log is about 250-260).

So the constants matter the most, not the complexity. And even if the BST would be slower it's going to be only a little bit slower and only if n is really huge.

Also @jzl106 said that the constants are no more than 4 but that's completely bullshit. If the constants are measeared as they should be as the number of processor clock ticks devided by loglogM you would get much much bigger constants.

So BSTs are probably going to be faster, they have small worst case time and are much easier to implement. So there isn't any practical reason to use x/y-fast-tries (at least not for there regular queries of insert, delete, find, predecessor and successor)

Musa answered 13/9, 2019 at 19:0 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.