Are there real-world reasons to employ a Binary Search Tree over a Binary Search of semi-contiguous list?
Asked Answered
G

1

9

I'm watching university lectures on algorithms and it seems so many of them rely almost entirely binary search trees of some particular sort for querying/database/search tasks.

I don't understand this obsession with Binary Search Trees. It seems like in the vast majority of scenarios, a BSP could be replaced with a sorted array in the case of a static data, or a sorted bucketed list if insertions occur dynamically, and then a Binary Search could be employed over them.

With this approach, you get the same algorithmic complexity (for querying at least) as a BST, way better cache coherency, way less memory fragmentation (and less gc allocs depending on what language you're in), and are likely much simpler to write.

The fundamental issue is that BSP are completely memory naïve -- their focus is entirely on O(n) complexity and they ignore the very real performance considerations of memory fragmentation and cache coherency... Am I missing something?

Goles answered 19/6, 2021 at 1:59 Comment(0)
T
5

Binary search trees (BST) are not totally equivalent to the proposed data structure. Their asymptotic complexity is better when it comes to both insert and remove sorted values dynamically (assuming they are balanced correctly). For example, when you when to build an index of the top-k values dynamically:

while end_of_stream(stream):
    value <- stream.pop_value()
    tree.insert(value)
    tree.remove_max()

Sorted arrays are not efficient in this case because of the linear-time insertion. The complexity of bucketed lists is not better than plain list asymptotically and also suffer from a linear-time search. One can note that a heap can be used in this case, and in fact it is probably better to use a heap here, although they are not always interchangeable.

That being said, your are right : BST are slow, cause a lot of cache miss and fragmentation, etc. Thus, they are often replaced by more compact variants like B-trees. B-tree uses a sorted array index to reduce the amount of node jumps and make the data-structure much more compact. They can be mixed with some 4-byte pointer optimizations to make them even more compact. B-trees are to BST what bucketed linked-lists are to plain linked-lists. B-trees are very good for building dynamic database index of huge datasets stored on a slow storage device (because of the size): they enable applications to fetch values associated to a key using very few storage-device lookups (which as very slow on HDD for example). Another example of real-world use-case is interval-trees.

Note that memory fragmentation can be reduced using compaction methods. For BSTs/B-trees, one can reorder the root nodes like in a heap. However, compaction is not always easy to apply, especially on native languages with pointers like in C/C++ although some very clever methods exists to do so.

Keep in mind that B-trees shine only on big datasets (especially the ones that do not fit in cache). On relatively small ones, using just plain arrays or even sorted array is often a very good solution.

Troup answered 19/6, 2021 at 10:4 Comment(5)
You mention a bucketed-list is not better than a plain list when it comes to insertion, why would this be the case? With a bucketed list (from my understanding: a bunch of fixed-sized arrays linked together), you would only have to reorder and/or realloc just the fixed sized array you're inserting into, not the entire data-structure. Also searching wouldn't be linear time either, as a MoveNext/enumeration would move over many elements (depending on your bucket size). Therefore with a bucketed-list, you could tweak the querying/insertion-deletion performance by varying the size of your buckets.Goles
Insertion in a bucketed list can be O(1) like in plain lists. But in this case, you need to perform a dichotomy to find the item to insert/delete in data structures before performing the operation. Dichotomy is O(n) on plain list because you need to iterate over the list nodes (random access to nodes is not possible). The same thing apply to bucketed lists: while random accesses are possible in buckets, this is not the case for nodes. Thus, a node traversal is still required in this case, hence the O(n) complexity (buckets just reduce the hidden constant factor before n).Catton
Note that I assumed that the size of buckets are bounded in the answer (which is often the case in implementations). Thus, when you say "move over many elements", keep in mind the number of element skipped is bounded too. If you do not bound the bucket size it causes other problems, and the resulting complexity gets close to the one of a sorted array. Especially insertion which will be linear (because the size is not bounded anymore). In the end there is a thread off: small buckets are not great, and big one too.Catton
Finally, note that you can design a bucketed list with buckets of size sqrt(n). This is hard to do since insertion/deletion can occur always in the same location. Resizing buckets takes a linear time but this cost could be amortized (like C++ vector does) so I think we can ignore this. Anyway, the optimal complexity achived would probably be something like O(sqrt(n)) which is far from the complexity O(log(n)) of using trees. Note that memory fragmentation caused by buckets in this implementation will likely be an issue (I think even more than with trees).Catton
I also assumed bucket sized was fixed/bounded. I would be very careful with dynamic or hetergenous bucket sizes because of fragmentation, also it would likely increase code complexity. I'll be interested to look into b trees!Goles

© 2022 - 2024 — McMap. All rights reserved.