Curious as to how indexing works in Databricks. Can you see the partitioning as indexing because it effectively organizes the data in grouped subcategories?
Yes, partitioning could be seen as kind of index - it allows you to jump directly into necessary data without reading the whole dataset.
For databricks delta there is another feature - Data Skipping. When writing data to Delta, the writer is collecting statistics (for example, min & max values) for first N columns (32 by default) and write that statistics into Delta log, so when we filter data by indexed column, we know if given file may contain given data or not. Another indexing technique for databricks delta is bloom filtering that is shows if the specific value is definitely not in the file, or could be in the file.
Update 14.04.2022: Data Skipping is also available in OSS Delta, starting with version 1.2.0
This is a topic of much interest to me. If I extend the word "index" in the question to be a relational "non-clustered index", then this is the definition of an index:
A. A data structure seperate to the actual data we are searching (like a bloom filter index)
B. Is used automatically and transparently by the query planner (like a bloom filter index)
C. The leaf edges of the non clustered index point at specific rows in the data we are searching
As I understand it, the thing that datbricks doesn't do is C, as there is no concept of an absolute row identifier across a dataset. Which is fine because it gains all kinds of other performance and maintenance improvements by not having to define and maintain this.
It seems that all the databricks performance objects focus on data skipping (known as pruning in the database world). In other words they don't support finding specific rows of data, they support ignoring data (files) that don't have the data you need.
partitioning, z-order both focus on getting known attributes into known files so we know which files to ignore
bloom filter (explained above), is again a way of understanding which files we can ignore and which ones might have the data we need
clusterby... this one I'm struggling to find background on how spark utilises this but I assume once again this allows us to know the data range in the file and therefore know which ones to ignore
EDIT: Time marches on: the latest performance feature from databricks is liquid clustering which appears to cover all of the above mentioned features
© 2022 - 2024 — McMap. All rights reserved.