What is a difference between table distribution and table partition in sql?
Asked Answered
P

3

12

I am still struggling with identifying how the concept of table distribution in azure sql data warehouse differs from concept of table partition in Sql server?

Definition of both seems to be achieving same results.

Prindle answered 3/8, 2018 at 17:26 Comment(0)
S
20

Azure DW has up to 60 computing nodes as part of it's MPP architecture. When you store a table on Azure DW you are storing it amongst those nodes. Your tables data is distributed across these nodes (using Hash distribution or Round Robin distribution depending on your needs). You can also choose to have your table (preferably a very small table) replicated across these nodes.

enter image description here

That is distribution. Each node has its own distinct records that only that node worries about when interacting with the data. It's a shared-nothing architecture.

enter image description here


Partitioning is completely divorced from this concept of distribution. When we partition a table we decide which rows belong into which partitions based on some scheme (like partitioning an order table by the order.create_date for instance). A chunk of records for each create_date then gets stored in its own table separate from any other create_date set of records (invisibly behind the scenes).

Note that the rows in that partition table are then hashed and distributed across the available nodes as described above.

Partitioning is nice because you may find that you only want to select 10 days worth of orders from your table, so you only need to read against 10 smaller tables, instead of having to scan across years of order data to find the 10 days you are after.

Here's an example from the Microsoft website where horizontal partitioning is done on the name column with two "shards" based on the names alphabetical order:

enter image description here


Table distribution is a concept that is only available on MPP type RDBMSs like Azure DW or Teradata. It's easiest to think of it as a hardware concept that is somewhat divorced (to a degree) from the data. Azure gives you a lot of control here where other MPP databases base distribution on primary keys. Partitioning is available on nearly every RDBMS (MPP or not) and it's easiest to think of it as a storage/software concept that is defined by and dependent on the data in the table.

In the end, they do both work to solve the same problem. But... nearly every RDBMS concept (indexing, disk storage, optimization, partition, distribution, etc) are there to solve the same problem. Namely: "How do I get the exact data I need out as quickly as possible?" When you combine these concepts together to match your data retrieval needs you make your SQL requests CRAZY fast even against monstrously huge data.

Stenson answered 3/8, 2018 at 17:37 Comment(3)
Partitioning happens within the individual distributions. Conceptually you could have different partition strategies in each distribution, although that's not supported in Azure SQL Data Warehouse.Conviction
@DavidBrowne-Microsoft I have removed my sentences suggesting that a partition is distributed. That's very interesting. I need to read up more on the particulars of Azure's MPP implementation. I'm used to working in Teradata which is similar, but also not at all.Stenson
A distribution or distribution DB in SQL DW parlance really refers to one of the 60 underlying databases that make up each SQL DW instance. An instance of SQL DW can be powered by 1 - 60 compute nodes and actual node count has to be divisible by 60 so each node will power the same number of distribution databases. There are presentations on channel9 and youtube about SQL DW best practices that cover partitions good, bad and ugly that is worth reviewing. In short, you need to have a LOT of data to make table partitioning necessary in SQL DW.Pinnati
A
15

Just for fun, allow me to explain it with an analogy.

Suppose there exists one massive book about all history of the world. It has the size of a 42 story building.

Now what if the librarian splits that book into 1 book per year. That makes it much easier to find all information you need for some specific years. Because you can just keep the other books on the shelves.
A small book is easier to carry too.

That's what table partitioning is about. (Reference: Data Partitioning in Azure)
Keeping chunks of data together, based on a key (or set of columns) that is usefull for the majority of the queries and has a nice average distribution.
This can reduce IO because only the relevant chunks need to be accessed.

Now what if the chief librarian unbinds that book. And sends sets of pages to many different libraries. When we then need certain information, we ask each library to send us copies of the pages we need.
Even better, those librarians could already summarize the information of their pages and then just send only their summaries to one library that collects them for you.

That's what the table distribution is about. (Reference: Table Distribution Guidance in Azure)
To spread out the data over the different nodes.

Anticholinergic answered 3/8, 2018 at 18:18 Comment(2)
That is a good analogy. Asking each library to send any pages we need is slower then just going to the big book, if the big book is relatively small, but if it's crazy huge and would literally take weeks to sort through the whole thing, then making 60 librarians search their particular section (shared nothing architecture) and then send what they find, even by snail-mail, it's still faster.Stenson
Thanks for this amazing analogy.Prindle
D
0

Conceptually they are the same. The basic idea is that the data will be split across multiple stores. However, the implementation is radically different. Under the covers, Azure SQL Data Warehouse manages and maintains the 70 databases that each table you define is created within. You do nothing beyond define the keys. The distribution is taken care of. For partitioning, you have to define and maintain pretty much everything to get it to work. There's even more to it, but you get the core idea. These are different processes and mechanisms that are, at the macro level, arriving at a similar end point. However, the processes these things support are very different. The distribution assists in increased performance while partitioning is primarily a means of improved data management (rolling windows, etc.). These are very different things with different intents even as they are similar.

Dode answered 3/8, 2018 at 17:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.