When to use horizontal partitioning and when to use database sharding?
Asked Answered
K

2

8

I'm reading this article on Wikipedia: http://en.wikipedia.org/wiki/Shard_(database_architecture) trying to find the major difference between these 2 techniques. Here is what I found:

Horizontal partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.

Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.

As I understood, horizontal partitioning is more applicable for single instance (single node environment) whereas sharding is used in multi-node / multiple data-center environment. Is this correct? Or are there different usage scenarios ?

Extra question: For a huge table (with millions of rows) with simple schema (about 4-5 columns) , what is the best technique to improve read/write performance on this table ?

Kuibyshev answered 1/6, 2014 at 8:1 Comment(1)
Huge tables are not millions of rows. Huge tables are now at least 50 billion rows. You cannot choose a single best technique for improving read/write performance without knowing the data and the patterns of querying. Anyone who claims to tell you how to do this will be faced with the embarrassing problem of having wildly different performance for two databases of the same "size."Hull
L
11

You are correct, horizontal partition (supported for example in MySQL and PostgreSQL) splits a table up within a single server. This can improve performance because data and indexes can be split across many disk volumes, improving I/O. This is usually done with a key range.

With database sharding, you are dividing the data across multiple servers, not just within a single server. In this case you use a shard key to partition the data, typically with some sort of hashing algorithm. You can get a white paper on this subject here (offered by our company, it is not specific to any product, it explains the technology): http://www.codefutures.com/database-sharding-white-paper/

The advantage of DBMS single server partitioning is that it is relatively simple to set up and manage. The disadvantage is ultimately you are limited by what a single server can do. This is particularly the case when it comes to heavy write contention, database locking and heavy queries.

Database Sharding takes more work, but has the advantage of being a shared-nothing approach, thus it is fully scalable.

A clear indicator that database sharding is needed is when a single server cannot keep up with write volume. If you have many heavy queries, this also can drive the need for this type of solution.

Having said all of this, if you are talking about "millions" of rows with 4 - 5 columns, and your reads can be well-indexed for fast access, it is doubtful if you need to implement either of these options. When you are talking 100s of millions or billions of rows, with 1000s of users, that is where database scalability is critical.

There is also an informational Web site I am working on regarding database scalability: www.bigdatascalability.com. It includes links to various articles, and will have new content added over time.

Leclerc answered 1/6, 2014 at 16:20 Comment(0)
M
0

You are correct with your distinction between Partitioning and Sharding. I recommend you read a post I wrote on this subject exactly: Scale Up, Partitioning, Scale Out

Another good post can be found here: "MySQL Partitioning: A Stopgap Measure" (disclaimer: I work for ScaleBase)

Partitioning solve some of the size challenges and reads from tables, but sharding is only way to really address all aspects of big databases including reads and writes and concurrency and maintenance of the database instance (backups, replication, etc.) and all others.

While modern databases (often those are NoSQLs) such as MongoDB offer this capability right out of the box, in MySQL it has used to be "go shrad yourself" agenda... ScaleBase is a maker of a complete scale-out solution an "automatic sharding machine" if you like. ScaleBae analyzes your data and SQL stream, splits the data across DB nodes, route commands and aggregates results in runtime – so you won’t have to!

Hope that helped!

Doron

Markowitz answered 2/6, 2014 at 14:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.