Why NoSQL databases can be easily horizontally scaled than SQL ones? I have been trying to figure out why people keep saying this. I came across many articles which only confused me with their not-industry familiar terminologies and vague assumptions. I will suggest you read Designing Data-intensive applications by Martin Kleppman. Also, I will share some of my understanding of this subject.
JOINS - in the case of many-to-one or many-to-many relationships there is no way that any database invented till now can keep the data together in one table or document so if the data is sharded(or partitioned), either it is SQL or NoSQL, the latency will be same, the database has to look for both the documents. NoSQL seems to dominate only in the case of one to many relationships. For example:
NoSql
Student
{
"name": "manvendra",
"education": [
{
"id": 1,
"Degree": "High School"
},
{
"id": 2,
"Degree": "B.Tech"
}
]
}
Eduction Institute collection
[
{
"id": "1",
"name": "army public school"
},
{
"id": "2",
"name": "ABES Engineering College"
}
]
Sql
Student Table
id | name
1 | Manvendra
Education Institute
id | Name
1 | Army public school
2 | ABES Engineering college
Studies Table
student | education institute | degree
1 | 1 | high school
1 | 2 | B.tech
Now suppose in the case of NoSql if both collection's data is on different nodes there will some extra time required to resolve the ids of the education institute and this situation is similar in the case of SQL databases so where is the benefit? I can't think of any.
Also, you must be thinking why can't we store the education institute info also in the same student collection, then it will be like:
{
"name": "manvendra",
"education": [
{
"name": "Army public school",
"Degree": "High School"
},
{
"name": "ABES Engineering College",
"Degree": "B.Tech"
}
]
}
which is really a bad design because there is a many-to-many relationship between student and education institute, many students might have studied from the same institute so tomorrow if there is a change in name or any information of the institute it will be really a very difficult challenge to change at all places.
However, in the case of a one-to-many relationship, we can club all the info together for example:
Consider a customer and an order relationship
{
"name": "manvendra",
"order": [
{
"item": "kindle",
"price": "7999"
},
{
"item":"iphone 12",
"price":"too much"
}
]
}
Since an order only belongs to one customer it makes sense to store order info in one place however storing item id or name is another choice anyway, if we use SQL database here, there will be two tables with orders and customers which will not give good results to queries if data is not stored in the same node.
So saying joins in an argument as to why the NoSql database is easier to scale horizontally does not make sense.
TRANSACTIONS
Both SQL(Postgres, MySQL, etc) and NoSQL(MongoDB, Amazon's DynamoDB, etc) support transactions so there is nothing left to discuss on that.
ACID
ACID is overused just like CAP actually it is all about showing a single copy of data to the client instead actually there might be multiple copies of data(to enhance availability, fault-tolerance, etc) and what strategies the database uses to do that. For example in Postgres in the case of a master-slave distributed system, one can opt for synchronous or asynchronous replication and the replication is made possible with WAL(Write ahead logs) and same is the case in MongoDB, only in place of WAL it has oplog(Operations Log), both support streaming replication and failovers.
Then where is the difference? Actually, I can't find a very strong reason that why NoSql databases can be scaled easily. What I can say is NoSql is the latest so databases come with ready-made support for horizontal scaling for example consider Mongos in MongoDB, they do all the dirty work of sharding documents, routing requests to the specific shard, etc. So tomorrow if Postgres or MySQL come up with some mechanism of intelligently sharding tables so all the related data is mostly kept in one node then it may put an end to this debate because there is nothing intrinsic in a relational database that prevents it from horizontal scaling.
On an optimistic note, I believe in the near future it will all be about the strategies. How you are planning to scale and those strategies will be independent of how you are storing data either in tables or documents. For example in Amazon's DocumentDB, there is a concept of auto-scaling in and out but if you want to achieve this with sharding it will be a burden to copy data each time you are scaling in and out. In DocumentDB this is taken care of as a shared cluster volume(data storage is separated from computing) which is nothing but a shared disk to all the instances(primary or secondary) and to escape from the risk of the shared disk failure DocumentDB replicates data of the shared disk to six other disks in different availability zones. So point to be noted here is DocumentDB mixed the concept of the shared disk and standard replication strategy to achieve its goal. So it is all about the strategy you are using in your database which is what matters