What is Sharding?
In literal terms, dividing larger parts into smaller parts is called ‘sharding’. In the world of distributed systems and databases, Sharding means splitting a single logical dataset and storing in multiple dataset.
Sharding is a method for distributing data across multiple machines.
Why we need Sharding?
Sharding is a booster when it comes to horizontal scaling of data. In a replica set, where we are having a master-slave architecture for fault tolerance (yes, we are referring to database replication), each server/node needs to contain the entire dataset. With time, as the dataset grows to a point, where it becomes tedious to serve clients, architects start thinking in terms of scalability.
One option could be to increase the capacity of individual machines so that they acquire more RAM or disk space or may be even a powerful CPU. In short, we are talking about Vertical Scaling. However, this growth can be potentially costly operation over time and at some point, of time, it will meet a dead end. Especially, when it comes to cloud-based services, Vertical Scaling forever is not an option at all since there is limit to hardware configuration as well as storage facility.
Horizontal scaling comes to the rescue. Instead of scaling up our machines, more machines are added, and dataset is distributed among them. Here, entire dataset is not stored on one server. We can create as many shards of our dataset, which will eventually make up the formation of Sharded Cluster.
However, we still have one problem to solve. What if, one of the shards fail to serve data? Well, we can always create a replica set in one shard.
Basically, each shard can be deployed as a replica set to ensure a level of fault tolerance.
Sharding Architecture Overview
Till now, we are clear about the part that we are distributing our data into multiple shards and maintaining a replica set in each shard. The main challenge would be to serve data to client application from multiple shards, as querying becomes trickier.
If client application is looking for a specific document, how to figure out where to look for the document in the sharded cluster?
Well, necessity is mother of all solutions, so architects introduced a ‘router process’ between Sharded Cluster and respective clients, whose job is to accept queries from clients and figure out which shard should be processing the query. This router process maintains a metadata related to which shards are maintaining what data.
Now, this metadata itself can be big, hence, metadata is stored separately in Config Servers. To maintain, high availability of metadata for the router process, Config servers are also deployed as Config replica set.
To distribute data across shards, architects might use the strategy of using a unique identifier (Shard key) which consists of a field or fields that exist in every chunk of data across shards. There can be different sharding strategies to shard data across sharded cluster. Some common strategies are Hashed Sharding, Ranged Sharding, etc.
It is the job of Config Servers to maintain uniform distribution of data across shards. It is not an ideal scenario to have disproportionate data across shards in a cluster.
In cases where, we must serve data from multiple shards across a cluster, the router process will send the query to respective shards via Config Server and gather the results and merge them in a single result and serve it back to the client application.
When to go for Sharding?
So, the real question persists, when do we go for Sharding?
It is not always a good idea to just scale your architecture vertically right from the word Go. A good architect must compute the requirements and look for indicators before deciding to split up the system.
Let us go through that flow of decision making then.
First check to do would be to estimate whether it is still economically viable to vertically scale our system. Vertical scaling would require scaling up one or more of the vertical resources like RAM, CPU or disk space. And if vertical scaling of any of the resource would be less costly than adding up a new server altogether, then why not go for it.
However, in course of time, vertical scaling will reach to a deadline, after which, it would not be able to scale up any further.
Consider a scenario, where the current architecture is using 3 servers costing 200$ each, working together in a master-slave pattern, with a total cost of 600$. Now, to meet up the performance, we want to scale up to the next level of servers, which cost 800$ each, however, performance increase would be just 2 times the previous set up, while total cost would be 2400$.
Now, consider, setting up another replica set with the original server type costing 200$ each. So, two replica set operating together will involve 6 servers, with each costing 200$, which leads to the final cost estimation of 1200$.
So, by splitting the system horizontally, we are getting the same performance at half the cost i.e. 1200$ instead of scaling up, which is going to cost 2400$.
Also, we need to consider, some scenarios like, while vertically scaling up, if we are increasing size of the disk, we might also need to increase the RAM to return back the same level of performance, which is a classic case of one expense leading to other.
There are use cases like Netflix, which stores data in different Geo-locations and have clients of respected Geo-locations attached to them. It uses the concept of zone-sharding to distribute data sets geographically.