Skip to main content

Command Palette

Search for a command to run...

Hashing

Published
3 min read

Hashing, in system design, is a way to store data points in a database or distribute requests across servers. It is used to divide data or requests across multiple resources so that no single resource gets overloaded and the load is distributed efficiently. The main idea of hashing is to convert a key value into a fixed-size output using a hash function, which maps every key to a specific position. This output is then used to decide where the data should be stored or which server should handle the request. A good hashing function tries to distribute data uniformly across all resources to reduce hotspots and improve performance.


Collision in Hashing:

When we use hashing, two different keys can sometimes produce the same hash output. This is called a collision. Good hashing algorithms try to distribute outputs uniformly to reduce collisions, but since the number of possible inputs is much larger than the number of possible hash outputs, collisions are always possible.

There are several ways to deal with collisions in hashing, including using a larger hash output size, using better hashing algorithms, or using techniques like chaining or open addressing. In some cases, salting is used mainly in security hashing to reduce predictable collisions. Despite these measures, it is generally recommended to assume that collisions are possible and design systems keeping this in mind.

If we somehow manage collision handling, we still need to think about the effort required to rehash data when we add or remove nodes. In traditional hashing, any change in the number of nodes can trigger large-scale rehashing.

If one node goes down, we may need to rehash most of the data inputs across the remaining resources. This increases the workload on the system along with handling normal requests. To solve this problem, we use consistent hashing, which reduces the amount of data that needs to be rehashed when nodes are added or removed.

Consistent Hashing:

Consistent hashing is a distributed hashing scheme that works independently of the number of servers or objects in a distributed system by assigning them a position on an abstract circle, also called a hash ring. This allows servers and data to scale dynamically without significantly affecting the overall system.

In general, only about K/N number of keys need to be remapped when a server is added or removed, where K is the number of keys and N is the number of servers (more precisely, the maximum of the initial and final number of servers). This is much more efficient compared to traditional hashing, where most keys may need to be reassigned.

If we want to improve load balancing even further, we can use multiple virtual nodes for each physical server. Instead of placing one server at one position on the hash ring, we place multiple virtual nodes of the same server at different positions on the ring. This helps distribute the data more evenly across all servers and reduces the chances of one server getting overloaded.

Virtual nodes help solve the problem of uneven distribution that can happen when only a few physical servers are present. If a server has multiple virtual nodes, it will receive data from multiple parts of the ring instead of a single continuous section. This improves load balancing and makes the system more stable.

Another advantage of virtual nodes is better fault tolerance. If one physical server goes down, only the virtual nodes mapped to that server are affected, and their data can be redistributed among remaining servers. This reduces sudden load spikes on any single server.