Consistent Hashing and Caching

Consistent hashing is well suited to distributing load among a set of caching servers. That is, given the data to be cached, consistent hashing uses the cache key to determine the server that owns the cached data. To understand this in detail start with a cluster of three cache servers: S₁, S₂, and S₃.

**Three initial servers**: S₁, S₂, and S₃.

The servers are hashed to define the bins, then the keys are hashed to determine which bin, and hence which server, owns the key.

When a cluster first starts, the initial set of servers and their weights are known, perhaps from a configuration file, or from a database. For this case, assume that each server has a weight of three. Three distinct variations of each server are generated and then hashed. For example the first entry for a server may be 1 followed by the server address and port, the next 2 followed by the address and port, etc. If we call the hash of the n^th variation of the m^th server S_mⁿ then the first hash of the first server is S₁¹, the second hash of the first server is S₁² ... the third hash of the third server is S₃³.

The next step in understanding consistent hashing is to lay the hashed server values out along a number line bounded above by Integer.MAX_VALUE (2³¹-1) and below by Integer.MIN_VALUE (-2³¹)

To find the server that owns a particular cache key-value pair, hash the key and locate the first server along the number line that is equal to or greater than the hash of the key. So that any key that hashes to a value between S₂² and S₁³ will belong to S₁.

We can apply this logic to the entire number line, and assign each point to the first server that is equal to or greater than the point. This is represented below where keys that hash into the regions in green will belong to S₁, keys that hash into a blue region will belong to S₂, and keys that hash into a red region will belong to S₃.

The hashed servers form bins by mapping each point along the line to the next server to the left.

The real value of consistent hashing only becomes evident when we examine what happens when the number of servers changes.

If S₂ is removed, we can follow the same process to generate a cluster with only S₁ and S₃. Here the green region represents hash values for keys that will be mapped to S₁ and the red region represents hash values for keys that will be mapped to S₃.

The hashing with different numbers of servers is similar.

The two principal advantages of consistent hashing are seen by examining these last two figures. First, notice that any key that was mapped to S₁ or S₃ in the first case remains associated with that server when the hashing is recalculated without S₂. If a server is present in one consistent hash, and also present in a later consistent hash with fewer servers, then keys will remain with the servers that that are in both hashes. Only keys that were owned by servers that have been removed from the consistent hash will be moved.

Which brings up the second point. Those keys that are moved are distributed among the remaining servers, the additional load does not fall on a single server.

We also see value in consistent hashing when a server is added.

When a new server is added the redistributed keys can be found easily.

If S₂ is added back into the set, we can quickly see the keys that are moved from S₃ to S₂ are those where S₃¹ < H(K) ≤ S₂¹. This makes it comparatively easy to scan the key set from S₃ and load any that meet this condition into S₂ as part of the process of bringing S₂ online.

Examples

A generic consistent hash
The Hasher interface, for hashing keys.
ReHash, the default Hasher.
A wrapper that uses the generic consistent hash to impliment the Infinispan default hash.