Consistent Hashing and Caching
Consistent hashing is well suited to distributing load among a set of caching servers. That is, given the data to be cached, consistent hashing uses the cache key to determine the server that owns the cached data. To understand this in detail start with a cluster of three cache servers: S1, S2, and S3.
The servers are hashed to define the bins, then the keys are hashed to determine which bin, and hence which server, owns the key.
When a cluster first starts, the initial set of servers and their weights are known, perhaps from a configuration file, or from a database. For this case, assume that each server has a weight of three. Three distinct variations of each server are generated and then hashed. For example the first entry for a server may be 1 followed by the server address and port, the next 2 followed by the address and port, etc. If we call the hash of the nth variation of the mth server Smn then the first hash of the first server is S11, the second hash of the first server is S12 ... the third hash of the third server is S33.
The next step in understanding consistent hashing is to lay the hashed server
values out along a number line bounded above by
Integer.MAX_VALUE (231-1) and below by
To find the server that owns a particular cache key-value pair, hash the key and locate the first server along the number line that is equal to or greater than the hash of the key. So that any key that hashes to a value between S22 and S13 will belong to S1.
We can apply this logic to the entire number line, and assign each point to the first server that is equal to or greater than the point. This is represented below where keys that hash into the regions in green will belong to S1, keys that hash into a blue region will belong to S2, and keys that hash into a red region will belong to S3.
The real value of consistent hashing only becomes evident when we examine what happens when the number of servers changes.
If S2 is removed, we can follow the same process to generate a cluster with only S1 and S3. Here the green region represents hash values for keys that will be mapped to S1 and the red region represents hash values for keys that will be mapped to S3.
The two principal advantages of consistent hashing are seen by examining these last two figures. First, notice that any key that was mapped to S1 or S3 in the first case remains associated with that server when the hashing is recalculated without S2. If a server is present in one consistent hash, and also present in a later consistent hash with fewer servers, then keys will remain with the servers that that are in both hashes. Only keys that were owned by servers that have been removed from the consistent hash will be moved.
Which brings up the second point. Those keys that are moved are distributed among the remaining servers, the additional load does not fall on a single server.
We also see value in consistent hashing when a server is added.
If S2 is added back into the set, we can quickly see the keys that are moved from S3 to S2 are those where S31 < H(K) ≤ S21. This makes it comparatively easy to scan the key set from S3 and load any that meet this condition into S2 as part of the process of bringing S2 online.
- A generic consistent hash
- The Hasher interface, for hashing keys.
- ReHash, the default Hasher.
- A wrapper that uses the generic consistent hash to impliment the Infinispan default hash.