One of the key things about elaticity is the fact about being very optimistic about failure of systems. That is, not being bothered about it too much. The ability to say, “So what?”. Architecting such a system is so important. Many very good systems do not have this feature built into it, what we need is to use other systems to sympbiotically work to enrich the core system.
One such system of interest to me is the Redis, a high performance- in memory- persistent- key-value-store. It is a brilliant Key-value database, I love it, and have used it. Even AWS recently announced support for it underneath their ElasticCache. But a fault tolerant and masterless Redis Cluster is still not a give away; we need to use multiple systems to get that working. But, can all these components come together to get a fault tolerant and masterless Redis Cluster going? I choose to address this Utopian Redis Cluster in this write.
The key building a cluster that store’s data is to have an effective mechanism of storing and replicating the data. I hope to follow a proven methodology to address building a data cluster, in which you could add or remove a Redis node at will; and yet all your data will stick, and not vanish. The method is called Consistent Hashing.
I will take sometime to explain it, as it is not a very obvious concept. To understand consistent hashing you think of function f(x), that given any x will always give a result that is between 1 and 60(why 60? You will know, wait!). And for a unique x, there will be a always unique result for f(x) as well. These 1 to 60 values will form a ring arranged in a clockwise direction.
Now every node of the cluster will need to have a unique name, right? So if you pass this name to the f(’<redis_node_name>’) it will give a number between 1 to 60(inclusive, so [1,60]), that will be the node’s placement on the ring. It’s just a logical(or recorded) placement of the node. So, you get node, put it through the hash function, get the output and place it on the ring. Simple? So every node gets it place on the ring. For example if there are five Redis nodes named ‘a’, 'b’, 'c’, ’d’, and 'e’. So each would go through the hash function f(x) and get placed on the ring. So f('a’) = 21, f('b’) = 49, f('c’) = 11, f(’d’) = 40 and f('e’) = 57. Always remember these are logical placements.
So why do we place the nodes on the ring? The idea of placing these nodes on the define which node owns what hash space. So in the diagram the node ’d’ owns all the hash space between f('a’)(which is 21), and f(’d’)(which is 40) inclusive, i.e., (21,40]. That is the data that ’d’ owns will be any key x for which the value of f(x) is (21,40]. For example if a key 'apple’ results in f('apple’) = 35, then 'apple’ will be stored on ’d’. Likewise every key that gets stored on the Cluster would go through the hash function, and gets appropriately stored on the closest node in clockwise direction on that ring.
Though consistent hashing ends there; in most cases, this kind of system is built with high availabilty in mind. So inorder to address this availability need of data, it needs to be replicated with some factor, called the Replication Factor. Let us assume the replication factor for our cluster is 2, then the data owned by ’d’ is replicated to the 2 closest adjacent nodes in the clockwise direction, which are 'b’ and 'e’. This ensure even if node ’d’ fails, data could still be fetched from 'b’ or 'e’.
Not only that since the keys are stored based on a conistent hash, it make easy to cover up for node failure and yet consistently keep the replication factor intact. For example if node ’d’ fails, 'b’ takes the ownership of ’d’ hash space, and the hash space of ’d’ can be easily replicated to the 'c’.
Bad News. Good News.
The bad news is that entire concept, replication, failure remediation, and cluster scaling, as discussed is not available out of the box for Redis at the moment. Consistent hashing only theoretically addresses the mapping of nodes on a hash ring, and data ownership of those hashed data; nevertheless an excellent starting point to building resilient and scaling systems.
The good news is that, there are disparate tools that help to, apply consistent hashing on a redis cluster, notify of node failures, and new node addition. Though all this functionality is not part of one tool, we shall see how we can put to use multiple systems to get a Utopian Redis Cluster going.
Twemproxy aka Nutcracker
Twemproxy is an Opensource tool, which acts as a fast and lightweight proxy for memcached and redis protocol. So essentially what that means is that if you have a fleet of redis servers running, and wish to create a cluster with them; you put twemproxy in front of these servers, and proxy all your redis traffic through it.
Twemproxy apart from just proxying redis traffic can also do the consistent hashing when it stores data on the redis servers. This ensures data is distributed based on some consistent hash algorithm across a fleet of redis nodes.
But the problem is twemproxy does not have a HA built into it for a Redis cluster. The easiest way to get around this is to have a slave(or replica) for every Redis node on the cluster and promote it to master whenever the master fails. Configuring a redis slave is pretty simple.
The drawbacks of this model are obvious; need to run another server for each and every node on the Redis Cluster. But node failures are also obvious and more vicious, so how do we know about them and tackle them.
Gossip on Serf
Gossip is standard a mechanism through which nodes on a cluster keep themselves updated about their members. This way every one of the nodes in the cluster will know who joins and leaves the clusters.
Serf aids in this regard by implementing Gossip. Serf is an agent-based mechanism that implements the gossip protocol to exchanges messages about the node memberships(leave and join). Serf’s buck does not stop there, Serf is also capable of generating custom event.
Now take our node cluster for example, if every Redis node also had a serf agent running on it, hence creating members that gossip'ed with each other publishing details about themselves to others. Now, every node in the cluster knows of the existences of the other node and their state.
This is not enough, what we need for high availability is for twemproxy to know when a node fails and hence modify its configuration accordingly. Serf, as mentioned earlier, can do just that using custom actions based on some gossip triggered events, so when a Redis node in cluster dies for some reason, another node can notify this event of a member leaving unexpectedly to any given endpoint, in our case the twemproxy server.
Thats not all
Now that we have Redis Cluster layered on Consistent Hash Ring, twemproxy storing data accordingly(consisten hashing), Serf using gossip to check on Redis Cluster member failure, and a notification on failure sent to twemproxy; but we are not done yet with our Utopia Redis Cluster setup.
Though Serf can notify of member-leave or member-join to any endpoint. Yet twemproxy does not have a mechanism to listen for such events. So we need to have a custom listener, same on the lines of Redis-Twenproxy Agent, to do the following.
- Listen for Serf notifications
- Update the nutcraker.yml to reflect new topology
- Restart twemproxy
This notification listner could be a small http server; that on a POST of data does the above listed actions on the twemproxy. One thing to keep in mind this notification should be an atomic operation; since when a node failes(or leaves unexpectedly) all the live nodes are capable of notifying will notify to the listener, about the failure; but the listener should act only once.
In “Consitent Hashing” above, I mentioned the replication factor for the Redis Cluster. This too is not a feature of the Twemproxy; which only takes care of storing just one copy with a consistent hash. So in our pursuit to build a Utopian Redis Cluster we need to build this replication capability into twemproxy or redis itself.
To build replication into Twenproxy, it will need to accept replication factor as configuration item and store the data onto the neighboring redis nodes on the cluster(based on the replication factor). Since twemproxy is aware of the node placement, this would make a great feature to have on twemproxy.
Since twemproxy is just the mere proxy, it functionality’s simplicity is it strength, building replication management into it would bloat the proxy.
Redis Master-Slave Ring
One thing that came to my mind while thinking of how this needs to work is why not set up each node to be a replica, or a slave, of another node hence forming the a Master-Slave Ring.
This way if a node fails, the data on the failed node is still available on the adjacent nodes on the ring. And that node, which holds the replica, will be slave to node that its serves to hold the replica. It is an all-master-all-slave ring. Serf will act as an agent to propogate the node failures as usual. But this time, our custom, listener on twenproxy will not only update the failure on the twenproxy but also have the set the redis servers to adjust to the changes.
There is quite an obvious, and also a technical, flaw in the this ring. The obvious flaw is that ring could get vicious as the slave of slave cannot decide which is its master’s data and which is it’s master’s master’s data. Hence passing all the data across in circles.
Also technically once redis syncs the master’s data across to the slave, it wipes the data of the slave clean; hence deleting all the data that was written to the slave. This Master-Slave ring obviously is not going to work, without modifying the master-slave replication mechanism to suit our need. This way the slave of every master does not sync it’s master’s data across to it’s slave. This is only possible if every node is can make the distinction between its key-space, and its master’s keyspace; so it does not sync it’s master’s data across to it’s slave.
And so, when a nodes fails four actions need to be performed. One, make the failed node’s slave the owner of it’s keyspace. Two, propogate those keys across to failed node’s slave’s slave, for replication. Three, make failed node’s slave the slave of it’s master. Then, reset the twemproxy to the new topology.
There truly is no Redis Cluster with Cosistent Hashing, High Availabilty and Partition tolerance. Hence the last picture depicts a Redis Cluster that is Utopian; yet not impossible. Taking stack on what would make this a really really work.
It is essential to have a twemproxy that is transparent about it’s placement of the Redis Nodes on the Hash Ring. This will enable for each Redis node know its place and its neighbours. The knowledge of this is essential to replication, and node failure remediation. Since twemproxy is Open Source the placement knowledge could be hacked, and extended.
Redis Data Ownership
This is the tricky part, each redis node should keep track of what data is its own, and what data is it’s master’s. At the moment this kind of segregation is not possible. This too needs a hack on the redis code base, so the node is of aware what to sync with the slave and what not to.
So that sums it up. Hacking on these two component is needed to get make our Utopian Redis Cluster a reality. Since, both are very much industry grade, and used in production, it is worth anybody’s while to hack on it.