Redis-Compatible Distributed Databases

So, I recently set out towards a goal: create CloudFlare’s Quicksilver. It is a highly replicated database that allows for exceedingly fast key-value lookups. It is designed to store all of CloudFlare’s configuration details, powering their services at blazing fast speeds.

The Problem with Quicksilver

Quicksilver, at last CloudFlare described it, is not distributed. It is a single database that is replicated in thousands of places globally. I, personally, find that unacceptable when discissing distributed systems and infrastructure cost - it is the cost of a thousand servers for a single dataset, no “combined power,” whatsoever.

When I look at a distributed system, it needs to have the benefits of being distributed, including the increase in speed, storage space, memory, etc., rather than just being a replicated system.

In fact, when I describe Quicksilver, I don’t find it to be a distributed database at all. It is replicated, and only distributed in the sense that the replication exists in multiple locations globally.

My Answer

So, I came up with an alternative. Distribute the data via a sharding scheme, and create clusters of servers, each cluster being a replica of every other cluster.

Architecture

The architecture of my database - called Fastgold - is split up into four parts. The global level, the datacenter level, the cluster level, and the disk level.

The Global Level

All of the servers present, in tandem, must be able to serve any key, no matter what server you call. This data has to be available to any client globally.

The Datacenter level

Each datacenter is an identical replica to every single other datacenter. This way, the farthest a client has to have its request travel is geographically small - allowing for much faster lookups, due to speed-of-light limitations induced by distance.

The Datacenter is compromised of a cluster of shards, where each shard has a subset of the data, using a deterministic algorithm to find the value of a given key.

The Disk level

Each individual shard contains a disk-backed key-value store (a concept I stole from memcached). This store uses RocksDB internally, a key-value store that is designed for exceedingly fast lookups. To be technical, it is a data-data store, rather than a key-value store, as I can use anything as a key, including an image of pineapples.

What can be the value of this “key,” you may ask? Yes. It can also be an image of pineapples.

Performance

I wrote this database to be as exceedingly fast as possible. In fact, in testing, it came out to be as fast as, if not faster than Redis. On a MacBook Pro, it is able to run ~90k operations per second, or about 11µs per write. Read times for RocksDB are in the sub-microsecond range, which means I picked a good choice for disk backing.

Why not write a custom disk layer?

Well, let me introduce a concept I only recently learned, and have started to live by.

If it isn’t your product, don’t build it from scratch

Very simply - it is nearly guaranteed that no matter the problem, somebody else has solved it at least adequately. Use their product, unless it is the main thing you sell. A disk store is a major product in itself, so I couldn’t be bothered to write one myself. Not to mention, anything I write of the sort will likely be slower than its competitors.

Distribution Algorithm

Let’s start with an example: If you have a global cluster, with five datacenters, each datacenter with ten nodes, and a replication factor of three, this would mean that you have five locations globally which each have three copies of the data, totaling fifteen copies of the data (rather than fifty).

Now, when I ask to retrieve data from the cluster at datacenter #2, it has to know which shard to ask for data from. Why not send the query to all nodes? Well, that would mean that there would be ten times the network utilization for any number of queries, topping the bandwidth out at 1/10th of the link layer limit.

So I used a cascading-hash algorithm

For this algorithm, I simply take the hash of the key - xxhash64(key) - and that hash value is modulo’ed by the number of nodes in the cluster to select the node number for the cluster, called the node ID. Then, to select the second node, I run xxhash64(xxhash64(key)), or take the hash of the previous hash, and modulo that by the number of nodes to select the second node.

This can result in some funny behavior, though. If you only have two nodes, and a replication factor of two, it is possible for the hash to select node 0, then select node 0 again.

Therefore, it is imperative to maintain a list of all nodes, and reject the node selection if it falls within the list of existing node selections.

But what happens if you have two nodes per cluster and a replication factor of three?

Don’t do that.

Auto-Healing and Dynamic Scaling

What’s even better is that when a new node joins the cluster, it will automatically calculate what data needs to be replicated to the new node, using the aforementioned data distribution algorithm. Granted, this does have a penalty of having to re-shard the entire dataset across all of the nodes in the cluster, which is imperfect at best. However, completed with a distributed lock, the node will not be sent any data until after all of the data has been properly sharded. This also means that if a node “dies,” the data will not be lost, and will be accessible in other locations.

The Reason for Redis

I used the Redis protocol for one simple reason - usability. There are Redis clients written in every language on Earth that could interface with it. If I wrote my own protocol, it would slow adoption because I would need to write clients in every language imaginable, which is very plainly not an option. So I designed the database to be compatible with Redis, allowing for plug-and-play adoption.

The Reason to not use Redis

Redis has a cluster mode, as well. However, it is not described as production-ready, and it does not shard in the way that Fastgold does. It simply replicates all of the data between all of the nodes, instead of giving each node a subset of the data.

Altogether, this makes for a fast, reliable, fault-tolerant, distributed database

Hope you enjoyed the read. If you’re interested in more, come find me on GitLab, or reach out to me at my email linked on this site.

–E

Creating a Redis-compatible highly distributed database