Recently I’ve been unlucky enough to contract a wrist tendonitis on both arms. I guess it was just a matter of time, always working on laptops in awkward positions or on the couch. I’m getting much better now but the point is that I’ve had to avoid keyboards as much as possible outside work for a while. So I started to read a few papers instead and somehow ran into distributed hash tables (usually abbreviated DHTs). Recovery was slow so I ended up reading quite a few papers, including algorithms, reports, studies, simulations and related optimizations. Got used to reading PDFs somehow. The wikipedia page is kind of light, staying a bit too abstract, so I thought I’d try to write a more concrete summary. It will help me remember some details too.

So DHTs basically provide the same interface as a normal hash table (put, get) but over a distributed network including a potentially very large number of nodes (think millions) and a very very large number of key / value pairs. Distribution provides reliability (through redundancy) and the ability to store a quantity of data far above anything else. The most common use is peer-to-peer with implementations like Kad (in eDonkey) or Azureus, but there are a few others like distributed file systems, content distribution or overlay for other applications (like chat). DHTs are very interesting in the sense that they mix distribution, resilience (to churn, mostly), reliability and distribution. A nice mix.

Most “classic” DHT algorithms I will talk about here are also optimized for high levels of churn (more on that later). There’s also another class of DHT, which actually aren’t stricto-sensu DHTs, that are optimized for lookups. Dynamo, the Amazon storage, is one of them. Here the keys for the nodes aren’t distributed, a node knows about all the others. Only values are distributed. It’s a different tradeoff: on the plus side, you reduce the number of hops as all nodes are known, on the minus side you can’t handle as many nodes or you’ll need much more network traffic. That last sentence might not be completely clear for you yet but hopefully it will make more sense later. Just keep in mind that I’m only going to consider “traditional” DHTs here.

Consistent Hashing

The common ground between all DHTs is the reliance on consistent hashing. Let’s say I come up with a naive DHT implementation based on modulo. We have 8 nodes and 100 keys, to assign a key to a node I just take the modulo of the key number by the node count. So key 18 will end up on node 2 and key 63 on node 7. Simple and effective so far. Until a node leaves or a new node arrives. By modifying the node count, the assignment of all the keys change. All the stored key/value pairs have to be moved around, leading to a fine and inefficient mess. Ugly pipe waste. Consistent hashing solves that problem.

The only drawback of consistent hashing is that you really need a drawing to explain it and I suck at drawing, even on a computer. So I’ve borrowed the following image from this Japanese blog post, which actually looks very interesting.

Consistent Hashing

At the start there’s a hash space, which usually corresponds to the whole space of SHA1 key (all the way to 2^160). Keys are expressed in this space by calculating their SHA1 (see the cute little HASH! explosion above?). And the space is usually represented as a circle that you would follow clockwise, so when you get past the maximum, you get back to zero. The secret of consistent hashing is to also give nodes a hash in that same space and to say that key / value pairs get attributed to the node closest to the key, going clockwise again. We’ll see that the idea of “closeness” can change but this is the easiest to understand.

Say we have two nodes, one at position 50 and another at position 200. If we want to store a key / value pair in the DHT and the key hash is 100, we’ll give it to node 200. Another key hash of 30 would go to the node 50. Makes sense so far? Good. Otherwise there’s a slightly more extensive explanation here.

It’s easy to see that consistent hashing is actually much better than my moronic modulo-based attribution. If a node leaves the network, only its key / value pairs need to be re-attributed. If a node joins the network, only a fraction of its follower node key / value pairs need to be moved. The rest of the hash space stays untouched.

Chord

With that out of the way, let’s start actually talking about DHTs. The canonical modern (introduced with the other 3, CAN, Pastry and Tapestry in 2001) DHT algorithm is Chord and it definitely contributed in popularizing consistent hashing. It’s the one you’ll see most often mentioned as it’s probably the simplest, in an elegant way. Which makes it interesting to study.

The basic idea is that once you have your nodes on the hash ring, the only thing that’s needed for a node is to know its successor. In the picture above, a key lookup starting at node (a), assuming it doesn’t have the key, would just forward the lookup to (b). If (b) doesn’t know about the key either, it will forward to (c) and so on. Sooner or later, you’ll hit a node that knows about your key. The good thing about this method is that it only requires a node to know about a single other node. The bad thing is that practically you actually have some decent memory to store more lookup information, enough network bandwidth for more contacts and that the lookup time with this method is linear with the number n of nodes (so O(n)). It’s also not very resilient against node disappearance.

To improve the lookup, Chord extends that idea by introducing a routing table containing O(log n) elements (at most 140 and that’s with a shitload of nodes). The first node in this table is the closest successor from our node’s position (say p) plus 2. The second element is the closest successor of key (p + 4). The third of key (p + 8). I think you get my drift. Generally speaking, element at index i in this table are located at successor(p + 2^i) starting at 1. Obviously some nodes will be duplicated in this table (it’s fairly unlikely you’ll have a node between p + 2 and p + 4) so the routing information to maintain is decently low. It’s important because maintaining routing information requires some constant traffic to make sure known nodes are still alive (otherwise a new node will need to be looked up to replace a missing one). With this scheme, the area “close” to a node in the ring is very well known by that node. The further you go, the sparser its knowledge is.

So with that whole scheme, it’s proven that a lookup will take, with high probability O(log n) hops. Which is actually pretty good in classic P2P: in a network of one million nodes, you shouldn’t get much more than 20 hops to find a value.

Additionally, to improve reliability and allow replication, Chord also introduces a successor table of size O(log n). It simply contains the successors of a given node. Chord is fairly sensitive to the disparition of its immediate successors (as the routing table jumps further and further fairly quickly) so that fixes it, when a close node disappears from its routing table, it has enough backups.

Finally, I should mention that the authors of Chords have created another DHT named Koorde. Koorde is interesting because it allows lookups with only O(log n / log log n) hops, which has been demonstrated to be the optimal for a DHT of n nodes with a (log n) sized routing table. However Koorde doesn’t resist as well to churn and as we’ll see in the next paragraph, it’s an important drawback (and the reason why I won’t detail Koorde).

Churn and Latency

At this point it should be reasonably clear that a major challenge for a DHT implementation in the real world is to resist well to high levels of churn. In file sharing applications for example, it’s pretty common to have nodes pop up in the network, do their thing (like download a couple songs) and quit after 20mn or even less. Under those circumstances, the network never reaches an equilibrium where all routing tables are accurate and all nodes are reachable. It’s in constant evolution with large number of nodes joining and leaving at any given moment. The job of the DHT then becomes being as resilient as possible, avoid pathological cases (like having parts of the network being totally isolated from others, forming islands for example) and try its best to maintain good connectivity.

There have been several studies comparing DHT implementations and their respective resistance to diverse levels of churn. Chord performs decently well but another DHT consistently comes ahead of the pack: Kademlia.

Another interesting property is to have an algorithm that’s conducive to some level of affinity between nodes based on network proximity. For example during a node lookup, if from the routing table we can know that two nodes A and B are both close enough to the key / value pair we target but that contacting B is much quicker, we could try to contact B preferably. And there are algorithm, like Vivaldi, that allow to calculate coordinates of a node in a network depending on its distance to other nodes. Unfortunately Chord’s routing table doesn’t allow that type of smart routing but other DHTs, like Kademlia again.

For these two reasons, on the second part of this post, I’ll detail Kademlia a bit more. It’s by far the most deployed DHT implementation and it definitely deserves some attention.