This is the second (and last) installment of a small personal study of distributed hash tables. The first one was focused on consistent hashing and the Chord DHT, check it out for more context.
Kademlia scales very well mostly because its design taps into the lessons learned by existing peer to peer networks. Which also makes it more complex and a little less elegant than Chord, feeling more like a set of heuristics rather than a single nice algorithm. But that’s often the price to pay to behave well under real-world circumstances.
Probably the most important observation that Kademlia relies on is that a node that has been in the network for a while is very likely to stay a little more. Conversely, a node that recently appeared is likely to leave soon. Kademlia’s routing table practically never removes an “old” node.
Getting a bit more into the details, like most DHTs Kademlia relies on consistent hashing to locate nodes and key / value pairs. But it shows some originality by using a slightly different notion of distance. Where Chord simply uses the traditional definition of how far two points are from each other on the circle (d(a,b) = abs(b-a)), Kademlia relies on an XOR metric (d(a,b)=a xor b). It’s a valid metric as it follows the necessary rules:
- d(x,x) = 0
- d(x,y) > 0 if x != y
- d(x,y) = d(y,x)
So why using this crazy XOR metric instead of the classic distance do you ask? Because of the structure of the routing table that Kademlia puts in place. Once again a picture is worse a thousand words so I’ll give you one (coming from the paper) plus a few hundred words to boot.
Ah I hear you think, a cool binary tree. Yes it is. The binary tree represents all the numbers in the hash space, so if your node, when expressed in binary is 1100 (12 in decimal) it will be placed under the node found by going twice through the left branch and twice through the right branch. Simple, isn’t it? Except that the rules to build that tree aren’t that simple but I’ll try to explain anyway.
Each leaf in the tree contains what’s called a k-bucket, basically a list of k elements (and in practice, k=20 seems to work well). When a new node enters the Kademlia network it doesn’t have much of a tree, it just has a single empty k-bucket. When it starts learning about the existence of other nodes, it will insert them into that bucket until it reaches its maximum length of k. When the k+1 node needs to be inserted, the bucket gets split. All the nodes that have a hash binary representation starting with 1 get assigned to the new k-bucket on the left of the root node, the ones starting with 0 go to the one on the right. Here is our one level tree. Then our node starts learning about more other nodes and say the right k-bucket (with node hashes starting with 0) gets full again. The bucket is split and all nodes starting with 00 will be in the right-right position (from the top of the tree) and all the ones starting with 01 will be re-attributed to the right-left position.
Now if we were to continue this process on and on, we could end up with a routing table having the same width as the network: n. But we already saw that a too large routing table is expensive to maintain, which is why Kademlia’s routing table only includes O(log n) elements. How? Simply by not storing too many nodes in the “far away” distance using the XOR metric. It never splits k-buckets that aren’t on the same side of the tree as the current node hash.
In the above picture, the current node hash starts with 0011. Therefore it will only have one k-bucket on the left side of the three for all nodes starting with 1 (represented by the big gray circle). That bucket will never be split. It will also have one single bucket under the right-left path (represented by the medium gray circle), which similarly will never get split. Neat isn’t it? This guarantees that our routing table only maintains O(log n) contacts and also that a node knows well its surroundings without knowing too much about what’s far. Heard that already? Yep, Chord does the same.
But Kademlia’s scheme uses k-buckets, with k nodes. And that’s where its main advantage lies, on a full k-bucket that can’t be split, it can still reorder nodes. The buckets being a list, it orders oldest nodes toward the end. And those will never be dropped unless they stop responding. As older nodes are also the most persistent in the routing table, the network organizes itself around the most reliable points.
I won’t detail the lookup algorithm too much, it almost naturally comes as a result of the routing table’s existence. In a few words, you ask the closest nodes you know about in your routing table which nodes they know about that would be close. And you continue doing that until you don’t learn about any new closer nodes anymore. Another very important property of Kademlia lies right here. The lookup always considers k nodes but only uses a subset of those. This choice in selecting the subset (usually of 3) out of k nodes leaves a lot of space for latency optimization.
That’s about it! There are a few additional refinements, especially around maintaining the network by having nodes refresh each others’ information but the most important part is there. And that’s the most deployed DHT algorithm on Internet.
Improvements and Shortcomings
I’ve mentioned that a big advantage of Kademlia that having a choice in which nodes to contact in Kademlia was a big advantage. Why? Because if you can find your way around in more than one way, you can start optimizing the “best” way. On a network that would mean the fastest which is also often the cheapest. That’s important for you, because you don’t like to wait and that’s important for ISPs because they have strong interests in optimizing the routing. An interesting direction for those optimizations are network coordinates, like Vivaldi or Pharos, that map network distance with real coordinates. Using such a system, a given node can easily analyze the surrounding topology.
It’s also been observed in the wild, on Kademlia’s implementations, that strangely the distribution of nodes on the hash space wasn’t even. When you give a node an id in the hash space, you would normally expect they will be more or less evenly distributed, especially if the id attribution is random. However it’s not what has been observed. How come? Because it’s a relatively easy way to game the system. Suppose there’s a given key / value stored on the network that you would like to censor or swap for another. You could just create a lot of “fake” nodes which would advertise their id as being very close to the chosen key / value. Anybody on the network would need to ask one of your nodes and then you can just return bogus values.
The problem is actually not as bad as it sounds. If it’s relatively easy to “hide” a specific key / value pairs, given the number of nodes and values present in the network, large scale censorship is almost impossible to achieve. And it’s also not trivial to come up with a scheme that would protect the network against that sort of attack. The naive way would be to have neighbors of a node give it a value based on its IP address. Unfortunately it’s possible for many nodes to share the same public IPs and there are privacy issues associated with using your IP directly.
I hope you enjoyed this journey in DHT land, it was certainly enjoyable to me. Looking at the bigger picture, it seems that peer to peer networks are still fairly underused, I’m sure there are new and unexpected applications. The web is actually a very centralized model and that worked really, really well. But it might not always be this way and there may be a day where another model will emerge. I’m not saying P2P will be it but knowing the alternatives in the network layout, knowing how they work, their advantages and shortcomings will surely be useful.
Image by jef_safi.