Archive for August, 2009

Distributed Hash Tables (part 2)

This is the second (and last) installment of a small personal study of distributed hash tables. The first one was focused on consistent hashing and the Chord DHT, check it out for more context.

Kademlia

Kademlia scales very well mostly because its design taps into the lessons learned by existing peer to peer networks. Which also makes it more complex and a little less elegant than Chord, feeling more like a set of heuristics rather than a single nice algorithm. But that’s often the price to pay to behave well under real-world circumstances.

Probably the most important observation that Kademlia relies on is that a node that has been in the network for a while is very likely to stay a little more. Conversely, a node that recently appeared is likely to leave soon. Kademlia’s routing table practically never removes an “old” node.

Getting a bit more into the details, like most DHTs Kademlia relies on consistent hashing to locate nodes and key / value pairs. But it shows some originality by using a slightly different notion of distance. Where Chord simply uses the traditional definition of how far two points are from each other on the circle (d(a,b) = abs(b-a)), Kademlia relies on an XOR metric (d(a,b)=a xor b). It’s a valid metric as it follows the necessary rules:

  • d(x,x) = 0
  • d(x,y) > 0 if x != y
  • d(x,y) = d(y,x)

So why using this crazy XOR metric instead of the classic distance do you ask? Because of the structure of the routing table that Kademlia puts in place. Once again a picture is worse a thousand words so I’ll give you one (coming from the paper) plus a few hundred words to boot.

Kademlia routing table

Kademlia routing table

Ah I hear you think, a cool binary tree. Yes it is. The binary tree represents all the numbers in the hash space, so if your node, when expressed in binary is 1100 (12 in decimal) it will be placed under the node found by going twice through the left branch and twice through the right branch. Simple, isn’t it? Except that the rules to build that tree aren’t that simple but I’ll try to explain anyway.

Each leaf in the tree contains what’s called a k-bucket, basically a list of k elements (and in practice, k=20 seems to work well). When a new node enters the Kademlia network it doesn’t have much of a tree, it just has a single empty k-bucket. When it starts learning about the existence of other nodes, it will insert them into that bucket until it reaches its maximum length of k. When the k+1 node needs to be inserted, the bucket gets split. All the nodes that have a hash binary representation starting with 1 get assigned to the new k-bucket on the left of the root node, the ones starting with 0 go to the one on the right. Here is our one level tree. Then our node starts learning about more other nodes and say the right k-bucket (with node hashes starting with 0) gets full again. The bucket is split and all nodes starting with 00 will be in the right-right position (from the top of the tree) and all the ones starting with 01 will be re-attributed to the right-left position.

Now if we were to continue this process on and on, we could end up with a routing table having the same width as the network: n. But we already saw that a too large routing table is expensive to maintain, which is why Kademlia’s routing table only includes O(log n) elements. How? Simply by not storing too many nodes in the “far away” distance using the XOR metric. It never splits k-buckets that aren’t on the same side of the tree as the current node hash.

In the above picture, the current node hash starts with 0011. Therefore it will only have one k-bucket on the left side of the three for all nodes starting with 1 (represented by the big gray circle). That bucket will never be split. It will also have one single bucket under the right-left path (represented by the medium gray circle), which similarly will never get split. Neat isn’t it? This guarantees that our routing table only maintains O(log n) contacts and also that a node knows well its surroundings without knowing too much about what’s far. Heard that already? Yep, Chord does the same.

But Kademlia’s scheme uses k-buckets, with k nodes. And that’s where its main advantage lies, on a full k-bucket that can’t be split, it can still reorder nodes. The buckets being a list, it orders oldest nodes toward the end. And those will never be dropped unless they stop responding. As older nodes are also the most persistent in the routing table, the network organizes itself around the most reliable points.

I won’t detail the lookup algorithm too much, it almost naturally comes as a result of the routing table’s existence. In a few words, you ask the closest nodes you know about in your routing table which nodes they know about that would be close. And you continue doing that until you don’t learn about any new closer nodes anymore. Another very important property of Kademlia lies right here. The lookup always considers k nodes but only uses a subset of those. This choice in selecting the subset (usually of 3) out of k nodes leaves a lot of space for latency optimization.

That’s about it! There are a few additional refinements, especially around maintaining the network by having nodes refresh each others’ information but the most important part is there. And that’s the most deployed DHT algorithm on Internet.

Improvements and Shortcomings

I’ve mentioned that a big advantage of Kademlia that having a choice in which nodes to contact in Kademlia was a big advantage. Why? Because if you can find your way around in more than one way, you can start optimizing the “best” way. On a network that would mean the fastest which is also often the cheapest. That’s important for you, because you don’t like to wait and that’s important for ISPs because they have strong interests in optimizing the routing. An interesting direction for those optimizations are network coordinates, like Vivaldi or Pharos, that map network distance with real coordinates. Using such a system, a given node can easily analyze the surrounding topology.

It’s also been observed in the wild, on Kademlia’s implementations, that strangely the distribution of nodes on the hash space wasn’t even. When you give a node an id in the hash space, you would normally expect they will be more or less evenly distributed, especially if the id attribution is random. However it’s not what has been observed. How come? Because it’s a relatively easy way to game the system. Suppose there’s a given key / value stored on the network that you would like to censor or swap for another. You could just create a lot of “fake” nodes which would advertise their id as being very close to the chosen key / value. Anybody on the network would need to ask one of your nodes and then you can just return bogus values.

The problem is actually not as bad as it sounds. If it’s relatively easy to “hide” a specific key / value pairs, given the number of nodes and values present in the network, large scale censorship is almost impossible to achieve. And it’s also not trivial to come up with a scheme that would protect the network against that sort of attack. The naive way would be to have neighbors of a node give it a value based on its IP address. Unfortunately it’s possible for many nodes to share the same public IPs and there are privacy issues associated with using your IP directly.

Conclusion

I hope you enjoyed this journey in DHT land, it was certainly enjoyable to me. Looking at the bigger picture, it seems that peer to peer networks are still fairly underused, I’m sure there are new and unexpected applications. The web is actually a very centralized model and that worked really, really well. But it might not always be this way and there may be a day where another model will emerge. I’m not saying P2P will be it but knowing the alternatives in the network layout, knowing how they work, their advantages and shortcomings will surely be useful.

Image by jef_safi.

1 Comment »

Matthieu Riou on August 10th 2009 in Uncategorized

Messing With Javascript

I’ve had this long unfinished rant about Javascript in my drafts for almost a year now. But somehow it felt unfinished, like a useless rant. And it dawned on me recently: “if you’re unhappy about the state of Javascript, just fix it dumb sucker”. That little inner voice had a point. So this a lightweight rant with patches.

See, I actually like Javascript. It’s minimal, functional, now has good implementations (getting better and better), flexible and extensible without magic. Of course it’s not as full-featured as languages with very large standard libraries but it’s not meant to be one anyway and it’s a pretty damn good language to have in your toolbox. And I’m talking about Javascript the language, not the browser DOM which is a library (which, non-arguably, sucks).

The problem though, is that mostly for historical reason, it has some very annoying quirks, much more than most other mainstream languages (one huge quirk being its name). And for some reason, nobody involved in the specs has ever bothered fixing those, even when they’re pretty obviously broken “features”. I guess the main reason is the Holly Backward Compatibility but there are ways to address that.

For most pain posts below I have patches for Rhino, the Javascript implementation I know best. They’re all published on github, applied on the current Rhino trunk, fork away!

Function is 8 characters, that’s looooong

As I already mentioned, Javascript is functional and that’s one of the sweetest thing about it. It allows you to do things like:

[js light="true"]
var total = [0, 1, 2, 3].reduce(function(a, b) { return a + b; });
var flattened = [[0,1],[2,3],[4,5]].reduce(function(a,b) { return a.concat(b); },[]);
[/js]

Sweet no? Well, kind of. There’s actually quite a bit of visual pollution in here, making code longer to write and harder to read, even for very simple examples like these. More complex programs lose elegance.

Javascript 1.8 has improved things a bit with “expression closures” (a very bad name as it’s not necessarily a closure) which are shorter lambda notations. The first anonymous function above could be rewritten “function(a,b) a+b”. It’s better, but just the keyword function is still taking more visual space than the actual implementation.

So I’ve introduced two new symbols. The first one is -> and is strictly identical to the function keyword, it can be used anywhere function can be used. So you get:

[js light="true"]
var flattened = [[0,1],[2,3],[4,5]].reduce(->(a,b) a.concat(b), []);
[/js]

Sweet no? And then there’s one more candy for you: unary lambdas. The ~> symbol is used for a function accepting a single parameter (or even 0 because in Javascript if you can do one you can do zero) that will be bound to the implicit underscore variable. Functions with a single argument are usually pretty common, so I thought it was worth sugaring them a bit. With ~> you can do that:

[js light="true"]
var bigger = [0, 1, 2, 3].map(~> _ + 5);
[/js]

Which will produce a new array with 5 added to each element.

The patch adding -> is here (and very simple), the one adding ~> here. Feedback welcome.

Zero and the empty string are falsy

Truthiness is a pretty useful thing, it makes most conditional testing or assertions shorter to write and more compact. Just in case you don’t know, truthiness for some values is the fact that they can be coerced (or converted) to booleans automatically, even if they’re not. Like this:

[js light="true"]
var m = "a string";
if (m) print("it’s true!");
[/js]

The problem is that Javascript makes undefined, null, false (of course), 0 and the empty string falsy, which is sort of reminiscent of much older languages. Now I can’t recall any program when I had to test values and was considering all null, false, 0 and the empty string as failure cases. Null and false yes (mostly null actually). Null and the empty string, sometimes, but I usually have a utility function to test that as it doesn’t occur that often. Null, false, 0 and the empty string, never. And when you get an array passed to a function, it’s not like you can know easily in advance if there are going to be zeros or empty strings in them

It’s pretty annoying because it’s easy to miss unintended bugs like this:

[js light="true"]
var a = [-1,0,1,2,"","a"];
for (var i = 0, item; item = a[i]; i++) { print(item); }
[/js]

This is a fairly useful Javascript idiom but it won’t print the whole array, the loop will stop at the 0 element. And it’s not like I’m the only one complaining. One could argue that now functions like map shield you against this but you still have the occasional if where you just forgot and for loops still have some usefulness.

So I’ve changed that, making 0 and the empty string truthy and only null and false falsy. The patch is here, feel free to review and tweak.

Undefined considered harmful

So theoretically, you might think having undefined is a good idea. So you have a difference between uninitialized variables (which have for value undefined) and null values (which are initialized with a null value). Practically though, it’s more of a headache because you now have to worry about two different types of things not-really-having-a-real-value and test for both in your programs. Additionally:

[js light="true"]
js> foo != undefined
js: "<stdin>", line 11: uncaught JavaScript runtime exception: ReferenceError: "foo" is not defined.
at <stdin>:11
js> typeof foo == "undefined"
true
js> null == undefined
true
js> null === undefined
false
js> typeof null
object
js> typeof undefined
undefined
[/js]

Kind of messed up. Practically speaking it’s a lot of unnecessary accidental complexity. So my next patch will be to remove undefined, I don’t have that completely ready but stay tuned, I’ll update this post when it’s fully cooked and in the meantime you can watch the Github repository.

Objects, Types and Equality

There’s a duality between primitive types and object types in Javascript that sort of remind the brokenness of Java in that area (which has been somewhat alleviated by boxing/unboxing). The thing is, Javascript is dynamically typed so that bites even more: when you implement a function receiving a string, are you going to get a “abc” or new String(“abc”)? Here is a a Javascript session summing up the problem:

[js light="true"]
js> new String("abc") == new String("abc")
false
js> new String("abc") == "abc"
true
js> typeof "abc"
string
js> typeof new String("abc")
object
js> [] instanceof Array
true
js> new Array() instanceof Array
true
js> var str = "abc"
js> str.foo = true
true
js> str.foo
js> str = new String("abc")
abc
js> str.foo = true
true
js> str.foo
true
[/js]

There are ways to work around the discrepancy in behavior between object-wrapped types and primitive types but again, that’s a lot of accidental complexity. I haven’t completely thought out how to fix that yet, it’s a more ambitious project than the others. So keep watching and help is definitely welcome.

Conclusion

My aim by publishing these patches is not to break compatibility or even fork the language or implementations. It’s more to foster discussion and experimentation. Again, I like Javascript but some of its quirks need fixing. So which one would you fix first?

[youtube]http://www.youtube.com/watch?v=ZbbxA8a_M_s[/youtube]

4 Comments »

Matthieu Riou on August 5th 2009 in Uncategorized