V8 under the hood

I think I can add one more option to my last post about VMs. With Chrome, Google released a new virtual machine for Javascript called V8. I’ve spent some sweet time playing with it and browsing through the code in the past two days and thought some would be interested in my findings.

First it’s not really a “classic” VM. There’s no usable intermediate representation or higher level opcode that you can program at the VM level. The only thing V8 understands is Javascript and the only target representation is native assembler (Intel and ARM for now). So in some respect it’s closer to a compiler than a traditional VM even though the line is blurry for most modern VMs. There’s no interpretation whatsoever.

But it’s also more than just a compiler. It includes a generational, accurate garbage collector and the assembler generation is reworked at runtime depending on execution paths. What they’re calling hidden classes allows the generalization of some calls to optimize them as much as possible. Once you know for sure the address that’s being called, there’s a lot of clever things to make that very fast. As Javascript is fairly dynamic as class structures go (classes are basically hashes, you can add, remove, or update functions whenever you feel like), those hidden classes can also be recalculated which triggers new assembler generation. A fairly cute machinery.

So for those who would want to reuse the VM only to implement their own languages on top of it, they’re out of luck. V8 isn’t like the JVM or Mono, where you can generate intermedate bytecode, it’s straight Javascript to assembler. On the other hand, it makes cross-compiling to Javascript an interesting option. Theoretically and with enough optimizations, this thing could be as fast as C at least on some benchmarks. It’s only going to take time to get to the level of maturity of gcc.

Other than that, as C++ code goes it’s as clean as you get. The API is pretty nice, with specialized higher level classes to wrap Javascript datatypes like String or Number all wrapped in handles for garbage collection. If you wanted to provide a readfile(f) function for example here’s what the code would look like:

Ah, a couple more things. Some standard libraries are written in Javascript. So to package everything in one executable, they translate them to big C arrays containing each char and compile that. Sounds to me like something others could reuse to build their own executables from Javascript, all bundled with V8. Another goodie is snapshotting. When you build V8 from source, you can have it generate a snapshot of the memory state once the libraries are loaded and this gets packaged in the executable. It makes it slightly bigger but the load time is blazingly fast. Pretty sweet.

I think I covered most of what was interesting. So what do you think? For me V8 is a keeper. We’ll see how it evolves and where the benchmark war leads it but the bases are definitely sound. After all, it’s just a first release.

14 Comments »

Matthieu Riou on September 3rd 2008 in Uncategorized

Which VM?

The question is the following: if you had to develop a new dynamic language, which Virtual Machine would you pick? It’s interesting because the VM market is probably going to consolidate in the coming years. Right now, there are several possible choices.

  • The Java(tm) VM. Huge ecosystem, outstanding tooling (think parsers, bytecode manipulation) and Sun seems to be very motivated to improve it for dynamic languages and the next big thing. But I have mixed feelings. Strong corporate backing is a two sided sword. Things nowadays happen outside the Enterprise. There’s a big memory price which shows in the hosting market (compare to the unlimited possibilities of PHP hosting). And there’s a startup cost which makes scripting usage delicate.
  • The .Net DLR on Mono. Another very good option. Except that betting big on Microsoft technology is very risky. And outside of C#, there aren’t that many robust languages implementations running on Mono.
  • Tamarin (ex AVM2, think Flash player). An already fairly good VM that’s getting quickly even better. I would bet that most of the research development to optimize dynamic languages execution will happen there. Pushed by Mozilla which is definitely a plus. But here the problem is the ecosystem, outside of the browser there’s nothing. A whole purpose language would need to develop (or integrate) the whole backend (files, sockets, HTTP and XML stacks, …). That’s a lot of work.
  • The Parrot VM. On paper, it’s the ideal VM. Designed to be multi-language from the ground up, a powerful set of tools eases development (parser, AST, compiler), the libraries come from Perl, it’s fairly fast and it’s a register-based VM. Problem is, it’s been around for a while and they haven’t proven themselves so far. Even Perl 6, which was supposed to be the killer language, isn’t ready. They seem to be getting there though, they’ve progressed pretty nicely in the past year or two.

There are many other VMs out there but they’re often too tied to their hosted language or to their environment, like the Python VM or SquirrelFish (another register VM by the way). So I think those 4 are the main options.

My reason tells me JVM, my heart tells me Tamarin and Parrot falls in the middle. So what’s your choice?

6 Comments »

Matthieu Riou on August 27th 2008 in Uncategorized

An Interview of Guy Steele

For those who never heard his name, you’re probably writing code right now in a programming language that has been partly designed by Guy Steele or that has been inspired by a language Guy Steel designed. He started with Scheme, together with Gerald-Jay Sussman. He’s still very much involved with Common Lisp. He’s also one of the designers behind Java and co-edited the first version of ECMAScript (aka Javascript). And he’s now working on Fortress. So he has a few interesting things to say about programming languages.

InfoQ just published an interview with him, so if you’re interested in the tools you spend most time using for programming everyday, go check it out. It’s a pretty good overview of the landscape.

No Comments »

Matthieu Riou on July 31st 2008 in Uncategorized

Featuritis in languages

Programming languages should be designed not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary.
- R5RS

The first snarky remark is that, last time I checked, Scheme (and any Lisp for that matter), wasn’t much used at all, so maybe a few additional features wouldn’t hurt. Or would it? A rather painful realization is that Scheme packs more power in a 50 pages specification than all other languages I know. Unless you’re cultivating a type-system fetish, maybe.

2 Comments »

Matthieu Riou on July 13th 2008 in Uncategorized

Apache board elections

After 2 days of voting, today was the end of my first member meeting for the ASF. I acted as a monitor for the vote (to make sure that nobody was cheating) so I got the results early. Was pretty fun. So if you didn’t read it somewhere else already, here is the composition of our new board:

Bertrand Delacretaz
Justin Erenkrantz
J Aaron Faar
Jim Jagielski
Geir Magnusson Jr.
William A. Rowe, Jr.
Sam Ruby
Henning Schmiedehausen
Greg Stein

Just thought you’d want more elections news now that the Obama vs. Hillary thing is over.

No Comments »

Matthieu Riou on June 5th 2008 in Uncategorized

The Problem With Assignment

Continuing on my functional musings, I want to illustrate how introducing assignment in an otherwise purely functional language basically fucks up everything (excuse my French). The start with, let’s say that we have the embryo of a programming language with only a simple set of primitives and most importantly lambdas. Our goal is to build higher level data structures on top of that.

Once again (please bear with me for a moment even if this sounds like my last post), a good start would be a purely functional definition of pairs and the corresponding extraction functions that get the head and the tail elements. The technique we’re going to use is even more functional than last time, only relying on proper function definitions. So assuming that the base language is a bare-bone implementation of Ruby, Javascript or Arc, we could add the following definitions:

def pair(x,y); lambda { |m| m[x,y] }; end
def head(p); p[lambda { |x,y| x }]; end
def tail(p); p[lambda { |x,y| y }]; end

var pair = function(x,y) { return function(m) { return m(x, y); }; };
var head = function(p) { return p( function(x,y) { return x; } ) };
var tail = function(p) { return p( function(x,y) { return y; } ) };

(def make_pair (a b) (fn (m) (m a b)))
(def head (p) (p (fn (a b) a)))
(def tail (p) (p (fn (a b) b)))

It could seem a bit cryptic at first so let me show you how it gets reduced by using an example based on the Arc code which I tend to find most readable (I hate Ruby’s [] syntax for lambdas invocation, tell me what’s the difference between invoking a named or an anonymous function?):

(head (make_pair (23, 7) )
≡ ( (fn (p) (p (fn (a b) a))) (fn (23 7) (fn (m) (m a b)) ) )
≡ ( (fn (p) (p (fn (a b) a)))  (fn (m) (m 23 7)) )
≡ ( (fn (m) (m 23 7)) (fn (a b) a) )
≡ ( (fn (a b) a) 23 7) )
≡ 23

Okay, maybe it’s not that readable. What I’m doing here is replacing the parameters by what they stand for at each line, reducing the expression little by little. As an exercise you can try it with Ruby or Javascript, you should get the same result (unless I have an error in my code).

By the way I didn’t make that up myself, this purely functional representation has been found by Alonzo Church, the father of lambda calculus, and is called the Church encoding of pairs (or Church pairs). Most of what’s really interesting nowadays in computer language comes from Church (Alan Turing was one of his students).

So now we have our very small, almost axiomatic language for which we introduced purely functional data structures. It’s all nice and rosy because there are many assumptions we can take from this point. We can always replace a function by its expression like I did just above. everything is purely deterministic, no uncertainty on results. A given function, provided the same input parameters, will always produce the same result. It’s really convenient.

But it seems that we’re missing something. Wouldn’t it be nice to be able to change the value of things? Even if recursion can go a long way, it’s sometimes convenient to have incremented values or global variables that can change over time. So let’s introduce assignment in our small language and see where it takes us. To be clear, by introducing assignment, I don’t mean giving a value to something. That we could already do (the head or tail functions were given a value in the form of the code they represent). What I’m going to add here is the ability to change that value. So now we can do:

a = 2
a += 5

Sweet. And really this really can’t harm anybody. Or can it? Well, let’s see what are the implications for the rest of our language, does it influence other constructions? The suspense won’t be long: it does. I can change the above purely functional definition of a pair to a mutable one just with a little modification. Let me show you.

def pair(x,y); lambda { |m| m[x,y, lambda { |n| x = n}, lambda { |n| y = n}] }; end
def head(p); p[lambda { |x,y| x }]; end
def tail(p); p[lambda { |x,y| y }]; end
def set-head(p, v); p[lambda { |x,y,mx,my| mx(v) }]; end
def set-tail(p, v); p[lambda { |x,y,mx,my| my(v) }]; end

var pair = function(x,y) { return function(m) { return m(x, y, function(n) { x = n}, function(n) { y = n}); }; };
var head = function(p) { return p( function(x,y, mx, my) { return x; } ) };
var tail = function(p) { return p( function(x,y, mx, my) { return y; } ) };
var setHead = function(p, v) { p( function(x,y, mx, my) { mx(v); } ) };
var setTail = function(p, v) { p( function(x,y, mx, my) { my(v); } ) };

(def make_pair (a b) (fn (m) (m a b (fn (n) (= a n) (fn (n) (= b n) )) )
(def head (p) (p (fn (a b ma mb) a)))
(def tail (p) (p (fn (a b ma mb) b)))
(def set-head (p v) (p (fn (a b ma mb) (ma v) )) )
(def set-tail (p v) (p (fn (a b ma mb) (mb v) )) )

Let’s see how that works with SpiderMonkey:

s> var mypair = pair(31, 9);
js> tail(mypair);
9
js> setTail(mypair, 11);
js> tail(mypair);
11

What we’ve done here is introducing mutator functions in the definition of the pair. Said differently, a pair knows how to modify itself. The set-head and set-tail functions just use what’s in pair.

By simply adding assignment in my little language, I’m now seeing it bubbling up to my pair definition. Once I provide assignment as part of my language, anybody can redefine my nice purely functional pair in something slightly different. Something that introduces side effects. And any code using this mutable pair definition will also be exposed to side effects. And so on and so forth all the way up the software stack. Assignment is viral.

Now, think of it, how many of the non-trivial bugs you’ve been confronted with in your programs can be tracked back to assignment? You know, when you modify something and it happens to leak to something else. When variables don’t really hold the value you expected them to have. Or when some part of a program produces an unintended state change by calling another. If you really think of it, you’ll realize it’s a very significant portion of your debugging life. A whole class of unintended consequences start to appear with the introduction of assignable variables.

So is it worth it? Shouldn’t we just all forget about assignment altogether? I tend to think the answer is yes, even though I have my doubts (I know, it’s ambiguous). We, human beings, are a messy bunch. Assignment reflects that. Besides, you can’t do much with no form of variability at all. Without I/Os programming gets hard and users are an important kind of I/O.

So what’s my point? If we really can’t do without assignment and if assignment is that dangerous, how come almost all programming languages handle it in such a flimsy way? Imperative languages tend to make it as easy as addition without even warning you or reflecting its inherent virality. Functional languages are not that much better. The only one I know of that seems to handle the problem in a reasonable way is Haskell with the monad theory (and monads bubble up). But given the sheer number of articles and blog posts trying to explain what monads are and how they work, I’m guessing it’s far from being the final answer.

There’s something lacking here and while I can’t really think of a proper way to handle the problem for now,  most languages have failed in helping us dealing with it. So the next time you press the = key, think of the bugs you’re probably introducing.

Pictures by DerrickT and latenightowl.

No Comments »

Matthieu Riou on April 21st 2008 in Uncategorized

Ruby In Javascript

Ruby is hot indeed. This Ruby VM written in Javascript runs bytecode compiled by YARV (Ruby 1.9 sort of) in your browser. Yes, you read that right. Right now the Ruby code, when found by the browser, is sent to the server, compiled using YARV, brought back to the browser as bytecode and finally executed by HotRuby. A bit unwieldy but I’m sure it will quickly improve. And the first micro-benchmarks look promising. Interesting.

No Comments »

Matthieu Riou on April 8th 2008 in Uncategorized

License Proliferation: The Bane of Open Source

Henry just wrote a post I’ve been meaning to write for quite a while. It’s fortunate as his argumentation is much clearer than mine would have been and I pretty much agree with his short list.

  • MIT: minimal, sweet and very permissive, do what the fuck you want, only more politically correct.
  • AL 2.0: BSD only better because more generic and including patent termination.
  • MPL: weak copyleft (or weak viral), way clearer than LGPL and still has a strong supporting community.
  • GPL 3.0: strong copyleft (or stong viral) because there is no other choice.
  • AGPL 3.0: strong copyleft, even hosted.

I would even go as far as arguing that most people who like GPL 2.0 and dislike GPL 3.0 (like Linus) should really be looking at MPL instead.

License proliferation makes the use of any open source software harder. It’s hurting open source because anytime a package is used you have to figure out what the license is and play the compatibility game. Is it compatible with my license? Will it “pollute” it? With just 5 licenses it’s already is a bit of a headache, manageable but sometimes painful. With too many licenses, some more documented than others, it gets close to impossible to release something sane. So choose well and think of your users.

Considering which license to use when you start a new project is really important. It determines where and how your software can be used, how freely and with which restrictions. It’s a balance between protecting your code and maintaining you ownership on one side and making it available with as few strings attached as possible on the other. You can’t achieve both so you have to think of where exactly you stand. Don’t just use the license du jour because that’s what everybody does, all licenses haven’t been created equal and some of them are really clunky.

As a result of your choice, some people who would like to use your software won’t be able to (like Apache if you choose GPL). Is that a problem? As a result of your choice, some people you may not want to give your software for free to, will use it for free (some BigCos if you use MIT or AL 2.0). Is that a problem?

Open source is all about freedom, the one you keep for yourself, the one you give to others, as a gift. Licenses are at the core of this freedom. Pick one carefully.

No Comments »

Matthieu Riou on March 21st 2008 in Uncategorized

Functional Data Structures Out Of Nowhere

I’ve been watching the Structure and Interpretation of Computer Programs videos recently while riding the Caltrain and enjoyed it quite a bit. For some reason I can’t quite grasp, I find these fun. Maybe it’s the fact that these are 20 years old now and still terribly relevant (especially for functional programming), maybe it’s the look of the attendance, very eighties, or maybe the obvious delectation Hal Abelson and Gerald Jay Sussman have teaching. Anyways, pretty intesting stuff.

One of the thing they emphasize a lot during their lessons is the blurring line between data structures and functions when programming in a functional style extensively. The Javascript accessors overriding technique in my last post was a pretty good example of it but there’s a much better one in the video: constructing a Pair data structure out of pure nothingness. I’ve adapted it in 3 different functional languages below, just take your pick.

Javascript

Ruby

Arc

It’s a very good illustration of how to introduce data structures in any functional language. Once you have a pair, you have a list. Once you have a list, you have a hash. I won’t argue that those would probably be a few order of magnitude slower than data structures mapped directly in memory but still, it sort of turns the world upside down. Instead of having objects that are bags of functions, you have functions that operate on something that feels like an object.

The implementation is of course a bit awkward, using this little pick parameter to select the right value in the closure. But that’s not the point, if I were to tell you that I’ve made a Pair API for you with a way to build a pair, get its head and its tail, the way it’s programmed would be irrelevant to you. I can change my implementation anytime without you even noticing. The abstraction is complete.

Photo (and painting) by Rob Lee

11 Comments »

Matthieu Riou on March 3rd 2008 in Uncategorized

Javascript Meta-Programming Techniques

A pretty good presentation from Adam McCrea already explains how to use meta-programming to build your own Javascript libraries. However I’d argue that it’s more about clever APIs than strictly meta-programming. For the best or the worst, meta-programming involves magic. Like dynamic finders in Rails or should() and have() in RSpec.  It’s mostly about dynamically catching method calls, adding methods on the fly or proxying them. Like reprogramming while executing. Hence the name.

So I’m going to show you some techniques available in Javascript to achieve the same results. I’ll start with fairly trivial stuff to move into the nitty-gritty details. Be aware that it’s very easy to shoot yourself in the foot if you abuse these tricks, guaranteeing endless hours of pain for you and your users. You know, with great powers come great responsibilities and such…

A last warning. Some of the techniques are simple plain Javascript but most of them use features only available in SpiderMonkey and Rhino. So if you’re thinking browser, forget about IE  and most non-Mozilla stuff. But I’m thinking server anyway. The good news is that ECMAScript 4 (aka Javascript 2.0) will include some variations of these as part of the standard language (which doesn’t mean the Redmond people will give a shit).

Altering Built-in Prototypes

Javascript is prototype based, so if you want to alter the behavior of all objects of a given “type” you have to alter its prototype. However it’s possible to alter the built-in prototypes in any Javascript engine and make them do whatever suits your fancy. A fairly common way to use this is to add utility functions that make your code nicer, like array.first() for example. Another far more dangerous way is to alter the behavior of existing functions. Needless to say, it’s fairly easy to break a lot of code this way.

For example, imagine we’d like all string concatenation to be separated by a dash:

As demonstrated above, cleaning up your mess by restoring the original function is highly recommended.

Accessors

A set of methods have been defined in SpiderMonkey and Rhino to let you mess with accessors. These are the sweetly named __defineGetter__, __defineSetter__, __lookupGetter__ and __lookupSetter__. That’s a lot of underscores. The idea is you can get an object, override one of its accessors with a function of yours and do all sort of neat things with it, like proxying calls for example. Here is an example that will let you do just that:

This function will watch property changes, setting valueChanged to true if it did change. Additionally, if functions are provided for the onSet and onGet attributes, these will get invoked to eventually intercept the call and change its result.

There’s a little gotcha to be aware of here, it’s not directly related to meta-programming but is also interesting. This function call will create three closures, referencing the value and the valueChanged local variables. So when the setter is invoked, the variable that will be set is value, not the underlying accessor. Similarly, the getter will return value, not what the accessor contains. Why doing it that way? Simply because using the accessor directly would result in an endless loop.

One more little detail. These accessor methods also work for array indexes. After all, an array is just an object whose properties are numbers. So if you really want an array to have a constant value at the 0 position:

No Missed Calls

In Ruby, magic is very often implemented with method_missing. It’s a method that’s used as a callback by the interpreter when it can’t find a method being invoked on an object. Add a method_missing method to your classes and all of a sudden you can intercept any missed call.

It turns out that Rhino and SpiderMonkey also have their own  callback method: __noSuchMethod__. Let’s see a trivial example:

Neat, isn’t it? Let’s see how a Rails style dynamic finder could be implemented.

There are quite a few utility methods used here but their code isn’t really needed to understand the example. The name of the function being called is checked to see if it starts with “findBy”. Whatever comes after is used as parameters to generate a query dynamically.  The most important part of this code is actually the exception thrown. When you’re playing with that sort of tricks, it’s very important to handle everything you can handle properly and fail as soon as something goes through the cracks. Otherwise you’re up for looong debugging sessions.

This code could also be further optimized. Always going through the __noSuchMethod__ callback has a performance penalty. But if findByName is called once, you can safely assume that it’s going to be called at least a few more times. A better implementation would add the findByName method in the called object so that any subsequent call would find its destination directly. Yep, you can also do that in Javascript.

Conclusion

I’ve detailed the 3 main techniques of meta-programming I’ve seen used in libraries in the wild. The magic bit. Javascript is already a very dynamic language, allowing a lot. But sometimes, having the possibility to do this one more thing makes all the difference between a good framework (JUnit) and a wonderful framework (RSpec). Some would call them hacks. I call them powerful tools that I keep in a special place in my toolbox and try to use wisely.

2 Comments »

Matthieu Riou on February 19th 2008 in Uncategorized