My next field of study is neural networks, I’ve been reading about them for some time. Interestingly, it’s not really easy to find resources that aren’t full of not-so-trivial math, so I had to get a calculus refresher as well. By the way if you’re interested in learning algorithms strictly from a programming standpoint, I can’t recommend enough Programming Collective Intelligence, very good practical book.

Before we get into the details, neural networks are used in supervised machine learning to solve classification problems and more generally build non-linear models out of raw data. One of their main strength is that they’re good at generalizing (responding sensibly to data that hasn’t been seen before). They were very popular in the 70s, many thought they were the early stage of a new kind of intelligence. That didn’t really happen so they fell in disgrace. To make a comeback in the mid-80s when back-propagation was invented. Then new machine learning techniques came about more recently, like Support Vector Machines, showing somewhat better results (with somewhat complex math techniques). But neural nets came back with a vengeance! With techniques still evolving, deep belief nets or LSTMs are now showing really good results on some categories of problems (on that last subject, a couple of Google tech talks from Hinton are worth the time).

So the simplest form of neural network is Rosenblatt’s Perceptron, which really isn’t much of a network given that it’s usually represented as a single neuron. Behold the Wikipedia illustration ripoff:

Basically, you have a series of input signals that stimulate the neuron into producing an output signal. Each input has a weight and the output is the sum of all inputs time their respective weight, passed through some normalization function. The role of the function is simply to normalize the sum.

The f function can be as simple as a step function (i.e. 0 for x < 0 and 1 for x > 0) although most problems are better handled with more smoothness (like an hyperbolic tangent). At this point you may start to wonder: but where is the learning? It’s all in the weight, brother. For example, say you want to predict the chances of someone having a cancer using a set of basic physiological measurements before recommending a more thorough and expensive examination. Each of theses measurements (blood pressure, quantity of white cells, …) is going to be one of the X input in the above picture, the expected output would ideally be a number between 0 to 1 telling you how sure the network is. To get to that you need to adjust the weights of each connection by training the network. Enters supervised learning.

In supervised learning, the network is fed a data set for which the answers are already known. The difference between the expected output and the one produced by the network is the error and can be used directly to adjust the weights:

In plain English, the new weight for a given input is the old weight added to the product of a learning factor (usually in the 0.1 to 0.01 ballpark, depending on how fast or precise you want the learning to happen), the error and the input. Pretty easy, no?

But there’s a catch, with Rosenblatt’s Perceptron you can only model a linear problem (the solutions can neatly be separated by a straight line). Sometimes that’s enough but often, like in my cancer detector example, it’s not. So what can we do about it? Just stack the neurons on top of each other.

Here we have 3 layers, neurons on each layer are connected to all the others neurons of the neighboring layers (the illustration above is slightly wrong, missing connections between the hidden and the output). Middle layers (there’s only one in the illustration) are called hidden layers, as they’re invisible to the outside world. The network can be as large as needed, with many hidden layers composed of many neurons. The thing is, depending on the problem, a network that would be too large can actually perform poorly (over-fitting) and will be long to train. That’s where Perceptrons can be annoying: in theory, they can model any function, in practice several educated guesses need to be made as to the size and structure of the network to get good results. That being said, people have been using them quite successfully for some time.

Multi-layers perceptrons were first described in 1969 but it took a while to figure out how to train them. 17 years actually. The trick is to propagate the error calculated at the output (just like for Rosenblatt’s perceptron) back through the hidden layers, all the way to the input layer. The algorithm is fairly well explained on Wikipedia so I won’t go into the details. Plus you’re going to get some code anyway.

Multi-layers perceptrons are still fairly basic as far as neural networks go nowadays but it’s necessary to understand them well to understand other networks. They also perform decently well and are easy to implement, so why not do that, huh? I’ve implemented one in Scala, it gave me a good occasion to try the language a little more. At the beginning I wanted the implementation to be purely functional but somewhere along the way it felt more natural to start mutating the objects. I’m not sure whether that’s because of Scala, because of object-orientation or because of me. My Scala is probably still a little rough, so let me know if there are more idiomatic ways.

To make it easier to try, instead of using a data set that you’d have to download, I’m training the network on the XOR function. It’s non-linear, so it’s a good illustration of the capabilities of the network, even though the result doesn’t have as much wow factor.