Tuesday, April 28, 2015

Neural Networks -- You Failed Me!

Kaggle have a competition open at the moment called the "Otto Group Product Classification Challenge". It's a good place to get started if for no other reason than the data files are quite small. I was finding that even on moderate sized data sets, I was struggling to make a lot of progress just because of the time taken to run a simple learning experiment.

I copied a script called "Benchmark" from the Kaggle site, ran it myself, and achieved a score of 0.57, which is low on the leaderboard (but not at the bottom). It took about 90 seconds. Great! Started!

I then tried to deploy a neural network to solve the problem (the benchmark script uses a random forest). The data consist of about 62 thousand examples, each example composed of 93 input features. The training goal is to divide these examples into 9 different classes. It's a basic categorisation problem.

I tried a simple network configuration -- 93 input nodes (by definition), 9 output nodes (by definition) and a hidden layer of 45 nodes. I gave the hidden layer a sigmoid activation function and the output layer a softmax activation function. I don't really know if that was the right thing to do -- I'm just going to confess that I picked them as much by hunch as by prior knowledge.

It was a bit late at night at this stage, and I actually spend probably an hour just trying to make my technology stake work, despite the simple nature of the problem. The error messages coming back were all about the failure to broadcast array shapes rather than something that immediately pointed out my silly mistake, so it took me a while. C'est la vie.

Eventually, I got it running. I got much, much worse results. What? Aren't neural networks the latest thing? Aren't they supposed to be better than everything else, and better yet don't they just work? Isn't the point of machine learning to remove the need to develop a careful solution to the problem? Apparently not...



I googled around, and found someone had put together a similar approach with just three nodes in the hidden layer. So I tried that.

I got similar results. Wait, what? That means that 42 of my hidden nodes were just dead weight! There must be three main abstractions, I guess, which explain a good chunk of the results.

I tried more nodes. More hidden layers. More training iterations. More iterations helped a bit, but not quite enough. I tried from 15 to 800 iterations. I found I needed at least 100 or so to start approaching diminishing returns, but I kept getting benefits all the way up to 800. But I never even came close to the basic random forest approach.

I have little doubt that the eventual winning result will use a neural network. The question is -- what else do I need to try? I will post follow-ups to this post as I wrangle with this data set. However, I really wanted to share this interesting intermediate result, because it speaks to the process of problem-solving. Most success comes only after repeated failure, if at all.