Wednesday, April 29, 2015

Neural Net Tuning: The Saga Begins

So, it has become abundantly clear that a totally naive approach to neural net building doesn't cut the mustard. A totally naive approach to random forests does cut a certain amount of mustard -- which is good to know!

First lesson: benchmark everything against the performance of a random forest approach.

Let's talk about levelling up, both in terms of your own skill, and in terms of NN performance. And when I say your skill, that's a thinly-veiled reference for my skill but put in the the third person -- yet maybe you'll benefit from watching me go through the pain of getting there :).

In this particular case, I discovered two things yesterday: a new favourite NN library, and a solution to the Otto challenge which performs better. I recommend them both!

http://keras.io/ is a machine learning library which is well-documented (!), utilises the GPU when available (!) and includes tutorials and worked examples (!). I'm very excited. It also matched my own programming and abstraction style very well, so I'll even call it Pythonic. (note, that is a joke about subjective responses, the True Scotsman fallacy and whether there is even such a thing as Pythonic style) (subnote -- I still think it's the best thing out there.)

https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13632/achieve-0-48-in-5-min-with-a-deep-net-feat-batchnorm-prelu is a Kaggle forum post on that author's NN implementation, based on the Keras library.

https://github.com/fchollet/keras/blob/master/examples/kaggle_otto_nn.py is the Python source code for their worked solution.

Their neural net is substantially different to the simple, naive three-layer solution I started with and talked about in the last blog post. I plan to look at the differences between our two networks, and comment on the impact and value of each change to the initial network design as I make it, including discussion of whether that change is generally applicable, or only offers specific advantages to the problem at hand. The overall goal is to come up with general best design practises, as well as an understanding on how much work is required to hand-tool an optimal network for a new problem.

My next post will be on the impacts of the first change: randomising the order of your input examples (I may add some other tweaks to the post). Spoiler alert! The result I got was faster network training, but not a significant improvement in the final performance. Down the track, I'll also experiment on the effects of removing this from a well-performing solution and discussing things from the other side. It may be that randomising order is useful, but not necessary, or it may be integral to the whole approach. We shall find out!