Tuesday, March 24, 2015

First steps to Kaggling: Making Any Improvement At All

Practical Objective:

Update: I realised it might have been better to preceed this post with an outline of how to get set up technically for running the examples. I'll post that one tomorrow! Experienced developers should have no trouble getting by, but for everyone else it won't be a long wait before a how-to on getting set up.

The first goal is to try to make a single improvement, of any degree, to the implementation provided in the Kaggle tutorial for the "ocean" data set (see below). This will exercise a number of key skills:
  • Downloading and working with data
  • Preparing and uploading submissions to Kaggle
  • Working with images in memory
  • Applying the Random Forest technique to images
  • Analysing and improving the performance of an implementation

Discussion and Background

Machine learning and data science is replete with articles which talk about the power, elegance and ease by which tools can be put to complex and new problems. Domain knowledge can be discarded in a single afternoon, and you can just point your GPU at a neural network and go home early.

Well, actually, it's darned fiddly work, and not everyone finds it so easy. For an expert with a well-equipped computer lab, perhaps a few fellow hackers and plenty of spare time, it probably isn't such a stretch to come up with a very effective response to a Kaggle competition in the time frame of about a week. For those of us working at home on laptops and a few stolen hours late in the evening or while the family is out doing other things, it's more challenging.

This blog is about that journey. It's about finding it actually hard to download the data because for some reason the download rate from the Kaggle servers to Melbourne, Australia is woeful, and so it actually takes a week to get the files onto a USB drive, which by the way only supports USB 2.0 transfer rates and isn't really a good choice for a processing cache...

So this post will be humbly dedicated to Making Any Difference At All. Many Kaggle competitions are accompanied with a simple tutorial that will get you going. We'll use the (now completed) competition into identifying Plankton types. I'll refer to it as the "ocean" competition for the sake of brevity.

Back to the Problem

I would recommend working in the Ipython Notebook environment. That's what I'll be doing for the examples contained in this blog, and it's a useful way to organise your work. You can take a short-cut and clone the repository for this blog from https://github.com/tleeuwenburg/stml.

Next, download the ocean data, and grab a copy of the tutorial code: https://www.kaggle.com/c/datasciencebowl/details/tutorial

It's a nice tutorial. There are a couple of key steps omitted, such as how to prepare your submission files. It implements a technique called a random forest approach. This is a highly robust way to get reasonably good solutions fast. It's fast in every way: it works well enough on a CPU, it doesn't need a lot of wall clock time even on a modest-spec laptop, it doesn't take much time to code up, and it also doesn't require a lot of specialist knowledge.

There are a lot of tutorials on how to use random forests, so I won't repeat myself here. The tutorial linked above is a good example of a tutorial. You can work through it in advance, or just try to get the notebooks I've shared working in your environment.

This post is about how to take the next step -- going past the tutorial into your own work. First -- just spend some time thinking about the problem. Here are a few key things about this post which took me by surprise when I reflected on it:

  • The random forest is actually using every pixel as a potential decision point. Every pixel is a thresholded yes/no point on a (forest of) decision tree(s).
  • A decision tree can be thought about as a series of logical questions; it can also be thought about as a series of data segmentations into two classes. I've only seen binary decision trees, I suppose there must be theoretical decision trees which use a multi-class approach, but I'm just going to leave that thought hanging.
  • The algorithm doesn't distinguish between image content (intensity values) and semantic content (e.g. aspect ratio, image size)
  • Machine learning is all about the data, not the algorithm. While there is some intricacy, you can basically just point the algorithm at the data, specify your basic configuration parameters, and let it rip. So far, I've found much more work to be in the data handling rather than the algorithm tuning. I suspect that while that's true now, eventually the data loading will become more routine and my library files more sophisticated. For new starters, however, expect there to be a lot of time working with data and manually identifying features.
  • It's entirely possible to run your machine out of resource and/or have it just sit there and compute for longer than my patience allows. Early stopping and the use of simple algorithms lets you explore the data set much more effectively than training the algorithms until they plateau.

I have created a version of the tutorial which includes a lot of comments and additional discussion of what's going on. If there's something I haven't understood, I've also made that clear. This blog is *not* a master-to-student blog, it's just one hacker to another. I would suggest reading through the notebook in order to learn more about what's going on at each point in the tutorial.

You don't need to run it yourself, I have made a copy available at:

This first notebook is simply a cut-down and annotated version of the original tutorial. It contains my comments and notes on what the original author is doing.

The next step is to try to make any improvement at all. I would suggest spending a little time considering what additional steps you would like to try. If you have this running in a notebook on your own machine, you could try varying some of the parameters in the existing algorithm, such as the image size and the number of estimators. Increasing both of these could well lead to some improvement. I haven't tried either yet, so please post back your results if you do.

I tried the approach of including the size of the original image as an additional variable at the input layer. I did a simple count of the number of pixels. An interesting feature of this data set is that the image size is matched to the actual size of the represented organism. Normally, you'd have to take account of the fact that distance from the camera will affect the size of the object within the image. In this case, I did a simple pixel count. Only three lines changed -- you should try to find them within create_X_and_Y().

A web-visible copy is available at: http://nbviewer.ipython.org/github/tleeuwenburg/stml/blob/master/kaggle_ocean/Random%20Forest%20-%20Image%20Size%20Experiment.ipynb

This made some positive difference -- a 0.19 point improvement in the log loss score (as measured by the test/train data provided).

Let's check out the Kaggle leaderboard to see what this means in practical terms (http://www.kaggle.com/c/datasciencebowl/leaderboard/public). Firstly, it's worth noting that simply implementing the tutorial would have put us better than last -- much better. We could have scored around position 723, out of a total of 1049 submissions.

Our improved score would place us at 703, 20 places further up.

The different between the leader and the second-place getter was ~0.015. Our degree improvement was at least bigger than that! On the other hand, we still have a very long way to go before achieving a competitive result.

Since the competition is now closed, the winner has posted a comprehensive write-up of their techniques. We could go straight there, read through their findings, and take that away immediately. However, I'm more interested in exploring the investigative process than I am in re-using final results. I want to test the process of discover and exploration, not refine the final knowledge. The skill to be learned is not a knowledge of what techniques eventually won (which is relevant) but the development of investigative skill (which is primary).

We should also take note of the number of entries the winners submitted. The top three competitors submitted 96 times, 150 times and 134 times respectively. They also used more compute-heavy techniques, which means they really sank some time into getting to their goal. The winning team was a collaboration between 7 people. I would expect each submission would have another 1-5 local experiments tried beforehand. This means that people are trying 300-500 (roughly) local hypotheses before making a submission. Taken that way, a single-attempt improvement of 0.19 doesn't seem so bad!

The next several posts will continue the theme of investigative skill, developing a personal library of code, and methods for comparing techniques. We'll explore how to debug networks and how to cope with large data.

Until next time, happy coding! :)
-Tennessee Leeuwenburg