Wednesday, April 1, 2015

Quick Look: MNIST Data

Project Theme: STML-MNIST

I like to do a lot of different things at once. I like exploring things from both a theoretical perspective and a practical perspective in an interleaved way. I'm thinking that I will present the blog in similar fashion: multiple interleaved themes. I'll have practical projects which readers can choose to repeat themselves, and which will push me to exploring new territory firsthand. At the same time, I will share general thoughts and observations. If I can manage it, I will release one of each per week, and I will expect each project post should not take a reader more than a week to comfortably work through. By that I mean there should be somewhere between 30 minutes and 3 hours work required to repeat a project activity.

Readers, please let me know if this works for you.

Right, so carrying on... a couple of posts ago, prior to being derailed into a discussion on analytics, I ran through how to take a pretty robust algorithm called a Random Forest and apply it to a Kaggle competition data set. The training time was about 5 minutes, which is fast enough to walk through, but a bit slow for really rapid investigation. I'm going to turn now to another data set called the MNIST database. It contains images of handwritten digits between 0 and 9. It has been done to death "in the literature", but it's still pretty interesting if you haven't come across it before.

This dataset will let us apply the same algorithm we used before (Random Forest) and see how it goes on another dataset. Consider this a bit of a practise or warmup exercise for improving our ability to apply random forests and other techniques to images. I'm going to point you out at another blog post for the actual walkthrough (why re-invent the wheel), but I'll still supply my working notebook for those who want a leg up or a shortcut. But be warned: there are no real shortcuts to learning! (why do you think I'm writing this blog, for example)

Theme: Image Categorisation

The MNIST database is an incredibly common "first dataset", but that doesn't mean it's not significant. The great thing about this database is that it is actually very meaningful. The task definition is to categorise handwritten digits correctly. The problem is essentially solved, and the technique for best solving it is a neural network. 

It is not, however, what is known as a 'toy problem'. Categorising digits is how automation of mail distribution based on postcodes works. Categorising handwriting in general is used for optical character recognition (OCR, digitisation of records) and security (captchas). Categorisation of images in general has very wide application and has many unsolved aspects.

Working on a problem with a "known good" solution allows us to focus on the software engineering and problem-solving aspects without the concern that we might be just wandering down a part that goes absolutely nowhere.

The scale of this task is also appropriate for the level of computing resources available to large numbers of people. 

The STML-MNIST series of posts will involve the application of multiple machine learning strategies to this particular task. This will allow us to cross-compare the various strategies in terms of performance, complexity and challenge. While I have worked through tutorials on the neural network solution before, I have not done a deep dive. This series of posts will be a deep dive.

A fantastic tutorial to work through, solving this very problem, is available at The linked tutorial uses a neural network approach, rather than a random forest approach. You could go down that path if you liked, or follow along with my random forest approach.

Whether you choose to replicate the result using a neural net or a random forest, I would recommend completing that on your own before proceeding much further in this post. For the moment, just try to get your own code working, rather than on understanding each step in depth. If you have replicated my setup, you can use my Ipython Notebook as short-cut to a working version. The notebook can be read online at or accessed via the git repository

... time passes ...

Okay, welcome back! I assume that it's been a couple of days since you diverted into the linked tutorial, and that you managed to successfully train a neural network to solve the problem to a high standard of quality. Great work! The neural net really solves the problem amazingly well, whereas my random forest approach reached about 90% success (still not bad, given there was no attempt to adapt the algorithm to the problem). I deliberately chose a set of RF parameters similar to that chosen for the Kaggle ocean dataset.

The mnist dataset was faster to process -- it took around a minute to train on, rather than 5 minutes. This makes it slightly more amenable to a tighter experimentation loop, which is great. Later on we can get into the mind-set of batch-mode experiments, setting off dozens at once to run overnight or while we do something else for a while. 

What we've found so far shows two things: that a Random Forest approach is mind-bogglingly flexible. It's probably not going to get beaten as a general-purpose, proof-of-concept algorithm. However, it also appears that it only goes so far, and that the next level is probably a neural not.

Each new problem will require its own investigative process to arrive at a good solution, and it is that investigative process that is of interest here.

I propose the following process for exploring this dataset further:
  • Comparisons of alternative neural network designs
  • Visualisations of intermediate results
  • Comparison against alternative machine learning methodologies
A list of techniques which could be applied here include:
  • Random forest
  • Decision tree optimisation
  • Comparison against reference images (error/energy minimisation)
Now, I would like each of my posts to be pretty easy to consume. For that reason, I'm not going to do all of that right now. Also, I don't think I'm a fast enough worker to get all the way through that process in just one week. Rather, following the referenced tutorial is the first week's activity for this project theme. I will undertake to apply one new technique to the dataset, on w roughly weekly schedule, and report back...

In the meantime, happy coding!