Project Theme: STML-MNIST
Right, so carrying on... a couple of posts ago, prior to being derailed into a discussion on analytics, I ran through how to take a pretty robust algorithm called a Random Forest and apply it to a Kaggle competition data set. The training time was about 5 minutes, which is fast enough to walk through, but a bit slow for really rapid investigation. I'm going to turn now to another data set called the MNIST database. It contains images of handwritten digits between 0 and 9. It has been done to death "in the literature", but it's still pretty interesting if you haven't come across it before.
This dataset will let us apply the same algorithm we used before (Random Forest) and see how it goes on another dataset. Consider this a bit of a practise or warmup exercise for improving our ability to apply random forests and other techniques to images. I'm going to point you out at another blog post for the actual walkthrough (why re-invent the wheel), but I'll still supply my working notebook for those who want a leg up or a shortcut. But be warned: there are no real shortcuts to learning! (why do you think I'm writing this blog, for example)
Theme: Image Categorisation
Whether you choose to replicate the result using a neural net or a random forest, I would recommend completing that on your own before proceeding much further in this post. For the moment, just try to get your own code working, rather than on understanding each step in depth. If you have replicated my setup, you can use my Ipython Notebook as short-cut to a working version. The notebook can be read online at http://nbviewer.ipython.org/github/tleeuwenburg/stml/blob/master/mnist/MNIST%20Random%20Forest.ipynb or accessed via the git repository https://github.com/tleeuwenburg/stml
The mnist dataset was faster to process -- it took around a minute to train on, rather than 5 minutes. This makes it slightly more amenable to a tighter experimentation loop, which is great. Later on we can get into the mind-set of batch-mode experiments, setting off dozens at once to run overnight or while we do something else for a while.
Each new problem will require its own investigative process to arrive at a good solution, and it is that investigative process that is of interest here.
- Comparisons of alternative neural network designs
- Visualisations of intermediate results
- Comparison against alternative machine learning methodologies
- Random forest
- Decision tree optimisation
- Comparison against reference images (error/energy minimisation)