Monday, March 16, 2015

The basic model for this blog

How Things Will Proceed

Hi there! I've been busy since the last post. I've been thinking mainly about the following areas:


Picture of a scam drug that's guaranteed to work instantly
Source: Wikipedia article on Placebo
drugs, license: public domain
  • What differentiates this blog for my readers?
  • What is the best way, for me, of developing my knowledge and master of these techniques?
  • What pathway is also going to work for people who are either reading casually, or interested in working through problems at a similar pace?
  • Preparing examples and potential future blog posts...
I think I have zeroed into something that is workable. I believe in an integrative approach to learning -- namely that incorporating information from multiple disparate areas results in insights and information which aren't possible when considering only a niche viewpoint. At the same time, I also believe it's essentially impossible to effectively learn from a ground-up, broad-base theoretical presentation of concepts. The path to broad knowledge is to start somewhere accessible, and then fold in additional elements from alternative areas.

I will, therefore, start where I already am: applying machine learning for categorisation of images. At some point, other areas will be examined, such as language processing, game playing, search and prediction. However, for now, I'm going to "focus". That's in inverted commas (quotes) because it's still an incredibly broad area for study. 

The starting point for most machine learning exercises is with the data. I'm going to explain the data sets that you'll need to follow along. All of these should be readily downloadable, although some are very large. I would consider purchasing a dedicated external drive for this if you have the space, disk space requirements may reach several hundred gigabytes, particularly if you want to store your intermediate results.

The data sets you will want are:
  • The MNIST databast. It's included in this code repository which we will also be referring to later when looking at deep learning / neural networks: https://github.com/mnielsen/neural-networks-and-deep-learning
  • The Kaggle "National Data Science Bowl" dataset: http://www.kaggle.com/c/datasciencebowl
  • The Kaggle "Diabetic Retinopathy" dataset: http://www.kaggle.com/c/diabetic-retinopathy-detection
  • Maybe also try a custom image-based data set of your own choosing. It's important to pick something which isn't already covered by existing tutorials, so that you are effectively forced into the process of experimentation with alternative techniques, but which can be considered a categorisation problem so that similar approaches should be effective. You don't need to do this, but it's a fun idea. You could use an export of your photo album, the results of google image searches or another dataset you create yourself. Put each class of images into its own subdirectory on disk.
For downloading data, I recommend Firefox over Chrome, since it is much more capable at resuming interrupted downloads. Many of these files are large, and you may genuinely have trouble. Pay attention to your internet plan's download limits if you have only a basic plan.

The next post will cover the technology setup I am using, including my choice of programming language, libraries and hardware. Experienced Python developers will be able to go through this very fast, but modern hardware does have limitations when applying machine learning algorithms, and it is useful to understand what those are at the outset.

Following that will be the first in a series of practical exercises aimed to obtain a basic ability to deploy common algorithms on image-based problems. We will start by applying multiple approaches to the MNIST dataset, which is the easiest starting point. The processing requirements are relatively low, as are the data volumes. Existing tutorials exist online for solving this problem. This is particularly useful to start with, since it gives you ready-made benchmarks for comparison, and also allows easy cross-comparison of techniques.

I'd really like it if readers could reply with their own experiences along the way. Try downloading the data sets -- let me know how you go! I'll help if I can. I expect that things will get more interesting when we come to sharing the experimental results.

Happy coding,
-Tennessee