Wednesday, March 25, 2015

My Data Science / Machine Learning Technology Stack

The Setup

Let's talk briefly about my technology stack. I work in Python, for a number of reasons. Focusing allows me to learn more about the nuances of a single language. Many of the techniques here are well-supported in other languages, but I trip over my feet more when using them. I'm better off going with my strengths, even if that means foregoing some great tools.

So first up, here's the list:
  • A 2011 Macbook Air laptop for my main computing system
  • Python 2.7, using pandas, numpy, sqlite3 and theano (not a full list)
  • IPython Notebook for putting together walkthroughs and experiments
  • Access to a CUDA-enabled NVidia GPU-enabled machine for performance work
  • An external two-terabyte hard drive for much needed disk space
  • A copy of several significant machine learning databases, including
    • MNIST data
    • Two kaggle competition datasets (ocean plankton and diabetic retinopathy)
    • The Enron mail archive
Side note -- If you want to get your local environment set up, a great goal will be to follow along with This will exercise your setup using most of the tools and libraries that you will be interacting with.

So far, this has proved to be a tolerably effective setup. By far, the two hardest aspects of the problem are leveraging the GPU and access to disk space at a reasonable speed. GPU-based libraries are quite disconnected from general libraries, and require a different approach. There are compatability elements (such as conversion to and from numpy arrays), but essentially the GPU is a different beast to the CPU and can't simply be abstracted away ... yet.

The Macbook Air is limited in its amount of RAM and on-board storage. That said, almost any laptop is going to be very limited compared to almost any desktop for the purposes of machine learning. Basic algorithms such as linear regression are easy enough that the laptop is no problem; intensive algorithms such as deep learning are hard enough that you will want to farm the processing out to a dedicated machine. You'll basically want a gaming setup.

If I'm lucky, I may be upgrading to a 13" 2015 Macbook Pro shortly. While this will ease the constraints, the same fundamental dynamics will remain unchanged. It is still not sufficient for deep learning purposes, and won't really change the balance all that much. Unless you have a beast of a laptop with an on-board NVidia GPU, you'll be looking at a multi-machine setup. Note - the 15" MPB does have an Nvidia GPU, but I don't want the size.

Working on Multiple Machines

Some people will have easy access to a higher-end machine, some won't. Some people will be dismayed not to be able to work easily on a single machine, others will be fine with it. Unlike many, I won't be recommending using AWS or another cloud provider. They are fine if the charges represent small change to you, but because of the disk and GPU requirements of the machines, you're not going to be looking at entry-level prices. You would be much better off spending $200 on the Nvidia Jetson TK1 and using that as modular processing engine for your laptop.

For small data sets like MNIST, your laptop will be totally fine. For medium-size sets, like the ocean plankton, you will be completely unable to complete the tasks on a CPU in a tolerable time-frame.

For large data sets, you may well be unable to fit the data onto the onboard storage of a single machine, especially a laptop. For most of us, that will mean using an external USB drive.

There are many ways to effectively utilise multiple machines. Some are based on making those machines available as "slaves" to a master or controller system, others will be based on remote code execution, perhaps through shells scripts. Future blog posts will cover the how and why of this. For the moment, the important thing to take away from this is that you'll either be working on small data sets or on multiple machines.

Why NVidia? Isn't that favouritism?

Short version: It's what the library ecosystem supports best right now. NVidia GPUs support CUDA, which is the only GPU controller language the best Python libraries support properly. If you want to work on machine learning problems, rather than on adapting libraries to new GPU platforms, you'll need to go down the proprietary path. In a couple of years, I expect that the software world will have caught up to more architectures. If you're working on problems which can be reasonably solved on a CPU, then no problem. If you're an amazing software wizard who can comfortably modify the back-end architecture, then you're probably reading the wrong blog anyway.

Tuesday, March 24, 2015

First steps to Kaggling: Making Any Improvement At All

Practical Objective:

Update: I realised it might have been better to preceed this post with an outline of how to get set up technically for running the examples. I'll post that one tomorrow! Experienced developers should have no trouble getting by, but for everyone else it won't be a long wait before a how-to on getting set up.

The first goal is to try to make a single improvement, of any degree, to the implementation provided in the Kaggle tutorial for the "ocean" data set (see below). This will exercise a number of key skills:
  • Downloading and working with data
  • Preparing and uploading submissions to Kaggle
  • Working with images in memory
  • Applying the Random Forest technique to images
  • Analysing and improving the performance of an implementation

Discussion and Background

Machine learning and data science is replete with articles which talk about the power, elegance and ease by which tools can be put to complex and new problems. Domain knowledge can be discarded in a single afternoon, and you can just point your GPU at a neural network and go home early.

Well, actually, it's darned fiddly work, and not everyone finds it so easy. For an expert with a well-equipped computer lab, perhaps a few fellow hackers and plenty of spare time, it probably isn't such a stretch to come up with a very effective response to a Kaggle competition in the time frame of about a week. For those of us working at home on laptops and a few stolen hours late in the evening or while the family is out doing other things, it's more challenging.

This blog is about that journey. It's about finding it actually hard to download the data because for some reason the download rate from the Kaggle servers to Melbourne, Australia is woeful, and so it actually takes a week to get the files onto a USB drive, which by the way only supports USB 2.0 transfer rates and isn't really a good choice for a processing cache...

So this post will be humbly dedicated to Making Any Difference At All. Many Kaggle competitions are accompanied with a simple tutorial that will get you going. We'll use the (now completed) competition into identifying Plankton types. I'll refer to it as the "ocean" competition for the sake of brevity.

Back to the Problem

I would recommend working in the Ipython Notebook environment. That's what I'll be doing for the examples contained in this blog, and it's a useful way to organise your work. You can take a short-cut and clone the repository for this blog from

Next, download the ocean data, and grab a copy of the tutorial code:

It's a nice tutorial. There are a couple of key steps omitted, such as how to prepare your submission files. It implements a technique called a random forest approach. This is a highly robust way to get reasonably good solutions fast. It's fast in every way: it works well enough on a CPU, it doesn't need a lot of wall clock time even on a modest-spec laptop, it doesn't take much time to code up, and it also doesn't require a lot of specialist knowledge.

There are a lot of tutorials on how to use random forests, so I won't repeat myself here. The tutorial linked above is a good example of a tutorial. You can work through it in advance, or just try to get the notebooks I've shared working in your environment.

This post is about how to take the next step -- going past the tutorial into your own work. First -- just spend some time thinking about the problem. Here are a few key things about this post which took me by surprise when I reflected on it:

  • The random forest is actually using every pixel as a potential decision point. Every pixel is a thresholded yes/no point on a (forest of) decision tree(s).
  • A decision tree can be thought about as a series of logical questions; it can also be thought about as a series of data segmentations into two classes. I've only seen binary decision trees, I suppose there must be theoretical decision trees which use a multi-class approach, but I'm just going to leave that thought hanging.
  • The algorithm doesn't distinguish between image content (intensity values) and semantic content (e.g. aspect ratio, image size)
  • Machine learning is all about the data, not the algorithm. While there is some intricacy, you can basically just point the algorithm at the data, specify your basic configuration parameters, and let it rip. So far, I've found much more work to be in the data handling rather than the algorithm tuning. I suspect that while that's true now, eventually the data loading will become more routine and my library files more sophisticated. For new starters, however, expect there to be a lot of time working with data and manually identifying features.
  • It's entirely possible to run your machine out of resource and/or have it just sit there and compute for longer than my patience allows. Early stopping and the use of simple algorithms lets you explore the data set much more effectively than training the algorithms until they plateau.

I have created a version of the tutorial which includes a lot of comments and additional discussion of what's going on. If there's something I haven't understood, I've also made that clear. This blog is *not* a master-to-student blog, it's just one hacker to another. I would suggest reading through the notebook in order to learn more about what's going on at each point in the tutorial.

You don't need to run it yourself, I have made a copy available at:

This first notebook is simply a cut-down and annotated version of the original tutorial. It contains my comments and notes on what the original author is doing.

The next step is to try to make any improvement at all. I would suggest spending a little time considering what additional steps you would like to try. If you have this running in a notebook on your own machine, you could try varying some of the parameters in the existing algorithm, such as the image size and the number of estimators. Increasing both of these could well lead to some improvement. I haven't tried either yet, so please post back your results if you do.

I tried the approach of including the size of the original image as an additional variable at the input layer. I did a simple count of the number of pixels. An interesting feature of this data set is that the image size is matched to the actual size of the represented organism. Normally, you'd have to take account of the fact that distance from the camera will affect the size of the object within the image. In this case, I did a simple pixel count. Only three lines changed -- you should try to find them within create_X_and_Y().

A web-visible copy is available at:

This made some positive difference -- a 0.19 point improvement in the log loss score (as measured by the test/train data provided).

Let's check out the Kaggle leaderboard to see what this means in practical terms ( Firstly, it's worth noting that simply implementing the tutorial would have put us better than last -- much better. We could have scored around position 723, out of a total of 1049 submissions.

Our improved score would place us at 703, 20 places further up.

The different between the leader and the second-place getter was ~0.015. Our degree improvement was at least bigger than that! On the other hand, we still have a very long way to go before achieving a competitive result.

Since the competition is now closed, the winner has posted a comprehensive write-up of their techniques. We could go straight there, read through their findings, and take that away immediately. However, I'm more interested in exploring the investigative process than I am in re-using final results. I want to test the process of discover and exploration, not refine the final knowledge. The skill to be learned is not a knowledge of what techniques eventually won (which is relevant) but the development of investigative skill (which is primary).

We should also take note of the number of entries the winners submitted. The top three competitors submitted 96 times, 150 times and 134 times respectively. They also used more compute-heavy techniques, which means they really sank some time into getting to their goal. The winning team was a collaboration between 7 people. I would expect each submission would have another 1-5 local experiments tried beforehand. This means that people are trying 300-500 (roughly) local hypotheses before making a submission. Taken that way, a single-attempt improvement of 0.19 doesn't seem so bad!

The next several posts will continue the theme of investigative skill, developing a personal library of code, and methods for comparing techniques. We'll explore how to debug networks and how to cope with large data.

Until next time, happy coding! :)
-Tennessee Leeuwenburg

Don't Click Me Results: Even More Linkbait

Okay, so I should have seen this coming. My last post, "Don't click me if you're human", generated over double my ordinary rate of traffic. My sincere thanks go out to those 22 souls who left comments so that I could discount them from the accounting.

I find it a bit hard to get the data I want at a glance from the blogger interface. I categorise my blog's traffic into the following categories:

  1. Number of visits while dormant (not really posting much)
  2. Number of visits on days I post
  3. Number of visits the day after that
  4. Number of visits during non-posting days, but not dormant

What I'd really like to get at is: how many humans do I actually provide some value to as a proportion of my visits? I'm not really an analytics exports...

Feedback comments are relatively rare. That's fine, but I do still hope that people are enjoying the posts.

The number of visits when dormant seems to be about 20-30 per day. I'm assuming that's robots. However, it's logical to assume that, like with humans, there will be more visits on posting days than on non-posting days.

My first two posts on this blog attracted something like 800 page views. Yay, great! I'm guessing that's maybe 400 actual people, and maybe 50 of those read the whole post, and maybe 35 of them really got something out of it. That's pretty good, it's a rare day that I manage to actually help 35 people in the rest of my life.

My "don't click me" post got 2100 page views. *facepalm*.

Now, presumably all (2100 - 800) of those are actual people.

I don't really know what it all means, except that the observer effect is very real, and linkbait works like a charm.

So, stay tuned for my next amazing post: "I'm going to waste all your time and you'll hate me for it".

Until next time, happy coding!

Friday, March 20, 2015

If you are a human: don't read me or click me. I'm trying to estimate robot traffic.

If you are a human, and you got here by clicking a link, please leave a comment below. I'm attempting to estimate the amount of non-human traffic which is triggered when I make a blog post. Please help by either not clicking through to this article, or by leaving a comment if you did.

Monday, March 16, 2015

The basic model for this blog

How Things Will Proceed

Hi there! I've been busy since the last post. I've been thinking mainly about the following areas:

Picture of a scam drug that's guaranteed to work instantly
Source: Wikipedia article on Placebo
drugs, license: public domain
  • What differentiates this blog for my readers?
  • What is the best way, for me, of developing my knowledge and master of these techniques?
  • What pathway is also going to work for people who are either reading casually, or interested in working through problems at a similar pace?
  • Preparing examples and potential future blog posts...
I think I have zeroed into something that is workable. I believe in an integrative approach to learning -- namely that incorporating information from multiple disparate areas results in insights and information which aren't possible when considering only a niche viewpoint. At the same time, I also believe it's essentially impossible to effectively learn from a ground-up, broad-base theoretical presentation of concepts. The path to broad knowledge is to start somewhere accessible, and then fold in additional elements from alternative areas.

I will, therefore, start where I already am: applying machine learning for categorisation of images. At some point, other areas will be examined, such as language processing, game playing, search and prediction. However, for now, I'm going to "focus". That's in inverted commas (quotes) because it's still an incredibly broad area for study. 

The starting point for most machine learning exercises is with the data. I'm going to explain the data sets that you'll need to follow along. All of these should be readily downloadable, although some are very large. I would consider purchasing a dedicated external drive for this if you have the space, disk space requirements may reach several hundred gigabytes, particularly if you want to store your intermediate results.

The data sets you will want are:
  • The MNIST databast. It's included in this code repository which we will also be referring to later when looking at deep learning / neural networks:
  • The Kaggle "National Data Science Bowl" dataset:
  • The Kaggle "Diabetic Retinopathy" dataset:
  • Maybe also try a custom image-based data set of your own choosing. It's important to pick something which isn't already covered by existing tutorials, so that you are effectively forced into the process of experimentation with alternative techniques, but which can be considered a categorisation problem so that similar approaches should be effective. You don't need to do this, but it's a fun idea. You could use an export of your photo album, the results of google image searches or another dataset you create yourself. Put each class of images into its own subdirectory on disk.
For downloading data, I recommend Firefox over Chrome, since it is much more capable at resuming interrupted downloads. Many of these files are large, and you may genuinely have trouble. Pay attention to your internet plan's download limits if you have only a basic plan.

The next post will cover the technology setup I am using, including my choice of programming language, libraries and hardware. Experienced Python developers will be able to go through this very fast, but modern hardware does have limitations when applying machine learning algorithms, and it is useful to understand what those are at the outset.

Following that will be the first in a series of practical exercises aimed to obtain a basic ability to deploy common algorithms on image-based problems. We will start by applying multiple approaches to the MNIST dataset, which is the easiest starting point. The processing requirements are relatively low, as are the data volumes. Existing tutorials exist online for solving this problem. This is particularly useful to start with, since it gives you ready-made benchmarks for comparison, and also allows easy cross-comparison of techniques.

I'd really like it if readers could reply with their own experiences along the way. Try downloading the data sets -- let me know how you go! I'll help if I can. I expect that things will get more interesting when we come to sharing the experimental results.

Happy coding,

Tuesday, March 10, 2015

Struggling Through Machine Learning 1: Declaration of Intent

A Year of Learning

I have been struggling with gaining proper control over machine learning techniques. I've read online articles, reproduced walkthrough results, even done some online coursework, but I've continued to struggle when adapting these techniques to new problems.

I feel a little tiny bit hopeless.

This isn't a particularly unfamiliar feeling for me -- this is the great monster to be overcome whenever learning something new and challenging. It's just a feeling to remind me that it's worth keeping going, to prove once again that I've still got it. This time around, I reckon it's going to take a little longer than normal.

For that reason I've decided that I need a specific schedule, a set of clear objectives and something at stake. Let's work backwards. There's little objectively at stake here. I'm not really a gambling man, I don't need a new job, and this is really a personal exercise. What's really at stake is just my own sense of achievement. I though I'd try to raise those stakes through this blog, by committing my progress into form and making it public.

Let's talk about the objectives. In no particular order...
  • Be able to competently apply deep learning techniques to new image-based problems
  • Give a presentation at PyCon AU 2015 on the topic of deep learning in Python
  • Write a blog post per week on this topic for a year, or until feel I have achieve the other objectives on the list
  • Enter a kaggle competition with a result in the top 50%
  • Enter a kaggle competition with a result in the top 25%
  • Enter a kaggle competition with a result in the top 10%
  • Write up all my code, results and learnings into this blog post for the benefit of others
  • Publish my code into a public repository and have at least one other person actually make use of it
If I could do all of those, I'd really feel like I did something worthwhile.

I'd really like to hear from anyone who might be reading about this. Is there anything in particular I can expand on? What about this story is interesting for you?