Friday, July 17, 2015

Shareable Datasets -- A Functional Design

So I was inspired by reading a blog post on "Truly Open Data" by Bill Mills:

We engaged in a bit of back-and-forth, and he encouraged me to set my ideas out a bit more clearly. I got somewhat inspired, and created an imaginary tool called "odit" -- the Open Data Integration Tool. I mocked out a functional design, and wrote a user guide outlining the intended functionalty.

Here's the result:

I'd appreciate comments and feedback. Is this something you'd consider valuable enough to use? Is it worth building?

Here's a teaser of what you'll find over at readthedocs....

Command Summary

odit fetch Create a new project. odit share Shares the dataset online
odit append Append to a local dataset. odit update Revise the content of a dataset
odit set-licence
Specify the license for a data set

Wednesday, July 8, 2015

Can you make a BitTorrent 'channel' for just some files?

I have a problem I'd like to solve with BitTorrent -- I think. BT is great for two things: moving large files around quickly, and distributing storage capacity. Those are two things which data scientists badly, badly need. The only real alternative is for large data storage to be bankrolled by a large company acting in the public good. That happens sometimes, but there's something satisfying about the concept of a truly public infrastructure.

The downside, for me, with BT, is twofold. One is the 'negative branding' -- the association (merited or otherwise) with various kind of dubious content including piracy. The second is just providing a high quality set of data which isn't swamped by irrelevant content. You don't want to get a whole pile of television shows when you're actually trying to get some engineering-quality data.

There's a third niggle, which is how to handle realtime or streaming data.

So here's my question for the internet: how do I create a channel for scientific data using BitTorrent? I'm happy to get my hands dirty -- I'm a software dev after all.

Sunday, July 5, 2015

Quick example: A heat map of pedestrian counts

It might not look like it, but I have been super-busy lately working on data science and machine learning tech. I've been going on a bit of a vision quest trying to wrap my head around the whole thing. You know what -- I'm pretty lost. I've learned a lot of things, but I can also see how much deeper the rabbit-hole goes.

While that will bear fruit in time, I've decided to add a series of 'shorts' to the blog. Things which I can genuinely do more easily, and never mind if that risks being too simple to be of wider interest. The point here isn't to blaze a trail, but rather to keep up my exercise.

The plot above was generated by this code (link goes to a notebook).

The City of Melbourne provides quite fine-grained pedestrian count information for major locations in my home town -- see I really applaud this effort. I'm very excited about anything which reflects the physical world into the digital. This data updates in near-real-time as well, which is just wonderful.

Down the road I hope to use this to do some interesting prediction software, but for now I just want to explore the data. I'm also learning how to plot things.

Python has a number of libraries for this. My favourite in terms of API design is without doubt Seaborn, but it's slowwww. For speed, I recommend Bokeh, but I find it much clumsier to use. I'm also not a fan of its interactive javascript tools, because I think it's too easy to accidentally scroll away from the data entirely or otherwise misnavigate the chart. Please share your views on plotting, I'd really like to build up some more knowledge about the range of opinions on this tool.

Thursday, June 25, 2015

Setup for PyCon AU Tutorial

This is an attempt to provide attendees of PyCon AU 2015 with a guide to getting set up ahead of the tutorial. Getting set up in advance will assist greatly in getting the most of the tutorial. It will let attendees focus on the slides and the problem examples rather than on hurdling through an installation process.

What it's like installing software during a tutorial session
There will be USB keys on the day with the data sets and some of the software libraries included, in case the network breaks. However, things will go more smoothly on everyone if some of these hurdles can be cleared out the way in advance.

The Software You Will Need

  1. Python 3.4, with Numpy, Scipy, Scikit-Learn, Pandas, Xray, pillow -- install via anaconda
  2. Ipython Notebook, Matplotlib, Seaborn -- install via anaconda
  3. Theano, Keras -- install via pip
  4. Word2Vec ( -- avoid pip, install from source
  5. -- just drop in the notebook folder
  6. -- install via pip
I recommend using Anaconda as it ships with prebuilt binaries for O/S dependencies, for a variety of platforms. It's possible to get this all working with pip and your O/S package manager. It should be fine to use Windows, but OSX or Linux are likely to be easier to use. Due to the use of Ipython Notebook as the primary environment, the choice of operating system is not likely to be a major limiting factor in this case.

I have only had success installing word2vec by cloning the repository and installing locally. I went with the old-school 'python install'. For whatever reason, what's in PyPI doesn't work for me.

I've noted the easiest path for installing each package in the list above.

The Data You Will Need 

  1. MNIST: 
  2. Kaggle Otto competition data:
  3. "Text8":
  4. For a stretch, try the larger data sets from

An Overview of the Tutorial

The tutorial will include an introduction, a mini-installfest, and then three problem walkthroughs. There will be some general tips, plus time for discussion. 

Entree: Problem Walkthrough One: MNIST Digit Recognition

Compute Time: Around 3 to 5 minutes for a random forest approach

Digit recognition is most obviously used when decoding postcode numbers on envelopes. It's also relevant to general handwriting recognition, and also non-handwritten recognition such as OCR of scanned documents or license plate recognition.

Attendees will be able to run the supplied, worked solution on the spot. We'll step through the implementation stages to talk about how to apply similar solutions to other problems. If time is available, we will include alternative machine learning techniques and other data sets.

Data for this problem will be available on USB.

Main: Otto Shopping Category Challenge

Compute time: 1 minute for random forest
Compute time: 7 minutes for deep learning
Data for this problem can be downloaded only through the Kaggle site due to the terms of use.

This is a real-world, commercial problem. The "Otto Group" sell stuff, and they put that stuff into eight classes for problem. Each thing they sell has 93 features. They sample data set has 200k individual products which have each been somehow scored against these 93 features. The problem definition is to go from 93 input numbers to a category id between 1 and 9.

{ 93 features } --> some kind of machine learning --> { number between 1 and 9 }

Dessert: A Twitter Memebot in Word2Vec

Compute Time: Word2Vec training of 4m + 2 mins meme generation

This is something fun based on Word2Vec. We'll scrape twitter for some text to process, then use Word2Vec to look at some of the word relationships in the timelines.

Visualisation, Plotting and Results Analysis

No data science tutorial would be complete without data visualisation and plotting of results. Rather than have a separate problem for this, we will include them in each problem. We will also be considering how to determine whether your model is 'good', and how to convince both yourself and your customers / managers of that fact!

Bring Your Own Data

If you have a data problem of your own, you can bring it along to the tutorial and work on that instead. As time allows, I'll endeavour to assist with any questions you might have about working with your own data. Alternatively, you can just come up to me during the conference and we can take a look! There's nothing more interesting that looking at data that inherently matters to you.

I hope to see you at the conference!!


Monday, June 15, 2015

A Twitter Memebot in Word2Vec

I wanted to explore some ideas with Word2Vec to see how it could potentially be applied in practise. I thought that I would take a run at a Twitter bot that would try to do something semantically interesting and create new content. New ideas, from code.

Here's an example. Word2Vec is all about finding some kind of underlying representation of the semantics of words, and allowing some kind of traversal of that semantic space in a reliable fashion. It's about other things too, but what gets really excited is the fact that it's an approach which seems to actually approach the way that we humans tend to form word relationships.

Let's just say I was partially successful. The meme I've chosen above is one of the better results from the work, but there were many somewhat-interesting outputs. I refrained from making the twitter bot autonomous, as it had an unfortunate tendency to lock in on the most controversial tweets in my timeline, then make some hilarious but often unfortunate inferences from them, then meme them. Yeah, I'm not flicking that particular switch, thanks very much!

The libraries in use for this tutorial can be found at:

I recommend tweepy over other twitter API libraries, at least for Python 3.4, as it was the only one which worked for me first try. I didn't get round to the others again for a second try, because working solution.

You'll need to go get some twitter API keys. I don't remember all the steps for this, I just kind of did it on instinct. There's a Stack Overflow question on the topic if that helps. but that's not what I used. Good luck :)

This particular Twitter bot will select a random tweet from your timeline, then comment on it in the form of a meme. The relevance of those tweets is a bit hit-and-miss to be honest. This could probably be solved by using topic-modelling rather than random selection to find the most relevant keywords from the tweet.
public_tweets = api.home_timeline() 
will fetch the most recent tweets from the current user's timeline. The code then chooses a random tweet, and focuses on words that are longer than 3 characters (a proxy for 'interesting' or 'relevant). From this, we extract four words (if available). The goal is to produce a meme of the form "A is to B as C is to D". A, B and C are random words chosen from the tweet. D is a word found using word2vec. The fourth word is used to choose the image background by doing a flickr search.
indexes, metrics = model.analogy(pos=[ws[0], ws[1]], neg=[ws[2]], n=10)
The first line there is getting a list of candidate words for the end of our analogy. The second line is just picking the first one.

For example, a human might approach this as follows. Suppose the tweet is:

"I really love unit testing. It makes my job so much easier when doing deployments."

The word selection might result in "Testing, Easier, Deployments, Job". The goal would be to come up with a word for "Testing is to easier as Deployments is to X" (over an image of a job). I might come up with the word "automatic". Who knows -- it's kind of hard to relate all of those things. 

Here's an image showing another dubious set of relationships.

There' some sense in it -- it's certainly seems that breaking things can be trivial, and that flicking a switch is easy, and that questioning is a bit like testing. The background evokes both randomness and goal-seeking. However, getting any more precise than that is drawing a long bow, and a lot of those relationships came up pretty randomly. 

I could imagine this approach being used to create suggested memes to a human reviewer, using a supervised system approach. However, it's not really ready for autonomous use, being inadequate in both semantic meaning and sensitivity to content. However, I do think it shows that it's pretty easy to take and use the technologies for basic use.

There are a bunch of potential improvements I can think of which should result in better output. Focusing the word search towards the topic of the tweet is one. Selecting words for the analogy which are reasonably closely related would be another, and quite doable using the word2vec approach.

Understanding every step of this system requires a more involved explanation of what's going on, so I think the next few posts might be targetted at the intermediate steps and how they were arrived at, plus a walkthrough of each part of the code (which will be made available at that point).

Until next time, happy coding!

Tuesday, June 2, 2015

Getting deep learning going with Python 3.4 and Anaconda

I wanted to test out how hard (or easy) it would be to re-create prior results using two technologies I've been itching to try -- Python 3.4 and Anaconda. Python 3.4 is, obviously, where things are headed. Up to date, I have never succeeded in getting all the relevant packages installed that I would like to use.

Anaconda is an alternative Python distribution produced by Continuum Analytics. They provide various commercial products, but that's okay. They make something free and super-useful for developers, and their commercial products solve enterprise-relevant problems.

The 'big sell' of Anaconda as opposed to using the standard distribution is the ease of installation of scientific packages on a variety of platforms. Spending a day trudging through getting the relevant base OS packages installed and the Python libraries effectively using them all is pretty dull work.

I set out to install Keras and Ipython notebook. That is pretty much the end-game, so if that works, there's a valid path. Short answer: it worked out well, with only a few stumbles.

There are two operating system packages to install. Anaconda itself, obviously. OpenBLAS was the one remaining (or, I think, any other BLAS installation). There were still some imperfections, but everything went far, far better than the same process went for the standard Python approach.

Achieving success depending, somewhat strangely, on the order of installation of the packages. My end game was to have the Keras library up and running. That's not in the Anaconda world, so you need to use pip to get the job done. A simple 'pip install keras' didn't work for me -- there were various complaints, I think it said there was no cython.  Let's Redo From Start:

Take One
conda create -p ./new numpy
source activate ./new 
python install (yes I know I should use pip but :P )
... much compiling ...
warning: no previously-included files matching '*.pyo' found anywhere in distributionCould not locate executable gfortranCould not locate executable f95Could not locate executable f90Could not locate executable f77Could not locate executable xlf90Could not locate executable xlfCould not locate executable ifortCould not locate executable ifcCould not locate executable g77Could not locate executable g95Could not locate executable pgfortrandon't know how to compile Fortran code on platform 'posix'error: Setup script exited with error: library dfftpack has Fortran sources but no Fortran compiler found 

Take Two

pip install theano
... much compiling ...
"object of type 'type' has no len()" in evaluating 'len(list)' (available names: [])
    error: library dfftpack has Fortran sources but no Fortran compiler found 
Take Three 

conda create -p ./new scipy <-- block="" blockquote="" finally="" i="" realised="" scipy="" stumbling="" the="" was="">
source activate ./new 
python install 
... much compiling ... 
SUCCESS!!! W00t!

The same route on pure Python, last time I tried it, was much more involved. When I tried it, I found that scipy didn't necessarily install with the relevant Fortran support, which a lot of science packages depend on. Getting the base libraries to get up and running, and finding out what they even were, was a real mission.

Now, I'm not 100% sure anyone is to blame here. There will be reasons for each of the various packaging decisions along the way, and also I haven't necessarily taken the time to understand my own environment properly. I'm just doing what every practically-minded person does: try to just do an install of the thing and see what pops.

Fewer things pop with Anaconda. I now have a functional Python 3.4 environment with all the latest and greatest machine learning tech that Python has to offer, and that is awesome.

Also, I haven't included the bit where I discovered I had to install OpenBLAS through macports rather than through Anaconda. I've saved the reader from that.

Happy hacking!

Thursday, May 28, 2015

An environment to share...

I'm a terrible blogger. I just ground to a halt and got overwhelmed by real life. I ran out of good ideas. I hadn't finished things I was in the middle of and had no results. AAAAAGH. Here's something vague about repeatable data science environments for tutorials...

I am working on the background material to support tutorial sessions. If there's one hard thing about giving a tutorial, it's getting everyone on the same page without anyone being left behind. All of a sudden, you need to think about *everyone's* environment. Not just yours -- not even just 'most people's, but everyone's.

There are a few technologies for setting up environments, plus some entirely different approaches completely. My goal is to present people with multiple paths to success, without having to think of everything.

I'll be looking at:
  -- Virtualenv and Python packages
  -- Virtual machine images
  -- Vagrant automatic VM provisioning
  -- Alternative Python distributions
  -- Using web-based environments rather than your own installation

Why is this all so complicated? Well, without pointing any fingers, Python package alone won't get the job done for scientific packages. There's no getting around the fact that you will need to install some packages into the base operating system, and there is no good, well-supported path to make that easy. Particularly if you would prefer to do this without modifying the base system. Then, there's the layer of being able to help a room full of people all get the job done in about twenty minutes. Even with a room full of developers, it's going to be a challenge.

Let's take a tour of the options.

One -- Virtualenv and Python Packages

This option is the most 'pythonic' but also, by far, the least likely to get the job done. The reason is going to be the dependency on complex scientific libraries, which the user is then going to have to hand install by following through a list of instructions. It's doable, but I won't know up front what the differences between say yum and apt are going to be, let alone the potential differences between operating system versions might be. Then, there will be some users on OSX (hopefully using either macports or brew) and potentially some on Windows. In my experience, there are naming differences between package names across those systems, and at times there may be critical gaps or versioning differences. Furthermore, the relevant Python 3 vs 2.7 packages may differ. It is basically just too hard to use Python's inbuilt packaging mechanism to handle a whole room full of individual development platform differences.

Two -- Virtual Machines Images

This approach is fairly reliable, but feels kind of clumsy to me, and isn't necessarily very repeatable. While not everyone is going to have Virtualbox (or any other major virtualiser) installed, most people will be able to use this technology on their systems. There may be a few who will need to install Virtualbox, but from there it really should 'just work'.

VM files can be shared with USB keys or over the network. So long as you bring a long a good number of keys it should be mostly okay. A good tip though -- bring along keys of a couple of different brands. I have had firsthand experience of specific brands of USB key and computer just not getting along.

The downside is that while this will work in a tutorial setting, virtual machines can be slow, and don't necessarily set up the attendees with the technology they should be using going forward. They may find themselves left short of being able to work effectively in their own environments later.

Three -- Vagrant Automatic VM Provisioning

The level up from supplying a base VM is using Vagrant ( It allows you to specify the configuration of the base machine and its packages through a configuration file, so the only thing you need to share with people is a simple file. Rather than having to share large virtual machine files, which are also hard to version, you can share a simple configuration file only. That's something than can be easily versioned, and is lightweight to send around. The only downside is that each attendee will need to download the base VM image through the Vagrant system, which will hit the local network. Running a tutorial is an exercise in digital survivalism. It's best not to rely on any aspect of supporting technology.

I have also had some a lot of trouble trying to install Vagrant boxes. I'm not really sure what the source of the issues was. I'm not really sure why it started working either. I just know I'm not going to trust it in a live tutorial environment. Crossing it off for now.

Four -- Alternative Python Distributions

This could be a really good option. The two main distributions that I'm aware of are Python(x,y) and Anaconda. Both seem solid, but Anaconda probably has more mindshare, particularly for scientific packages. For the purposes of machine learning and data science, that is going to be very relevant. Many people support using the Anaconda distribution by default, but that's not my first option.

I would recommend Anaconda in corporate environments, where it's useful to have a singular, multi-capable installation which is going to be installable onto enterprise linux distributions but still have all the relevant scientific libraries. I would recommend against it on your own machine, because despite its general excellence, I have found occasional issues when trying to install niche packages.

Anaconda would probably also be good in the data centre or in a cloud environment, where things can also get a little wild when installing software. It's probably a good choice for system administrators as well. Firstly, installed packages won't interfere with the system Python. Secondly, it allows users to create user-space deployments with their own libraries that are isolated from the central libraries. This helps with managing robustness. Standard Python will do this with the 'virtualenv' package, so there are multiple ways to achieve this goal.

Using web-based environments rather than your own installation

This is really about side-stepping the issue, rather than fixing it as such. It's not free from pitfalls, because there are still browser incompatibilities to consider. However, if your audience can't manage to have an up-to-date version of either Firefox or Chrome installed, then things are likely to be tricky. Also, up-to-date versions of Internet Explorer are also likely to work, however I haven't tested it to any degree. You'll also need to understand the local networking environment to make sure you can run a server and that attendees are going to be able to access it. You could host an ad-hoc network on your own hardware, but I'm a bit nervous about that approach.

Perhaps, if I have some spare hardware, I'll expose something through a web server. Another alternative is to demonstrate (for example) the Kaggle scripting environment.


I think I have talked myself around to providing a virtual machine image via USB keys. I can build the environment on my own machine, verify exactly how it is set up, then provide something controlled to participants. 

In addition, I'll supply the list of packages that are in use, so that people can install them directly onto their own system if desired. This will be particularly relevant to those looking to exploit their GPUs to the maximum. 

Finally, I'll include a demo of the Kaggle scripting environment for those who don't really have an appropriate platform themselves.

I'd appreciate any comments from anyone who has run or attended tutorials who has any opinion about how best to get everyone up and running...

Thursday, May 7, 2015

On Reading and Understanding Academic Papers: Batch Normalisation

So, has a link in its documentation. In the section on the "Batch Normalisation" layer, there is a hyperlink to a PDF of an academic paper on the use and effectiveness of this approach. Tim Berners-Lee would be proud.

I followed that link. I am soooo ignorant when it comes to understanding maths properly. Don't get me wrong, I'm not completely foreign to it -- I work with numerical data all the time, and have a printout of the unit circle and associated trig functions on my desk. I use it multiple times per year. I remember nothing of it between those times.

Reading a paper requires me to have a degree of easy familiarity with mathematical concepts which I just don't have. Let me quote from introduction of the paper. I needed to take an image snapshot to deal with the mathematical notation (sorry Tim!)

Now, as far as I can tell, this paper is a top piece of research and a wonderful presentation of a highly useful and relevant concept. Parts of this I can understand, parts I don't. SGD was already defined as stochastic gradient descent, by the way.

The first symbol is a theta. I know from prior experience plus the description in the text it refers not to a number, but to some kind of collection of numbers being the parameters of the network. I'm not sure if it's a matrix, or several numbers, or what parameters it means exactly. Arg min, I think, means a magical function which returns "The value of theta that minimises the result of the following expression". I'm reading this like a programmer, see.

Okay, then the 1/n multiplied by the sum of a function between 1 and N. This is otherwise known as "the average". I think N refers to the number of training examples, and i is an index parameter into the current example.

I have no clue what the 'l' function is. I'm going to guess from the text it means 'loss function'.

So, unpacked, this means "The network parameters which minimise the mean average loss across the training data".

What's unclear to me is how the mathematical notation actually helps here. Surely the statement "stochastic gradient descent minimises the mean average loss over the training data" is actually more instructive to both the mathematical and casual reader than this function notation?

Now, I can eventually unpack most parts of the paper, slowly, and one-by-one. Writing this post genuinely helped me grok what was going on better. I haven't actually gotten to the section on batch normalisation yet, of course. I'll read on ... casual readers can tune out now as the rest of this post is going to be an extended exposition of my confusion, rather than anything of general interest.

The next paragraph refers to the principle behind mini-batching. There is something slightly non-obvious being said here. They state that "The mini-batch is used to approximate the gradient of the loss function with respect to the parameters...". What they are saying here is that the mini-batch, if it's a representative sample, approximates the overall network. Calculating the loss of the mini-batch is an estimator of the loss of the whole training set. It's smaller and easier to work with than the entire training set, but more representative that looking just at each example. The larger the mini-batch, the more representative it is of the whole, but at the same time it is more unwieldy. Picking the optimal batch size isn't covered, although they do mention that its efficiency also relates to being able to parellelise the calculation of the loss for each example.

The reason I think it's mentioned is that the purpose of mini-batching is similar to the purpose of batch normalisation. They are saying that mini-batching improves generalisation, because the learning is related to average loss across the mini-batch, rather than learning each specific example. That is to say -- it makes the network concentrate less sensitive to spurious details.

As I understand it, batch normalisation also achieves that end, as well as reducing reducing the sensitivity of the network to the tuning of its meta-parameters (the latter being the prime purpose of batch normalisation).

The make the point that in a deep network, the effect of tuning parameters is magnified. For example, if picking a learning rate of 0.1 has some effect on a single-layer network, it could have double that effect on a two-layer network. Potentially. I think this point is a little shaky myself, because having multiple layers could allow each layer to cancel out overshoots at previous layers. However, this might be a strong case for a more intricate layer design based on capturing effects at different scales. For example, having an architecture with a 'fine detail' and a 'course detail' layer might be better than two fine-scale layers. Another approach (for example on images) might be to actually train off smoothed data plus the detail data. Food for thought.

They then move on to what I think is the main game: reducing competition between the layers. As I interpret what they are saying, the learning step affects all layers. As a result, in a sense, layers towards the top of the network are experiencing a 'moving goalposts' situation while the layers underneath them learn and respond differently to the input data. This is basically where I no longer understand things completely. They are referring to the shifting nature of layer outputs as "Internal covariate shift". I interpret this as meaning that higher layers need to re-adjust to the new outputs of lower-layers. I think of it as being like input scaling, except at the layer level, updated through mini-batch training.

They then point to their big result: reduction in training time and improvements to accuracy. They took a well-performing network, and matched its results in 7% of the iterations. That's a big reduction. The also state they can improve final accuracy, but do not state how much.

Now for the details. The paper now talks about their detailed methodology and approach. I'm literally just 1.5 pages into an 8-page document, and my mind is experiencing that pressure you get when you just don't think you can grok any more. I don't think I can reasonably burden my readers any more with my thoughts, nor can I process much more today.

I'm going to have to break up understanding this paper into more sessions, I think, coming back after internalising every page or two. It's probably worth my getting there, because the end of the paper does mention alternative means to the same ends and talks about the limits of the technique. Perhaps I will make further posts on the topic, perhaps I won't. We shall see.

If any readers here are more advanced in their understanding than I am, I would very much appreciate if you could point out anything I've gotten wrong!!!

Tuesday, May 5, 2015

Network Configuration Exploration

Let's take stock. We have a primitive model (A), and a best-performing model (B). We are undertaking a process of breakdown down the differences to understand how each difference between the two models contributes to the observed performance gains. The hope is to learn a standard best practise, and thereby start from a higher base in future. However we are also hoping to learn whether there is anything we could do to take model (B) and extend its performance further.

We have deal with the input-side differences -- shuffling and scaling. We now move onto the network's internals -- the network configuration. This means its layers, activation functions and connections. Let's call the 'upgraded' version Model A2.

Model (A) has a single hidden layer. Model (B) is far deeper, with three sets of three layers, plus the input layer, plus the output layer. Model A uses a non-linear activation functions for the hidden and the output layer. Model B uses a non-linear activation function for the output layer, but uses mostly linear processes to get its work done.

I got a bit less scientific at this point ... not knowing where to start, I took Model B and started fiddling with its internals, based on nothing more than curiosity and interest. I removed one of the layer-sets entirely, and replaced the activation function of another with 'softmax'. That network began its training more quickly, but finished with the identical performance in due course.

So I removed that softmax layer set, now having a simpler configuration with just the input layer, a linear rectifier layer, and the final softmax activation layer. This was worse, but also interesting. For the first time, the training loss was worse than the validity loss. To my mind, this is the network 'memorising' rather than 'generalising'. The final loss was 0.53, which was better than Model A, better than a Random Forest, but much worse than Model B. This maybe gives us some more to go on. Let's call this new model B1.

There are still some key differences between Model A2 and Model B1. B1 actually uses simpler activations, but includes both Batch Normalisation and Dropout stages in processing, which we haven't talked about before. Which of those differences are important to the improvement in functionality?

This gives us the direction for the next steps. My feeling is that both batch normalisation and dropout are worth examining independently. The next posts will focus on what those things do, and what their impacts are when added to a more basic model or removed from a more sophisticated one.

Saturday, May 2, 2015

What is the impact of scaling your inputs?

Last post we looked at the "input shuffling" technique. This time we're looking at input scaling. Input scaling is just one of several input modification strategies that can be put into place. Unlike input shuffling which makes intuitive sense to me, input scaling does not. Absolute value can in fact be important, and it feels like input scaling is actually removing information.

Let's take the same approach as last time: add it to a basic network, and remove it from a well-performing one. This time round we also have an extra question -- are the benefits of input scaling independent of the benefits from input shuffling?

The first thing I did was add input scaling to the network design as we had it at the end of the last post. I ran it for 20 iterations. This is the a shuffled, scaled, three-layer architecture. The performance here is much, much better. After 20 iterations, we reach a valid loss of 0.59, and a train loss of 0.56. That's not as good as our best network, but it's a big step up. If I run it out to 150 iterations, I get a valid loss of 0.558. As a reminder, our current best network hits 0.5054 after 20 iterations.

Let's review:
      Design Zero achieved valid loss of about 0.88 -- much worse than Random Forest.
      Design Zero eventually cranked its way down to 0.6 or so after 800 cycles.
      Design Zero + shuffling hits 0.72 after thirteen iterations
      Design Zero + shuffling + scaling hits 0.59 after 20 iterations (slightly worse than RF)
      Design Zero + shuffling + scaling wanders off base and degrades to 0.94 after 800 iterations

Interesting. Performance has improved greatly, but we have introduced an early-stopping problem, where the optimum result appears part-way through the training cycle. Early stopping is basically the inverse of the halting problem, which is not not knowing if you can stop. For the time being, I don't want to go down that rabbit hole, so I'll just stop at 20 iterations and concentrate on comparing performance there.

Our current design is performing at 0.59, vs 0.50 for the best network. The performance of Random Forest was 0.57, so we're getting close to reaching the benchmark level now using a pretty basic network design and the most basic input techniques.

Let's see what happens when we pull scaling out of our best design. After 20 iterations, without scaling, the performance is at 0.5058. That's a tiny bit worse than its best performance, but you'd be hard-pressed to call it significant. Whatever benefit is provided by scaling has, presumably, largely been accounted for in the network design, perhaps implicitly being reflected into the weights and biases of the first layer of the network. In fact, the performance at each epoch is virtually identical as well -- there is no change to early training results either.

The big relief is that it didn't cost us anything. The verdict on scaling is that it can be very beneficial in some circumstances, and doesn't seem to count against us. I still have a nagging philosophical feeling that I'm 'hiding' something from the NN by pre-scaling, but until I can think of a better idea, it looks like scaling is another one of those "every time" techniques.

Thursday, April 30, 2015

The Effect of Randomising the Order of Your Inputs

This post is going to focus on the impacts on a single example of randomising the order of your inputs (training examples) in one particular problem. This will look at the effect of adding input shuffling to a poorly-performing network, and also the effect of removing it from a well-performing network.

Let's just straight to the punch: Always randomise the order. The network will start effective training much sooner. Failure to do so may jeopardise the final result. There appears to be no downside. (if anyone has an example where it actually makes things worse, please let me know)

Here's a screenshot from my Ipython notebook of the initial, poorly-performing network. Sorry for using an image format, but I think it adds a certain amount of flavour to show the results as they appear during development. This table shows that after 13 iterations (picked because it fits in my screenshot easily), the training loss is 1.44, and the % accuracy is 11.82%. This network eventually trains to hit a loss of about 0.6, but only after a large (800-odd) number of iterations.

It takes 100 iterations or so to feel like the network is really progressing anywhere, and it's slow, steady progress the whole way through. I haven't run this network out to >1k iterations to see where it eventually tops out, but that's just a curiosity for me. Alternative techniques provide more satisfying results much faster.

Here's the improved result. I added a single line to the code to achieve this effect:

train = shuffle(train)
Adding the shuffling step wasn't a big deal -- it's fast, and conceptually easy as well. It's so effective, I honestly think it would be a good thing for NN libraries to simply to by default rather than leave to the user. We see here that by iteration 13, the valid loss is 0.72 as opposed to 2.53 in the first example. That's pretty dramatic.

If anyone knows of any examples where is it better not to randomise the input examples, please let me know!!!

For the time being, I think the benefits of this result are so clear, that a deeper investigation is not really called for in fact. I'm just going to add it to my 'standard technique' for building NNs going forward, and consider changing this step only if I am struggling with a particular problem-at-hand. What's more likely is that more sophisticated approaches to input modification will be important, rather than avoiding the step entirely. I'm aware that many high-performing results have been achieved by transforming input variables to provide additional synthetic learning data into the training set. Examples of this include image modifications such as skew, colour filtering and other similar techniques. I would be very interested to learn more about other kinds of image modification preprocessing, like edge-finding algorithms, blurring and other standard algorithms.

Interestingly, the final results for these two networks are not very different. Both of them seem to train up to around the maximum capability of the architecture of the network. Neither result arising from this tweak only approach the performance provided by the best known alternative approach that I have copied from.

I wondered whether shuffling the inputs was a truly necessary step, or just a convenient / beneficial one. If you don't care about letting the machine rip through more iterations, then is this step really relevant? It turns out the answer is a resounding "Yes", at least sometimes.

To address this I took the best-performing code (as discussed in the last post) and removed the shuffle step. The result was a network which trained far more slowly, and moreover did not reach the optimal solution nor even approach it. The well-performing network achieved a valid loss of 0.5054, which looks pretty good compared to random forest.

Here is the well-performing network with input shuffling removed. You can see here that the training starts off badly, and gets worse. Note the "valid loss" is the key characteristic to monitor. The simple "loss" improves. This shows the network is remembering prior examples well, but extrapolating badly.

After 19 rounds (the same number of iterations taking by the most successful network design), avoiding the shuffling step results a network that just wanders further and further off base. We end up with a valid loss of 13+, which is just way off.

As an interesting aside, there is a halting problem here. How do we know, for sure, that after sufficient training this network isn't suddenly going to 'figure it out' and train up to the required standard? After all, we know for a fact that the network nodes are capable of storing the information of a well-performing predictive system. It's clearly not ridiculous to suggest that it's possible. How do we know that the network is permanently stuck? Obviously indications aren't good, and just as obviously we should just use the most promising approach. However, this heuristic is not actually knowledge.

Question for the audience -- does anyone know if a paper has already been publishes analysing the impacts of input randomisation across many examples (image and non-image) and many network architectures? What about alternative input handling techniques and synthetic input approaches?

Also, I haven't bothered uploading the code I used for these examples to the github site, since they are really quite simple and I think not of great community value. Let me know if you'd actually like to see them.

Wednesday, April 29, 2015

Neural Net Tuning: The Saga Begins

So, it has become abundantly clear that a totally naive approach to neural net building doesn't cut the mustard. A totally naive approach to random forests does cut a certain amount of mustard -- which is good to know!

First lesson: benchmark everything against the performance of a random forest approach.

Let's talk about levelling up, both in terms of your own skill, and in terms of NN performance. And when I say your skill, that's a thinly-veiled reference for my skill but put in the the third person -- yet maybe you'll benefit from watching me go through the pain of getting there :).

In this particular case, I discovered two things yesterday: a new favourite NN library, and a solution to the Otto challenge which performs better. I recommend them both! is a machine learning library which is well-documented (!), utilises the GPU when available (!) and includes tutorials and worked examples (!). I'm very excited. It also matched my own programming and abstraction style very well, so I'll even call it Pythonic. (note, that is a joke about subjective responses, the True Scotsman fallacy and whether there is even such a thing as Pythonic style) (subnote -- I still think it's the best thing out there.) is a Kaggle forum post on that author's NN implementation, based on the Keras library. is the Python source code for their worked solution.

Their neural net is substantially different to the simple, naive three-layer solution I started with and talked about in the last blog post. I plan to look at the differences between our two networks, and comment on the impact and value of each change to the initial network design as I make it, including discussion of whether that change is generally applicable, or only offers specific advantages to the problem at hand. The overall goal is to come up with general best design practises, as well as an understanding on how much work is required to hand-tool an optimal network for a new problem.

My next post will be on the impacts of the first change: randomising the order of your input examples (I may add some other tweaks to the post). Spoiler alert! The result I got was faster network training, but not a significant improvement in the final performance. Down the track, I'll also experiment on the effects of removing this from a well-performing solution and discussing things from the other side. It may be that randomising order is useful, but not necessary, or it may be integral to the whole approach. We shall find out!

Tuesday, April 28, 2015

Neural Networks -- You Failed Me!

Kaggle have a competition open at the moment called the "Otto Group Product Classification Challenge". It's a good place to get started if for no other reason than the data files are quite small. I was finding that even on moderate sized data sets, I was struggling to make a lot of progress just because of the time taken to run a simple learning experiment.

I copied a script called "Benchmark" from the Kaggle site, ran it myself, and achieved a score of 0.57, which is low on the leaderboard (but not at the bottom). It took about 90 seconds. Great! Started!

I then tried to deploy a neural network to solve the problem (the benchmark script uses a random forest). The data consist of about 62 thousand examples, each example composed of 93 input features. The training goal is to divide these examples into 9 different classes. It's a basic categorisation problem.

I tried a simple network configuration -- 93 input nodes (by definition), 9 output nodes (by definition) and a hidden layer of 45 nodes. I gave the hidden layer a sigmoid activation function and the output layer a softmax activation function. I don't really know if that was the right thing to do -- I'm just going to confess that I picked them as much by hunch as by prior knowledge.

It was a bit late at night at this stage, and I actually spend probably an hour just trying to make my technology stake work, despite the simple nature of the problem. The error messages coming back were all about the failure to broadcast array shapes rather than something that immediately pointed out my silly mistake, so it took me a while. C'est la vie.

Eventually, I got it running. I got much, much worse results. What? Aren't neural networks the latest thing? Aren't they supposed to be better than everything else, and better yet don't they just work? Isn't the point of machine learning to remove the need to develop a careful solution to the problem? Apparently not...

I googled around, and found someone had put together a similar approach with just three nodes in the hidden layer. So I tried that.

I got similar results. Wait, what? That means that 42 of my hidden nodes were just dead weight! There must be three main abstractions, I guess, which explain a good chunk of the results.

I tried more nodes. More hidden layers. More training iterations. More iterations helped a bit, but not quite enough. I tried from 15 to 800 iterations. I found I needed at least 100 or so to start approaching diminishing returns, but I kept getting benefits all the way up to 800. But I never even came close to the basic random forest approach.

I have little doubt that the eventual winning result will use a neural network. The question is -- what else do I need to try? I will post follow-ups to this post as I wrangle with this data set. However, I really wanted to share this interesting intermediate result, because it speaks to the process of problem-solving. Most success comes only after repeated failure, if at all.

Monday, April 27, 2015

Interlude: Kaggle Scripting

I don't like to support specific companies, in general I'm more interested in how people can be productive as independently and autonomously as possible.

However, Kaggle just did something pretty amazing.

They've launched a free online scripting environment for machine learning, with access to their competition data sets. That means that everything I've been writing about so far -- loading data, setting up your environment, using notebooks, can be easily sidestepped for instant results. The best thing -- it supports Python!

I haven't yet determined where the limitations are, I've simply forked one of the sample scripts that I found, but it worked wonderfully. I found that the execution time of the script was similar to running it locally on my 2011 Macbook Air, but that's not bad for something totally free that runs on someone else's infrastructure. I was able to sidestep data downloading entirely, and "leap into the middle" my forking an existing script.

Here's a direct link to one:

Note, I'm logged into Kaggle, so I'm not sure how this will go for people who aren't signed in.

My expectation is that for solutions which demand a lot of computational resources (or rely on the GPU for reasonable performance) or use a unusual libraries, a self-service approach to computing will still win out, but this is an amazing way to get started.

Thursday, April 9, 2015

STML: Loading and working with data

Loading Data Into Memory

I've had some Real Life Interruptions preventing me from taking the MNIST data further. Here's a generic post on working with data that I prepared earlier.

It may sound a little ridiculous, but sometimes even loading data into memory can be a challenge. I've come across a few challenges when loading data (and subsequently working with it).
  1. Picking an appropriate in-memory structure
  2. Dealing with very large data sets
  3. Dealing with a very large number of files
  4. Picking data structures that are appropriate for your software libraries
Some of these questions have obvious answers; some do not. Most of them will change depending on the technology stack you choose. Many of them will be different between Python 2.7 and Python 3.x. Let's take it step-by-step, in detail.

Update -- after beginning this post intending to address all four points, I found that I exceeded my length limit with even a brief statement on the various categories of input data with respect to choosing an appropriate in-memory data structure. I will deal with all these topics in detail later, when we actually focus on a specific machine learning problem. For now, this blog post is now just a brief statement on picking appropriate in-memory data structures. 

Picking an appropriate in-memory structure

Obviously, this will depend very much on your data set. Very common data formats are:
  1. Images, either grayscale or RGB, varying in size from icons to multi-megapixel photos
  2. Plain text (with many kinds of internal structure)
  3. Semantic text (xml, json, markdown, html...)
  4. Graphs (social graphs, networks, object-oriented databases)
  5. Relational databases
  6. Spatial data / geographically-aware data
Two additional complicating factors are dealing with time-series of data, and dealing with streaming data.

This isn't the place to exhaustively cover this topic. Simply covering how to load these files into memory could comfortably take a couple of posts per data type, and this is without going into cleaning data, scraping data or otherwise getting data in fit shape. Nonetheless, this is a topic often skipped over my machine learning tutorials and has been the cause of many a stubbed toe on my walk through a walkthrough.

Loading Images into Memory

Python has quite a few options when it comes to loading image data. When it comes to machine learning, the first step is typically to deal with with image intensity (e.g. grayscale) rather than utilising the colour values. Further, the data is typically treated as normalised values between zero and one. Image coordinates are typically based at the top left, array coordinates vary, graphs are bottom-left, screen coordinates vary, blah blah blah. Within a single data set, this probably doesn't matter too much, as the algorithms don't tend to care about whether images are flipped upside down, just so long as things are all internally consistent. Where it matters is when you bring together disparate data sources.

The super-super short version is:
# open random image small enough for memory
img ='filename.jpg'))
img = numpy.asarray(img, dtype='float32') / 256.
Note the use of float32, the division by 256, the decimal point on the divisor to force floating-point division. You'll need to install numpy and something called "Pillow". Pillow is the new PIL library and is compatible with any tutorials you're likely to find.

Loading Plain Text Into Memory

Plain text is usually pretty easy to read. There are some minor platform differences between Windows and the rest of the known universe, but by and large getting text into memory isn't a problem. The main issue is dealing with it after that point.

A simple
thestring = open('filename.txt').read()
will typically do this just fine for now. The challenge will come when you need to understand its contents. Breaking up the raw text is called tokenization, and most language processing will use something called 'parts of speech tagging' to start building up a representation of the underlying language concepts. A library called NLTK is the current best place to start with this, although Word2Vec is also a very interesting starting point for language processing. Don't mention unicode -- that's for later.

Loading Semantic Text into Memory

Let's start with HTML data, because I have a gripe about this. There's a library out there called Beautiful Soup, which has the explicit goal of being able to process pretty much any html page you can through at it. However, for some incomprehensible reason, I have found it is actually less tolerant that the standard Python library lxml, which is better able to handle nonconformant HTML input (when placed into compatability mode). First up, I really don't work with HTML much. As a result, you're going to be mostly on your own when it comes to parsing the underlying document, but you'll be using one of these two libraries depending on your needs and your source data.

Loading Graphs Into Memory

I also haven't worked all that much with graph structures. Figuring out how to load and work with this kind of data is a challenge. Firstly, there appear to be few commonly-accepted graph serialisation formats. Secondly, the term "graph" is massively overloaded, so you'll find a lot of search results for libraries that are actually pretty different. There are three broad categories of graph library: graphical tools for plotting charts (e.g. X vs Y plots a-la matplotlib and graphite); tools for plotting graph structures (e.g. dot, graphviz) and tools for actually working with graph structures in memory (like networkX). There is in fact some overlap between all these tools, plus there are sure to be more out there. It is also possible to represent graphs through Python dictionaries and custom data structures.

Working with Relational Data

Relational database queries are absolutely fantastic tools. Don't underestimate the value of Ye Olde Database structure. They are also a wonderful ways to work with data locally. The two magic libraries here are sqlite3 and pandas. Here's the three-line recipe for reading out from a local sqlite3 database
conn = sqlite3.connect('../database/demo.db')
query = 'select * from TABLE'
df = pandas.read_sql(query, conn)
Connected to a remote database simply involves using the appropriate library to create the connection object. Pandas is the best way to deal with the data once in memory, bar none. It is also the best way to read CSV files, and is readily usable with other scientific libraries due to the ability to retrieve data as numpy arrays. Many, many interesting machine learning problems can and should be approached with these tools.

More recently, I have found it more effective to use HDF5 as the local data storage format, even for relational data. Go figure.

Spatial Data / GIS Data

Abandon all hope, ye who enter here.

Spatial / GIS data is a minefield of complexity seemingly designed to entrap, confuse and beguile even the experts. I'm not going to even try to talk about it in a single paragraph. Working with spatial data is incredibly powerful, but it will require an enormous rant-length blog post all of its own to actually treat properly.

Dealing with Time Series

Machine learning problems frequently ignore the time dimension of our reality. This can be a very useful simplification, but feels unsatisfying. I haven't seen any good libraries which really nail this. Let's talk about some common approaches:

Markov Modelling. The time dimension is treated as a sequence of causally-linked events. Frequency counting is used to infer the probability of an outcome given a small number of antecedent conditions. The key concept here is the n-gram, which is an n-length sequence which is counted.

Four-Dimensional Arrays. Time can be treated either as a sequence of events with an equal or irregular interval. Arrays can be sparse or dense.

Date-Time Indexing. This is where, for example, a 2d matrix or SQL table has date and time references recorded with the entries.

Animation (including video). Video and sound formats both include time as a very significant concept. While most operating systems don't provide true realtime guarantees around operation, their components (sound cards and video cards) provide the ability to buffer data and then write it out in pretty-true realtime. These will in general be based on an implicit time interval between data values.

Dealing with Streaming Data

Algorithms which deal with streaming data are typically referred to as "online" algorithms. In development mode, your data will mostly be stored on a disk or something where you can re-process the data easily and at will. However, not all data sources can be so readily handled. Examples where you might use an online algorithm include video processing, data firehose scenarios (satellite data,  twitter, sound processing). Such data sources are often only of ephemeral value, not worth permanently storing; you may be working on a lower-capability compute platform; or you just might not have that much disk free.

You can think of these data streams as something where you store the output of your algorithm, but throw out the source data. An easy example might include feature detection on a network router. You might monitor the network stream for suspicious packets, but not store the full network data of every packet going across your network. Another might be small robots or drones which have limited on-board capacity for storage, but can run relevant algorithms on the data in realtime.

Next Steps

I am not sure exactly what to make of this blog post. The commentary is only slightly useful for people with a specific problem to solve, but perhaps it serves to explain just how complex an area machine learning can be. The only practical way forward that I can think of is to get into actual problem-solving on non-trivial problems.

I think the main thing to take away from this is: Don't Feel Stupid If You Find It Hard. There really really are a lot of factors which go into solving a machine learning problem, and you're going to have to contend with them all. These problems haven't all been solved, and they don't have universal solutions. Your libraries will do much of the heavy lifting, but then leave you without a safety net when you suddenly find yourself leaving the beaten path.

Wednesday, April 1, 2015

Quick Look: MNIST Data

Project Theme: STML-MNIST

I like to do a lot of different things at once. I like exploring things from both a theoretical perspective and a practical perspective in an interleaved way. I'm thinking that I will present the blog in similar fashion: multiple interleaved themes. I'll have practical projects which readers can choose to repeat themselves, and which will push me to exploring new territory firsthand. At the same time, I will share general thoughts and observations. If I can manage it, I will release one of each per week, and I will expect each project post should not take a reader more than a week to comfortably work through. By that I mean there should be somewhere between 30 minutes and 3 hours work required to repeat a project activity.

Readers, please let me know if this works for you.

Right, so carrying on... a couple of posts ago, prior to being derailed into a discussion on analytics, I ran through how to take a pretty robust algorithm called a Random Forest and apply it to a Kaggle competition data set. The training time was about 5 minutes, which is fast enough to walk through, but a bit slow for really rapid investigation. I'm going to turn now to another data set called the MNIST database. It contains images of handwritten digits between 0 and 9. It has been done to death "in the literature", but it's still pretty interesting if you haven't come across it before.

This dataset will let us apply the same algorithm we used before (Random Forest) and see how it goes on another dataset. Consider this a bit of a practise or warmup exercise for improving our ability to apply random forests and other techniques to images. I'm going to point you out at another blog post for the actual walkthrough (why re-invent the wheel), but I'll still supply my working notebook for those who want a leg up or a shortcut. But be warned: there are no real shortcuts to learning! (why do you think I'm writing this blog, for example)

Theme: Image Categorisation

The MNIST database is an incredibly common "first dataset", but that doesn't mean it's not significant. The great thing about this database is that it is actually very meaningful. The task definition is to categorise handwritten digits correctly. The problem is essentially solved, and the technique for best solving it is a neural network. 

It is not, however, what is known as a 'toy problem'. Categorising digits is how automation of mail distribution based on postcodes works. Categorising handwriting in general is used for optical character recognition (OCR, digitisation of records) and security (captchas). Categorisation of images in general has very wide application and has many unsolved aspects.

Working on a problem with a "known good" solution allows us to focus on the software engineering and problem-solving aspects without the concern that we might be just wandering down a part that goes absolutely nowhere.

The scale of this task is also appropriate for the level of computing resources available to large numbers of people. 

The STML-MNIST series of posts will involve the application of multiple machine learning strategies to this particular task. This will allow us to cross-compare the various strategies in terms of performance, complexity and challenge. While I have worked through tutorials on the neural network solution before, I have not done a deep dive. This series of posts will be a deep dive.

A fantastic tutorial to work through, solving this very problem, is available at The linked tutorial uses a neural network approach, rather than a random forest approach. You could go down that path if you liked, or follow along with my random forest approach.

Whether you choose to replicate the result using a neural net or a random forest, I would recommend completing that on your own before proceeding much further in this post. For the moment, just try to get your own code working, rather than on understanding each step in depth. If you have replicated my setup, you can use my Ipython Notebook as short-cut to a working version. The notebook can be read online at or accessed via the git repository

... time passes ...

Okay, welcome back! I assume that it's been a couple of days since you diverted into the linked tutorial, and that you managed to successfully train a neural network to solve the problem to a high standard of quality. Great work! The neural net really solves the problem amazingly well, whereas my random forest approach reached about 90% success (still not bad, given there was no attempt to adapt the algorithm to the problem). I deliberately chose a set of RF parameters similar to that chosen for the Kaggle ocean dataset.

The mnist dataset was faster to process -- it took around a minute to train on, rather than 5 minutes. This makes it slightly more amenable to a tighter experimentation loop, which is great. Later on we can get into the mind-set of batch-mode experiments, setting off dozens at once to run overnight or while we do something else for a while. 

What we've found so far shows two things: that a Random Forest approach is mind-bogglingly flexible. It's probably not going to get beaten as a general-purpose, proof-of-concept algorithm. However, it also appears that it only goes so far, and that the next level is probably a neural not.

Each new problem will require its own investigative process to arrive at a good solution, and it is that investigative process that is of interest here.

I propose the following process for exploring this dataset further:
  • Comparisons of alternative neural network designs
  • Visualisations of intermediate results
  • Comparison against alternative machine learning methodologies
A list of techniques which could be applied here include:
  • Random forest
  • Decision tree optimisation
  • Comparison against reference images (error/energy minimisation)
Now, I would like each of my posts to be pretty easy to consume. For that reason, I'm not going to do all of that right now. Also, I don't think I'm a fast enough worker to get all the way through that process in just one week. Rather, following the referenced tutorial is the first week's activity for this project theme. I will undertake to apply one new technique to the dataset, on w roughly weekly schedule, and report back...

In the meantime, happy coding!


Wednesday, March 25, 2015

My Data Science / Machine Learning Technology Stack

The Setup

Let's talk briefly about my technology stack. I work in Python, for a number of reasons. Focusing allows me to learn more about the nuances of a single language. Many of the techniques here are well-supported in other languages, but I trip over my feet more when using them. I'm better off going with my strengths, even if that means foregoing some great tools.

So first up, here's the list:
  • A 2011 Macbook Air laptop for my main computing system
  • Python 2.7, using pandas, numpy, sqlite3 and theano (not a full list)
  • IPython Notebook for putting together walkthroughs and experiments
  • Access to a CUDA-enabled NVidia GPU-enabled machine for performance work
  • An external two-terabyte hard drive for much needed disk space
  • A copy of several significant machine learning databases, including
    • MNIST data
    • Two kaggle competition datasets (ocean plankton and diabetic retinopathy)
    • The Enron mail archive
Side note -- If you want to get your local environment set up, a great goal will be to follow along with This will exercise your setup using most of the tools and libraries that you will be interacting with.

So far, this has proved to be a tolerably effective setup. By far, the two hardest aspects of the problem are leveraging the GPU and access to disk space at a reasonable speed. GPU-based libraries are quite disconnected from general libraries, and require a different approach. There are compatability elements (such as conversion to and from numpy arrays), but essentially the GPU is a different beast to the CPU and can't simply be abstracted away ... yet.

The Macbook Air is limited in its amount of RAM and on-board storage. That said, almost any laptop is going to be very limited compared to almost any desktop for the purposes of machine learning. Basic algorithms such as linear regression are easy enough that the laptop is no problem; intensive algorithms such as deep learning are hard enough that you will want to farm the processing out to a dedicated machine. You'll basically want a gaming setup.

If I'm lucky, I may be upgrading to a 13" 2015 Macbook Pro shortly. While this will ease the constraints, the same fundamental dynamics will remain unchanged. It is still not sufficient for deep learning purposes, and won't really change the balance all that much. Unless you have a beast of a laptop with an on-board NVidia GPU, you'll be looking at a multi-machine setup. Note - the 15" MPB does have an Nvidia GPU, but I don't want the size.

Working on Multiple Machines

Some people will have easy access to a higher-end machine, some won't. Some people will be dismayed not to be able to work easily on a single machine, others will be fine with it. Unlike many, I won't be recommending using AWS or another cloud provider. They are fine if the charges represent small change to you, but because of the disk and GPU requirements of the machines, you're not going to be looking at entry-level prices. You would be much better off spending $200 on the Nvidia Jetson TK1 and using that as modular processing engine for your laptop.

For small data sets like MNIST, your laptop will be totally fine. For medium-size sets, like the ocean plankton, you will be completely unable to complete the tasks on a CPU in a tolerable time-frame.

For large data sets, you may well be unable to fit the data onto the onboard storage of a single machine, especially a laptop. For most of us, that will mean using an external USB drive.

There are many ways to effectively utilise multiple machines. Some are based on making those machines available as "slaves" to a master or controller system, others will be based on remote code execution, perhaps through shells scripts. Future blog posts will cover the how and why of this. For the moment, the important thing to take away from this is that you'll either be working on small data sets or on multiple machines.

Why NVidia? Isn't that favouritism?

Short version: It's what the library ecosystem supports best right now. NVidia GPUs support CUDA, which is the only GPU controller language the best Python libraries support properly. If you want to work on machine learning problems, rather than on adapting libraries to new GPU platforms, you'll need to go down the proprietary path. In a couple of years, I expect that the software world will have caught up to more architectures. If you're working on problems which can be reasonably solved on a CPU, then no problem. If you're an amazing software wizard who can comfortably modify the back-end architecture, then you're probably reading the wrong blog anyway.

Tuesday, March 24, 2015

First steps to Kaggling: Making Any Improvement At All

Practical Objective:

Update: I realised it might have been better to preceed this post with an outline of how to get set up technically for running the examples. I'll post that one tomorrow! Experienced developers should have no trouble getting by, but for everyone else it won't be a long wait before a how-to on getting set up.

The first goal is to try to make a single improvement, of any degree, to the implementation provided in the Kaggle tutorial for the "ocean" data set (see below). This will exercise a number of key skills:
  • Downloading and working with data
  • Preparing and uploading submissions to Kaggle
  • Working with images in memory
  • Applying the Random Forest technique to images
  • Analysing and improving the performance of an implementation

Discussion and Background

Machine learning and data science is replete with articles which talk about the power, elegance and ease by which tools can be put to complex and new problems. Domain knowledge can be discarded in a single afternoon, and you can just point your GPU at a neural network and go home early.

Well, actually, it's darned fiddly work, and not everyone finds it so easy. For an expert with a well-equipped computer lab, perhaps a few fellow hackers and plenty of spare time, it probably isn't such a stretch to come up with a very effective response to a Kaggle competition in the time frame of about a week. For those of us working at home on laptops and a few stolen hours late in the evening or while the family is out doing other things, it's more challenging.

This blog is about that journey. It's about finding it actually hard to download the data because for some reason the download rate from the Kaggle servers to Melbourne, Australia is woeful, and so it actually takes a week to get the files onto a USB drive, which by the way only supports USB 2.0 transfer rates and isn't really a good choice for a processing cache...

So this post will be humbly dedicated to Making Any Difference At All. Many Kaggle competitions are accompanied with a simple tutorial that will get you going. We'll use the (now completed) competition into identifying Plankton types. I'll refer to it as the "ocean" competition for the sake of brevity.

Back to the Problem

I would recommend working in the Ipython Notebook environment. That's what I'll be doing for the examples contained in this blog, and it's a useful way to organise your work. You can take a short-cut and clone the repository for this blog from

Next, download the ocean data, and grab a copy of the tutorial code:

It's a nice tutorial. There are a couple of key steps omitted, such as how to prepare your submission files. It implements a technique called a random forest approach. This is a highly robust way to get reasonably good solutions fast. It's fast in every way: it works well enough on a CPU, it doesn't need a lot of wall clock time even on a modest-spec laptop, it doesn't take much time to code up, and it also doesn't require a lot of specialist knowledge.

There are a lot of tutorials on how to use random forests, so I won't repeat myself here. The tutorial linked above is a good example of a tutorial. You can work through it in advance, or just try to get the notebooks I've shared working in your environment.

This post is about how to take the next step -- going past the tutorial into your own work. First -- just spend some time thinking about the problem. Here are a few key things about this post which took me by surprise when I reflected on it:

  • The random forest is actually using every pixel as a potential decision point. Every pixel is a thresholded yes/no point on a (forest of) decision tree(s).
  • A decision tree can be thought about as a series of logical questions; it can also be thought about as a series of data segmentations into two classes. I've only seen binary decision trees, I suppose there must be theoretical decision trees which use a multi-class approach, but I'm just going to leave that thought hanging.
  • The algorithm doesn't distinguish between image content (intensity values) and semantic content (e.g. aspect ratio, image size)
  • Machine learning is all about the data, not the algorithm. While there is some intricacy, you can basically just point the algorithm at the data, specify your basic configuration parameters, and let it rip. So far, I've found much more work to be in the data handling rather than the algorithm tuning. I suspect that while that's true now, eventually the data loading will become more routine and my library files more sophisticated. For new starters, however, expect there to be a lot of time working with data and manually identifying features.
  • It's entirely possible to run your machine out of resource and/or have it just sit there and compute for longer than my patience allows. Early stopping and the use of simple algorithms lets you explore the data set much more effectively than training the algorithms until they plateau.

I have created a version of the tutorial which includes a lot of comments and additional discussion of what's going on. If there's something I haven't understood, I've also made that clear. This blog is *not* a master-to-student blog, it's just one hacker to another. I would suggest reading through the notebook in order to learn more about what's going on at each point in the tutorial.

You don't need to run it yourself, I have made a copy available at:

This first notebook is simply a cut-down and annotated version of the original tutorial. It contains my comments and notes on what the original author is doing.

The next step is to try to make any improvement at all. I would suggest spending a little time considering what additional steps you would like to try. If you have this running in a notebook on your own machine, you could try varying some of the parameters in the existing algorithm, such as the image size and the number of estimators. Increasing both of these could well lead to some improvement. I haven't tried either yet, so please post back your results if you do.

I tried the approach of including the size of the original image as an additional variable at the input layer. I did a simple count of the number of pixels. An interesting feature of this data set is that the image size is matched to the actual size of the represented organism. Normally, you'd have to take account of the fact that distance from the camera will affect the size of the object within the image. In this case, I did a simple pixel count. Only three lines changed -- you should try to find them within create_X_and_Y().

A web-visible copy is available at:

This made some positive difference -- a 0.19 point improvement in the log loss score (as measured by the test/train data provided).

Let's check out the Kaggle leaderboard to see what this means in practical terms ( Firstly, it's worth noting that simply implementing the tutorial would have put us better than last -- much better. We could have scored around position 723, out of a total of 1049 submissions.

Our improved score would place us at 703, 20 places further up.

The different between the leader and the second-place getter was ~0.015. Our degree improvement was at least bigger than that! On the other hand, we still have a very long way to go before achieving a competitive result.

Since the competition is now closed, the winner has posted a comprehensive write-up of their techniques. We could go straight there, read through their findings, and take that away immediately. However, I'm more interested in exploring the investigative process than I am in re-using final results. I want to test the process of discover and exploration, not refine the final knowledge. The skill to be learned is not a knowledge of what techniques eventually won (which is relevant) but the development of investigative skill (which is primary).

We should also take note of the number of entries the winners submitted. The top three competitors submitted 96 times, 150 times and 134 times respectively. They also used more compute-heavy techniques, which means they really sank some time into getting to their goal. The winning team was a collaboration between 7 people. I would expect each submission would have another 1-5 local experiments tried beforehand. This means that people are trying 300-500 (roughly) local hypotheses before making a submission. Taken that way, a single-attempt improvement of 0.19 doesn't seem so bad!

The next several posts will continue the theme of investigative skill, developing a personal library of code, and methods for comparing techniques. We'll explore how to debug networks and how to cope with large data.

Until next time, happy coding! :)
-Tennessee Leeuwenburg

Don't Click Me Results: Even More Linkbait

Okay, so I should have seen this coming. My last post, "Don't click me if you're human", generated over double my ordinary rate of traffic. My sincere thanks go out to those 22 souls who left comments so that I could discount them from the accounting.

I find it a bit hard to get the data I want at a glance from the blogger interface. I categorise my blog's traffic into the following categories:

  1. Number of visits while dormant (not really posting much)
  2. Number of visits on days I post
  3. Number of visits the day after that
  4. Number of visits during non-posting days, but not dormant

What I'd really like to get at is: how many humans do I actually provide some value to as a proportion of my visits? I'm not really an analytics exports...

Feedback comments are relatively rare. That's fine, but I do still hope that people are enjoying the posts.

The number of visits when dormant seems to be about 20-30 per day. I'm assuming that's robots. However, it's logical to assume that, like with humans, there will be more visits on posting days than on non-posting days.

My first two posts on this blog attracted something like 800 page views. Yay, great! I'm guessing that's maybe 400 actual people, and maybe 50 of those read the whole post, and maybe 35 of them really got something out of it. That's pretty good, it's a rare day that I manage to actually help 35 people in the rest of my life.

My "don't click me" post got 2100 page views. *facepalm*.

Now, presumably all (2100 - 800) of those are actual people.

I don't really know what it all means, except that the observer effect is very real, and linkbait works like a charm.

So, stay tuned for my next amazing post: "I'm going to waste all your time and you'll hate me for it".

Until next time, happy coding!

Friday, March 20, 2015

If you are a human: don't read me or click me. I'm trying to estimate robot traffic.

If you are a human, and you got here by clicking a link, please leave a comment below. I'm attempting to estimate the amount of non-human traffic which is triggered when I make a blog post. Please help by either not clicking through to this article, or by leaving a comment if you did.

Monday, March 16, 2015

The basic model for this blog

How Things Will Proceed

Hi there! I've been busy since the last post. I've been thinking mainly about the following areas:

Picture of a scam drug that's guaranteed to work instantly
Source: Wikipedia article on Placebo
drugs, license: public domain
  • What differentiates this blog for my readers?
  • What is the best way, for me, of developing my knowledge and master of these techniques?
  • What pathway is also going to work for people who are either reading casually, or interested in working through problems at a similar pace?
  • Preparing examples and potential future blog posts...
I think I have zeroed into something that is workable. I believe in an integrative approach to learning -- namely that incorporating information from multiple disparate areas results in insights and information which aren't possible when considering only a niche viewpoint. At the same time, I also believe it's essentially impossible to effectively learn from a ground-up, broad-base theoretical presentation of concepts. The path to broad knowledge is to start somewhere accessible, and then fold in additional elements from alternative areas.

I will, therefore, start where I already am: applying machine learning for categorisation of images. At some point, other areas will be examined, such as language processing, game playing, search and prediction. However, for now, I'm going to "focus". That's in inverted commas (quotes) because it's still an incredibly broad area for study. 

The starting point for most machine learning exercises is with the data. I'm going to explain the data sets that you'll need to follow along. All of these should be readily downloadable, although some are very large. I would consider purchasing a dedicated external drive for this if you have the space, disk space requirements may reach several hundred gigabytes, particularly if you want to store your intermediate results.

The data sets you will want are:
  • The MNIST databast. It's included in this code repository which we will also be referring to later when looking at deep learning / neural networks:
  • The Kaggle "National Data Science Bowl" dataset:
  • The Kaggle "Diabetic Retinopathy" dataset:
  • Maybe also try a custom image-based data set of your own choosing. It's important to pick something which isn't already covered by existing tutorials, so that you are effectively forced into the process of experimentation with alternative techniques, but which can be considered a categorisation problem so that similar approaches should be effective. You don't need to do this, but it's a fun idea. You could use an export of your photo album, the results of google image searches or another dataset you create yourself. Put each class of images into its own subdirectory on disk.
For downloading data, I recommend Firefox over Chrome, since it is much more capable at resuming interrupted downloads. Many of these files are large, and you may genuinely have trouble. Pay attention to your internet plan's download limits if you have only a basic plan.

The next post will cover the technology setup I am using, including my choice of programming language, libraries and hardware. Experienced Python developers will be able to go through this very fast, but modern hardware does have limitations when applying machine learning algorithms, and it is useful to understand what those are at the outset.

Following that will be the first in a series of practical exercises aimed to obtain a basic ability to deploy common algorithms on image-based problems. We will start by applying multiple approaches to the MNIST dataset, which is the easiest starting point. The processing requirements are relatively low, as are the data volumes. Existing tutorials exist online for solving this problem. This is particularly useful to start with, since it gives you ready-made benchmarks for comparison, and also allows easy cross-comparison of techniques.

I'd really like it if readers could reply with their own experiences along the way. Try downloading the data sets -- let me know how you go! I'll help if I can. I expect that things will get more interesting when we come to sharing the experimental results.

Happy coding,