Thursday, April 30, 2015

The Effect of Randomising the Order of Your Inputs

This post is going to focus on the impacts on a single example of randomising the order of your inputs (training examples) in one particular problem. This will look at the effect of adding input shuffling to a poorly-performing network, and also the effect of removing it from a well-performing network.

Let's just straight to the punch: Always randomise the order. The network will start effective training much sooner. Failure to do so may jeopardise the final result. There appears to be no downside. (if anyone has an example where it actually makes things worse, please let me know)

Here's a screenshot from my Ipython notebook of the initial, poorly-performing network. Sorry for using an image format, but I think it adds a certain amount of flavour to show the results as they appear during development. This table shows that after 13 iterations (picked because it fits in my screenshot easily), the training loss is 1.44, and the % accuracy is 11.82%. This network eventually trains to hit a loss of about 0.6, but only after a large (800-odd) number of iterations.

It takes 100 iterations or so to feel like the network is really progressing anywhere, and it's slow, steady progress the whole way through. I haven't run this network out to >1k iterations to see where it eventually tops out, but that's just a curiosity for me. Alternative techniques provide more satisfying results much faster.

Here's the improved result. I added a single line to the code to achieve this effect:

train = shuffle(train)
Adding the shuffling step wasn't a big deal -- it's fast, and conceptually easy as well. It's so effective, I honestly think it would be a good thing for NN libraries to simply to by default rather than leave to the user. We see here that by iteration 13, the valid loss is 0.72 as opposed to 2.53 in the first example. That's pretty dramatic.

If anyone knows of any examples where is it better not to randomise the input examples, please let me know!!!

For the time being, I think the benefits of this result are so clear, that a deeper investigation is not really called for in fact. I'm just going to add it to my 'standard technique' for building NNs going forward, and consider changing this step only if I am struggling with a particular problem-at-hand. What's more likely is that more sophisticated approaches to input modification will be important, rather than avoiding the step entirely. I'm aware that many high-performing results have been achieved by transforming input variables to provide additional synthetic learning data into the training set. Examples of this include image modifications such as skew, colour filtering and other similar techniques. I would be very interested to learn more about other kinds of image modification preprocessing, like edge-finding algorithms, blurring and other standard algorithms.

Interestingly, the final results for these two networks are not very different. Both of them seem to train up to around the maximum capability of the architecture of the network. Neither result arising from this tweak only approach the performance provided by the best known alternative approach that I have copied from.

I wondered whether shuffling the inputs was a truly necessary step, or just a convenient / beneficial one. If you don't care about letting the machine rip through more iterations, then is this step really relevant? It turns out the answer is a resounding "Yes", at least sometimes.

To address this I took the best-performing code (as discussed in the last post) and removed the shuffle step. The result was a network which trained far more slowly, and moreover did not reach the optimal solution nor even approach it. The well-performing network achieved a valid loss of 0.5054, which looks pretty good compared to random forest.

Here is the well-performing network with input shuffling removed. You can see here that the training starts off badly, and gets worse. Note the "valid loss" is the key characteristic to monitor. The simple "loss" improves. This shows the network is remembering prior examples well, but extrapolating badly.

After 19 rounds (the same number of iterations taking by the most successful network design), avoiding the shuffling step results a network that just wanders further and further off base. We end up with a valid loss of 13+, which is just way off.

As an interesting aside, there is a halting problem here. How do we know, for sure, that after sufficient training this network isn't suddenly going to 'figure it out' and train up to the required standard? After all, we know for a fact that the network nodes are capable of storing the information of a well-performing predictive system. It's clearly not ridiculous to suggest that it's possible. How do we know that the network is permanently stuck? Obviously indications aren't good, and just as obviously we should just use the most promising approach. However, this heuristic is not actually knowledge.

Question for the audience -- does anyone know if a paper has already been publishes analysing the impacts of input randomisation across many examples (image and non-image) and many network architectures? What about alternative input handling techniques and synthetic input approaches?

Also, I haven't bothered uploading the code I used for these examples to the github site, since they are really quite simple and I think not of great community value. Let me know if you'd actually like to see them.