Let's just straight to the punch: Always randomise the order. The network will start effective training much sooner. Failure to do so may jeopardise the final result. There appears to be no downside. (if anyone has an example where it actually makes things worse, please let me know)
Here's a screenshot from my Ipython notebook of the initial, poorly-performing network. Sorry for using an image format, but I think it adds a certain amount of flavour to show the results as they appear during development. This table shows that after 13 iterations (picked because it fits in my screenshot easily), the training loss is 1.44, and the % accuracy is 11.82%. This network eventually trains to hit a loss of about 0.6, but only after a large (800-odd) number of iterations.
It takes 100 iterations or so to feel like the network is really progressing anywhere, and it's slow, steady progress the whole way through. I haven't run this network out to >1k iterations to see where it eventually tops out, but that's just a curiosity for me. Alternative techniques provide more satisfying results much faster.
Here's the improved result. I added a single line to the code to achieve this effect:
train = shuffle(train)Adding the shuffling step wasn't a big deal -- it's fast, and conceptually easy as well. It's so effective, I honestly think it would be a good thing for NN libraries to simply to by default rather than leave to the user. We see here that by iteration 13, the valid loss is 0.72 as opposed to 2.53 in the first example. That's pretty dramatic.
If anyone knows of any examples where is it better not to randomise the input examples, please let me know!!!
For the time being, I think the benefits of this result are so clear, that a deeper investigation is not really called for in fact. I'm just going to add it to my 'standard technique' for building NNs going forward, and consider changing this step only if I am struggling with a particular problem-at-hand. What's more likely is that more sophisticated approaches to input modification will be important, rather than avoiding the step entirely. I'm aware that many high-performing results have been achieved by transforming input variables to provide additional synthetic learning data into the training set. Examples of this include image modifications such as skew, colour filtering and other similar techniques. I would be very interested to learn more about other kinds of image modification preprocessing, like edge-finding algorithms, blurring and other standard algorithms.
Interestingly, the final results for these two networks are not very different. Both of them seem to train up to around the maximum capability of the architecture of the network. Neither result arising from this tweak only approach the performance provided by the best known alternative approach that I have copied from.
I wondered whether shuffling the inputs was a truly necessary step, or just a convenient / beneficial one. If you don't care about letting the machine rip through more iterations, then is this step really relevant? It turns out the answer is a resounding "Yes", at least sometimes.
To address this I took the best-performing code (as discussed in the last post) and removed the shuffle step. The result was a network which trained far more slowly, and moreover did not reach the optimal solution nor even approach it. The well-performing network achieved a valid loss of 0.5054, which looks pretty good compared to random forest.
As an interesting aside, there is a halting problem here. How do we know, for sure, that after sufficient training this network isn't suddenly going to 'figure it out' and train up to the required standard? After all, we know for a fact that the network nodes are capable of storing the information of a well-performing predictive system. It's clearly not ridiculous to suggest that it's possible. How do we know that the network is permanently stuck? Obviously indications aren't good, and just as obviously we should just use the most promising approach. However, this heuristic is not actually knowledge.
Question for the audience -- does anyone know if a paper has already been publishes analysing the impacts of input randomisation across many examples (image and non-image) and many network architectures? What about alternative input handling techniques and synthetic input approaches?
Also, I haven't bothered uploading the code I used for these examples to the github site, since they are really quite simple and I think not of great community value. Let me know if you'd actually like to see them.