Last post we looked at the "input shuffling" technique. This time we're looking at input scaling. Input scaling is just one of several input modification strategies that can be put into place. Unlike input shuffling which makes intuitive sense to me, input scaling does not. Absolute value can in fact be important, and it feels like input scaling is actually removing information.

Let's take the same approach as last time: add it to a basic network, and remove it from a well-performing one. This time round we also have an extra question -- are the benefits of input scaling independent of the benefits from input shuffling?

The first thing I did was add input scaling to the network design as we had it at the end of the last post. I ran it for 20 iterations. This is the a shuffled, scaled, three-layer architecture. The performance here is much, much better. After 20 iterations, we reach a valid loss of 0.59, and a train loss of 0.56. That's not as good as our best network, but it's a big step up. If I run it out to 150 iterations, I get a valid loss of 0.558. As a reminder, our current best network hits 0.5054 after 20 iterations.

Let's review:

Design Zero achieved valid loss of about 0.88 -- much worse than Random Forest.

Design Zero eventually cranked its way down to 0.6 or so after 800 cycles.

Design Zero + shuffling hits 0.72 after thirteen iterations

Design Zero + shuffling + scaling hits 0.59 after 20 iterations (slightly worse than RF)

Design Zero + shuffling + scaling wanders off base and degrades to 0.94 after 800 iterations

Interesting. Performance has improved greatly, but we have introduced an early-stopping problem, where the optimum result appears part-way through the training cycle. Early stopping is basically the inverse of the halting problem, which is not not knowing if you can stop. For the time being, I don't want to go down that rabbit hole, so I'll just stop at 20 iterations and concentrate on comparing performance there.

Our current design is performing at 0.59, vs 0.50 for the best network. The performance of Random Forest was 0.57, so we're getting close to reaching the benchmark level now using a pretty basic network design and the most basic input techniques.

Let's see what happens when we pull scaling out of our best design. After 20 iterations, without scaling, the performance is at 0.5058. That's a tiny bit worse than its best performance, but you'd be hard-pressed to call it significant. Whatever benefit is provided by scaling has, presumably, largely been accounted for in the network design, perhaps implicitly being reflected into the weights and biases of the first layer of the network. In fact, the performance at each epoch is virtually identical as well -- there is no change to early training results either.

The big relief is that it didn't cost us anything. The verdict on scaling is that it can be very beneficial in some circumstances, and doesn't seem to count against us. I still have a nagging philosophical feeling that I'm 'hiding' something from the NN by pre-scaling, but until I can think of a better idea, it looks like scaling is another one of those "every time" techniques.