Tuesday, May 5, 2015

Network Configuration Exploration

Let's take stock. We have a primitive model (A), and a best-performing model (B). We are undertaking a process of breakdown down the differences to understand how each difference between the two models contributes to the observed performance gains. The hope is to learn a standard best practise, and thereby start from a higher base in future. However we are also hoping to learn whether there is anything we could do to take model (B) and extend its performance further.

We have deal with the input-side differences -- shuffling and scaling. We now move onto the network's internals -- the network configuration. This means its layers, activation functions and connections. Let's call the 'upgraded' version Model A2.

Model (A) has a single hidden layer. Model (B) is far deeper, with three sets of three layers, plus the input layer, plus the output layer. Model A uses a non-linear activation functions for the hidden and the output layer. Model B uses a non-linear activation function for the output layer, but uses mostly linear processes to get its work done.

I got a bit less scientific at this point ... not knowing where to start, I took Model B and started fiddling with its internals, based on nothing more than curiosity and interest. I removed one of the layer-sets entirely, and replaced the activation function of another with 'softmax'. That network began its training more quickly, but finished with the identical performance in due course.

So I removed that softmax layer set, now having a simpler configuration with just the input layer, a linear rectifier layer, and the final softmax activation layer. This was worse, but also interesting. For the first time, the training loss was worse than the validity loss. To my mind, this is the network 'memorising' rather than 'generalising'. The final loss was 0.53, which was better than Model A, better than a Random Forest, but much worse than Model B. This maybe gives us some more to go on. Let's call this new model B1.

There are still some key differences between Model A2 and Model B1. B1 actually uses simpler activations, but includes both Batch Normalisation and Dropout stages in processing, which we haven't talked about before. Which of those differences are important to the improvement in functionality?

This gives us the direction for the next steps. My feeling is that both batch normalisation and dropout are worth examining independently. The next posts will focus on what those things do, and what their impacts are when added to a more basic model or removed from a more sophisticated one.