Basic Convolutional Neural Network Visualization (Part 1)
Image processing & computer vision have gained a lot of popularity recently, and have become the hottest areas where AI could be largely applied. In our modern world, these applications can help cars drive autonomously, increase productivity in agriculture by analyzing large crop fields imagery, as well as detect spots on fields where attention & intervention are required. Image processing & computer vision can monitor & control traffic, and even shop for clothing without actually trying them on! Most computer vision applications are very straight forward to understand due to their interactivity and visuality. It is however, getting a little harder when trying to get concepts, and processes that fuel computer vision AI under the hood. This is why, I would like to shed a little light on how these types of AI applications work, by considering one of the most heavily used types of artificial neural networks in computer vision: convolutional networks or simply conv nets.
Conv nets have shown outstanding performance in image classification, and object detection tasks; in some cases, exceeding human level accuracy. This is achieved by incorporating convolutional layers into image processing neural networks architectures which perform convolution operation on visual input data (Fig. 1) In order to understand convolution operation, we must first consider an input image as a matrix containing some values. Each cell of the matrix represents a single pixel of the image, whereas the value assigned to a certain cell/pixel represents an intensity value of that particular pixel. Generally speaking, convolution operation starts with applying a small filter to a subset of the input data by calculating dot product between kernel, and a portion of the input image. This dot product operation results in a value of a certain cell of another matrix, which makes an output of convolutional layer, and is known as feature map. The feature map is formed by sliding the filter over the entire image with a certain step, and calculating dot product (feature map element), during each step. Usually, conv nets are designed by using several filters, hence the output of convolutional layer may also contain several feature maps as well. It is a common notation to illustrate input images, feature maps, and filters as cubic blocks since these entities are 3 dimensional (k x n x m). Where k x n stands for size of a picture/filter/feature map and m-th dimension corresponds to number of filters/feature maps or number of RGB channels (in case with input images).
The convolution operation is required for several reasons: First and foremost, to achieve a so-called shift invariance. In other words, we need our network to detect a certain object depicted on an image, regardless of the position of that object on that image. Let’s say we want our network to recognize a certain car model on a picture, and we need the network to do so regardless of where this car may appear on an input image (Fig. 2).
Another important reason for applying multiple convolutional layers is such architecture allows for distributing visual patterns over many layers what makes the network more generalizable by localizing patterns in lower layers. Also, convolutional setting of the network helps significantly reduce the number of trainable parameters, which would be just enormous compared to classical networks composed of dense layers.
Now, let’s get our hands dirty! Try to construct, and train a network that would recognize some visual objects on images that we will be feeding to that network. I will use Simpsons cartoon series characters dataset to train conv net recognize 9 different well known characters (Homer, Marge, Bart, etc.) The final layout of the network is shown in Fig. 3
As the figure shows, the input image passes a series of convolutional & pooling layers, and finally, after propagating through a bunch of fully connected layers, activates one of the outputs of the softmax layer, which points to a certain class within our class-labeled data. This network achieved a little more than 90% accuracy on previously unseen data, trained on around 11 000 images of 9 classes (approx. 1200 images per class).
In order to understand what happens with the network during training process, I will use Tensorboard, a special visualization tool that provides a number of insightful visualizations such as distribution of network parameters across different layers or, training accuracy & loss stats. It allows for inspecting the training process as well as pinpointing and better understanding problems that may appear during the training. For the sake of demonstration, I will re-run the training 2 times with 2 different learning rate values setting: I will set first learning rate to be of an adequate value, while the second learning rate will be unreasonably large. In both cases, Tensorboard will gather & visualize information from the network so we can finally analyze it. Fig. 4 shows examples of what sort of visualizations we can get from Tensorboard. Here, you can see how the second Conv layer`s parameters look like when the network is trained applying different learning rates. It is clear from the picture below that neither parameters, nor gradients of this particular layer are changing much during training when large learning rate is applied.
However, there is another picture with a smaller learning rate. Distributions of gradients and network weights are moving throughout the entire training process what indicates that the network tries to adapt, and learn some patterns from the data.
As we can see from the loss and accuracy plots (Fig. 5), the network with large learning rate is training very poorly; in fact, it’s not training at all! It`s accuracy jitters around the same value, and doesn’t increase during training process.In contrast, training of the network with small learning rate flows smooth, where accuracy steadily increases and loss steadily drops over training time.
As we can see from figures above choosing large learning rate, results in a very poor, and unstable accuracy.
In a nutshell, training a neural network means optimizing its parameters so the combination of these parameters will result in a lower value of the error function output. When small learning rate is applied the network optimizes to descent down the error function at every training step. Whereas large learning rate causes the network jump over the minimum of the error function, and appear on the other side of its curve what causes the error remain almost at the same level during the training, hence the accuracy also jitters around a certain value, and does not improve much during training. (Fig. 6)
It is also quite interesting to look at the heatmap of weights, and compare how it changes over training time given 2 different learning rate settings. As we can see from images below (top image), nearly all parameters of the network become the same color (value) when large learning rate is applied, and this picture does not change much over time. We can see completely different picture (bottom image) when small learning rate is applied where it could be clearly seen how parameters are changing after each training iteration.
In this part of the article, very basic types of neural networks parameters visualization were shown. These methods help get a preliminary understanding of what is going on with the network parameters during training. In the second part, I will cover more advanced methods of visualizing patterns that neural networks actually learn from training data. Stay tuned!
Ildar Abdrashitov, Business Intelligence Analyst Missing Link Technologies