CNNs with PyTorch
A 2-Layer Convolutional Neural Network with Fashion MNIST dataset
Dataset Handling
During this project we’ll be working with the MNIST Fashion dataset, a well know dataset which happens to come together as a toy example within the PyTorch library.
The Fashion-MNIST dataset is proposed as a more challenging replacement dataset for MNIST. It is a dataset comprised of 60,000 small square 28×28 pixel gray scale images of items of 10 types of clothing, such as shoes, t-shirts, dresses, and more.
You can find here the repo of this article, in case you want to follow the comments alongside the code. As a brief comment, the dataset images won’t be re-scaled, since we want to increase the prediction performance at the cost of a higher training rate. Hence, the only transformation taking place will be the one needed to handle images as Tensor objects (matrices).
Building the Model
It’s known that Convolutional Neural Networks (CNN) are one of the most used architectures for Computer Vision. This kind of architectures can achieve impressive results generally in the range of 90% accuracy. Not only that, the models tend to generalize well.
A CNN is composed of several transformation including convolutions and activations. Several layers can be piped together to enhance the feature extraction (yep, I know what you’re thinking, we feed the model with raw data). Generally, we use convolutions as a way to reduce the amount of information to process, while keeping the features intact. This helps us reduce the amount of inputs (and neurons) in the last layer. Here is a good resource in case you want a deeper explanation CNN Cheatsheet CS 230. For this particular case we’ll use a convolution with a kernel size 5 and a Max Pool activation with size 2.
If you’re new to convolutions, here’s also a good video which shows, in the first minutes, how the convolution takes place. It’s a good animation which help us visualize the concept of how the process works. Furthermore, in case you want to know more about Max Pool activation, here’s another video with extra details. Also important to say, is that the convolution kernel (or filter) weights (parameters) will be learned during the training, in order to optimize the model.
One of the hardest parts while designing the model is determining the matrices dimension, needed as an input parameter of the convolutions and the last fully connected linear layer. The last layer helps us determine the predicted classes or labels, for this case these are the different clothing categories.
We’ll create a 2-layer CNN with a Max Pool activation function piped to the convolution result. Since we don’t want to loose the image edges, we’ll add padding to them before the convolution takes place. During the whole project we’ll be working with square matrices where m=n (rows are equal to columns). We’ll refer to the matrix input dimension as I
, where in this particular case I = 28
for the raw images. In the same way, the dimension of the output matrix will be represented with letter O.
Convolution parameters
kernel = 5
padding = 2
stride = 1
dilation = 1
Given these parameters, the new matrix dimension after the convolution process is:
O = I + 2 p - k + 1
O = I
where:
- p: padding
- k: kernel size
- I: Input matrix size
- O: Output matrix size
MaxPool Activation parameters
For the MaxPool activation, stride is by default the size of the kernel. Parameters are:
kernel = 2
padding = 0
stride = 0
dilation = 1
In this case, the new matrix dimension after the Max Pool activation are:
O = (I - k)/s + 1
O = (I - 2)/2 + 1
O = I/2
If you’re interested in determining the matrix dimension after the several filtering processes, you can also check it out in this: CNN Cheatsheet CS 230
Actual project matrix dimensions
After the previous discussion, in this particular case, the project matrix dimensions are the following
- After the first convolution, 16 output matrices with a 28x28 px are created.
- The dimension of the matrices after the Max Pool activation are 14x14 px.
- The 32 resultant matrices after the second convolution, with the same kernel and padding as the fist one, have a dimension of 14x14 px.
- Finally after the last Max Pool activation, the resultant matrices have a dimension of 7x7 px.
The 32 channels after the last Max Pool activation, which has 7x7 px each, sums up to 1568 inputs to the fully connected final layer after flattening the channels.
The following class shows the forward
method, where we define how the operations will be organized inside the model. This is, here is where we design the Neural Network architecture. PyTorch offers an alternative way to this, called the Sequential
mode. You can learn more here. As you may notice, the first transformation is a convolution, followed by a Relu activation and later a MaxPool Activation/Transformation. As mentioned before, the convolutions act as a feature extraction process, where predictors are preserved and there is a compression in the information. In this way we can train the network faster without loosing input data.
Determining Optimizer and Data Loader
After modelling our Neural Network, we have to determine the loss function and optimizations parameters. For so, we’ll select a Cross Entropy strategy as loss function. This function is typically chosen with non-binary categorical variables. There’s a great article to know more about it here. To determine the minimum cost we’ll use a Stochastic Gradient Descent strategy, which is almost plain vanilla style in the cases where our data doesn’t fit into memory. Using SGD, the loss function is ran seeking at least a local minimum, using batches and several steps. For this purpose, we’ll create the train_loader
and validation_loader
iterators.
Training the Model
As said before, we’re going to run some training iterations (epochs) through the data, this will be done in several batches. Then, we’re going to check the accuracy of the model with the validation data and finally we’ll repeat the process. It is important to note that optimizer.step()
adjusts the model weights for the next iteration, this is to minimize the error with the true function y.
Finally we’ll append the cost and accuracy value for each epoch and plot the final results. Analyzing the plot. we’ll see how the cost descends and the accuracy increases as the model adjusts the weights and “learns” from the training data.
Analyzing the Results
Below you’ll find the plot with the cost and accuracy for the model. As expected, the cost decreases and the accuracy increases while the training fine-tunes the kernel and the fully connected layer weights. In other words, the model learns through the iterations.
Checking Classifications
Here’s an image depicting the different categories in the Fashion MNIST dataset.
Finally, we’ll check some samples where the model didn’t classify the categories correctly. As you may see, sometimes it’s not easy to distinguish between a sandal or a sneaker with such a low resolution picture, even for the human eye. Notice also the first image, where the model predicted a bag but it was a sneaker. It kind of looks like a bag, isn’t it?. The model also has a hard times discriminating pullovers from coats, but with that image, honestly it’s not easy to tell.
Wrapping it up
A 2 layer CNN does an excellent work in predicting images from the Fashion MNIST dataset with an overall accuracy after 6 training epochs of almost a 90%. This is not a surprise since this kind of neural network architecture achieve great results.
Certainly, the accuracy can increase reducing the convolution kernel size in order to loose less data per iteration, at the expense of higher training times. Also, normalization can be implemented after each convolution and in the final fully connected layer. This helps achieve a larger accuracy in fewer epochs. You can try experimenting with it and leave some comments here with the results. There’s a good article on batch normalization you can dig in.
You can check out the notebook in the github repo. Don’t forget to follow me at twitter. Thanks for reaching up to here and specially to Jorge and Franco for the revision of this article.