Convolutional neural networks

8 min readMar 8, 2021

Convolutional neural network (CNN) is a class of deep neural network and is commonly applied for processing structured arrays of data such as images. CNN is widely used in computer vision. They have many applications in image and video recognition, image classification, natural language processing, etc.

A basic convolutional neural network includes an input layer, hidden layers, and an output layer.

The input is a tensor of shape (number of images) × (image height) × (image width) × (number of channels)
Hidden layers are located between the input and the output layer. They include convolutional layers and are followed by other layers as pooling layers, normalization layers, and fully connected layers.
The output layer is the last layer of the network that produces the final result of the program.

In this article, we are going to discover a basic architecture of a CNN model as well as its application to the problem of recognizing the MNIST handwritten digit images.

I. Convolutional layers

The convolutional layer is an essential component of a CNN model, it enables the model to extract features from the input data. Typically, a convolutional layer consists of multiple filters, and each filter produces a separate feature map, capturing a specific aspect of the input data. The depth of the feature map corresponds to the number of filters used. By stacking

In this section, we explore some operations in a convolutional layer, including cross-correlation operations, padding, and strides.

1. The cross-correlation operation.

The cross-correlation operation is applied between a set of learnable filters (a.k.a. kernels) and the input data. It is used to extract features and generate feature maps.

Let’s see how this operation work in a simple example where we have an input that is a matrix of two dimensions with size 4× 4 (i.e. a tensor of size 4 × 4× 1). The kernel window is a square matrix of size 3× 3:

The convolutional window begins at the top of the input tensor, it slides from left to right and from top to bottom of the input. At each position, the sub-tensor of the input within that window is multiplied element-wise with the kernel, then the result is summed up to obtain the output element at this position.

0×0 + 1×2 + 2×1 + 3×1 + 0×3 + (-1)×2 + (-2)×2 + 0×0 + 1×(-1) = 0

1×0 + 2×2 + 4×1 + 0×1 + (-1)×3 + 5×2 + 0×2 + 1×0 + 3×(-1) = 12

3×0 + 0×2 + (-1)×1 + (-2)×1 + 0×3 + 1×2 + 1×2 + 0×0 + 9×(-1) = -8

0×0 + 1×2 + (-5)×1 + 0×1 + 1×3 + 3×2 + 0×2 + 9×0 + 0×(-1) = 12.

Suppose that the input is a matrix of size n₁ × n₂ and the kernel size is k₁ × k₂, then the output size is determined as:

2. Padding

The padding payer is utilized to adjust the spatial dimensions of the input data or feature maps. It adds additional pixels around the boundaries of the input images to present information loss caused by kernel filtering. Since the kernel size is often larger than 1, the resulting output is smaller than the input, leading to a loss of pixels. To address this limitation, the padding layer introduces extra pixels, typically set to 0, around the image boundaries. By adding p₁ rows and p₂ columns of extra pixels, then the new size of the output is given by:

Hence, the input and output have the same size when p₁ = k₁–1.

Let’s return to the above example, if we add 2 rows (p₁ = 2) and 2 columns (p₂ = 2) of 0 pixels into the input, then the output size, in this case, is 4 × 4, which equals to the original input size.

3. Stride

Stride operation is a parameter that determines the step size at which the convolutional filters move across the input data or feature maps. It controls the amount of data downsampling that occurs during the convolution process.

In the above example, we applied a stride of 1 pixel when moving the convolutional window across the input tensor. However, there are scenarios where compressing output into a smaller size is necessary. To achieve this, we can increase the stride step. For instance, in our example, if we stride the padding image vertically and horizontally by 3 pixels, the resulting output size is reduced to 2 × 2

In the examples above, we have stridden the convolutional window by 1 pixel on the input tensor. Sometimes, it is necessary to compress the output into a smaller size. Hence, we need to increase the stride step. In our example, if we stride all the vertical and the horizontal of the padding image by 3 pixels, then the output size is reduced to 2 × 2:

Let’s denote s₁ and s₂ the stride steps according to the vertical and the horizontal in the input, then the output size is determined by:

In summary, padding and stride are two operations with different functions. The padding operation adds extra 0 pixels in the input to avoid losing information in the output, while the stride operation aims to compress the output. These two operations can be combined to obtain an output with the desired size including the crucial pixels. The following figures illustrate the outputs of convolutional layers where different padding and strides are combined.

II. Pooling layers

A pooling layer is a building block of a convolutional neural network. Its function is t reduce the dimension of outputs in the convolutional layer while maintaining the most important information.

Like convolutional layers, the pooling layers consist of a fixed-shape window (a.k.a the pooling window). This window strides over all regions of the input from left to right and top to bottom. At each location, it computes either the maximum or the average of all input elements inside the pooling window. These operators are called maximum pooling (max pooling) and average pooling. But different from convolutional layers, pooling layers do not contain any parameters (there are neither kernel nor bias).

Let’s discover how max-pooling and average pooling work by the following examples:

If we applied a max-pooling window of size 3 × 3 to our input image, then the output size will be 2 × 2, with the values as in the following figure:

When the average pooling window of size 2 × 2 is applied in the input with 2 striding steps, the output size is also 2 × 2:

III. Fully connected layer

Fully connected layers are the last stages of a convolutional neural network. There may have one or more of these layers in a network. All the neurons of the current layer are connected to every neuron in the next layer. That’s the reason why they are called fully connected layers.

The first fully connected layer takes the output of the previous layer (convolutional layer of pooling layer) and flattens them into a single vector. In the output layer, the number of nodes is corresponding to the number of classes (or labels). This layer gives the probabilities for each class. For example, in the problem of classifying MNIST data, we have 10 labels which are numbers from 0 to 9. Hence, the number of output nodes is 10. If the input is an image of number 2, then the probability given by class 2 should be the largest.

IV. Example

In this section, we are going to build a simple convolutional neural network in Keras to classify MNIST data.

1. MNIST dataset of handwritten digits

This dataset consists of 60,000 samples in the training set and 10,000 samples in the test set. The digits have been size-normalized and centered in a fixed size, 28 × 28.

This database is available on this page. It can also be loaded from the library datasets of Keras:

Count the image number in each class of the training set:

Visualizing randomly some images in the training set:

2. Preprocessing data

Thi task includes the following steps:

Reshape images into the required size of Keras
Convert integer values into float values
Normalize data
One-hot encoding labels

3. Buiding a CNN model:

4. Training model

Visualize the learning curves (accuracy and loss values on the training and validation sets)

5. Evaluate the model on the test set

6. Confusion matrix

7. Visualizing randomly some images and their predicted labels

V. Conclusion

We have explored the architecture of a basic convolutional neural network (CNN) which includes convolutional layers, pooling layers, and fully connected layers. Besides, we have also constructed a simple CNN model to recognize handwritten digit images. The model achieved an impressive accuracy (99,20% on the training set, 98,72% on the validation set, and 98,85% on the test set).

I hope this blog is helpful to you. Feel free to connect with me on the Medium page for more insightful content in my upcoming blogs.

Github code: https://github.com/KhuyenLE-maths/Medium_blogs/tree/main/CNN_MNIST

Thanks for your reading!