Updated March 21, 2023

Introduction to Convolutional Neural Networks

Convolutional Neural Networks, also known as CNN or ConvNet, come under the category of the artificial neural networks used for image processing and visualizing. Artificial intelligence uses deep learning to perform the task. Neural networks are either hardware or software programmed as neurons in the human brain. The traditional neural network takes only images of reduced resolution as inputs. CNN solves that problem by arranging their neurons as the frontal lobe of human brains. Pre-processing on CNN is very less when compared to other algorithms. Convolution, a linear mathematical operation, is employed on CNN. It uses convolution instead of general matrix multiplication in one of its layers.

Layers in Convolutional Neural Networks

Below are the Layers of convolutional neural networks:

Image Input Layer: The input layer gives inputs( mostly images), and normalization is carried out. Input size has to be mentioned here.
Convolutional Layer: Convolution is performed in this layer. First, the image is divided into perceptrons(algorithm); local fields are created, leading to the compression of perceptrons to feature maps as a matrix with size m x n.
Non-Linearity Layer: Here feature maps are taken as input, and activation maps are given as output with the help of the activation function. The activation function is generally implemented as sigmoid or hyperbolic tangent functions.
Rectification Layer: The crucial component of CNN, this layer does the training faster without reducing accuracy. It performs element-wise absolute value operation on activation maps.
Rectified Linear Units(ReLU): ReLU combines non-linear and rectification layers on CNN. This does the threshold operation where negative values are converted to zero. However, ReLU doesn’t change the size of the input.
Pooling Layer: The pooling layer is also called the downsampling layer, as this is responsible for reducing the size of activation maps. A filter and stride of the same length are applied to the input volume. This layer ignores less significant data; hence image recognition is done in a smaller representation. This layer reduces overfitting. Since the amount of parameters is reduced using the pooling layer, the cost is also reduced. The input is divided into rectangular pooling regions, and either maximum or average is calculated, which returns maximum or average consequently. Max Pooling is a popular one.
Dropout Layer: This layer randomly sets the input layer to zero with a given probability. More results in different elements are dropped after this operation. This layer also helps to reduce overfitting. It makes the network to be redundant. No learning happens in this layer. This operation is carried out only during training.
Fully Connected Layer: Activation maps, which are the output of previous layers, is turned into a class probability distribution in this layer. FC layer multiplies the input by a weight matrix and adds the bias vector.
Output Layer: FC layer is followed by softmax and classification layers. The softmax function is applied to the input. The classification layer computes the cross-entropy and loss function for classification problems.
Regression Layer: Half the mean squared error is computed in this layer. This layer should follow the FC layer.

Architecture of Convolutional Neural Network

Below are the architecture of convolutional neural networks:

1. LeNet

LeNet was introduced for Optical and Character Recognition in Documents in 1998. It is small and perfect for running on the CPU. LeNet is small and easy to grasp. This is built with three main ideas: local receptive fields shared weights and spatial subsampling. The network shows the best internal representation of raw images. It has three convolutional layers, two pooling layers, one fully connected layer, and one output layer. The pooling layer immediately followed one convolutional layer.

2. AlexNet

AlexNet was developed in 2012. This architecture popularized CNN in Computer vision. It has five convolutional and three fully-connected layers where ReLU is applied after every layer. It takes the advantage of both the layers as a convolutional layer has few parameters and long computation, and it is the opposite for a fully connected layer. Overfitting was very much reduced by data augmentation and dropout. AlexNet was the pooling layer that does not separate deeper, bigger and convolutional layers as compared with LeNet.

3. ZF Net

ZF Net was developed in 2013, which was a modified version of AlexNet. The size of the middle convolutional layer was expanded, and the first convolutional layer’s stride and filter size were made smaller. It just recognized the shortcomings of AlexNet and developed a superior one. All the layers are the same as AlexNet. ZF Net adjusts the layer parameters such as filter size or stride of the AlexNet, which makes it reduce the error rates.

4. GoogLeNet

This architecture was developed in 2014. The inception layer is the core concept. This layer covers the bigger area but makes a note of small information of the image. To improve performance, nine inception modules are used in GoogLeNet. Since the inception layer is prone to overfitting, more non-linearities and fewer parameters are used here. Max pooling layer is used to concatenate the output of the previous layer. This architecture has 22 layers, and the parameters are 12x less.

This is more accurate than AlexNet, faster too. The error rate is comparatively lower. The average pooling layer is used at the end instead of a fully connected layer. Computation is reduced, depth and width are increased. Many inception modules are connected to go deeper into the architecture. GoogLeNet outperformed all the other architectures developed till 2014. Several follow up versions are available for this architecture.

5. VGG Net

This was an improvement over ZFNet and subsequently over AlexNet. It has 16 layers with 3×3 convolutional layers, 2×2 pooling layers, and fully connected layers. This architecture adopts the simplest network structure, but it has most of the parameters.

6. ResNet

Residual Network architecture was developed in 2015. It uses batch normalization and skips the use of FC layers. This architecture uses 152 layers and uses skip connections. ResNet is mostly used in all deep learning algorithms now.

Conclusion

Facebook uses CNN for image tagging, Amazon for product recommendations and Google to search among user photos. All these are done with greater accuracy and efficiency. The advancement in deep learning reached a stage where CNN was developed and helped in many ways. As complicated CNN becomes, it helps in improving efficiency.