Convolutional Neural Network

A Convolutional Neural Network (ConvNet/CNN) is a specialized type of neural network model designed for working with two-dimensional image data. They can be used with one-dimensional and three-dimensional data also.

Some of the areas where CNNs have widely used are image recognition, image classifications. Objects detections recognize faces, etc.

CNN image classifications take an input image, process it, and classify it under certain categories (Eg., Dog, Cat, Tiger, Lion). Computers see an input image as an array of pixels and it depends on the image resolution. Based on the image resolution, it will see h x w x d( h = Height, w = Width, d = Dimension ). Eg., An image of 6 x 6 x 3 array of matrix of RGB (3 refers to RGB values) and an image of 4 x 4 x 1 array of matrix of a grayscale image.

Each input image will pass it through a series of convolution layers with filters (Kernals), Pooling, fully connected layers (FC) , and apply Softmax function to classify an object with probabilistic values between 0 and 1.

The filter is smaller than the input data and the type of multiplication applied between a filter-sized patch of the input and the filter is a dot product. A dot product is an element-wise multiplication between the filter-sized patch of the input and filter, which is then summed, always resulting in a single value. Because it results in a single value, the operation is often referred to as the “scalar product“.

The three main terminologies related to the CNN architecture:

  • Convolutions
  • Pooling 
  • Feature Maps

Central to the convolutional neural network is the convolutional layer that gives the network its name. This layer performs an operation called a “convolution“. A convolution is a linear operation that involves the multiplication of a set of weights with the input, much like a traditional neural network. 

A feature map is a collection of multiple neurons each of which looks at different regions of the input with the same weights. All neurons in a feature map extract the same feature (but from different regions of the input). It is called a ‘feature map’ because it is a mapping of where a certain feature is found in the image. 

After extracting features (as feature maps), CNNs typically aggregate these features using the pooling layer. 

Pooling tries to figure out whether a particular region in the image has the feature we are interested in or not. It essentially looks at larger regions (having multiple patches) of the image and captures an aggregate statistic (max, average etc.) of each region. In other words, it makes the network invariant to local transformations.

The two most popular aggregate functions used in pooling are ‘max’ and ‘average’. The intuition behind these are as follows:

  • Max pooling: If any one of the patches says something strongly about the presence of a certain feature, then the pooling layer counts that feature as ‘detected’.
  • Average pooling: If one patch says something very firmly but the other ones disagree,  the pooling layer takes the average to find out.

Pooling has the advantage of making the representation more compact by reducing the spatial size (height and width) of the feature maps, thereby reducing the number of parameters to be learnt. On the other hand, it also loses a lot of information, which is often considered a potential disadvantage. Having said that, pooling has empirically proven to improve the performance of most deep CNNs.

Comments are closed.