CNNs invented by Yan LeCun in 1989 became popular for image recognition tasks.

Historically CNNs are great for both audio and image tasks.

Image tasks

When dealing with image classification of big images CNNs are a much better solution than plain fully connected neural networks (FCNN) because CNNs are memory efficient compared to FCNN.

“Curse of dimensionality” is just a fancy name for the fact that images live in high dimensional vector spaces.

The Imagenet challenge determined the historical evolution of CNNs starting from the AlexNet model.

The key feature traditional CNNs have is they are equivariant with respect to translation. This means if you have the cat on an image, the CNN will recognize the cat no matter where the cat is positioned if the cat is not rotated. This is an important feature that made CNNs practical for modern image tasks: image segmentation, object detection, and object classification no matter where the object is located in the image.

You can construct special CNNs that are equivariant wrt. rotation. These CNNs can recognize rotated objects.

Instead of term equivariant some sources also use the term invariant.

CNNs have filters. These filters are what CNNs learn. Two important things:

  • multiple filters for each convolution layer
  • filters are shared across neurons

Mimicking the human brain

“Curse of dimensionality” is just a fancy name for the fact that images live in high dimensional vector spaces.

CNNs were able to find the solution for this curse by mimicking the human brain. This was evident after the Study of the Visual Cortex from Hubel and Wiesel (1964).

This work was so impactful that they won the Nobel prize in 1981 for this work.

CNNs are famously equivariant with respect to translation.

Should we use bias in conv2d?

It is possible to use bias, but often it is ignored, by setting it with bias=False. This is because we usually use the BN behind the conv layer which has bias itself.

nn.Conv2d(1, 20, 5, bias=False)

Why do we use pooling layers in CNNs?

One of the reasons to use poling layers is to decrease the receptive field.

Say we are having a 1000 by 1000 pixels image and we will just use 3x3 convolution with ~~~ . Then the receptive field will be 500 layers till we get to the end. With pooling layers we make them smaller …

If there would be no pooling layer we would just use convolution after convolution We use pooling

Why do we have max pooling to lower the resolution and at the same time we increase the number of filters?

By increasing the number of filters and by lowering the image using max pooling we try to keep the same number of features.

What nn.Conv2d(3,10, 2,2) numbers 3 and 10?

The in_channels in the beginning is 3 for images with 3 channels (colored images). For images black and white it should be 1. Some satellite images may have 4 in there.

The out_channels is the number of convolution filters we have: 10. The filters will be of size 2x2.

What is dilation?

To explain what dilation is you can simple understand from these two images:


Why a 3x3 filter is the best.

According to the paper from Max Zeiler. 17.3.3346

Few more tips about convolution

  • Convolution is position invariant and handles location, but not actions.
  • In PyTorch convolution is actually implemented as correlation.
  • In PyTorch nn.ConvNd and F.convNd do have reverse order of parameters.

Bag of tricks for CONV networks

This Bag of tricks paper presents many tricks to be used for Convolutional Neural Networks such as:

  • Large batch training
  • Low precision training
  • Decay of the learning rate
  • Resnet tweaks
  • Label smoothing
  • Mixup training
  • Transfer learning
  • Semantic segmentation