This would be the 1d convolution in PyTorch
import torch import torch.nn.functional as F # batch, in, iW (input width) inputs = torch.randn(2, 1, 2) # out, in, kW (kernel width) filters = torch.randn(1, 1, 2) print("\ninputs:", inputs) print("\nfilters:", filters) res = F.conv1d(inputs, filters, padding=0) print("\nsize:", res.size()) print(res)
inputs: tensor([[[-1.3681, 0.8410]], [[-1.1009, 0.0678]]]) filters: tensor([[[1.5199, 0.8276]]]) size: torch.Size([2, 1, 1]) tensor([[[-1.3833]], [[-1.6172]]])
This is the output worth interpreting.
At first we need to provide the input and the kernel to convolve.
Input should have the format starting with
minibatch size, followed by
input size and then followed by
input width iW.
The single dimension of input means the input will have just the width. The line:
inputs = torch.randn(2, 1, 2) means minibatch size is two, input size is just 1 and the input width is just 2.
The filters tensor should have rank 3 for
conv1d. Again we need to provide the
output size, the
input size and the
kernel width. The constraint is the kernel width
kW must allays be equal or less than input width
# 2D convolution example import torch.nn.functional as F from matplotlib import pyplot as plt img = x_train/255 img.resize_(28,28) print("img show:", img.size()) plt.imshow(img, cmap="gray") plt.show() k = torch.tensor([1.,1.,1., 0.,0.,0., -1.,-1.,-1.]).reshape(3,3).unsqueeze(0).unsqueeze(0) print("kernel:", k.size()) # see how the dimensions of img and k should match img2 = img.unsqueeze(0).unsqueeze(0) print("img before conv2d:", img2.size()) res=F.conv2d(img2, k, padding=1) print(res.size()) rest = res.squeeze() plt.imshow(rest, cmap="gray") plt.show()
As you may understand from the image, the purpose of the convolution is to extract certain image features.
Input image size was
1,1,28,28 and the meaning of these numbers are the
mini batch size,
input width iW,
input height iH.
Then we have the kernel of size
1,1,3,3, and in here the meaning of these numbers is similar as for the
First one means the
out size, then the
in size, then the
Again the same rule must be, the kernel size must be smaller of equal to the input size. Since I used a little bit of padding I got the same output shape of 28x28.
Note that in the later example I used the convolution kernel that will sum to 0.
Convolution to linear
It is not easy to understand the how we ended from
self.conv2 = nn.Conv2d(20, 50, 5) to
self.fc1 = nn.Linear(4*4*50, 500) in the next example.
class M(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(1, 20, 5) self.conv2 = nn.Conv2d(20, 50, 5) self.fc1 = nn.Linear(4*4*50, 500) self.fc2 = nn.Linear(500, 10) def forward(self, x): x = x.view(-1, 1, 28, 28) x = F.relu(self.conv1(x)) x = F.max_pool2d(x, 2) x = F.relu(self.conv2(x)) x = F.max_pool2d(x, 2) x = x.view(x.size(0), -1) x = F.relu(self.fc1(x)) x = self.fc2(x) return x
Common sense is telling that in and out should follow the same pattern all over again. E.g.
self.conv2 takes the out from previous layer as
in=20 and outputs
self.fc2 takes the
in=500 which is the out from
fc* means fully connected
But it is not like that. The next two lines break this seemingly obvious pattern.
self.conv2 = nn.Conv2d(20, 50, 5) self.fc1 = nn.Linear(4*4*50, 500)
In fact this pattern does work for linear layers, but may not work for convolution layers.
To understand we should print
x size after every line.
The next code also has the comments explaining the dimensions.
def forward(self, x): print(x.size()) #torch.Size([64, 784]) x = x.view(-1, 1, 28, 28) print(x.size()) #torch.Size([64, 1, 28, 28]) x = F.relu(self.conv1(x)) print(x.size()) #torch.Size([64, 20, 24, 24]) x = F.max_pool2d(x, 2) print(x.size()) #torch.Size([64, 20, 12, 12]) x = F.relu(self.conv2(x)) print(x.size()) #torch.Size([64, 50, 8, 8]) x = F.max_pool2d(x, 2) print(x.size()) #torch.Size([64, 50, 4, 4]) x = x.view(x.size(0), -1) print(x.size()) #torch.Size([64, 800]) x = F.relu(self.fc1(x)) print(x.size()) #torch.Size([64, 500]) x = self.fc2(x) print(x.size()) #torch.Size([64, 10]) return x
As we can see before we enter the
self.fc1 we end in size
torch.Size([64, 50, 4, 4]) that depends from the input size.
How do we get the convolution filters?
w = model.conv1.weight b = model.conv1.bias
Why do we need max pooling?
It is great for detecting the edges, since, max operation forces to pick up the edge values. It is also great for isolating the image details. Searching for hands in the image for example.
Can we compare nn.Conv2d first two parameters with nn.Linear first two parameters?
nn.Linear creates a matrix of
NxM. If we ignore the bias parameters we have
nn.Conv2d to calculate the parameters it is little funky, since it depends on kernel size:
c = nn.Conv2d(5,10, 2,2) for _ in c.parameters(): print(_.size()) print(_.nelement()) # torch.Size([10, 5, 2, 2]) # 200 # torch.Size() ==> bias # 10
So generally we cannot compare these two.
Are the nn.Conv2d parameters the filters?
Yes, weight parameters represent the filters.
Are Conv2d kernels equivalent to conv memory?
Yes, since we learned that with GD.
What is nn.Linear FC layer memory?
The parameters of nn.Linear layer, weights and bias represent the memory.
Is convolution with stride 2 equivalent to the convolution with stride 1 and the max pooling layer of 2?
No it is not. The following example explains the output is completely different, but the dimension of the output is the same.
tt = torch.randn(1,1,6,6) # 3x3 convolution kernel kk = torch.randn(1,1,3,3) # assumption convolution with stride 2 is equal as conv with stride 1 and max pooling of 2 res1 = F.conv2d(tt, kk, stride=2) print(res1) # tensor([[[[ 2.1621, 7.2672], [-1.5460, -1.2396]]]]) res2 = F.conv2d(tt, kk, stride=1) res2 = F.max_pool2d(res2,2) print(res2) # tensor([[[[5.8472, 7.2672], [1.3216, 1.9220]]]])
Should we use bias in conv2d?
It is possible to use bias, but often it is ignored, by setting it with
bias=False. This is because we usually use the BN behind the conv layer which has bias itself.
nn.Conv2d(1, 20, 5, bias=False)
Why we have max pooling to lower the resolution and at the same time we increase the number of filters?
By increasing the number of filters and by lowering the image using max pooling we try to keep the same number of features.
What does it means nn.Conv2d(3,10, 2,2) numbers 3 and 10?
in_channels in the beginning is
3 for images with 3 channels (colored images). For images black and white it should be 1. Some satellite images may have 4 in there.
out_channels is the number of convolution filters we have:
10. The filters will be of size 2x2.
What is dilation?
To explain what dilation is you can simple understand from these two images:
Why 3x3 filter is the best.
According to the paper from Max Zeiler. 17.3.3346
Few more tips about convolution
- Convolution is position invariant and handles location, but not actions.
- In PyTorch convolution is actually implemented as correlation.
- In PyTorach nn.ConvNd and F.convNd do have reverse order of parameters.