As you may assume in PyTorch images we use should be prepared for learning. This part is called image normalization.

In case we use PIL Image class to convert the image to PyTorch Tensor we will need:

from PIL import Image
from torchvision.transforms import ToTensor

Our Tensor images will look like this after the conversion:

tensor([[[0.1725, 0.2353, 0.2941,  ..., 1.0000, 1.0000, 0.9843],
         [0.1804, 0.2314, 0.2824,  ..., 1.0000, 1.0000, 0.9843],
         [0.1922, 0.2314, 0.2706,  ..., 1.0000, 1.0000, 0.9843],
         ...,
         [0.7098, 0.6863, 0.6627,  ..., 0.6863, 0.6863, 0.6039],
         [0.6941, 0.6824, 0.6745,  ..., 0.6118, 0.6314, 0.5569],
         [0.6863, 0.6863, 0.6941,  ..., 0.2392, 0.2902, 0.3137]], ...

Note how, min and max value of this tensor will be: tensor(0.) and tensor(1.) respectively. The histogram per channel will look like this:

Let we use the following PyTorch normalization function:

def normalize(x: torch.FloatTensor, mean: torch.FloatTensor, std: torch.FloatTensor) -> torch.FloatTensor
    "Normalize `x` with `mean` and `std`."
    return (x - mean[..., None, None]) / std[..., None, None]

What we provide is a Tensor image x and mean and std values for the image set we are working in. This means that we evaluated in advance the mean and std for all the images in the set.

Following are some well known mean and std list tupples (RGB) for different image sets:

cifar_stats = ([0.491, 0.482, 0.447], [0.247, 0.243, 0.261])
imagenet_stats = ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
mnist_stats = ([0.15] * 3, [0.15] * 3)

In particular [0.491, 0.482, 0.447] is the mean for the cifar image set; 0.491 is the mean for the Red channel, and so on. The standard deviation for the same image set is represented with this list [0.247, 0.243, 0.261] and 0.247 is exactly the std of the Red channel.

Note the Ellipsis notation we used inside normalize function. It may be strange what it means in PyTorch. Let’s check this code:

l=Tensor([1,2,3])
print(l)
r=l[...,None, None]
print(r)

This will output like this:

tensor([1., 2., 3.])
tensor([[[1.]],
        [[2.]],
        [[3.]]])

Note how we subtract x - mean[..., None, None] for specific RGB channel, and also how we do RGB channel division std[..., None, None] after that.

At the end, we will get the result like this where our data pixel values will be around 0.

You may note that before we had our pixel values inside [0., 1.] range, and now we have positive and negative values around 0, ideal for machine learning.