LSUV

LSUV procedure is based on All you need is a good init paper.

The proposal is in two steps:

(I) pre-initialize weights of each convolution or linear layer with orthonormal matrices,

(II) from the first to the last layer normalize the mean and std for each layer to be zero and one respectively.

Here presented is the initialization algorithm even without the orthonormal matrices initialization since it works anyways.

We initialize our neural net based on what PyTorch proposes by default then we pass a batch through the model and check the mean and std outputs of the linear and convolution layers.

We then rescale the weights to have the mean of 0 and std of 1. We repeat this process for every layer.

Let we start with examination, where I used MNIST dataset and the model like this:

MJ(
(conv1): Conv2d(1, 8, kernel_size=(5, 5), stride=(2, 2), padding=(1, 1))
(relu1): ReLU()
(conv2): Conv2d(8, 16, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2))
(relu2): ReLU()
(conv3): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2))
(relu3): ReLU()
(conv4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2))
(relu4): ReLU()
(avg): AdaptiveAvgPool2d(output_size=1)
(l1): Linear(in_features=32, out_features=10, bias=True)
)

We can check out the outputs like this:

model = MJ()
l=[] # levels

def hook_print(module, inp, outp):        
    mean,std = outp.data.mean().item(),outp.data.std().item()
    print(module.__class__.__name__,":µ",mean,":σ",std)

    
for i, submod in enumerate(model.modules()):
        if (isinstance(submod, (nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.Linear))):
            l.append(i)            
            submod.register_forward_hook(hook_print)
    
print(l) # these are the layers we examine

#get the first mini batch = data
data, label = next(iter(train_ds))
output = model(data)

#remove the hooks
for m in model.modules():
    m._forward_hooks.clear()

And by default I can see the following:

Conv2d :µ -0.00550442561507225 :σ 0.5710428357124329
Conv2d :µ 0.030391748994588852 :σ 0.17474451661109924
Conv2d :µ -0.016749830916523933 :σ 0.08557184785604477
Conv2d :µ -0.0007881399360485375 :σ 0.040719881653785706
Linear :µ -0.008650993928313255 :σ 0.10739827901124954

Note: these are very bad results for standard deviation.

If I would use He normal initialization for all these layers that would be:

Conv2d :µ -0.10771720111370087 :σ 1.046576976776123
Conv2d :µ 0.13378898799419403 :σ 0.815869152545929
Conv2d :µ -0.04592322185635567 :σ 0.7471044063568115
Conv2d :µ -0.042254965752363205 :σ 0.6714988946914673
Linear :µ 0.02825448475778103 :σ 0.4340907335281372

This is way better, but can we go even better with LSUV?

def hook_update(module, inp, outp):        
    mean,std = outp.data.mean().item(), outp.data.std().item()    
    module.bias.data =  torch.nn.Parameter( module.bias.data - mean)
    module.weight.data = torch.nn.Parameter( module.weight.data /std )
    
def set_update_hooks(model, id):        
    i = 0
    for submod in model.modules():
        if(isinstance(submod, (nn.Conv2d, nn.Linear)) and id==i):            
            submod.register_forward_hook(hook_update) 
        i=i+1    

def hook_print(module, inp, outp):        
    mean,std = outp.data.mean().item(),outp.data.std().item()
    print(module.__class__.__name__,":µ",mean,":σ",std)
    
def set_print_hooks(model):
    for i, submod in enumerate(model.modules()):    
        if (isinstance(submod, (nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.Linear))):
            submod.register_forward_hook(hook_print)
    
model = MJ()
    
def p(model, data):
    print("---")
    set_print_hooks(model)
    output = model(data)

    #remove the hooks
    for m in model.modules():
        m._forward_hooks.clear()  
    
def update(model, id):    
    set_update_hooks(model, id)
    
    output = model(data)

    #remove the hooks
    for m in model.modules():
        m._forward_hooks.clear()            

it = iter(train_ds)
data1, label = next(it)        

p(model,data1)
for _ in l:
    update(model,_)

p(model, data1) 
data2, label = next(it)
p(model, data2) 

With this simple LSUV we get:

--- without LSUV and default PyTorch init
Conv2d :µ -0.006135555449873209 :σ 0.6363536715507507
Conv2d :µ -0.028499729931354523 :σ 0.22008129954338074
Conv2d :µ -0.0009620300261303782 :σ 0.08058485388755798
Conv2d :µ 0.007920045405626297 :σ 0.038900844752788544
Linear :µ -0.004452767316251993 :σ 0.0917159914970398
--- LSUV first minibatch
Conv2d :µ 0.0037989411503076553 :σ 0.9946872591972351
Conv2d :µ -0.0737309455871582 :σ 0.9935423731803894
Conv2d :µ -0.020194055512547493 :σ 0.9971215128898621
Conv2d :µ 0.12966477870941162 :σ 1.003449559211731
Linear :µ 0.3086388409137726 :σ 0.9397833943367004
--- LSUV on second minibatch
Conv2d :µ 0.015423248521983624 :σ 1.0560718774795532
Conv2d :µ -0.09113991260528564 :σ 1.0773249864578247
Conv2d :µ -0.018735704943537712 :σ 1.0790156126022339
Conv2d :µ 0.14292550086975098 :σ 1.0948491096496582
Linear :µ 0.38046377897262573 :σ 1.0043189525604248

The very first result is without LSUV, then the next result is with LSUV for the data1 minibatch, and the last result is the LSUV but with the next data2 minibatch.

As you may understand, this initialization, or lack of initialization is what stopped machine learning from progress in the early days.

This shows LSUV works better even than He normal initialization and it is applicable to any neural network leading to improved learning.

Depending on minibatch µ and σ values may very a bit, but it is possible even to calculate the initial weights and biases taking in account multiple minibatches.

It is also possible to calculate something like ideal initial weights for the training dataset taking in account all minibatches.

You can check out the example source code.