Two ways

There are two approaches to statistics:

  • the frequentist approach
  • the Bayesian approach

For the frequentist experiments are random. For the Bayesian the experiments are not random.

For the fair coin toss example, frequentists will say there are 50% chances to hit the head.

Bayesian scholar will say:

If i would somehow know the initial velocities of the coin and all the initial parameters in general I would be able to predict the outcome.

Bayesian way is the subjective approach to statistics, while the other (frequentist approach) is the objective approach.

In other words for Bayesian person the experiments are not random. There is always a cause.

Parameters and the Data

Frequentist will say parameters $\theta$ are fixed, and data $X$ is random, let’s evaluate the likelihood (log likelihood), and let’s further maximize the log likelihood by setting new parameters $\theta$.

likelihood is the probability of the data given parameters.

Bayesian will say parameters $\theta$ are random and $X$ is fixed.

For those who train the neural nets the Bayesian approach makes a lot of sense.

The last statement shows that those who work with neural networks are actually Bayesian.

Cardinality matters

Bayesian methods works for any number of data points $\vert X \vert$.

$\vert X \vert$ stands for cardinality of $X$

Frequentist work only when the number of data points is much bigger than the number of parameters $\vert X \vert \gg \vert \theta\vert$.

Again, for neural networks Bayesian approach suits better since we may have millions of parameters and just thousand data points.

How do we train

Frequentist will use the maximum likelihood principle (MLE) to train:

$\begin{aligned}\widehat{\theta}=\arg \max _{\theta} P(X \mid \theta)\end{aligned}$

Bayesian will try to estimate/compute first the Posterior (the probability of parameters given the data $P(\theta)$).

$P(\theta \mid X) = \Large \frac{P(X \mid \theta) P(\theta)}{P(X)}$

Example: Calculate the parameters given data samples

Here we will generate samples from a Gaussian distribution $\mathcal N(3,2)$. Once we have the samples we will calculate the parameters $\mu$ and $\sigma$ based on samples.

import numpy as np
from numpy import random
x = random.normal(loc=3, scale=2, size=(3000))
N = len(x) # length of gaussian samples
mu = x.mean()
print("μ", mu) # mean
sigma = np.sqrt(((x-mu)**2).sum()/N)
print("σ=", sigma) # sigma

Out:

μ = 3.0343101612162453
σ = 1.9620811647405432

The Bayesian principle is MAP (maximum a posteriori probability estimate). MAP is an estimate of an unknown quantity, that equals the mode of the posterior distribution.

The mode of the distribution is the value which occurs most frequently in a data set. The mode is usually of interest for bigger data sets.

We can have unimodal, bimodal, and multimodal distributions. For example, bimodal distributions show two peaks in their frequency diagrams.

Example: Classification in Bayesian eyes

Bayesian just said what is the probability of the parameters $\theta$ given data.

theta given data is the famous Bayesian mantra.

Training time where $X_{tr}, y_{tr}$ are data columns and targets we will have:

$P\left(\theta \mid X_{\mathrm{tr}}, y_{\mathrm{tr}}\right)= \Large \frac{P\left(y_{\mathrm{tr}} \mid X_{\mathrm{tr}}, \theta\right) P(\theta)}{P\left(y_{\mathrm{tr}} \mid X_{\mathrm{tr}}\right)}$

We adjust the model data likelihood as searching the targets $y_{tr}$ given the data rows $X_{tr}$ and parameters $\theta$.

In prediction time, we also have the test set data and we need the probability for the labels of the test set $y_{ts}$, given $X_{ts}, X_{tr}, y_{tr}$:

To do so we use marginalization.

$P\left(y_{\mathrm{ts}} \mid X_{\mathrm{ts}}, X_{\mathrm{tr}}, y_{\mathrm{tr}},\theta\right)=\int P\left(y_{\mathrm{ts}} \mid X_{\mathrm{ts}}, \theta\right) P\left(\theta \mid X_{\mathrm{tr}}, y_{\mathrm{tr}}\right) d\theta$

Note, we estimated $P\left(\theta \mid X_{\mathrm{tr}}, y_{\mathrm{tr}}\right)$ during the training procedure.

Note: Bayesian prediction is a weighted average of output of our model for all possible values of parameters.

Bayes Prior as Regularizer

Bayesian uses the prior $P(\theta)$ as a regularizer.

$P(\theta \mid X) = \Large \frac{P(X \mid \theta) P(\theta)}{P(X)}$

If you set the coin toss prior for the heads to $P(\theta)=0.5$ this is what it makes the fair coin tossing, else it will not be fair.

unfair coin

Frequentist simple don’t have such tools.

Online learning

Online learning is adopting the prior and using it in the next iteration.

$P_{k+1}(\theta)=P\left(\theta \mid x_{k}\right)= \Large \frac{P\left(x_{k} \mid \theta\right) P_{k}(\theta)}{P\left(x_{k}\right)}$

The old prior $P_k(\theta)$ in the next iteration will become $P_{k+1}(\theta)$ and so on.

We use the new posterior as a prior to the next experiment.


Appendix: One interesting long video on this topic.