# There is only one way and these are two ways | Frequentist vs. Bayes approach

- Two ways
- Parameters and the Data
- Cardinality matters
- How do we train
- Bayes Prior as Regularizer
- Online learning

## Two ways

There are two approaches to statistics:

- the
**frequentist approach** - the
**Bayesian approach**

For the **frequentist** experiments are **random**. For the **Bayesian** the experiments are **not random**.

For the fair coin toss example, frequentists will say there are 50% chances to hit the head.

Bayesian scholar will say:

If i would somehow know the initial velocities of the coin and all the initial parameters in general

I would be able to predict the outcome.

Bayesian way is the *subjective approach* to statistics, while the other (frequentist approach) is the *objective approach*.

In other words for Bayesian person the **experiments are not random**. There is always a **cause**.

## Parameters and the Data

Frequentist will say parameters $\theta$ are fixed, and data $X$ is random, let’s evaluate the likelihood (log likelihood), and let’s further maximize the log likelihood by setting new parameters $\theta$.

likelihoodis the probability of the data given parameters.

Bayesian will say parameters $\theta$ are random and $X$ is fixed.

For those who train the neural nets the Bayesian approach makes a lot of sense.

The last statement shows that those who work with neural networks are actually Bayesian.

## Cardinality matters

Bayesian methods works for **any** number of
data points $\vert X \vert$.

$\vert X \vert$ stands for

cardinalityof $X$

Frequentist work **only** when the number of data points is much bigger than the number of parameters $\vert X \vert \gg \vert \theta\vert$.

Again, for neural networks Bayesian approach suits better since we may have millions of parameters and just thousand data points.

## How do we train

Frequentist will use the **maximum likelihood** principle (MLE) to train:

$\begin{aligned}\widehat{\theta}=\arg \max _{\theta} P(X \mid \theta)\end{aligned}$

Bayesian will try to estimate/compute first the Posterior (the probability of parameters given the data $P(\theta)$).

$P(\theta \mid X) = \Large \frac{P(X \mid \theta) P(\theta)}{P(X)}$

**Example:** Calculate the parameters given data samples

Here we will generate samples from a Gaussian distribution $\mathcal N(3,2)$. Once we have the samples we will calculate the parameters $\mu$ and $\sigma$ based on samples.

```
import numpy as np
from numpy import random
x = random.normal(loc=3, scale=2, size=(3000))
N = len(x) # length of gaussian samples
mu = x.mean()
print("μ", mu) # mean
sigma = np.sqrt(((x-mu)**2).sum()/N)
print("σ=", sigma) # sigma
```

Out:

```
μ = 3.0343101612162453
σ = 1.9620811647405432
```

The Bayesian principle is MAP (maximum a posteriori probability estimate). MAP is an estimate of an unknown quantity, that equals the **mode** of the posterior distribution.

The mode of the distribution is the value which occurs most frequently in a data set. The mode is usually of interest for *bigger* data sets.

We can have

unimodal,bimodal, andmultimodaldistributions. For example, bimodal distributions show two peaks in their frequency diagrams.

**Example:** Classification in Bayesian eyes

Bayesian just said what is the probability of the parameters $\theta$ **given data**.

theta given datais the famous Bayesian mantra.

Training time where $X_{tr}, y_{tr}$ are data columns and targets we will have:

$P\left(\theta \mid X_{\mathrm{tr}}, y_{\mathrm{tr}}\right)= \Large \frac{P\left(y_{\mathrm{tr}} \mid X_{\mathrm{tr}}, \theta\right) P(\theta)}{P\left(y_{\mathrm{tr}} \mid X_{\mathrm{tr}}\right)}$

We adjust the model data likelihood as searching the targets $y_{tr}$ given the data rows $X_{tr}$ and parameters $\theta$.

In prediction time, we also have the test set data and we need the probability for the labels of the test set $y_{ts}$, given $X_{ts}, X_{tr}, y_{tr}$:

To do so we use marginalization.

$P\left(y_{\mathrm{ts}} \mid X_{\mathrm{ts}}, X_{\mathrm{tr}}, y_{\mathrm{tr}},\theta\right)=\int P\left(y_{\mathrm{ts}} \mid X_{\mathrm{ts}}, \theta\right) P\left(\theta \mid X_{\mathrm{tr}}, y_{\mathrm{tr}}\right) d\theta$

Note, we estimated $P\left(\theta \mid X_{\mathrm{tr}}, y_{\mathrm{tr}}\right)$ during the training procedure.

Note: Bayesian prediction is a weighted average of output of our model for all possible values of parameters.

## Bayes Prior as Regularizer

Bayesian uses the prior $P(\theta)$ as a regularizer.

$P(\theta \mid X) = \Large \frac{P(X \mid \theta) P(\theta)}{P(X)}$

If you set the coin toss prior for the heads to $P(\theta)=0.5$ this is what it makes the **fair coin tossing**, else it will **not be fair**.

*Frequentist simple don’t have such tools.*

## Online learning

Online learning is adopting the prior and using it in the next iteration.

$P_{k+1}(\theta)=P\left(\theta \mid x_{k}\right)= \Large \frac{P\left(x_{k} \mid \theta\right) P_{k}(\theta)}{P\left(x_{k}\right)}$

The old prior $P_k(\theta)$ in the next iteration will become $P_{k+1}(\theta)$ and so on.

We use the new posterior as a prior to the next experiment.

**Appendix:** One interesting long video on this topic.