MLE for the coin toss example
In the typical coin tossing example, with probability for heads equal to $p$ and tossing the coin $n$ times, let’s calculate the Maximum Likelihood Estimate (MLE) for the heads probability.
We know this is a typical case of Binomial distribution that is given by this formula:
$\operatorname{Bin}(k;n,p) = \binom{n}{k}p^k(1-p)^{n-k}$
( Read: $k$ is parametrized by $n$ and $p$)
We have:
$n = H + T$ is the total number of tosses, and $H = k$ is the number of heads.
Leading to:
$\operatorname{Bin}(H;H+T,p) = \binom{H+T}{H}p^H(1-p)^{T}$
$\operatorname{Bin}(H;H+T,p)_{\operatorname{MLE}} = \underset{p}{\operatorname{arg\,max}} \binom{H+T}{H}p^H(1-p)^{T}$
$=\underset{p}{\operatorname{arg\,max}} \operatorname{log} \big[ \binom{H+T}{H}p^H(1-p)^{T} \big]$
$=\underset{p}{\operatorname{arg\,max}} \big[ \operatorname{log} \binom{H+T}{H} + \operatorname{log} p^H + \operatorname{log}(1-p)^{T} \big]$
$=\underset{p}{\operatorname{arg\,max}} \big[ H \operatorname{log} p + T \operatorname{log}(1-p) \big]$
We used log trick to gain numerical stability, and we removed the constant in this transformation process since it will not affect the argmax.
To get the MLE, we will find where the first derivative is equal to zero:
$\large \frac{\partial [ H \operatorname{log} p + T \operatorname{log}(1-p)]}{\partial p}=\small 0$
And this is true for:
$\large \frac{H}{p} = \frac{T}{1-p}$
So:
$\large p_{\small \text{MLE}} = \frac{H}{T+H}$
We could intuitively get the same conclusion, let’s say we have some tossing events:
$\mathcal{T}=\{h, h, h, t, t, h, t, t, t, h, t \}$, where $\mathcal{T}$ is our set of tosses with $n = T + H = 11$ elements, and the number of heads is $H = 5$. Just based on this example:
$\large p_{\small \text{MLE}}$ is ${H \over {T+H}} = {5 \over 11}$.
Addendum
Bernoulli distribution
Bernoulli distribution is a distribution for a single binary random variable $X$ with state $x \in{0,1}$. It is governed by a single continuous parameter $\mu \in[0,1]$ that represents the probability of $X=1 .$ The Bernoulli distribution $\operatorname{Ber}(\mu)$ is defined as:
\[\begin{aligned} p(x \mid \mu) &=\mu^{x}(1-\mu)^{1-x}, \quad x \in\{0,1\}, \\ \mathbb{E}[x] &=\mu, \\ \mathbb{V}[x] &=\mu(1-\mu) \end{aligned}\]where $\mathbb{E}[x]$ and $\mathbb{V}[x]$ are the mean and variance of the binary random variable $X$.
Binomial distribution
Binomial distribution is generalization of the Bernoulli distribution.
In particular, the Binomial can be used to describe the probability of observing $m$ occurrences of $X=1$ in a set of $N$ samples (number of trials) from a Bernoulli distribution where $p(X=1)=\mu \in[0,1] .$ The Binomial distribution $\operatorname{Bin}(N, \mu)$ is defined as:
\[\begin{aligned} p(x \mid N, \mu, m) &=\left(\begin{array}{c} N \\ m \end{array}\right) \mu^{m}(1-\mu)^{N-m} \\ \mathbb{E}[x] &=N \mu \\ \mathbb{V}[x] &=N \mu(1-\mu) \end{aligned}\]where $\mathbb{E}[x]$ and $\mathbb{V}[x]$ are the mean and variance of $m$, respectively.