Let’s frame logistic regression into the right context. Let’s show what kind of model it represents.

LR is discriminative model

Consider you have dependent $Y$ and independent variable $X$, both are IID and random.

To create a classifier one form joint model $p(y,x)$ and then condition on $X$. This is a generative approach.

Another approach exists in machine learning which is called discriminative approach. This is where logistic regression fits in.

In discriminative approach we fit a model of the $p(Y \mid X)$ posterior directly.

Binary classification model

$P(Y \mid X, W)=\mathsf{Ber}\left(Y \mid \operatorname{sigm}\left(W^{T} X\right)\right)$

$P$ denotes discrete distribution, $p$ continuous, $\operatorname{sigm}$ is sigmoid function.

Logistic Regression is linear model

A classifier is linear if its decision boundary on the feature space is a linear function: positive and negative examples are separated by an hyperplane.

Logistic regression uses linear decision boundaries. Imagine you trained a logistic regression and obtained the coefficients $\beta_i$. You might want to classify a test record if $P(x) > 0.5$, where $x_{i} \in \mathbb{R}^{d}$.

The probability is obtained with your logistic regression by: \(P(x) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \dots + \beta_d x_d)}}\)

If you work out the math you see that $P({x}) > 0.5$ defines a hyperplane on the feature space which separates positive from negative examples.

LR model

To model LR we take $m$ samples \(\{x_{i}, y_{i}\}\) such that $x_{i} \in \mathbb{R}^{d}$ and $y_{i} \in \mathbb{R}$.

In here $y_i$ is our dependent variable and $x_i$ is independent variable. You can consider $x_i$ are multiple columns, while $y_i$ is the target column.

$d$ is how many independent columns we have.

To define the formal model we need:

  • hypothesis function,
  • loss function

You may guess hypothesis function is the logistic function (also known as: logit, sigmoid, $\sigma$)

\[h_{\theta}\left(x_{i}\right)=\sigma\left(\omega^{T} x_{i}\right)=\sigma\left(z_{i}\right)=\frac{1}{1+e^{-z_{i}}}\]

Where $\omega \in \mathbb{R}^{d}$ and $z_{i}=\omega^{T} x_{i} .$

We will be computing gradients in here w.r.t. weights $\omega$

The loss function is then defined as: \(l(\omega)=\sum_{i=1}^{m}-\left(y_{i} \log \sigma\left(z_{i}\right)+\left(1-y_{i}\right) \log \left(1-\sigma\left(z_{i}\right)\right)\right)\)

Or in indexed notation: \(l_i(\omega)=-y_i\log\sigma(z_i)-(1-y_i)\log(1-\sigma(z_i))\)

Utilizing some properties of logistic function

Derivation property :

$\frac{d}{d z} \sigma(z)= \frac{d}{d z}\left(\frac{1}{1+e^{-z}}\right)=-\frac{-e^{-z}}{\left(1+e^{-z}\right)^{2}}=\frac{1}{1+e^{-z}} \cdot \frac{e^{-z}}{1+e^{-z}}=\sigma(z)(1-\sigma(z))$

Negative property :

$\sigma(-z) =1 /\left(1+e^{z}\right) =e^{-z} /\left(1+e^{-z}\right) =1-1 /\left(1+e^{-z}\right)= 1-\sigma(z)$

We will also need to express $x$ and $x^T$ in terms of $z$:

$\frac{\partial z}{\partial \omega}=\frac{x^{T} \omega}{\partial \omega}=x^{T}$

$\frac{\partial z}{\partial \omega^{T}}=\frac{\partial \omega^{T} x}{\partial \omega^{T}}=x$

Computing Gradient and Hessian

  • Gradient $\nabla_{\omega} l(\omega)$
  • Hessian $\nabla_{\omega}^2 l(\omega)$

We start from indexed notation of the loss:

\(l_i(\omega)=-y_i\log\sigma(z_i)-(1-y_i)\log(1-\sigma(z_i))\) and we will first express the $\log\sigma(z_i)$ and $\log(1-\sigma(z_i))$.

\[\begin{aligned} \frac{\partial \log \sigma\left(z_{i}\right)}{\partial \omega^{T}} &=\frac{1}{\sigma\left(z_{i}\right)} \frac{\partial \sigma\left(z_{i}\right)}{\partial \omega^{T}}=\frac{1}{\sigma\left(z_{i}\right)} \frac{\partial \sigma\left(z_{i}\right)}{\partial z_{i}} \frac{\partial z_{i}}{\partial \omega^{T}}=\left(1-\sigma\left(z_{i}\right)\right) x_{i} \\ \frac{\partial \log \left(1-\sigma\left(z_{i}\right)\right)}{\partial \omega^{T}} &=\frac{1}{1-\sigma\left(z_{i}\right)} \frac{\partial\left(1-\sigma\left(z_{i}\right)\right)}{\partial \omega^{T}}=-\sigma\left(z_{i}\right) x_{i} \end{aligned}\]

Finally

\[\nabla_{\omega} l_{i}(\omega)=\frac{\partial l_{i}(\omega)}{\partial \omega^{T}}=-y_{i} x_{i}\left(1-\sigma\left(z_{i}\right)\right)+\left(1-y_{i}\right) x_{i} \sigma\left(z_{i}\right)=x_{i}\left(\sigma\left(z_{i}\right)-y_{i}\right)\]

And in vector notation :

$\nabla_{\omega} l_{i}(\omega)=\sum_{i}\left(\mu_{i}-y_{i}\right) \boldsymbol{x}_{i}$

Now to compute the Hessian:

\(\nabla_{\omega}^{2} l_{i}(\omega)=\frac{\partial l_{i}(\omega)}{\partial \omega \partial \omega^{T}}=x_{i} x_{i}^{T} \sigma\left(z_{i}\right)\left(1-\sigma\left(z_{i}\right)\right)\) For $m$ samples we have ${\nabla_{\omega}}^{2} l(\omega)=\sum_{i=1}^{m} x_{i} x_{i}^{T} \sigma\left(z_{i}\right)\left(1-\sigma\left(z_{i}\right)\right)$.

This is equivalent to concatenating column vectors $x_{i} \in \mathbb{R}^{d}$ into a matrix $X$ of size $d \times m$ such that

\[\sum_{i=1}^{m} x_{i} x_{i}^{T}=X X^{T}\]

The scaling terms are combined in a diagonal matrix $S$ such that $S_{i i}=\sigma\left(z_{i}\right)\left(1-\sigma\left(z_{i}\right)\right)$.

Finally, Hessian w.r.t. weights of the loss function is: \({H}(\omega)=\nabla_{\omega}^{2} l(\omega)=X S X^{T}\)