Logistic regression is classification algorithm. It assumes that the response variable only takes on two possible outcomes.

## LR is supervised method

The following diagram describes the general supervised model:

• $X$ are the input features.
• $Y$ are labels
• $\hat Y$ would be the output after the prediction.
• $\theta$ are the parameters we update.
• The cost function is used to compare the original labels $Y$ and the predicted labels.
• For Logistic Regression the Sigmoid prediction function is used.

## Positive or negative classification

We are able to define the line $W^TX=0$.

Vector notation can be written: $W^TX=w_0x_0+w_1x_1+\ … + w_dx_d$. Where $d$ is the dimension of input features, and $w_0$ is bias and $x_0$ is set to 1 (not to worry about this detail).

We define two classes $W^TX>0$ (positive) and $W^TX<0$ (negative).

We can also define the point $X_A$ whose distance from the pink line (hyperplane) is $\rho$:

$\rho = \frac{W^TX }{\|W \|_2}$

This way the $\texttt{sign}(\rho)$ indicates if the example is positive or negative. In the image the point $X_A$ is a positive example, because the $\rho >0$.

In general binary classification can go to either positive or negative class:

$\texttt{sign}(\rho) \in \{-1, 1\}$

To calculate the L2 norm you use:

$\| {W} \|_2 = \sqrt{\sum_{i=0}^d w_i^2}$

where $d$ is the dimension of the hyperplane and for $w_0$ represents the bias and is usually set to 1.

## LR probabilities

It is not sufficient to just find the class, but also to provide the info about the positive prediction.

How can we map $W^TX$ into $[0,1]$ probability space?

If the $\rho=0$ we cannot say which class do we have. If $\rho$ is infinitely large the probability should be 1. Same for infinitely small. In between the probability should be smaller than 1.

$\hat {P}_{+} = P \{ y=+1 \mid X \} \in [0,1]$

$\hat {P}_{-} = P \{ y=-1 \mid X \} \in [0,1]$

We can make odds ratio:

$\frac{\hat P_{+}}{\hat P_{-}}\in [0, +\inf)$

Then: $\texttt{log}\frac{\hat P_{+}}{\hat P_{-}} = W^TX$

Logistic regression predicts probability $\hat P_{+}$ as: $\hat P_{+}=\frac{1}{1+e^{-W^TX}}=\texttt{sigmoid}(W^TX)$

## LR is discriminative model

LR is discriminative models because it has a hyperplane between the two classes.

In discriminative approach we fit a model of the $p(Y \mid X)$ posterior directly.

## Binary classification model

$P(Y \mid X, W)=\texttt{Ber}\left(Y \mid \texttt{sigm}\left(W^{T} X\right)\right)$

$P$ denotes discrete distribution, $p$ continuous, $\operatorname{sigm}$ is sigmoid function.

## Logistic Regression is linear model

A classifier is linear if its decision boundary on the feature space is a linear function: positive and negative examples are separated by an hyperplane.

Logistic regression outcome depends on the sum of the input features and parameters. The output cannot depend on the product of features, or features are independent of each other.

Logistic regression uses linear decision boundaries. Imagine you trained a logistic regression and obtained the coefficients $w_i$. You might want to classify a test record if $P(x) > 0.5$, where $x_{i} \in \mathbb{R}^{d}$.

The probability is obtained with your logistic regression by: $P(x) = \frac{1}{1+e^{-(w_0 + w_1 x_1 + \dots + w_d x_d)}}$

If you work out the math you see that $P({x}) > 0.5$ defines a hyperplane on the feature space which separates positive from negative examples.

## Logloss

$\texttt {logloss}(x_i, y_i) = log(1+e^{-y_iW^TX_i})$

where

$y_i\in\{-1, +1\}$

In here, part: $y_iW^TX_i$ is called margin $M$.

Logloss is differentiable and monotonous. From this reason we can use this function to calculate parameters $w_i$ in optimization procedure.

If we would have loss function that is differentiable, but not monotonous we couldn’t use it for LR.

Any differentiable function would still work for gradient boosting.

## LR model

To model LR we take $m$ samples ${x_{i}, y_{i}}$ such that $x_{i} \in \mathbb{R}^{d}$ and $y_{i} \in \mathbb{R}$.

In here $y_i$ is our dependent variable and $x_i$ is independent variable. You can consider $x_i$ are multiple columns, while $y_i$ is the target column.

$d$ is how many independent columns we have.

To define the formal model we need:

• hypothesis function,
• loss function

You may guess hypothesis function is the logistic function (also known as: logit, sigmoid, $\sigma$)

$h_{\theta}\left(x_{i}\right)=\sigma\left(\omega^{T} x_{i}\right)=\sigma\left(z_{i}\right)=\frac{1}{1+e^{-z_{i}}}$

Where $\omega \in \mathbb{R}^{d}$ and $z_{i}=\omega^{T} x_{i} .$

We will be computing gradients in here w.r.t. weights $\omega$

The loss function is then defined as: $l(\omega)=\sum_{i=1}^{m}-\left(y_{i} \log \sigma\left(z_{i}\right)+\left(1-y_{i}\right) \log \left(1-\sigma\left(z_{i}\right)\right)\right)$

Or in indexed notation: $l_i(\omega)=-y_i\log\sigma(z_i)-(1-y_i)\log(1-\sigma(z_i))$

## Utilizing some properties of logistic function

Derivation property :

$\frac{d}{d z} \sigma(z)= \frac{d}{d z}\left(\frac{1}{1+e^{-z}}\right)=-\frac{-e^{-z}}{\left(1+e^{-z}\right)^{2}}=\frac{1}{1+e^{-z}} \cdot \frac{e^{-z}}{1+e^{-z}}=\sigma(z)(1-\sigma(z))$

Negative property :

$\sigma(-z) =1 /\left(1+e^{z}\right) =e^{-z} /\left(1+e^{-z}\right) =1-1 /\left(1+e^{-z}\right)= 1-\sigma(z)$

We will also need to express $x$ and $x^T$ in terms of $z$:

$\frac{\partial z}{\partial \omega}=\frac{x^{T} \omega}{\partial \omega}=x^{T}$

$\frac{\partial z}{\partial \omega^{T}}=\frac{\partial \omega^{T} x}{\partial \omega^{T}}=x$

• Gradient $\nabla_{\omega} l(\omega)$
• Hessian $\nabla_{\omega}^2 l(\omega)$

We start from indexed notation of the loss:

$l_i(\omega)=-y_i\log\sigma(z_i)-(1-y_i)\log(1-\sigma(z_i))$ and we will first express the $\log\sigma(z_i)$ and $\log(1-\sigma(z_i))$.

\begin{aligned} \frac{\partial \log \sigma\left(z_{i}\right)}{\partial \omega^{T}} &=\frac{1}{\sigma\left(z_{i}\right)} \frac{\partial \sigma\left(z_{i}\right)}{\partial \omega^{T}}=\frac{1}{\sigma\left(z_{i}\right)} \frac{\partial \sigma\left(z_{i}\right)}{\partial z_{i}} \frac{\partial z_{i}}{\partial \omega^{T}}=\left(1-\sigma\left(z_{i}\right)\right) x_{i} \\ \frac{\partial \log \left(1-\sigma\left(z_{i}\right)\right)}{\partial \omega^{T}} &=\frac{1}{1-\sigma\left(z_{i}\right)} \frac{\partial\left(1-\sigma\left(z_{i}\right)\right)}{\partial \omega^{T}}=-\sigma\left(z_{i}\right) x_{i} \end{aligned}

Finally

$\nabla_{\omega} l_{i}(\omega)=\frac{\partial l_{i}(\omega)}{\partial \omega^{T}}=-y_{i} x_{i}\left(1-\sigma\left(z_{i}\right)\right)+\left(1-y_{i}\right) x_{i} \sigma\left(z_{i}\right)=x_{i}\left(\sigma\left(z_{i}\right)-y_{i}\right)$

And in vector notation :

$\nabla_{\omega} l_{i}(\omega)=\sum_{i}\left(\mu_{i}-y_{i}\right) \boldsymbol{x}_{i}$

Now to compute the Hessian:

$\nabla_{\omega}^{2} l_{i}(\omega)=\frac{\partial l_{i}(\omega)}{\partial \omega \partial \omega^{T}}=x_{i} x_{i}^{T} \sigma\left(z_{i}\right)\left(1-\sigma\left(z_{i}\right)\right)$ For $m$ samples we have ${\nabla_{\omega}}^{2} l(\omega)=\sum_{i=1}^{m} x_{i} x_{i}^{T} \sigma\left(z_{i}\right)\left(1-\sigma\left(z_{i}\right)\right)$.

This is equivalent to concatenating column vectors $x_{i} \in \mathbb{R}^{d}$ into a matrix $X$ of size $d \times m$ such that

$\sum_{i=1}^{m} x_{i} x_{i}^{T}=X X^{T}$

The scaling terms are combined in a diagonal matrix $S$ such that $S_{i i}=\sigma\left(z_{i}\right)\left(1-\sigma\left(z_{i}\right)\right)$.

Finally, Hessian w.r.t. weights of the loss function is: ${H}(\omega)=\nabla_{\omega}^{2} l(\omega)=X S X^{T}$