In linear regression we want to find a linear relation between explanatory variables and a response value. For example we want to find out how the house price (=response value) relates to the size of the house and it's location (=explanatory variables).

The response value has to be a continous value like house prices. (There is a variation called logistic regression for discrete outcomes).

Linear regression is typically the first estimation method you learn. But why do we minimize the squared sum of differences between the observed responses and the responses of the linear approximation? (This is also called Least Squares)

$$\min_c \sum \left( (\underbrace{c_0 + c_1 \cdot x}_{\text{Approximation}} ) - \underbrace{y}_{\text{Response}} \right)^2$$

Especially outside math/computer science/econometry courses this is handwaved as "Two is a good compromise". Indeed it is, it has to property that all differences (or errors) are positive. This means that an underestimation does not cancel out an overestimation. Additionally, it ensures that many small errors are preferred to few large errors.

Perhaps you want to know a better answer. Then read this. Otherwise don't :) It turns out if we take a detour through random variables we derive the 2.

## Rephrasing with random variables

A trick that is often used is to declare that there is a random unobserved factor in play. Even though the process might be completely deterministic. You can argue that horrendous deterministic complexity is random to a simple observer, like a linear approximation of a high dimension function.

We can then rephrase the questions as: find a linear relation between explanatory variables and a response value that minimizes the unexplained random factor:

$$y_i = c_0 + c_1 \cdot x_i + \epsilon_i$$

The previous approach to minimize the squared sum of errors is the same as minimizing the sum of squared unexplained random factors $\epsilon$.

To work with this randomness we need to know or assume on how the randomness behaves. Typically we assume the error is independent identically normally distributed (Gaussian).

This is not always reasonable but it is a very good starting point. Independence and identical are quite often true (enough). Because of the Central Limit Theorem we know that a sum of random variables converges to a normal distribution.

## Maximum likelihood

Maximum likelihood is a method of estimating the parameters of an unknown random distribution. Distribution is just a word to describe the possible values and their chances the random variable takes. In our case we are interested in estimating the mean ($\mu$) and standard deviation ($\sigma$) of the random factor $\epsilon$.

The possible values and their changes (likelihood) of occuring can be described by the probability density function (PDF). This function caputures this information. The CDF is the cumulative version.

There is a slight tricky notion that $pdf(5)$ is not the probablity that the random variable has value 5, but the density. The random variable is continuous and there are infinite values in any interval. However the integral (weighted sum) $\int_{4 < x < 5} pdf(x) d x$ is the probability that x is between 4 and 5.

If this confuses you just assume density is almost the probability.

The maximum likelihood of values $x$ occuring is defined as multiplying all densities for each individual $x$. If you read probability instead of density you get the intuition that his quantity relates to the probability of all $x$ occuring simultaneously. Maximimizing this quantity amounts to maximizing the probability.

## So...how does this work out?

Using maximum likelihood we want to maximize the likelihood that $y_i$ occurs given $x_i$ with parameters $c_0, c_1$ and $\sigma$:

$$LH(y | x; c_0, c_1, \sigma) = \prod_i pdf(y_i | x_i; c_0, c_1, \sigma)$$

The PDF of the normal distribution is defined as $$pdf(x; \mu, \sigma) = \frac{1}{\sigma \sqrt(2\pi)} \exp \left[ -\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^2 \right]$$

We then get: \begin{aligned} LLH(y | x; c_0, c_1, \sigma) &= \sum_i \log pdf(y_i | x_i; c_0, c_1, \sigma) \\ &= \sum_i \log \left( \frac{1}{\sigma \sqrt(2\pi)} \exp \left[ -\frac{1}{2} \left( \frac{\left( c_0 + c_1\cdot x_i \right) - y_i}{\sigma} \right)^2 \right] \right) \\ &= \sum_i \log \frac{1}{\sigma \sqrt(2\pi)} + \left( -\frac{1}{2} \left( \frac{\left( c_0 + c_1\cdot x_i \right) - y_i}{\sigma} \right)^2 \right) \\ \end{aligned}

We can already see that, ignoring $\sigma$, maximizing LLH is equivalent to minimizing least squares. In least squares we did not optimize over $\sigma$. To show it also works out if $\sigma$ would be optimized over we show that the estimation of $\sigma$ depends on $c_0$, $c_1$, $x$ and $y$.

Feel free to skip the last step as it requires more algebra.

Note: Before we mentioned that the error has a mean $\mu$ and standard deviation $\sigma$. In our case $\mu$ is superfluous as if it would be non-zero we would set $c_0 := c_0 + \mu$ and $\mu := 0$ and all calculations would work out the same.

## Last step

We show here that the estimation of $\sigma$ is defined in terms of other parameters and the variables, and is simply the sample standard deviation:

$$\sigma^2 = \frac{1}{n} \sum_i ((c_0 + c_1 \cdot x_i) - y_i)^2$$

The optimal set of parameters that maximize $LLH$ has the property that the partial derivates are zero:

\begin{aligned} \frac{\partial LLH}{\partial c_0} &= 0 = -\frac{1}{\sigma^2} \sum_i (c_0 + c_1x_i-y_i) \\ \frac{\partial LLH}{\partial c_1} & = 0 = -\frac{1}{\sigma^2} \sum_i (c_0 x_i + c_1x_i^2-x_iy_i) \\ \end{aligned}

Then \begin{aligned} \frac{\partial LLH}{\partial \sigma} &= - \frac{n}{2\sigma^2} + \left( \frac{1}{2} \sum_i \left( \left( c_0 + c_1\cdot x_i \right) - y_i \right)^2 \right) \left( \frac{1}{(\sigma^2)^2}\right) \\ &= \frac{1}{2\sigma^2} \left( \frac{1}{\sigma^2} \sum_i \left( \left( c_0 + c_1\cdot x_i \right) - y_i \right)^2 - n \right) \end{aligned} Which is only zero if $$\sigma^2 = \frac{1}{n} \sum_i \left( \left( c_0 + c_1\cdot x_i \right) - y_i \right)^2$$

Now you know 2.