Let's dive into the mathematical foundations of the bias-variance tradeoff. Recall from the previous example, we aim to reduce the total error of predicted values and actual values. This error is composed of three components: bias, variance and irreducible error. We can analyze the expected squared prediction error of a model:
f^(x)
compared to the true function: f(x),
where f^(x) is learned from a training dataset D , and x is the true (unknown) function.
Let:
y=f(x)+ε,ε∼N(0,σ2)
this means for the function y=f(x)+ε , the error (denoted by ε ) is normally distributed with a mean of 0 and a variance of σ2 , σ denotes the standard deviation of the distribution
f^(x) is the model’s predicted value at input x
The expectation (or mean) is taken over different training datasets D and noise ε . The symbol E is used to express "expectation," or "expected value," which is a true value of the mean of the distribution
We are interested in the expected prediction error at a single point x :
ED,ε[(y-f^(x))2]
Substitute:
y=f(x)+ε
So the expression becomes:
=ED,ε[(f(x)+ε-f^(x))2]
Expanding the square:
$=ED,ε[(f(x)-f^(x))2+2(f(x)-f^(x))ε+ε2]$
Split the expectation by using linearity (linearity is a simple algebraic concept, for example, E[A+B]=E[A]+E[B]):
=ED[(f(x)-f^(x))2]+2ED,ε[(f(x)-f^(x))ε]+Eε[ε2]
Now, since:
E[ε]=0⇒E[(f(x)-f^(x))ε]=0
E[ε2]=σ2
We get:
ED[(f(x)-f^(x))2]+σ2