Put simply, overfitting is the opposite of underfitting, occurring when the model has been overtrained or when it contains too much complexity, resulting in high error rates on test data. Overfitting a model is more common than underfitting one, and underfitting typically occurs in an effort to avoid overfitting through a process called “early stopping.”
If undertraining or lack of complexity results in underfitting, then a logical prevention strategy would be to increase the duration of training or add more relevant inputs. However, if you train the model too much or add too many features to it, you may overfit your model, resulting in low bias but high variance (i.e. the bias-variance tradeoff). In this scenario, the statistical model fits too closely against its training data, rendering it unable to generalize well to new data points. It’s important to note that some types of models can be more prone to overfitting than others, such as decision trees or KNN.
Identifying overfitting can be more difficult than underfitting because unlike underfitting, the training data performs at high accuracy in an overfitted model. To assess the accuracy of an algorithm, a technique called k-fold cross-validation is typically used.
In k-folds cross-validation, data is split into k equally sized subsets, which are also called “folds.” One of the k-folds will act as the test set, also known as the holdout set or validation set, and the remaining folds will train the model. This process repeats until each of the fold has acted as a holdout fold. After each evaluation, a score is retained and when all iterations have completed, the scores are averaged to assess the performance of the overall model.
The ideal scenario when fitting a model is to find the balance between overfitting and underfitting. Identifying that “sweet spot” between the two allows machine learning models to make predictions with accuracy.