Uncertainty comes in two primary types: data-driven uncertainty and model-driven uncertainty. In either case, it can be helpful to know how reliable a prediction is both before it's made and after it's made.
You can think of this as a model predicting how many times a door hinge can open and close before it fails to approximately plus or minus 1000 operation. It can also show how likely it is that this time closing the door hinge breaks it.
Sampling-based methods
Sampling-based approaches are some of the most commonly used techniques for uncertainty quantification because they can handle any kind of model complexity and provides an intuitive comprehensive uncertainty characterization. By generating many possible scenarios, sampling can build up a statistical picture of what outcomes are likely and how uncertain our predictions are when applied to real-world data. Instead of computing uncertainty analytically, these methods use statistical analysis of many sample outputs to characterize uncertainty distributions.
Monte Carlo simulation is one of the most common approaches. This runs thousands of model simulations with randomly varied inputs to see the range of possible outputs. These are especially common with parametric models where the confidence intervals and model outputs for different models are compared to see the range of all possible values.
A variation of Monte Carlo simulation called Latin hypercube sampling is a more efficient version that requires fewer runs while still covering the input space well.
Monte Carlo dropout is another technique that keeps the dropout active during prediction, running multiple forward passes to get a distribution of outputs.2 Dropout is primarily used as a regularization technique, a method employed to fine-tune machine learning models. It aims to optimize the adjusted loss function while avoiding the issues of overfitting or underfitting.
Monte Carlo Dropout applies dropout at test time and runs multiple forward passes with different dropout masks. This has the model produce a distribution of predictions rather than a single point estimate. The distribution provides insights into the model uncertainty about predictions. It's a computationally efficient technique to get neural networks to output distributions without requiring the networks to be trained multiple times.
When running the actual model many times is too expensive, statisticians create simplified "surrogate" models by using techniques like Gaussian process regression (GPR).5 GPR is a Bayesian approach for modeling certainty in predictions that makes it a valuable tool for optimization, time series forecasting and other applications. GPR is based on the concept of a 'Gaussian process', which is a collection of random variables that have a joint Gaussian distribution.
You can think of a Gaussian process as a distribution of functions. GPR places a prior distribution over functions and then uses observed data to create a posterior distribution. Using GPR to calculate uncertainty doesn’t require extra training or model runs because the output inherently expresses how certain or uncertain the model is about the estimate through the distribution. Libraries like Scikit-learn provide implementations of GPR for uncertainty analysis.
The choice of sampling method depends on what features matter most for your model and scenario. Most real-world applications combine multiple approaches.
Bayesian methods
Bayesian statistics is an approach to statistical inference that using Bayes' theorem to combine prior beliefs with observed data and update the probability of a hypothesis. Bayesian statistics explicitly deals with uncertainty by assigning a probability distribution rather than a single fixed value. Instead of giving a single 'best' estimate for a model parameter, Bayesian methods provide a distribution of the likelihood of possible estimates.
Bayesian inference updates predictions as new data becomes available, which naturally incorporates uncertainty throughout the process of estimating covariates. Markov chain Monte Carlo (MCMC) methods help implement Bayesian approaches when mathematical solutions are complex. The MCMC approach samples from complex, high-dimensional probability distributions that cannot be sampled directly, particularly posterior distributions in Bayesian inference.
Bayesian neural networks (BNNs) are a departure from traditional neural networks that treat network weights as probability distributions rather than fixed-point estimates. This probabilistic approach enables principled and rigorous uncertainty quantification. Instead of single point estimates for weights, these maintain probability distributions over all network parameters. Predictions typically include
- mean and variance estimates for the predictive distribution
- samples from the predictive distribution
- credible intervals derived from the distribution
Several popular open source libraries exist for implementing BNNs like PyMC and Tensorflow-Probability.
Ensemble methods
The core idea behind ensemble-based uncertainty quantification is that if multiple independently trained models disagree on a prediction, this disagreement indicates uncertainty about the correct answer.4 Conversely, when all models in the ensemble agree, this suggests higher confidence in the prediction. This intuition translates into concrete uncertainty measures through the variance or spread of ensemble predictions.
If f₁, f₂, ..., fₙ represent the estimators of N ensemble members for input x, the uncertainty can be quantified as
where f̄(x) is the ensemble mean. Training multiple diverse models (different architectures, training data subsets or initialization) and combining their predictions. The main drawback of this approach is the computational cost: it requires training and running multiple models.
Conformal prediction
Conformal prediction is a technique for uncertainty quantification. It provides a distribution-free, model-agnostic framework for creating prediction intervals (for regression scenarios) or prediction sets (for classification applications).3 This provides valid coverage guarantees with minimal assumptions about the model or data. This makes conformal prediction particularly helpful when working with black-box pretrained models.
Conformal prediction has several features that make it widely applicable. For instance, it requires only that data points are exchangeable, rather than requiring that they be independent and identically distributed. Conformal prediction can also be applied to any predictive model and allows you to set the allowable predictive uncertainty of a model.
For instance, in a regression task, you might want to achieve 95% coverage, which would mean that the model should output a range where the true falls into the output interval 95% of the time. This approach is model independent and works well with classification, linear regression, neural networks and a wide variety of time series models.
To use conformal prediction, you split your data into three sets: a training set, a baseline testing set and a calibration set. The calibration set is used to compute the nonconformity scores, often denoted as si. This score measures how unusual a prediction is. Given a new input, form a prediction interval based on these scores to guarantee coverage.
In a classification task, conformal prediction the nonconformity score is a measure of how much a new instance deviates from the existing instances in the training set. This determines whether a new instance belongs to a particular class or not. For multiclass classification, this is typically 1—predicted class probability for the particular label.
So, if the predicted probability of a new instance belonging to a certain class is high, the nonconformity score is low, and vice versa. A common approach is to compute the si scores for each instance in the calibration set and sort the scores from low (certain) to high (uncertain).
To get to 95% conformal coverage, compute the threshold q where 95% of the si scores are lower. For new test examples, you include a label in the prediction set if its si is less than the threshold q.
If you required a guarantee that your model had 95% conformal coverage, you would obtain average si scores for all classes. Then, you would find a threshold of si scores that contain 95% of the data. You can then be assured that your classifier correctly identifies 95% of new instances across all classes.
This is slightly different than the accuracy of the classifier because conformal prediction might identify multiple classes. In a multiclass classifier, conformal prediction also shows the coverage for all classes. You can assign a coverage rate for individual classes rather than for the entire training set.