Calibration testing after-the-fact

Tue Jul 15, 2025

In statistics, the term “calibration” often refers to checking whether a model faithfully reproduces observed probabilities on the response variables: for example, a logistic model predicting success in some project is well calibrated if the predicted probability for a given input category matches the observed success rate in that category.

Calibration diagnostics are also used to understand if the predicted distribution under a given model is appropriate in terms of location, scale, skewness.

The most common diagnostics for calibration rely on the probability integral transform: for a calibration value $Y \sim F$, the transformed ordinate $U = F(Y)$ is uniformly distributed. If this is done for a series of calibration values $Y_i$, we can test calibration by testing the uniformity of $(U_i)_{i=1}^n$.

Czado, Gneiting and Held (2009) emphasize that calibration should be measured on a held-out set of data. This is because if we use the same data for estimation first and calibration second, the reliability of the calibration testing can be compromised. Indeed, the model will have “seen” the data already in the estimation phase, and will therefore tend to be well-calibrated on the same sample.

If you arrive after the fact and are called to perform calibration testing using

$$\widehat{U}_{i} = F_{\widehat{\theta}}(Y_i)$$

where $\widehat{\theta}$ is estimated from $Y_i$, the following procedure yields valid tests:

Denote $\widehat{T}$ the test statistic based on the sample $Y_i$
For $r = 1, \ldots, R$ a large number of bootstrap iterates,
1. Draw a sample $(Y^r_i)_{i=1}^n$ from $F_{\widehat{\theta}}$
2. Estimate the corresponding parameter $\theta^r$
3. Set $U^r_i =F_{\theta^r}(Y^r_i)$
4. Compute the test statistic $T_r$ for uniformity of $U^r_i$
Compute the overall p-value $p = \frac{1}{n}\sum_{i=1}^n \mathbf{1}_{T_r>\widehat{T}}$