Risk Managed

Calibration testing after-the-fact

Tue Jul 15, 2025 by Maximilian

In statistics, the term “calibration” often refers to checking whether a model faithfully reproduces observed probabilities on the response variables: for example, a logistic model predicting success in some project is well calibrated if the predicted probability for a given input category matches the observed success rate in that category.

Calibration diagnostics are also used to understand if the predicted distribution under a given model is appropriate in terms of location, scale, skewness.

The most common diagnostics for calibration rely on the probability integral transform: for a calibration value $Y \sim F$, the transformed ordinate $U = F(Y)$ is uniformly distributed. If this is done for a series of calibration values $Y_i$, we can test calibration by testing the uniformity of $(U_i)_{i=1}^n$.

Czado, Gneiting and Held (2009) emphasize that calibration should be measured on a held-out set of data. This is because if we use the same data for estimation first and calibration second, the reliability of the calibration testing can be compromised. Indeed, the model will have “seen” the data already in the estimation phase, and will therefore tend to be well-calibrated on the same sample.

If you arrive after the fact and are called to perform calibration testing using

$$\widehat{U}_{i} = F_{\widehat{\theta}}(Y_i)$$

where $\widehat{\theta}$ is estimated from $Y_i$, the following procedure yields valid tests:

Denote $\widehat{T}$ the test statistic based on the sample $Y_i$
For $r = 1, \ldots, R$ a large number of bootstrap iterates,
1. Draw a sample $(Y^r_i)_{i=1}^n$ from $F_{\widehat{\theta}}$
2. Estimate the corresponding parameter $\theta^r$
3. Set $U^r_i =F_{\theta^r}(Y^r_i)$
4. Compute the test statistic $T_r$ for uniformity of $U^r_i$
Compute the overall p-value $p = \frac{1}{n}\sum_{i=1}^n \mathbf{1}_{T_r>\widehat{T}}$

Online updating of a covariance matrix

Fri May 31, 2024 by Maximilian

The sample covariance matrix $\Sigma$ of a sample $\mathbf{X}_{i,j}$, $i = 1,\ldots,n$, $j=1,\ldots,p$ is given by $\Sigma_n = \frac{1}{n} (X-\bar{X})^T(X-\bar{X})$.

In some applications this covariance matrix must be kept updated as each observation comes in, which is called online updating. The formulas below provide an efficient way of calculating $\Sigma_{n+1}$, the covariance matrix of the sample of size $n + 1$, from the previous covariance matrix and the new observation $X_{n+1}$.

$$\Sigma_{n+1} = \frac{n}{n+1} \Sigma_n + \frac{1}{n} \left(X_{n+1} - \bar{X}_{n+1}\right)^T\left(X_{n+1} - \bar{X}_{n+1}\right)$$

$$\bar{X}_{n+1} = \bar{X}_n + \frac{1}{n+1} \left(X_n - \bar{X}_n\right)$$

Heavy tails are a last resort

Thu Mar 21, 2024 by Maximilian

For a given problem, statistics offers a range of models or techniques ranging from the simple to the very complex. Entire disciplines of science have been pioneered using the workhorse of ANOVA and linear regression, with more sophisticated techniques being developed later as refinements. The granularity of the method is only one factor among others including data availability, ease of communication, and robustness.

A fundamental example of this relationship between granularity and variability is the case of a mixture model. Consider observations yᵢ generated by a two-stage process: suppose there is a latent parameter σ which determines the variability of the observations. Suppose that if σ were known, then, yᵢ is drawn conditionally on σ as a N(0, σ²) value.

The latent variation of σ induces excess variability in the yᵢ over several repetitions, beyond what might be expected from a Gaussian distribution with fixed variance, and will manifest as heavy-tailed behaviour when the distribution of σ is appropriately chosen. For example, if each σ is independently Gamma distributed, then the marginal distribution of yᵢ becomes the Student t distribution, which is heavy-tailed for low degrees of freedom.

The only way to achieve good fit of the yᵢ would then be to investigate further if there are other factors which are responsible for the variation in σ. If these are correctly identified in a model, then it would be possible to obtain a prediction for σ and subsequently for the yᵢ, without using heavy-tailed distributions at any stage.

A recent paper by Yi He and John Einmahl argues that heterogeneity explains at least some cases of apparent heavy-tailedness (for example, in stock returns) and that accounting for the heterogeneity allows more precise estimation extreme value behaviour. This means that even if you choose to go with a heavy-tail / extreme value model, you can leverage heterogeneity to improve inference.

John H.J. Einmahl, Yi He. Extreme Value Inference for Heterogeneous Power Law Data (2023)