notes · 17

Resampling, Bootstrap & Model Uncertainty

A single metric is never the whole story. AUC, Gini, KS, accuracy, or Brier score all move if the sample changes. Resampling methods exist to show how stable those numbers really are.

Start with why point estimates are incomplete, then move into bootstrap distributions, confidence intervals, train-vs-test optimism, and challenger-vs-champion uncertainty. The goal is to turn model evaluation from a single-number ritual into a stability-aware judgment.

Mindset Bootstrap Confidence Intervals Optimism Bias Champion vs Challenger Reference Summary

the core problem

A model metric is an estimate, not a fact

Point estimate thinking

“AUC = 0.74” sounds definitive, but it is only the value observed on one particular sample.

If the portfolio had been slightly different, the metric would also be slightly different.

Observed metric = sample-dependent estimate

Uncertainty-aware thinking

The real question is not only what the metric is, but how stable it is, how much it varies across plausible resamples, and whether model differences are large relative to sampling noise.

Good validation asks: estimate + uncertainty

Validation reality: challenger AUC = 0.762 vs champion AUC = 0.751 may or may not matter. Without uncertainty quantification, that comparison is often overconfident.

learning sequence

A useful order for learning resampling

Start with sampling variability

Any model metric changes because the sample changes. That fact should be understood before formal testing.

Then bootstrap the metric itself

Bootstrap makes metric uncertainty visible by rebuilding the sample many times and recalculating the same statistic.

Then compare train and test behaviour

Resampling also helps reveal optimism bias: how much performance shrinks when you move away from the data used to fit or tune the model.

Then evaluate model differences probabilistically

The useful question becomes not “which number is bigger?” but “how likely is the improvement to be real and stable?”

interactive · bootstrap

Bootstrap the metric distribution

Pick a metric and a scenario, then resample the portfolio repeatedly. The histogram below shows the metric not as one number, but as a sampling distribution.

Bootstrap distribution

Bootstrap metric distribution Original sample estimate

Percentile interval view

Scenario

Bootstrap replications400

Sample size800

Default rate (%)8

Metric

Original estimate

—

Bootstrap mean

—

Bootstrap std

—

95% CI

—

Main lesson: a wide bootstrap distribution means the metric is unstable, even if the point estimate itself looks good.

interactive · sample size and interval width

Why bigger samples tighten uncertainty

This section shows the most basic but most important relationship: as effective sample size grows, the interval around the metric usually narrows.

CI width vs sample size

Controls

Base model strength0.70

Default rate (%)6

n = 200 width

—

n = 2000 width

—

Reduction

—

Practical message

—

Low-default relevance: interval width is driven not just by portfolio size, but by event count. A large portfolio with almost no defaults can still behave like a statistically thin dataset.

interactive · optimism bias

Train performance is usually too optimistic

A model usually looks better on the data it was fitted or tuned on. This section shows the gap between apparent performance and out-of-sample performance.

Train vs validation vs test

Optimism by model complexity

Scenario

Train AUC

—

Validation AUC

—

Test AUC

—

Optimism gap

—

Interpretation: the more model tuning and flexibility you allow, the bigger the chance that train performance becomes an overstatement.

Validation discipline: the best-looking model on development data is often not the best model on genuinely new data.

interactive · champion vs challenger uncertainty

Is the challenger really better?

Compare champion and challenger through bootstrap deltas rather than raw point estimates. This is closer to the real decision problem.

Bootstrap delta distribution (challenger − champion)

Win probability view

Scenario

Champion mean

—

Challenger mean

—

Mean delta

—

95% delta CI

—

P(challenger > champion)

—

Decision hint

—

Better framing: “the challenger wins in 88% of bootstrap resamples” is often more decision-useful than a bare delta of +0.007.

reference

Resampling methods compared

Method	Main use	Strength	Main limitation
Bootstrap	Metric uncertainty, intervals, delta comparisons	Very flexible	Depends on sample representativeness
k-fold CV	Estimate out-of-sample performance	Efficient use of data	Can still be optimistic if tuning leaks
Repeated CV	Reduce split randomness	More stable than one split	More computationally expensive
Train / validation / test split	Simple holdout logic	Easy to explain	High variance if split is unlucky
Jackknife	Influence / leave-one-out sensitivity	Useful for influence diagnostics	Less general for model comparison
Out-of-time validation	Temporal robustness	Closest to production reality	Needs enough history and stable labels

deeper concepts

Concepts every validator should keep

point estimates

One number hides volatility

AUC = 0.76 may look solid, but without interval context you do not know whether it is robust or fragile.

event scarcity

Low-default portfolios magnify uncertainty

When defaults are sparse, even reasonable-looking performance metrics can vary materially across resamples.

optimism

Development data flatters the model

Every tuning decision borrows information from the development sample, which is why independent testing matters.

delta focus

Model comparison needs uncertainty too

The uncertainty of the difference is often more important than the uncertainty of each model alone.

stability

Good models are repeatable

A model that wins only in a narrow subset of samples is less trustworthy than one with slightly lower but much more stable performance.

governance

Uncertainty should appear in reporting

Model governance improves when intervals, ranges, and stability statements appear next to point estimates in validation outputs.

summary

What to leave this page with

Resampling methods force a simple but important correction: performance is not a fixed fact, but a sample-dependent estimate.

The useful order is: first inspect bootstrap variability, then understand interval width and event scarcity, then compare train and test optimism, then judge challenger-vs-champion deltas with uncertainty attached.

Once that structure is clear, model validation becomes less about celebrating one number and more about assessing how much trust that number deserves.