notes · 12

Hypothesis Testing & Confidence Intervals

This is where descriptive statistics turns into decision-making. Confidence intervals tell you what values are plausible. Hypothesis tests tell you whether the observed evidence is strong enough to challenge a model, a parameter, or a claim.

Start with confidence intervals, then p-values, then Type I and Type II errors, then power, and finally PD backtesting. That is the practical path from theory to validation.

Framework Confidence Intervals P-Value Type I / II & Power PD Backtest Reference Summary

the framework

From data to statistical decisions

Confidence intervals

A confidence interval gives a range of plausible values for an unknown parameter. A 95% CI means that across many repeated samples, 95% of intervals constructed in the same way would contain the true parameter.

Width depends on three things: variability, sample size, and confidence level. More data or less noise means a narrower interval.

CI = point estimate ± critical value × standard error

Validation example: the observed default rate is 2.3% with a 95% CI of [1.8%, 2.8%]. If the model predicts 2.5%, that sits inside the interval, so there is no immediate evidence of miscalibration.

Hypothesis testing

Start with a null hypothesis: for example, “the model is correctly calibrated.” Then compute a test statistic and ask how unusual the observed data would be if that null were actually true.

The p-value measures exactly that: how extreme the data is under H₀. Small p-values suggest the data is difficult to reconcile with the null.

p-value < α → reject H₀

Validation example: observed defaults = 45 where the model expected 30. If the binomial p-value is 0.004, the result is unlikely under H₀, so the calibration claim is no longer comfortable.

learning sequence

A useful order for learning inference

Start with interval thinking

Before “significance,” first learn how uncertainty around an estimate behaves. Confidence intervals make that visible.

Then learn the logic of H₀

A hypothesis test is not “is my model true?” It is “would this data be surprising if the null were true?”

Then separate α, β, and power

False alarms, missed problems, and detection ability are different things. Serious validation work depends on that distinction.

Only then move into regulatory tests

Traffic light rules, binomial backtests, Hosmer-Lemeshow, Wald, and LR tests all make more sense once the framework is already clear.

interactive · confidence intervals

Confidence Interval Builder

Adjust the estimate, standard deviation, sample size, and confidence level. Watch the interval expand or contract as precision changes.

Sampling distribution with confidence region

sampling distribution confidence region outside region

Parameters

Sample mean (x̄)5.0

Std Dev (σ)2.0

Sample size (n)30

Confidence level95%

SE = σ/√n

—

Margin of Error

—

z* × SE

CI Width

—

Confidence Interval

—

Interpretation

—

CI = x̄ ± z* · (σ / √n)

Try this: keep the mean fixed and move n from 10 to 200. The interval tightens fast. That is the basic intuition behind why larger samples make validation judgments more stable.

interactive · p-value

What a p-value actually means

A p-value is the tail area beyond the observed statistic, under the assumption that H₀ is true. It is not the probability that H₀ is true.

Standard Normal under H₀ — two-tailed

p-value area null curve

Controls

Observed statistic (zobs)1.8

Significance level (α)0.05

p-value

—

Decision

—

Critical z*

—

Correct reading: if p = 0.03, that means data this extreme would appear about 3% of the time if H₀ were true.

Wrong reading: p = 0.03 does not mean “there is a 3% chance the null is true.”

interactive · Type I / II errors & power

False alarms, missed problems, and detection power

Hypothesis testing is not only about α. It is also about β and power. A test can be “strict” but weak, or permissive but sensitive.

Correct
True negative

Type II Error (β)
False negative

Type I Error (α)
False positive

Correct
Power = 1 − β

H₀ and H₁ distributions

H₀ H₁ power region

Parameters

Effect size (d)1.5

α0.05

σ1.0

Type I (α)

—

Type II (β)

—

Power

—

Critical z*

—

Validation trade-off: lowering α reduces false alarms, but increases the chance of missing a real problem unless the effect is large or the sample is strong.

interactive · PD backtesting

Binomial PD backtest in action

This is one of the classic validation questions: the model predicts PD = p for a pool of n exposures, but the realised defaults are d. Is that still plausible under H₀?

Binomial distribution under H₀

P(X = k) under H₀ observed defaults

Parameters

Predicted PD (%)3.0

Portfolio size (n)200

Observed defaults (d)10

Expected defaults

—

n × PD

Observed default rate

—

d / n

Upper-tail p-value

—

P(X ≥ d | H₀)

Traffic Light

—

z-score

—

normal approximation

95% CI for ODR

—

Interpretation: green if p ≥ 0.05, yellow if 0.01 ≤ p < 0.05, red if p < 0.01. This is a practical traffic-light version of calibration testing.

reference

Tests commonly used in validation

Test	Null Hypothesis	When used	Statistic	Validation role
Binomial test	d ~ Binom(n, PD)	PD backtesting	exact p-value	Traffic light, calibration checks
z-test (proportion)	p̂ = PD	Large-n PD test	z = (p̂−PD)/SE	Quick approximation to binomial logic
Chi-squared / HL	Model fits grouped data	Bucket calibration	χ²	Calibration across grades or deciles
t-test	μ = μ₀ or μ₁ = μ₂	Comparing means	t	Coefficient and sample mean significance
KS test	Same distribution	Separation testing	D = max\|F₁−F₂\|	Discrimination check
Wald test	β = 0	Coefficient significance	β̂ / SE(β̂)	Predictor-level inference
Likelihood Ratio	Restricted model is enough	Model comparison	−2ΔLL	Does extra complexity help?
Anderson-Darling	Data follows target distribution	Normality checks	A²	Residual or assumption checking

deeper concepts

Concepts every validator should keep

p-value traps

Small p is not the whole story

A tiny p-value can reflect a trivial effect with a huge sample. Always combine significance with effect size and materiality.

multiple testing

False positives accumulate

If many segments or grades are tested separately, one or more may fail by chance alone. Context and correction matter.

one-sided vs two-sided

Direction changes the test

Many risk settings care more about underestimation than overestimation, so one-sided tests can be more aligned with the real question.

power

Weak tests can “pass” bad models

Small portfolios or mild miscalibration often lead to low power. A non-significant result does not automatically mean the model is fine.

practical significance

Statistical and economic materiality differ

A difference can be statistically detectable and still operationally irrelevant, or materially serious and statistically hard to prove in small samples.

validation style

No single test is enough

Good validation combines interval thinking, hypothesis tests, discrimination metrics, stability checks, and judgement.

summary

What to leave this page with

Confidence intervals tell you what parameter values remain plausible. Hypothesis tests tell you whether the observed data is too difficult to reconcile with the null.

The useful order is: first learn interval uncertainty, then p-values, then Type I and Type II errors, then power, and only after that apply the tools to PD backtesting and validation tests.

Once that structure is clear, inferential statistics stops being a list of formulas and becomes a decision framework.