← ds learning track
notes · 12

Hypothesis Testing & Confidence Intervals

This is where descriptive statistics turns into decision-making. Confidence intervals tell you what values are plausible. Hypothesis tests tell you whether the observed evidence is strong enough to challenge a model, a parameter, or a claim.

Start with confidence intervals, then p-values, then Type I and Type II errors, then power, and finally PD backtesting. That is the practical path from theory to validation.

From data to statistical decisions

Confidence intervals

A confidence interval gives a range of plausible values for an unknown parameter. A 95% CI means that across many repeated samples, 95% of intervals constructed in the same way would contain the true parameter.

Width depends on three things: variability, sample size, and confidence level. More data or less noise means a narrower interval.

CI = point estimate ± critical value × standard error
Validation example: the observed default rate is 2.3% with a 95% CI of [1.8%, 2.8%]. If the model predicts 2.5%, that sits inside the interval, so there is no immediate evidence of miscalibration.

Hypothesis testing

Start with a null hypothesis: for example, “the model is correctly calibrated.” Then compute a test statistic and ask how unusual the observed data would be if that null were actually true.

The p-value measures exactly that: how extreme the data is under H₀. Small p-values suggest the data is difficult to reconcile with the null.

p-value < α → reject H₀
Validation example: observed defaults = 45 where the model expected 30. If the binomial p-value is 0.004, the result is unlikely under H₀, so the calibration claim is no longer comfortable.

A useful order for learning inference

01

Start with interval thinking

Before “significance,” first learn how uncertainty around an estimate behaves. Confidence intervals make that visible.

02

Then learn the logic of H₀

A hypothesis test is not “is my model true?” It is “would this data be surprising if the null were true?”

03

Then separate α, β, and power

False alarms, missed problems, and detection ability are different things. Serious validation work depends on that distinction.

04

Only then move into regulatory tests

Traffic light rules, binomial backtests, Hosmer-Lemeshow, Wald, and LR tests all make more sense once the framework is already clear.

Confidence Interval Builder

Adjust the estimate, standard deviation, sample size, and confidence level. Watch the interval expand or contract as precision changes.

What a p-value actually means

A p-value is the tail area beyond the observed statistic, under the assumption that H₀ is true. It is not the probability that H₀ is true.

False alarms, missed problems, and detection power

Hypothesis testing is not only about α. It is also about β and power. A test can be “strict” but weak, or permissive but sensitive.

H₀ is actually TRUE
H₀ is actually FALSE
Fail to reject H₀
Correct
True negative
Type II Error (β)
False negative
Reject H₀
Type I Error (α)
False positive
Correct
Power = 1 − β

Binomial PD backtest in action

This is one of the classic validation questions: the model predicts PD = p for a pool of n exposures, but the realised defaults are d. Is that still plausible under H₀?

Binomial distribution under H₀

P(X = k) under H₀ observed defaults

Parameters

Predicted PD (%)3.0
Portfolio size (n)200
Observed defaults (d)10
Expected defaults
n × PD
Observed default rate
d / n
Upper-tail p-value
P(X ≥ d | H₀)
Traffic Light
z-score
normal approximation
95% CI for ODR
Interpretation: green if p ≥ 0.05, yellow if 0.01 ≤ p < 0.05, red if p < 0.01. This is a practical traffic-light version of calibration testing.

Tests commonly used in validation

Test Null Hypothesis When used Statistic Validation role
Binomial testd ~ Binom(n, PD)PD backtestingexact p-valueTraffic light, calibration checks
z-test (proportion)p̂ = PDLarge-n PD testz = (p̂−PD)/SEQuick approximation to binomial logic
Chi-squared / HLModel fits grouped dataBucket calibrationχ²Calibration across grades or deciles
t-testμ = μ₀ or μ₁ = μ₂Comparing meanstCoefficient and sample mean significance
KS testSame distributionSeparation testingD = max|F₁−F₂|Discrimination check
Wald testβ = 0Coefficient significanceβ̂ / SE(β̂)Predictor-level inference
Likelihood RatioRestricted model is enoughModel comparison−2ΔLLDoes extra complexity help?
Anderson-DarlingData follows target distributionNormality checksResidual or assumption checking

Concepts every validator should keep

p-value traps

Small p is not the whole story

A tiny p-value can reflect a trivial effect with a huge sample. Always combine significance with effect size and materiality.

multiple testing

False positives accumulate

If many segments or grades are tested separately, one or more may fail by chance alone. Context and correction matter.

one-sided vs two-sided

Direction changes the test

Many risk settings care more about underestimation than overestimation, so one-sided tests can be more aligned with the real question.

power

Weak tests can “pass” bad models

Small portfolios or mild miscalibration often lead to low power. A non-significant result does not automatically mean the model is fine.

practical significance

Statistical and economic materiality differ

A difference can be statistically detectable and still operationally irrelevant, or materially serious and statistically hard to prove in small samples.

validation style

No single test is enough

Good validation combines interval thinking, hypothesis tests, discrimination metrics, stability checks, and judgement.

What to leave this page with

Confidence intervals tell you what parameter values remain plausible. Hypothesis tests tell you whether the observed data is too difficult to reconcile with the null.

The useful order is: first learn interval uncertainty, then p-values, then Type I and Type II errors, then power, and only after that apply the tools to PD backtesting and validation tests.

Once that structure is clear, inferential statistics stops being a list of formulas and becomes a decision framework.