Hypothesis Testing & Confidence Intervals
This is where descriptive statistics turns into decision-making. Confidence intervals tell you what values are plausible. Hypothesis tests tell you whether the observed evidence is strong enough to challenge a model, a parameter, or a claim.
From data to statistical decisions
Confidence intervals
A confidence interval gives a range of plausible values for an unknown parameter. A 95% CI means that across many repeated samples, 95% of intervals constructed in the same way would contain the true parameter.
Width depends on three things: variability, sample size, and confidence level. More data or less noise means a narrower interval.
Hypothesis testing
Start with a null hypothesis: for example, “the model is correctly calibrated.” Then compute a test statistic and ask how unusual the observed data would be if that null were actually true.
The p-value measures exactly that: how extreme the data is under H₀. Small p-values suggest the data is difficult to reconcile with the null.
A useful order for learning inference
Start with interval thinking
Before “significance,” first learn how uncertainty around an estimate behaves. Confidence intervals make that visible.
Then learn the logic of H₀
A hypothesis test is not “is my model true?” It is “would this data be surprising if the null were true?”
Then separate α, β, and power
False alarms, missed problems, and detection ability are different things. Serious validation work depends on that distinction.
Only then move into regulatory tests
Traffic light rules, binomial backtests, Hosmer-Lemeshow, Wald, and LR tests all make more sense once the framework is already clear.
Confidence Interval Builder
Adjust the estimate, standard deviation, sample size, and confidence level. Watch the interval expand or contract as precision changes.
What a p-value actually means
A p-value is the tail area beyond the observed statistic, under the assumption that H₀ is true. It is not the probability that H₀ is true.
False alarms, missed problems, and detection power
Hypothesis testing is not only about α. It is also about β and power. A test can be “strict” but weak, or permissive but sensitive.
True negative
False negative
False positive
Power = 1 − β
Binomial PD backtest in action
This is one of the classic validation questions: the model predicts PD = p for a pool of n exposures, but the realised defaults are d. Is that still plausible under H₀?
Tests commonly used in validation
| Test | Null Hypothesis | When used | Statistic | Validation role |
|---|---|---|---|---|
| Binomial test | d ~ Binom(n, PD) | PD backtesting | exact p-value | Traffic light, calibration checks |
| z-test (proportion) | p̂ = PD | Large-n PD test | z = (p̂−PD)/SE | Quick approximation to binomial logic |
| Chi-squared / HL | Model fits grouped data | Bucket calibration | χ² | Calibration across grades or deciles |
| t-test | μ = μ₀ or μ₁ = μ₂ | Comparing means | t | Coefficient and sample mean significance |
| KS test | Same distribution | Separation testing | D = max|F₁−F₂| | Discrimination check |
| Wald test | β = 0 | Coefficient significance | β̂ / SE(β̂) | Predictor-level inference |
| Likelihood Ratio | Restricted model is enough | Model comparison | −2ΔLL | Does extra complexity help? |
| Anderson-Darling | Data follows target distribution | Normality checks | A² | Residual or assumption checking |
Concepts every validator should keep
Small p is not the whole story
A tiny p-value can reflect a trivial effect with a huge sample. Always combine significance with effect size and materiality.
False positives accumulate
If many segments or grades are tested separately, one or more may fail by chance alone. Context and correction matter.
Direction changes the test
Many risk settings care more about underestimation than overestimation, so one-sided tests can be more aligned with the real question.
Weak tests can “pass” bad models
Small portfolios or mild miscalibration often lead to low power. A non-significant result does not automatically mean the model is fine.
Statistical and economic materiality differ
A difference can be statistically detectable and still operationally irrelevant, or materially serious and statistically hard to prove in small samples.
No single test is enough
Good validation combines interval thinking, hypothesis tests, discrimination metrics, stability checks, and judgement.
What to leave this page with
Confidence intervals tell you what parameter values remain plausible. Hypothesis tests tell you whether the observed data is too difficult to reconcile with the null.
The useful order is: first learn interval uncertainty, then p-values, then Type I and Type II errors, then power, and only after that apply the tools to PD backtesting and validation tests.
Once that structure is clear, inferential statistics stops being a list of formulas and becomes a decision framework.