Calibration, PD Scaling & Probability Alignment
A model can rank borrowers correctly and still produce the wrong probabilities. Calibration is the separate question: does a predicted PD of 3% actually behave like 3% in the real portfolio?
Good ranking is not the same as good probabilities
Discrimination
Discrimination asks whether risky borrowers are ranked above safer borrowers. ROC, AUC, Gini, and KS mostly live here.
A model can discriminate well even if every predicted PD is systematically too high or too low.
Calibration
Calibration asks whether the predicted probability level matches realised frequency.
If a set of borrowers is assigned 5% PD, then roughly 5% should default over the matching horizon and segment definition.
A useful order for learning calibration
Start with grouped predicted vs observed rates
The calibration curve is the most intuitive entry point: what the model predicted, versus what the portfolio actually did.
Then separate level shift from spread distortion
Intercept problems push probabilities up or down overall. Slope problems make low PDs too low and high PDs too high, or the reverse.
Then quantify the damage
Brier score, log loss, and calibration error metrics translate visual mismatch into disciplined diagnostics.
Then connect it to real PD scaling work
Once the idea is clear, TTC-to-PIT adjustment, central tendency alignment, and macro-conditioned scaling become easier to reason about.
Grouped predicted PD vs observed default rate
Choose a calibration scenario and watch how the portfolio-level grouped curve moves relative to the 45-degree perfect-calibration line.
Two ways calibration breaks
Move the intercept and slope separately. Watch how the PD mapping shifts and how the calibration curve changes for the same ranked portfolio.
Brier score and log loss
These metrics punish wrong probabilities directly. Unlike Gini or AUC, they care about the PD value itself, not only about ranking.
Score comparison
Scenario
From TTC-style PD to PIT-style aligned PD
This is where calibration becomes operational. Change portfolio central tendency and macro stress, then watch rating-grade PDs shift from a base TTC profile toward a scenario-sensitive PIT profile.
Calibration concepts compared
| Concept | Main question | Good value | What failure means |
|---|---|---|---|
| Calibration curve | Do grouped predicted PDs match grouped observed DRs? | Near diagonal | Systematic under/over-prediction |
| Calibration intercept | Is the whole level shifted? | ≈ 0 | Average PD too low/high |
| Calibration slope | Is the spread right? | ≈ 1 | Predictions too extreme or too flat |
| Brier score | How wrong are the probabilities overall? | Lower is better | Poor reliability and/or weak resolution |
| Log loss | How harshly are wrong probabilities punished? | Lower is better | Overconfident wrong predictions |
| Central tendency adjustment | Does portfolio average PD reflect current reality? | Aligned | Portfolio mean under- or overstated |
Concepts every validator should keep
AUC can stay high while calibration breaks
That is why discrimination and calibration should never be treated as interchangeable evidence.
The horizon must match
A 12-month PD must be compared with a 12-month observed default frequency, not with some mixed or lifetime realised rate.
Overall calibration can hide segment failure
A model may look aligned in aggregate while materially missing specific segments, grades, industries, or products.
Not every issue needs redevelopment
Sometimes ranking is still healthy and only the probability layer needs repair through intercept/slope correction or central tendency adjustment.
Philosophy shapes calibration expectations
A TTC-oriented model is not supposed to chase every short-term cycle move. A PIT-style calibration is much more responsive to current conditions.
Use proper probability metrics
If the objective is probability quality, metrics that directly punish wrong probabilities deserve a central role.
What to leave this page with
Calibration is the missing second half of probability modelling. Ranking tells you who is riskier. Calibration tells you whether the PD number itself is believable.
The useful order is: first inspect the grouped calibration curve, then separate intercept and slope effects, then quantify mismatch with proper scoring rules, then connect the result to PD scaling and current portfolio alignment.
Once that structure is clear, calibration stops looking like a minor technical adjustment and starts looking like the probability layer of the entire model.