← ds learning track
notes · 03

Calibration, PD Scaling & Probability Alignment

A model can rank borrowers correctly and still produce the wrong probabilities. Calibration is the separate question: does a predicted PD of 3% actually behave like 3% in the real portfolio?

Start with the difference between discrimination and calibration, then move into calibration curves, intercept and slope distortions, proper scoring rules, and finally PD scaling from TTC toward PIT-style alignment.

Good ranking is not the same as good probabilities

Discrimination

Discrimination asks whether risky borrowers are ranked above safer borrowers. ROC, AUC, Gini, and KS mostly live here.

A model can discriminate well even if every predicted PD is systematically too high or too low.

Discrimination = ordering

Calibration

Calibration asks whether the predicted probability level matches realised frequency.

If a set of borrowers is assigned 5% PD, then roughly 5% should default over the matching horizon and segment definition.

Calibration = probability alignment
Risk validation reality: a model with high Gini but poor calibration can still create wrong provisions, wrong pricing, wrong capital intuition, and wrong override signals. That is why probability alignment deserves its own chapter.

A useful order for learning calibration

01

Start with grouped predicted vs observed rates

The calibration curve is the most intuitive entry point: what the model predicted, versus what the portfolio actually did.

02

Then separate level shift from spread distortion

Intercept problems push probabilities up or down overall. Slope problems make low PDs too low and high PDs too high, or the reverse.

03

Then quantify the damage

Brier score, log loss, and calibration error metrics translate visual mismatch into disciplined diagnostics.

04

Then connect it to real PD scaling work

Once the idea is clear, TTC-to-PIT adjustment, central tendency alignment, and macro-conditioned scaling become easier to reason about.

Grouped predicted PD vs observed default rate

Choose a calibration scenario and watch how the portfolio-level grouped curve moves relative to the 45-degree perfect-calibration line.

Two ways calibration breaks

Move the intercept and slope separately. Watch how the PD mapping shifts and how the calibration curve changes for the same ranked portfolio.

Brier score and log loss

These metrics punish wrong probabilities directly. Unlike Gini or AUC, they care about the PD value itself, not only about ranking.

Score comparison

Brier Log loss

Scenario

Brier score
Log loss
Reliability part
Resolution part
Brier: quadratic penalty. Easier to decompose and interpret.
Log loss: harsher penalty for confident, wrong probabilities. Very useful when the model acts too certain.

From TTC-style PD to PIT-style aligned PD

This is where calibration becomes operational. Change portfolio central tendency and macro stress, then watch rating-grade PDs shift from a base TTC profile toward a scenario-sensitive PIT profile.

Calibration concepts compared

Concept Main question Good value What failure means
Calibration curveDo grouped predicted PDs match grouped observed DRs?Near diagonalSystematic under/over-prediction
Calibration interceptIs the whole level shifted?≈ 0Average PD too low/high
Calibration slopeIs the spread right?≈ 1Predictions too extreme or too flat
Brier scoreHow wrong are the probabilities overall?Lower is betterPoor reliability and/or weak resolution
Log lossHow harshly are wrong probabilities punished?Lower is betterOverconfident wrong predictions
Central tendency adjustmentDoes portfolio average PD reflect current reality?AlignedPortfolio mean under- or overstated

Concepts every validator should keep

ranking vs levels

AUC can stay high while calibration breaks

That is why discrimination and calibration should never be treated as interchangeable evidence.

horizon consistency

The horizon must match

A 12-month PD must be compared with a 12-month observed default frequency, not with some mixed or lifetime realised rate.

segment drift

Overall calibration can hide segment failure

A model may look aligned in aggregate while materially missing specific segments, grades, industries, or products.

recalibration

Not every issue needs redevelopment

Sometimes ranking is still healthy and only the probability layer needs repair through intercept/slope correction or central tendency adjustment.

TTC vs PIT

Philosophy shapes calibration expectations

A TTC-oriented model is not supposed to chase every short-term cycle move. A PIT-style calibration is much more responsive to current conditions.

scoring rules

Use proper probability metrics

If the objective is probability quality, metrics that directly punish wrong probabilities deserve a central role.

What to leave this page with

Calibration is the missing second half of probability modelling. Ranking tells you who is riskier. Calibration tells you whether the PD number itself is believable.

The useful order is: first inspect the grouped calibration curve, then separate intercept and slope effects, then quantify mismatch with proper scoring rules, then connect the result to PD scaling and current portfolio alignment.

Once that structure is clear, calibration stops looking like a minor technical adjustment and starts looking like the probability layer of the entire model.