notes · 03

Calibration, PD Scaling & Probability Alignment

A model can rank borrowers correctly and still produce the wrong probabilities. Calibration is the separate question: does a predicted PD of 3% actually behave like 3% in the real portfolio?

Start with the difference between discrimination and calibration, then move into calibration curves, intercept and slope distortions, proper scoring rules, and finally PD scaling from TTC toward PIT-style alignment.

Mindset Calibration Curve Intercept & Slope Brier & Log Loss PD Scaling Reference Summary

the key distinction

Good ranking is not the same as good probabilities

Discrimination

Discrimination asks whether risky borrowers are ranked above safer borrowers. ROC, AUC, Gini, and KS mostly live here.

A model can discriminate well even if every predicted PD is systematically too high or too low.

Discrimination = ordering

Calibration

Calibration asks whether the predicted probability level matches realised frequency.

If a set of borrowers is assigned 5% PD, then roughly 5% should default over the matching horizon and segment definition.

Calibration = probability alignment

Risk validation reality: a model with high Gini but poor calibration can still create wrong provisions, wrong pricing, wrong capital intuition, and wrong override signals. That is why probability alignment deserves its own chapter.

learning sequence

A useful order for learning calibration

Start with grouped predicted vs observed rates

The calibration curve is the most intuitive entry point: what the model predicted, versus what the portfolio actually did.

Then separate level shift from spread distortion

Intercept problems push probabilities up or down overall. Slope problems make low PDs too low and high PDs too high, or the reverse.

Then quantify the damage

Brier score, log loss, and calibration error metrics translate visual mismatch into disciplined diagnostics.

Then connect it to real PD scaling work

Once the idea is clear, TTC-to-PIT adjustment, central tendency alignment, and macro-conditioned scaling become easier to reason about.

interactive · calibration curve

Grouped predicted PD vs observed default rate

Choose a calibration scenario and watch how the portfolio-level grouped curve moves relative to the 45-degree perfect-calibration line.

Calibration curve

Model curve Perfect calibration

Grouped bin table

Scenario

Mean predicted PD

—

Observed DR

—

ECE (approx)

—

Traffic light

—

Interpretation: points above the diagonal mean the model under-predicts risk. Points below the diagonal mean it over-predicts risk.

Grouped view: calibration is rarely checked observation by observation. The practical unit is usually a bucket, band, rating grade, or score bin.

interactive · intercept & slope

Two ways calibration breaks

Move the intercept and slope separately. Watch how the PD mapping shifts and how the calibration curve changes for the same ranked portfolio.

PD mapping after recalibration

Original mapping Adjusted mapping

Grouped observed vs adjusted prediction

Controls

Calibration intercept0.00

Calibration slope1.00

Interpretation

—

Spread effect

—

Mean shift

—

logit(PD*) = a + b · logit(PD)

Intercept: shifts the whole probability level up or down while keeping much of the ranking shape.

Slope: changes the spread. If slope < 1, predictions become too compressed. If slope > 1, they become too extreme.

Validation signal: intercept and slope failure together often indicate that both central tendency and score dispersion need attention, not just a simple scalar shift.

interactive · proper scoring rules

Brier score and log loss

These metrics punish wrong probabilities directly. Unlike Gini or AUC, they care about the PD value itself, not only about ranking.

Score comparison

Brier Log loss

Scenario

Brier score

—

Log loss

—

Reliability part

—

Resolution part

—

Brier: quadratic penalty. Easier to decompose and interpret.

Log loss: harsher penalty for confident, wrong probabilities. Very useful when the model acts too certain.

interactive · pd scaling

From TTC-style PD to PIT-style aligned PD

This is where calibration becomes operational. Change portfolio central tendency and macro stress, then watch rating-grade PDs shift from a base TTC profile toward a scenario-sensitive PIT profile.

Grade-level PD scaling

TTC / base PD Scaled PD

Grade table

Controls

Central tendency multiplier1.00

Macro stress factor1.00

Slope preservation1.00

Portfolio avg PD

—

Scaled avg PD

—

Upper grade shift

—

Lower grade shift

—

logit(PD_scaled) = a + b · logit(PD_TTC)

Central tendency alignment: even if ranking is fine, average PD may need to move up or down to reflect current portfolio conditions.

PIT flavor: under stress, the whole distribution may shift upward, and weaker grades often move more sharply once nonlinear effects kick in.

reference

Calibration concepts compared

Concept	Main question	Good value	What failure means
Calibration curve	Do grouped predicted PDs match grouped observed DRs?	Near diagonal	Systematic under/over-prediction
Calibration intercept	Is the whole level shifted?	≈ 0	Average PD too low/high
Calibration slope	Is the spread right?	≈ 1	Predictions too extreme or too flat
Brier score	How wrong are the probabilities overall?	Lower is better	Poor reliability and/or weak resolution
Log loss	How harshly are wrong probabilities punished?	Lower is better	Overconfident wrong predictions
Central tendency adjustment	Does portfolio average PD reflect current reality?	Aligned	Portfolio mean under- or overstated

deeper concepts

Concepts every validator should keep

ranking vs levels

AUC can stay high while calibration breaks

That is why discrimination and calibration should never be treated as interchangeable evidence.

horizon consistency

The horizon must match

A 12-month PD must be compared with a 12-month observed default frequency, not with some mixed or lifetime realised rate.

segment drift

Overall calibration can hide segment failure

A model may look aligned in aggregate while materially missing specific segments, grades, industries, or products.

recalibration

Not every issue needs redevelopment

Sometimes ranking is still healthy and only the probability layer needs repair through intercept/slope correction or central tendency adjustment.

TTC vs PIT

Philosophy shapes calibration expectations

A TTC-oriented model is not supposed to chase every short-term cycle move. A PIT-style calibration is much more responsive to current conditions.

scoring rules

Use proper probability metrics

If the objective is probability quality, metrics that directly punish wrong probabilities deserve a central role.

summary

What to leave this page with

Calibration is the missing second half of probability modelling. Ranking tells you who is riskier. Calibration tells you whether the PD number itself is believable.

The useful order is: first inspect the grouped calibration curve, then separate intercept and slope effects, then quantify mismatch with proper scoring rules, then connect the result to PD scaling and current portfolio alignment.

Once that structure is clear, calibration stops looking like a minor technical adjustment and starts looking like the probability layer of the entire model.