Model Stability & Backtesting
A model can still discriminate well and yet no longer be safe to use. Population drift, variable drift, and PD miscalibration all create different types of model risk. This page brings those monitoring dimensions together.
Three questions every ongoing validation cycle should ask
1. Is the population still the same?
PSI and CSI answer whether the score distribution or individual variable distributions have shifted since development.
2. Can the model still rank risk?
Gini, AUC, and KS monitor whether the score is still separating higher-risk and lower-risk observations in a useful way.
3. Are PD levels still correct?
Binomial backtests and Hosmer-Lemeshow examine whether predicted PDs are still aligned with realised default behaviour.
A useful order for monitoring logic
Start with drift
Before judging the model, first ask whether the portfolio or inputs have materially changed since development.
Then check discrimination
If population shift is low but ranking deteriorates, the model itself may be ageing. If shift is high and discrimination falls, portfolio change may be the main driver.
Then test calibration
Backtesting asks whether predicted default rates still match realised experience, grade by grade and overall.
Then decide the action
Some outcomes justify simple monitoring, some require investigation, and some require recalibration or redevelopment.
PSI — detecting score distribution drift
PSI compares the development distribution with the current distribution. A stable population gives a small PSI. A shifted or newly segmented population pushes PSI upward and changes which bins dominate the score.
CSI — which variables are actually shifting?
CSI applies the same stability logic to each input variable. This is where score-level drift turns into diagnosis: which variables are moving, and which of them are driving the score PSI?
Predicted PD vs observed default rate
This is the core calibration question in ongoing validation. Per-grade binomial tests show where the model is too optimistic or too conservative. Hosmer-Lemeshow gives an overall calibration view across the portfolio.
Time-series monitoring dashboard
This is the portfolio view that monitoring committees actually need: how PSI, Gini, and observed default behaviour move quarter by quarter, and whether the pattern suggests stable performance, gradual decay, or structural break.
PSI and Gini over time
Scenario
Quarter-by-quarter view
Monitoring metrics cheat sheet
| Metric | What it checks | Green | Yellow | Red | Frequency |
|---|---|---|---|---|---|
| PSI | Score distribution stability | < 0.10 | 0.10–0.25 | ≥ 0.25 | Quarterly |
| CSI | Input variable stability | < 0.10 | 0.10–0.25 | ≥ 0.25 | Quarterly |
| Gini / KS change | Discrimination decay | small drift | moderate drift | material drop | Quarterly |
| Binomial test | Per-grade PD calibration | p ≥ 0.05 | 0.01–0.05 | < 0.01 | Annual |
| Hosmer-Lemeshow | Overall calibration | p ≥ 0.05 | 0.01–0.05 | < 0.01 | Annual |
| Observed DR trend | Realised performance drift | within range | near boundary | outside range | Quarterly |
Concepts every validator should keep
Drift is not automatically model failure
A changing population can increase PSI without necessarily destroying model usefulness. The real question is whether ranking and calibration remain acceptable under the new data.
Calibration depends on philosophy
TTC models are not meant to chase every short-term cycle move. PIT models are. Backtesting must be interpreted against the intended design.
Some overprediction is intentional
In regulated environments, a margin of conservatism means predicted PDs are often expected to sit somewhat above realised default rates on average.
Classical tests lose power fast
When defaults are extremely rare, confidence intervals become very wide and annual backtests can become statistically weak. Interpretation must become more judgement-based.
Single points matter less than paths
The direction and persistence of change often matters more than any single quarterly reading. Good monitoring is about trajectories, not only flags.
Define actions in advance
Green should mean routine monitoring, yellow should trigger investigation, and red should trigger formal escalation. A threshold without an action plan is weak governance.
What to leave this page with
Model monitoring is not one number. PSI and CSI tell you whether the environment has changed. Discrimination tells you whether the ranking still works. Backtesting tells you whether predicted PD levels still match reality.
The useful order is: first detect drift, then assess ranking, then test calibration, then decide whether the right response is monitoring, investigation, recalibration, or redevelopment.
Once that structure is clear, ongoing validation becomes a disciplined monitoring system rather than a collection of disconnected indicators.