← ds learning track
notes · 04

Model Stability & Backtesting

A model can still discriminate well and yet no longer be safe to use. Population drift, variable drift, and PD miscalibration all create different types of model risk. This page brings those monitoring dimensions together.

Start with PSI and CSI to understand whether the portfolio has shifted. Then move into PD backtesting and calibration tests such as binomial checks and Hosmer-Lemeshow. End with the quarterly monitoring view, where all of these signals are tracked over time.

Three questions every ongoing validation cycle should ask

1. Is the population still the same?

PSI and CSI answer whether the score distribution or individual variable distributions have shifted since development.

Stability asks: is the model still seeing familiar data?

2. Can the model still rank risk?

Gini, AUC, and KS monitor whether the score is still separating higher-risk and lower-risk observations in a useful way.

Discrimination asks: can the model still sort?

3. Are PD levels still correct?

Binomial backtests and Hosmer-Lemeshow examine whether predicted PDs are still aligned with realised default behaviour.

Calibration asks: are the predicted levels still right?
Important distinction: high PSI does not automatically mean the model is wrong, and a calibration issue does not automatically mean the score has lost ranking power. These are different failure modes and should be diagnosed separately. [oai_citation:2‡12-model-stability.html](sediment://file_000000009570720abb48b9c25d8472d4)

A useful order for monitoring logic

01

Start with drift

Before judging the model, first ask whether the portfolio or inputs have materially changed since development.

02

Then check discrimination

If population shift is low but ranking deteriorates, the model itself may be ageing. If shift is high and discrimination falls, portfolio change may be the main driver.

03

Then test calibration

Backtesting asks whether predicted default rates still match realised experience, grade by grade and overall.

04

Then decide the action

Some outcomes justify simple monitoring, some require investigation, and some require recalibration or redevelopment.

PSI — detecting score distribution drift

PSI compares the development distribution with the current distribution. A stable population gives a small PSI. A shifted or newly segmented population pushes PSI upward and changes which bins dominate the score.

CSI — which variables are actually shifting?

CSI applies the same stability logic to each input variable. This is where score-level drift turns into diagnosis: which variables are moving, and which of them are driving the score PSI?

Predicted PD vs observed default rate

This is the core calibration question in ongoing validation. Per-grade binomial tests show where the model is too optimistic or too conservative. Hosmer-Lemeshow gives an overall calibration view across the portfolio.

Time-series monitoring dashboard

This is the portfolio view that monitoring committees actually need: how PSI, Gini, and observed default behaviour move quarter by quarter, and whether the pattern suggests stable performance, gradual decay, or structural break.

PSI and Gini over time

Gini PSI Threshold view

Scenario

Quarter-by-quarter view

Monitoring interpretation: a one-off bad quarter can be noise. A trend is much more important. Gradual Gini erosion with rising PSI is often more dangerous than a single visible shock because it is easier to ignore until it becomes structural. [oai_citation:7‡12-model-stability.html](sediment://file_000000009570720abb48b9c25d8472d4)

Monitoring metrics cheat sheet

Metric What it checks Green Yellow Red Frequency
PSIScore distribution stability< 0.100.10–0.25≥ 0.25Quarterly
CSIInput variable stability< 0.100.10–0.25≥ 0.25Quarterly
Gini / KS changeDiscrimination decaysmall driftmoderate driftmaterial dropQuarterly
Binomial testPer-grade PD calibrationp ≥ 0.050.01–0.05< 0.01Annual
Hosmer-LemeshowOverall calibrationp ≥ 0.050.01–0.05< 0.01Annual
Observed DR trendRealised performance driftwithin rangenear boundaryoutside rangeQuarterly

Concepts every validator should keep

drift vs failure

Drift is not automatically model failure

A changing population can increase PSI without necessarily destroying model usefulness. The real question is whether ranking and calibration remain acceptable under the new data.

ttc vs pit

Calibration depends on philosophy

TTC models are not meant to chase every short-term cycle move. PIT models are. Backtesting must be interpreted against the intended design.

conservatism

Some overprediction is intentional

In regulated environments, a margin of conservatism means predicted PDs are often expected to sit somewhat above realised default rates on average.

low default portfolios

Classical tests lose power fast

When defaults are extremely rare, confidence intervals become very wide and annual backtests can become statistically weak. Interpretation must become more judgement-based.

monitoring trend

Single points matter less than paths

The direction and persistence of change often matters more than any single quarterly reading. Good monitoring is about trajectories, not only flags.

trigger governance

Define actions in advance

Green should mean routine monitoring, yellow should trigger investigation, and red should trigger formal escalation. A threshold without an action plan is weak governance.

What to leave this page with

Model monitoring is not one number. PSI and CSI tell you whether the environment has changed. Discrimination tells you whether the ranking still works. Backtesting tells you whether predicted PD levels still match reality.

The useful order is: first detect drift, then assess ranking, then test calibration, then decide whether the right response is monitoring, investigation, recalibration, or redevelopment.

Once that structure is clear, ongoing validation becomes a disciplined monitoring system rather than a collection of disconnected indicators.