notes · 04

Model Stability & Backtesting

A model can still discriminate well and yet no longer be safe to use. Population drift, variable drift, and PD miscalibration all create different types of model risk. This page brings those monitoring dimensions together.

Start with PSI and CSI to understand whether the portfolio has shifted. Then move into PD backtesting and calibration tests such as binomial checks and Hosmer-Lemeshow. End with the quarterly monitoring view, where all of these signals are tracked over time.

Framework PSI CSI Backtesting Monitoring Reference Summary

the monitoring framework

Three questions every ongoing validation cycle should ask

1. Is the population still the same?

PSI and CSI answer whether the score distribution or individual variable distributions have shifted since development.

Stability asks: is the model still seeing familiar data?

2. Can the model still rank risk?

Gini, AUC, and KS monitor whether the score is still separating higher-risk and lower-risk observations in a useful way.

Discrimination asks: can the model still sort?

3. Are PD levels still correct?

Binomial backtests and Hosmer-Lemeshow examine whether predicted PDs are still aligned with realised default behaviour.

Calibration asks: are the predicted levels still right?

Important distinction: high PSI does not automatically mean the model is wrong, and a calibration issue does not automatically mean the score has lost ranking power. These are different failure modes and should be diagnosed separately. [oai_citation:2‡12-model-stability.html](sediment://file_000000009570720abb48b9c25d8472d4)

learning sequence

A useful order for monitoring logic

Start with drift

Before judging the model, first ask whether the portfolio or inputs have materially changed since development.

Then check discrimination

If population shift is low but ranking deteriorates, the model itself may be ageing. If shift is high and discrimination falls, portfolio change may be the main driver.

Then test calibration

Backtesting asks whether predicted default rates still match realised experience, grade by grade and overall.

Then decide the action

Some outcomes justify simple monitoring, some require investigation, and some require recalibration or redevelopment.

interactive · population stability index

PSI — detecting score distribution drift

PSI compares the development distribution with the current distribution. A stable population gives a small PSI. A shifted or newly segmented population pushes PSI upward and changes which bins dominate the score.

Score distribution: development vs current

Development Current

PSI contribution by bin

Scenario

PSI breakdown

Total PSI

—

Traffic light

—

Max bin PSI

—

Bins > 0.02

—

PSI = Σ (Actual_i − Expected_i) × ln(Actual_i / Expected_i)

Standard thresholds: PSI < 0.10 usually green, 0.10–0.25 investigate, ≥ 0.25 material shift. The total matters, but the composition matters too — a concentrated tail shift can be more dangerous than a smooth overall drift. [oai_citation:3‡12-model-stability.html](sediment://file_000000009570720abb48b9c25d8472d4)

interactive · characteristic stability index

CSI — which variables are actually shifting?

CSI applies the same stability logic to each input variable. This is where score-level drift turns into diagnosis: which variables are moving, and which of them are driving the score PSI?

CSI by variable

CSI < 0.10 0.10–0.25 ≥ 0.25

Scenario

Root-cause mindset: score PSI tells you that something moved. CSI tells you what moved. That distinction is essential when deciding whether the issue is data quality, economic change, underwriting change, or model ageing. [oai_citation:4‡12-model-stability.html](sediment://file_000000009570720abb48b9c25d8472d4)

interactive · pd backtesting

Predicted PD vs observed default rate

This is the core calibration question in ongoing validation. Per-grade binomial tests show where the model is too optimistic or too conservative. Hosmer-Lemeshow gives an overall calibration view across the portfolio.

Predicted PD vs observed DR by grade

Predicted PD Observed DR

Traffic light results by grade

Scenario

Green grades

—

Yellow grades

—

Red grades

—

Overall status

—

Hosmer-Lemeshow test

H-L χ²

—

p-value

—

Interpretation

—

Binomial backtest: P(X ≥ d | n, PD)
Hosmer-Lemeshow: χ² = Σ (O − E)² / Var(O)

Traffic-light logic: green if p ≥ 0.05, yellow if 0.01 ≤ p < 0.05, red if p < 0.01. A few yellow grades may justify investigation; repeated reds point toward recalibration or stronger intervention. [oai_citation:5‡12-model-stability.html](sediment://file_000000009570720abb48b9c25d8472d4)

Important caveat: TTC and PIT calibration behave differently. A TTC model can appear conservative or optimistic in certain phases of the cycle without that automatically meaning model failure. Calibration philosophy matters. [oai_citation:6‡12-model-stability.html](sediment://file_000000009570720abb48b9c25d8472d4)

interactive · quarterly monitoring

Time-series monitoring dashboard

This is the portfolio view that monitoring committees actually need: how PSI, Gini, and observed default behaviour move quarter by quarter, and whether the pattern suggests stable performance, gradual decay, or structural break.

PSI and Gini over time

Gini PSI Threshold view

Scenario

Quarter-by-quarter view

Monitoring interpretation: a one-off bad quarter can be noise. A trend is much more important. Gradual Gini erosion with rising PSI is often more dangerous than a single visible shock because it is easier to ignore until it becomes structural. [oai_citation:7‡12-model-stability.html](sediment://file_000000009570720abb48b9c25d8472d4)

reference

Monitoring metrics cheat sheet

Metric	What it checks	Green	Yellow	Red	Frequency
PSI	Score distribution stability	< 0.10	0.10–0.25	≥ 0.25	Quarterly
CSI	Input variable stability	< 0.10	0.10–0.25	≥ 0.25	Quarterly
Gini / KS change	Discrimination decay	small drift	moderate drift	material drop	Quarterly
Binomial test	Per-grade PD calibration	p ≥ 0.05	0.01–0.05	< 0.01	Annual
Hosmer-Lemeshow	Overall calibration	p ≥ 0.05	0.01–0.05	< 0.01	Annual
Observed DR trend	Realised performance drift	within range	near boundary	outside range	Quarterly

deeper concepts

Concepts every validator should keep

drift vs failure

Drift is not automatically model failure

A changing population can increase PSI without necessarily destroying model usefulness. The real question is whether ranking and calibration remain acceptable under the new data.

ttc vs pit

Calibration depends on philosophy

TTC models are not meant to chase every short-term cycle move. PIT models are. Backtesting must be interpreted against the intended design.

conservatism

Some overprediction is intentional

In regulated environments, a margin of conservatism means predicted PDs are often expected to sit somewhat above realised default rates on average.

low default portfolios

Classical tests lose power fast

When defaults are extremely rare, confidence intervals become very wide and annual backtests can become statistically weak. Interpretation must become more judgement-based.

monitoring trend

Single points matter less than paths

The direction and persistence of change often matters more than any single quarterly reading. Good monitoring is about trajectories, not only flags.

trigger governance

Define actions in advance

Green should mean routine monitoring, yellow should trigger investigation, and red should trigger formal escalation. A threshold without an action plan is weak governance.

summary

What to leave this page with

Model monitoring is not one number. PSI and CSI tell you whether the environment has changed. Discrimination tells you whether the ranking still works. Backtesting tells you whether predicted PD levels still match reality.

The useful order is: first detect drift, then assess ranking, then test calibration, then decide whether the right response is monitoring, investigation, recalibration, or redevelopment.

Once that structure is clear, ongoing validation becomes a disciplined monitoring system rather than a collection of disconnected indicators.