Model Discrimination & Performance
This is where validation usually gets concrete. Can the model separate good borrowers from bad ones? ROC, AUC, Gini, KS, and CAP all answer that question from different angles.
How discrimination metrics connect
ROC & AUC
ROC traces the trade-off between True Positive Rate and False Positive Rate across all possible thresholds.
AUC is the area under that curve. It is the threshold-independent summary of rank-ordering power.
Gini & Lorenz / CAP
Gini is the same ranking idea expressed on a different scale. The Lorenz or CAP curve shows how quickly defaults are captured as you move through the population sorted by score.
KS Statistic
KS is the maximum gap between the cumulative distributions of defaults and non-defaults.
It identifies the single score region where the model separates best.
A useful order for learning discrimination
Start with rank-ordering, not thresholds
AUC and Gini tell you whether the model ranks bads above goods in general. This is the cleanest starting point.
Then move to threshold trade-offs
Precision, recall, specificity, and F1 all depend on where you cut the score. They are operational decisions, not pure model properties.
Then isolate local separation
KS is useful because it highlights the region where the score distributions are maximally separated.
Then compare across time and models
A challenger beats a champion only if the improvement is stable, meaningful, and consistent across samples.
One model, all discrimination metrics
Choose a model quality scenario and move the classification threshold. ROC, Gini/CAP, KS, and the confusion matrix all update from the same underlying score distribution.
Compare two models side by side
This is the shape of a real validation question: does the challenger materially outperform the champion, or does it only look marginally better on one statistic?
ROC overlay
Scenario
Champion
Challenger
Discrimination metrics compared
| Metric | Range | What it measures | Strength | Limitation |
|---|---|---|---|---|
| AUC | [0.5, 1.0] | Overall ranking power across thresholds | Threshold-independent | Does not measure calibration |
| Gini | [0, 1.0] | Same ranking power, rescaled | Industry-standard in credit risk | Mathematically redundant with AUC |
| KS | [0, 1.0] | Maximum separation at one score point | Operationally intuitive | Only one-point summary |
| CAP / Lorenz | [0, 1.0] | % of bads captured by top % of population | Portfolio prioritisation view | Equivalent in spirit to Gini |
| Precision | [0, 1.0] | How clean positive predictions are | Business-action relevance | Threshold-dependent |
| Recall | [0, 1.0] | How many actual bads are caught | Miss-risk awareness | Threshold-dependent |
| F1 | [0, 1.0] | Balance of precision and recall | Useful for imbalanced classes | Ignores true negatives directly |
Concepts every validator should keep
They are different jobs
A model can rank-order perfectly and still predict wrong PD levels. Good discrimination does not mean good calibration.
Accuracy can mislead badly
In low-default portfolios, a model that predicts “non-default” for almost everyone can look accurate while being operationally useless.
Track metric decay over time
Falling Gini or AUC can indicate model degradation, but only after separating true model decay from population drift.
Discrimination drops need context
If PSI is high and Gini falls, population shift may be driving the change. If PSI is low and Gini falls, the model itself is under suspicion.
Portfolio averages can hide weakness
A strong aggregate Gini can still conceal a segment where the model barely separates risk. Segment-level breakdowns matter.
Not every delta is real
AUC 0.78 vs 0.76 may or may not be meaningful. Bootstrap confidence intervals or formal AUC comparison tests are needed before strong claims.
What to leave this page with
ROC, AUC, Gini, KS, and CAP are not isolated metrics. They are different windows onto the same question: how well the model separates good from bad.
The useful order is: first understand threshold-independent ranking, then threshold-based trade-offs, then local separation, then champion-challenger comparison.
Once that clicks, discrimination reporting stops being metric memorisation and becomes a coherent validation story.