notes · 02

Model Discrimination & Performance

This is where validation usually gets concrete. Can the model separate good borrowers from bad ones? ROC, AUC, Gini, KS, and CAP all answer that question from different angles.

Start with how the metrics connect, then explore one model across all curves, then compare champion and challenger models. The goal is to make discrimination metrics feel like one system, not five separate formulas.

Landscape Main Simulator Champion vs Challenger Reference Concepts Summary

the landscape

How discrimination metrics connect

ROC & AUC

ROC traces the trade-off between True Positive Rate and False Positive Rate across all possible thresholds.

AUC is the area under that curve. It is the threshold-independent summary of rank-ordering power.

AUC = P(score_default > score_non-default)

Gini & Lorenz / CAP

Gini is the same ranking idea expressed on a different scale. The Lorenz or CAP curve shows how quickly defaults are captured as you move through the population sorted by score.

Gini = 2 × AUC − 1

KS Statistic

KS is the maximum gap between the cumulative distributions of defaults and non-defaults.

It identifies the single score region where the model separates best.

KS = max |F_bad(s) − F_good(s)|

The key connection: these metrics are not competing philosophies. They all describe the same underlying question: how well the model ranks risky observations above safe ones. [oai_citation:1‡10-model-discrimination.html](sediment://file_0000000035087243b53931cfcc44d3da)

learning sequence

A useful order for learning discrimination

Start with rank-ordering, not thresholds

AUC and Gini tell you whether the model ranks bads above goods in general. This is the cleanest starting point.

Then move to threshold trade-offs

Precision, recall, specificity, and F1 all depend on where you cut the score. They are operational decisions, not pure model properties.

Then isolate local separation

KS is useful because it highlights the region where the score distributions are maximally separated.

Then compare across time and models

A challenger beats a champion only if the improvement is stable, meaningful, and consistent across samples.

interactive · main simulator

One model, all discrimination metrics

Choose a model quality scenario and move the classification threshold. ROC, Gini/CAP, KS, and the confusion matrix all update from the same underlying score distribution.

ROC Curve

Model ROC Random diagonal

Lorenz / CAP Curve

Model Perfect Random

KS Plot

CDF Non-default CDF Default KS point

Scenario

Classification threshold0.50

Discrimination metrics

AUC

—

Gini

—

KS at score

—

Confusion matrix

Accuracy

—

Precision

—

Recall / TPR

—

Specificity

—

F1 score

—

FPR

—

Gini = 2 × AUC − 1
TPR = TP / (TP + FN)
FPR = FP / (FP + TN)
KS = max(TPR − FPR)

Threshold trade-off: lower threshold catches more bads, but also creates more false alarms. ROC summarises this over all thresholds, which is why AUC is threshold-independent. [oai_citation:2‡10-model-discrimination.html](sediment://file_0000000035087243b53931cfcc44d3da)

Typical benchmark thinking: AUC above ~0.70 or Gini above ~0.40 is often considered acceptable in practice, but realistic targets depend heavily on portfolio type and default rate. [oai_citation:3‡10-model-discrimination.html](sediment://file_0000000035087243b53931cfcc44d3da)

interactive · champion vs challenger

Compare two models side by side

This is the shape of a real validation question: does the challenger materially outperform the champion, or does it only look marginally better on one statistic?

ROC overlay

Champion Challenger Random

Scenario

Champion

AUC

—

Gini

—

Challenger

AUC

—

Gini

—

ΔAUC

—

ΔGini

—

Validation language: a challenger recommendation is strongest when the improvement shows up not only in AUC/Gini but also across out-of-time samples, segments, and operationally relevant thresholds. [oai_citation:4‡10-model-discrimination.html](sediment://file_0000000035087243b53931cfcc44d3da)

Crossover caution: if ROC curves cross, AUC alone can hide the operational story. One model may be better in the exact threshold region the business actually uses. [oai_citation:5‡10-model-discrimination.html](sediment://file_0000000035087243b53931cfcc44d3da)

reference

Discrimination metrics compared

Metric	Range	What it measures	Strength	Limitation
AUC	[0.5, 1.0]	Overall ranking power across thresholds	Threshold-independent	Does not measure calibration
Gini	[0, 1.0]	Same ranking power, rescaled	Industry-standard in credit risk	Mathematically redundant with AUC
KS	[0, 1.0]	Maximum separation at one score point	Operationally intuitive	Only one-point summary
CAP / Lorenz	[0, 1.0]	% of bads captured by top % of population	Portfolio prioritisation view	Equivalent in spirit to Gini
Precision	[0, 1.0]	How clean positive predictions are	Business-action relevance	Threshold-dependent
Recall	[0, 1.0]	How many actual bads are caught	Miss-risk awareness	Threshold-dependent
F1	[0, 1.0]	Balance of precision and recall	Useful for imbalanced classes	Ignores true negatives directly

deeper concepts

Concepts every validator should keep

discrimination vs calibration

They are different jobs

A model can rank-order perfectly and still predict wrong PD levels. Good discrimination does not mean good calibration.

imbalanced data

Accuracy can mislead badly

In low-default portfolios, a model that predicts “non-default” for almost everyone can look accurate while being operationally useless.

monitoring

Track metric decay over time

Falling Gini or AUC can indicate model degradation, but only after separating true model decay from population drift.

PSI link

Discrimination drops need context

If PSI is high and Gini falls, population shift may be driving the change. If PSI is low and Gini falls, the model itself is under suspicion.

segment view

Portfolio averages can hide weakness

A strong aggregate Gini can still conceal a segment where the model barely separates risk. Segment-level breakdowns matter.

statistical significance

Not every delta is real

AUC 0.78 vs 0.76 may or may not be meaningful. Bootstrap confidence intervals or formal AUC comparison tests are needed before strong claims.

summary

What to leave this page with

ROC, AUC, Gini, KS, and CAP are not isolated metrics. They are different windows onto the same question: how well the model separates good from bad.

The useful order is: first understand threshold-independent ranking, then threshold-based trade-offs, then local separation, then champion-challenger comparison.

Once that clicks, discrimination reporting stops being metric memorisation and becomes a coherent validation story.