← ds learning track
notes · 02

Model Discrimination & Performance

This is where validation usually gets concrete. Can the model separate good borrowers from bad ones? ROC, AUC, Gini, KS, and CAP all answer that question from different angles.

Start with how the metrics connect, then explore one model across all curves, then compare champion and challenger models. The goal is to make discrimination metrics feel like one system, not five separate formulas.

How discrimination metrics connect

ROC & AUC

ROC traces the trade-off between True Positive Rate and False Positive Rate across all possible thresholds.

AUC is the area under that curve. It is the threshold-independent summary of rank-ordering power.

AUC = P(scoredefault > scorenon-default)

Gini & Lorenz / CAP

Gini is the same ranking idea expressed on a different scale. The Lorenz or CAP curve shows how quickly defaults are captured as you move through the population sorted by score.

Gini = 2 × AUC − 1

KS Statistic

KS is the maximum gap between the cumulative distributions of defaults and non-defaults.

It identifies the single score region where the model separates best.

KS = max |Fbad(s) − Fgood(s)|
The key connection: these metrics are not competing philosophies. They all describe the same underlying question: how well the model ranks risky observations above safe ones. [oai_citation:1‡10-model-discrimination.html](sediment://file_0000000035087243b53931cfcc44d3da)

A useful order for learning discrimination

01

Start with rank-ordering, not thresholds

AUC and Gini tell you whether the model ranks bads above goods in general. This is the cleanest starting point.

02

Then move to threshold trade-offs

Precision, recall, specificity, and F1 all depend on where you cut the score. They are operational decisions, not pure model properties.

03

Then isolate local separation

KS is useful because it highlights the region where the score distributions are maximally separated.

04

Then compare across time and models

A challenger beats a champion only if the improvement is stable, meaningful, and consistent across samples.

One model, all discrimination metrics

Choose a model quality scenario and move the classification threshold. ROC, Gini/CAP, KS, and the confusion matrix all update from the same underlying score distribution.

ROC Curve

Model ROC Random diagonal

Lorenz / CAP Curve

Model Perfect Random

KS Plot

CDF Non-default CDF Default KS point

Scenario

Classification threshold0.50

Discrimination metrics

AUC
Gini
KS
KS at score

Confusion matrix

Accuracy
Precision
Recall / TPR
Specificity
F1 score
FPR
Gini = 2 × AUC − 1
TPR = TP / (TP + FN)
FPR = FP / (FP + TN)
KS = max(TPR − FPR)
Threshold trade-off: lower threshold catches more bads, but also creates more false alarms. ROC summarises this over all thresholds, which is why AUC is threshold-independent. [oai_citation:2‡10-model-discrimination.html](sediment://file_0000000035087243b53931cfcc44d3da)
Typical benchmark thinking: AUC above ~0.70 or Gini above ~0.40 is often considered acceptable in practice, but realistic targets depend heavily on portfolio type and default rate. [oai_citation:3‡10-model-discrimination.html](sediment://file_0000000035087243b53931cfcc44d3da)

Compare two models side by side

This is the shape of a real validation question: does the challenger materially outperform the champion, or does it only look marginally better on one statistic?

ROC overlay

Champion Challenger Random

Scenario

Champion

AUC
Gini
KS

Challenger

AUC
Gini
KS
ΔAUC
ΔGini
Validation language: a challenger recommendation is strongest when the improvement shows up not only in AUC/Gini but also across out-of-time samples, segments, and operationally relevant thresholds. [oai_citation:4‡10-model-discrimination.html](sediment://file_0000000035087243b53931cfcc44d3da)
Crossover caution: if ROC curves cross, AUC alone can hide the operational story. One model may be better in the exact threshold region the business actually uses. [oai_citation:5‡10-model-discrimination.html](sediment://file_0000000035087243b53931cfcc44d3da)

Discrimination metrics compared

Metric Range What it measures Strength Limitation
AUC[0.5, 1.0]Overall ranking power across thresholdsThreshold-independentDoes not measure calibration
Gini[0, 1.0]Same ranking power, rescaledIndustry-standard in credit riskMathematically redundant with AUC
KS[0, 1.0]Maximum separation at one score pointOperationally intuitiveOnly one-point summary
CAP / Lorenz[0, 1.0]% of bads captured by top % of populationPortfolio prioritisation viewEquivalent in spirit to Gini
Precision[0, 1.0]How clean positive predictions areBusiness-action relevanceThreshold-dependent
Recall[0, 1.0]How many actual bads are caughtMiss-risk awarenessThreshold-dependent
F1[0, 1.0]Balance of precision and recallUseful for imbalanced classesIgnores true negatives directly

Concepts every validator should keep

discrimination vs calibration

They are different jobs

A model can rank-order perfectly and still predict wrong PD levels. Good discrimination does not mean good calibration.

imbalanced data

Accuracy can mislead badly

In low-default portfolios, a model that predicts “non-default” for almost everyone can look accurate while being operationally useless.

monitoring

Track metric decay over time

Falling Gini or AUC can indicate model degradation, but only after separating true model decay from population drift.

PSI link

Discrimination drops need context

If PSI is high and Gini falls, population shift may be driving the change. If PSI is low and Gini falls, the model itself is under suspicion.

segment view

Portfolio averages can hide weakness

A strong aggregate Gini can still conceal a segment where the model barely separates risk. Segment-level breakdowns matter.

statistical significance

Not every delta is real

AUC 0.78 vs 0.76 may or may not be meaningful. Bootstrap confidence intervals or formal AUC comparison tests are needed before strong claims.

What to leave this page with

ROC, AUC, Gini, KS, and CAP are not isolated metrics. They are different windows onto the same question: how well the model separates good from bad.

The useful order is: first understand threshold-independent ranking, then threshold-based trade-offs, then local separation, then champion-challenger comparison.

Once that clicks, discrimination reporting stops being metric memorisation and becomes a coherent validation story.