← ds learning track
notes · 07

KS Statistic & Confusion Matrix

This page goes one level deeper than overall discrimination. KS isolates the single score region where the model separates best, while the confusion matrix shows exactly what happens once a threshold is chosen and the model starts making hard decisions.

Start with KS mechanics, then move into decile analysis, then confusion matrix logic, then threshold-dependent metrics such as precision, recall, F1, MCC, and Youden’s J. The goal is to connect score ranking with operational cut-off decisions.

What KS actually measures

The mechanics

KS starts by sorting observations by model score. Then it builds two cumulative distributions: one for defaults and one for non-defaults.

At each score, it measures the distance between those two cumulative curves. The largest distance is the KS statistic.

KS = max |F₁(s) − F₀(s)|

So KS is not an average summary like AUC. It is the best single-point separation the model achieves.

Why validators care

KS is operationally attractive because it points to a concrete threshold region where the model differentiates most strongly.

That makes it useful for cut-off analysis, decile reporting, and score band evaluation.

Important distinction: AUC tells you about overall ranking quality. KS tells you about the strongest local separation point. A model can have decent AUC but still have a weak or unstable practical cut-off region. [oai_citation:1‡11-ks-confusion-matrix.html](sediment://file_0000000057fc720abb2199ceb0dbd574)

A useful order for understanding KS

01

Start with score ordering

KS only makes sense once the score is interpreted as a risk ranking from safer to riskier or vice versa.

02

Then build cumulative curves

The intuition of KS lives in the gap between two CDFs, not in the final number alone.

03

Then inspect deciles

Decile monotonicity is one of the most practical ways to validate whether the score is behaving consistently across the portfolio.

04

Then connect KS to threshold choice

The KS point is often a candidate cut-off, but never the only decision rule. Business costs and strategy still matter.

Build the KS curve yourself

Control the score separation between defaults and non-defaults. Watch the distributions split, the CDF gap appear, and the decile structure become more or less ordered.

Once a threshold is chosen, ranking becomes classification

The four outcomes

A confusion matrix records every hard decision outcome: true positive, false positive, true negative, and false negative.

From those four numbers you derive precision, recall, specificity, F1, MCC, false positive rate, and more.

TP, FP, TN, FN → almost every threshold-based metric

Why the threshold matters

Lower thresholds catch more bads, but they also flag more goods. Higher thresholds reduce false alarms, but increase missed defaults.

That is why threshold selection is not a pure statistics problem. It is a business-cost problem.

Risk framing: a false negative means a bad borrower slips through. A false positive means a good borrower is treated as risky. Those two errors almost never have equal cost. [oai_citation:3‡11-ks-confusion-matrix.html](sediment://file_0000000057fc720abb2199ceb0dbd574)

Confusion matrix, PR curve, and metrics vs threshold

Adjust the threshold and watch how the confusion matrix changes, how the precision-recall point moves, and which threshold optimises F1, MCC, or Youden’s J.

Which metric answers which question?

Metric Main question Threshold-free? Works well with imbalance? Validation use
AccuracyWhat % of decisions are correct overall?NoWeakUsually not enough for PD models
PrecisionOf predicted bads, how many are truly bad?NoPartialUseful when false positives are costly
RecallOf actual bads, how many are captured?NoPartialUseful when missed bads are costly
F1How balanced are precision and recall?NoPartialUseful single summary for positive-class focus
MCCWhat is the balanced quality across all 4 CM cells?NoStrongVery useful in imbalanced portfolios
KSWhere is class separation strongest?NoPartialCut-off analysis, decile review
PR AUCHow good is ranking when focus is on the positive class?YesStrongUseful when event rate is very low
Youden’s JWhich threshold maximises TPR − FPR?NoPartialStatistical cut-off candidate

Concepts every validator should keep

accuracy trap

High accuracy can still mean a bad model

In very low-default portfolios, predicting “non-default” for almost everyone can look accurate while producing near-zero recall.

cost asymmetry

False negatives and false positives are not equal

Missing a bad borrower and declining a good borrower create very different business consequences. Threshold design should reflect that asymmetry.

decile logic

Decile monotonicity is a practical stress test

If ranked score buckets do not show a clean decline in observed bad rate, the model’s local ranking structure deserves scrutiny.

pr curve

ROC is not always enough

When defaults are rare, PR metrics often give a more realistic view of how useful the model is in the region the business actually cares about.

ks stability

KS is informative, but local

Because KS depends on a single maximum gap, it can be less stable over time than broader metrics like AUC or Gini.

threshold governance

A threshold is a policy choice

The model supplies the ranking. The chosen threshold turns that ranking into approvals, rejections, alerts, or collections priorities.

What to leave this page with

KS helps you understand where the model separates best. The confusion matrix helps you understand what happens when that separation is turned into an actual decision rule.

The useful order is: first inspect ranking and KS, then inspect deciles, then choose a threshold, then evaluate the business consequences through precision, recall, F1, MCC, and false positive behaviour.

Once that structure is clear, threshold setting stops being arbitrary and becomes part of a defensible validation story.