notes · 07

KS Statistic & Confusion Matrix

This page goes one level deeper than overall discrimination. KS isolates the single score region where the model separates best, while the confusion matrix shows exactly what happens once a threshold is chosen and the model starts making hard decisions.

Start with KS mechanics, then move into decile analysis, then confusion matrix logic, then threshold-dependent metrics such as precision, recall, F1, MCC, and Youden’s J. The goal is to connect score ranking with operational cut-off decisions.

KS Mechanics KS Explorer Confusion Matrix Threshold Trade-off Reference Summary

ks statistic

What KS actually measures

The mechanics

KS starts by sorting observations by model score. Then it builds two cumulative distributions: one for defaults and one for non-defaults.

At each score, it measures the distance between those two cumulative curves. The largest distance is the KS statistic.

KS = max |F₁(s) − F₀(s)|

So KS is not an average summary like AUC. It is the best single-point separation the model achieves.

Why validators care

KS is operationally attractive because it points to a concrete threshold region where the model differentiates most strongly.

That makes it useful for cut-off analysis, decile reporting, and score band evaluation.

Important distinction: AUC tells you about overall ranking quality. KS tells you about the strongest local separation point. A model can have decent AUC but still have a weak or unstable practical cut-off region. [oai_citation:1‡11-ks-confusion-matrix.html](sediment://file_0000000057fc720abb2199ceb0dbd574)

learning sequence

A useful order for understanding KS

Start with score ordering

KS only makes sense once the score is interpreted as a risk ranking from safer to riskier or vice versa.

Then build cumulative curves

The intuition of KS lives in the gap between two CDFs, not in the final number alone.

Then inspect deciles

Decile monotonicity is one of the most practical ways to validate whether the score is behaving consistently across the portfolio.

Then connect KS to threshold choice

The KS point is often a candidate cut-off, but never the only decision rule. Business costs and strategy still matter.

interactive · ks explorer

Build the KS curve yourself

Control the score separation between defaults and non-defaults. Watch the distributions split, the CDF gap appear, and the decile structure become more or less ordered.

Score distributions

Non-defaults Defaults

Cumulative distributions & KS gap

CDF Non-defaults CDF Defaults KS max point

Decile analysis

Parameters

Separation (Δμ)2.0

Default rate (%)5

n (total)2000

KS Statistic

—

KS at score

—

KS decile

—

Gini (reference)

—

KS = max |F_default(s) − F_non-default(s)|

Decile reading: once the portfolio is sorted by score, the default rate should usually decline cleanly across deciles. Reversals often point to weak local ranking or unstable score regions. [oai_citation:2‡11-ks-confusion-matrix.html](sediment://file_0000000057fc720abb2199ceb0dbd574)

Try this: move separation close to zero. KS collapses and the decile structure flattens. Then move it above 3.0 and watch both KS and decile monotonicity strengthen at the same time.

confusion matrix

Once a threshold is chosen, ranking becomes classification

The four outcomes

A confusion matrix records every hard decision outcome: true positive, false positive, true negative, and false negative.

From those four numbers you derive precision, recall, specificity, F1, MCC, false positive rate, and more.

TP, FP, TN, FN → almost every threshold-based metric

Why the threshold matters

Lower thresholds catch more bads, but they also flag more goods. Higher thresholds reduce false alarms, but increase missed defaults.

That is why threshold selection is not a pure statistics problem. It is a business-cost problem.

Risk framing: a false negative means a bad borrower slips through. A false positive means a good borrower is treated as risky. Those two errors almost never have equal cost. [oai_citation:3‡11-ks-confusion-matrix.html](sediment://file_0000000057fc720abb2199ceb0dbd574)

interactive · threshold trade-off

Confusion matrix, PR curve, and metrics vs threshold

Adjust the threshold and watch how the confusion matrix changes, how the precision-recall point moves, and which threshold optimises F1, MCC, or Youden’s J.

The confusion matrix

Precision-Recall curve

PR curve Current threshold

Metrics vs threshold

Recall Precision F1 FPR

Controls

Classification threshold0.50

Classification metrics

Accuracy

—

(TP+TN)/N

Precision

—

TP/(TP+FP)

Recall

—

TP/(TP+FN)

Specificity

—

TN/(TN+FP)

F1 Score

—

2PR/(P+R)

FPR

—

FP/(FP+TN)

NPV

—

TN/(TN+FN)

FDR

—

FP/(FP+TP)

MCC

—

balanced quality

Threshold candidates

Best F1 at

—

Best MCC at

—

Youden’s J at

—

TPR − FPR max

PR AUC

—

F1 = 2PR / (P + R)
MCC = (TP·TN − FP·FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))
Youden’s J = TPR + TNR − 1

Why MCC matters: unlike accuracy, MCC uses all four quadrants of the confusion matrix and stays meaningful under class imbalance. That makes it especially useful in low-default settings. [oai_citation:4‡11-ks-confusion-matrix.html](sediment://file_0000000057fc720abb2199ceb0dbd574)

Operational reminder: the mathematically “best” threshold is not always the business-optimal threshold. Portfolio appetite, decline strategy, collection capacity, and regulatory conservatism can all override it.

reference

Which metric answers which question?

Metric	Main question	Threshold-free?	Works well with imbalance?	Validation use
Accuracy	What % of decisions are correct overall?	No	Weak	Usually not enough for PD models
Precision	Of predicted bads, how many are truly bad?	No	Partial	Useful when false positives are costly
Recall	Of actual bads, how many are captured?	No	Partial	Useful when missed bads are costly
F1	How balanced are precision and recall?	No	Partial	Useful single summary for positive-class focus
MCC	What is the balanced quality across all 4 CM cells?	No	Strong	Very useful in imbalanced portfolios
KS	Where is class separation strongest?	No	Partial	Cut-off analysis, decile review
PR AUC	How good is ranking when focus is on the positive class?	Yes	Strong	Useful when event rate is very low
Youden’s J	Which threshold maximises TPR − FPR?	No	Partial	Statistical cut-off candidate

deeper concepts

Concepts every validator should keep

accuracy trap

High accuracy can still mean a bad model

In very low-default portfolios, predicting “non-default” for almost everyone can look accurate while producing near-zero recall.

cost asymmetry

False negatives and false positives are not equal

Missing a bad borrower and declining a good borrower create very different business consequences. Threshold design should reflect that asymmetry.

decile logic

Decile monotonicity is a practical stress test

If ranked score buckets do not show a clean decline in observed bad rate, the model’s local ranking structure deserves scrutiny.

pr curve

ROC is not always enough

When defaults are rare, PR metrics often give a more realistic view of how useful the model is in the region the business actually cares about.

ks stability

KS is informative, but local

Because KS depends on a single maximum gap, it can be less stable over time than broader metrics like AUC or Gini.

threshold governance

A threshold is a policy choice

The model supplies the ranking. The chosen threshold turns that ranking into approvals, rejections, alerts, or collections priorities.

summary

What to leave this page with

KS helps you understand where the model separates best. The confusion matrix helps you understand what happens when that separation is turned into an actual decision rule.

The useful order is: first inspect ranking and KS, then inspect deciles, then choose a threshold, then evaluate the business consequences through precision, recall, F1, MCC, and false positive behaviour.

Once that structure is clear, threshold setting stops being arbitrary and becomes part of a defensible validation story.