KS Statistic & Confusion Matrix
This page goes one level deeper than overall discrimination. KS isolates the single score region where the model separates best, while the confusion matrix shows exactly what happens once a threshold is chosen and the model starts making hard decisions.
What KS actually measures
The mechanics
KS starts by sorting observations by model score. Then it builds two cumulative distributions: one for defaults and one for non-defaults.
At each score, it measures the distance between those two cumulative curves. The largest distance is the KS statistic.
So KS is not an average summary like AUC. It is the best single-point separation the model achieves.
Why validators care
KS is operationally attractive because it points to a concrete threshold region where the model differentiates most strongly.
That makes it useful for cut-off analysis, decile reporting, and score band evaluation.
A useful order for understanding KS
Start with score ordering
KS only makes sense once the score is interpreted as a risk ranking from safer to riskier or vice versa.
Then build cumulative curves
The intuition of KS lives in the gap between two CDFs, not in the final number alone.
Then inspect deciles
Decile monotonicity is one of the most practical ways to validate whether the score is behaving consistently across the portfolio.
Then connect KS to threshold choice
The KS point is often a candidate cut-off, but never the only decision rule. Business costs and strategy still matter.
Build the KS curve yourself
Control the score separation between defaults and non-defaults. Watch the distributions split, the CDF gap appear, and the decile structure become more or less ordered.
Once a threshold is chosen, ranking becomes classification
The four outcomes
A confusion matrix records every hard decision outcome: true positive, false positive, true negative, and false negative.
From those four numbers you derive precision, recall, specificity, F1, MCC, false positive rate, and more.
Why the threshold matters
Lower thresholds catch more bads, but they also flag more goods. Higher thresholds reduce false alarms, but increase missed defaults.
That is why threshold selection is not a pure statistics problem. It is a business-cost problem.
Confusion matrix, PR curve, and metrics vs threshold
Adjust the threshold and watch how the confusion matrix changes, how the precision-recall point moves, and which threshold optimises F1, MCC, or Youden’s J.
Which metric answers which question?
| Metric | Main question | Threshold-free? | Works well with imbalance? | Validation use |
|---|---|---|---|---|
| Accuracy | What % of decisions are correct overall? | No | Weak | Usually not enough for PD models |
| Precision | Of predicted bads, how many are truly bad? | No | Partial | Useful when false positives are costly |
| Recall | Of actual bads, how many are captured? | No | Partial | Useful when missed bads are costly |
| F1 | How balanced are precision and recall? | No | Partial | Useful single summary for positive-class focus |
| MCC | What is the balanced quality across all 4 CM cells? | No | Strong | Very useful in imbalanced portfolios |
| KS | Where is class separation strongest? | No | Partial | Cut-off analysis, decile review |
| PR AUC | How good is ranking when focus is on the positive class? | Yes | Strong | Useful when event rate is very low |
| Youden’s J | Which threshold maximises TPR − FPR? | No | Partial | Statistical cut-off candidate |
Concepts every validator should keep
High accuracy can still mean a bad model
In very low-default portfolios, predicting “non-default” for almost everyone can look accurate while producing near-zero recall.
False negatives and false positives are not equal
Missing a bad borrower and declining a good borrower create very different business consequences. Threshold design should reflect that asymmetry.
Decile monotonicity is a practical stress test
If ranked score buckets do not show a clean decline in observed bad rate, the model’s local ranking structure deserves scrutiny.
ROC is not always enough
When defaults are rare, PR metrics often give a more realistic view of how useful the model is in the region the business actually cares about.
KS is informative, but local
Because KS depends on a single maximum gap, it can be less stable over time than broader metrics like AUC or Gini.
A threshold is a policy choice
The model supplies the ranking. The chosen threshold turns that ranking into approvals, rejections, alerts, or collections priorities.
What to leave this page with
KS helps you understand where the model separates best. The confusion matrix helps you understand what happens when that separation is turned into an actual decision rule.
The useful order is: first inspect ranking and KS, then inspect deciles, then choose a threshold, then evaluate the business consequences through precision, recall, F1, MCC, and false positive behaviour.
Once that structure is clear, threshold setting stops being arbitrary and becomes part of a defensible validation story.