notes · 01

Regression Analysis

Regression is where descriptive statistics becomes modelling. Linear regression explains continuous outcomes with least squares. Logistic regression explains binary outcomes with probabilities, log-odds, and classification thresholds.

Start with linear regression and residuals, then move to logistic regression and scorecard logic. The goal is to connect coefficients, diagnostics, discrimination, and calibration into one picture.

Overview Linear Regression Logistic Regression OLS Assumptions Validator Notes Summary

the two pillars

Linear vs logistic — why both matter

Linear regression (OLS)

Linear regression predicts a continuous outcome. It fits a line by minimising squared residuals, so coefficients are directly interpretable in the original scale.

If β₁ = 1.5, one-unit increase in X changes Y by 1.5 units on average, holding the rest fixed.

Y = β₀ + β₁X + ε

Used in: LGD modelling, EAD / CCF estimation, recovery severity, macro overlays, and continuous forecast problems.

Logistic regression (MLE)

Logistic regression predicts a binary outcome. Instead of a straight-line prediction for Y itself, it models the log-odds and maps the result through a sigmoid into a probability.

That is why it is the standard tool for PD scorecards and default models: predictions stay naturally in the [0, 1] interval.

P(Y=1) = 1 / (1 + e^{−(β₀ + β₁X)})

Used in: PD scorecards, application scoring, behavioural scoring, early warning, and A-IRB default modelling.

learning sequence

A useful order for learning regression

Start with the fitted relationship

Learn what the coefficient actually means before worrying about p-values or diagnostics.

Then study residuals

Residuals are the fingerprint of what the model failed to learn. Most serious modelling issues show up there first.

Then separate fit from inference

R², RMSE, coefficient significance, discrimination, and calibration are not the same dimension. A model can look good on one and weak on another.

Then connect it to validation

In risk work, the question is not only “does it fit?” but also “is it stable, interpretable, monotonic, and decision-useful?”

interactive · linear regression

OLS in action — fit, residuals, and assumptions

Choose a data pattern and change noise level. The fitted line updates, the residual plot changes, and the assumption cards tell you where OLS looks comfortable and where it starts to break.

Scatter + fitted line

Data OLS fit

Residuals vs fitted

Reading rule: a random cloud around zero is good. Curvature suggests non-linearity. A fan shape suggests heteroscedasticity. A few extreme points suggest leverage problems.

Controls

Noise level1.0

n (data points)50

β₀ (intercept)

—

β₁ (slope)

—

R²

—

explained variance

Adj. R²

—

RMSE

—

β₁ p-value

—

Assumption diagnostics

Y = β₀ + β₁X + ε, ε ~ N(0, σ²)

Validator lens: residual diagnostics matter more than a clean-looking R². A high R² with broken assumptions can still be a bad model.

interactive · logistic regression

Logistic regression — probabilities, odds, and thresholds

Move the coefficients and threshold, then watch how the sigmoid, log-odds line, and confusion matrix respond. This is the cleanest way to feel why scorecards are built on logistic models.

Sigmoid curve — P(Default=1) vs X

Sigmoid Non-defaults (Y=0) Defaults (Y=1)

Log-odds line

Key idea: probability is curved, but log-odds is linear. That is why the coefficient interpretation happens in odds space.

Parameters

β₀ (intercept)-3.0

β₁ (coefficient)0.80

Threshold0.50

Decision boundary X*

—

where P = threshold

Odds at X = 5

—

p / (1−p)

P(Default) at X = 5

—

OR per unit X

—

e^β₁

Confusion matrix (simulated n = 200)

Accuracy

—

Precision

—

Recall

—

F1 score

—

P(Y=1) = 1 / (1 + e^{−(β₀ + β₁X)})
logit(p) = ln(p / (1−p)) = β₀ + β₁X

Scorecard interpretation: e^β₁ is the odds ratio. If β₁ = 0.8, each one-unit increase in X multiplies default odds by about 2.23.

reference · OLS assumptions

The five OLS assumptions and what failure looks like

Assumption	What it means	How to check	What happens if it fails	Typical remedy
Linearity	E(Y\|X) is linear in X	Residual vs fitted plot	Biased coefficients, systematic misspecification	Transform X, add polynomial / spline terms, change model class
Independence	Residuals are not serially correlated	Durbin-Watson, time structure review	Standard errors become misleading	Time-series correction, clustered / robust SE
Normality	Residuals are roughly Normal	QQ plot, residual skewness, Shapiro-Wilk	Inference weakens, especially in small samples	Transform Y, bootstrap, robust inference
Homoscedasticity	Residual variance is constant	Residual fan shape, Breusch-Pagan	SE and p-values become unreliable	Robust SE, weighted least squares, transform target
No multicollinearity	Predictors are not too strongly correlated	VIF, correlation matrix	Unstable coefficients, inflated uncertainty	Drop variables, combine variables, ridge / PCA

validator notes

What matters in practice

OLS intuition

Why squared errors?

OLS punishes large misses more than small ones. That makes the math elegant and the estimator efficient under the right assumptions, but also sensitive to outliers.

logistic logic

Probability is not linear

Logistic regression works because it is linear in log-odds, not in raw probability. That is the key conceptual shift from linear regression.

discrimination vs calibration

Two different jobs

A logistic model can rank defaults well and still produce wrong PD levels. Good discrimination does not guarantee calibration.

overfitting

Better fit is not always better model

Adding predictors always helps in-sample. Validation is where you discover whether the improvement is real or just memorised noise.

residual analysis

Residuals are evidence

Residuals show whether the model systematically misses a region, a segment, or a structure. They are not decorative plots.

scorecard translation

Why logistic dominates scorecards

Logistic + WoE gives monotonicity, interpretability, and a direct route from coefficients to score points and odds scaling.

summary

What to leave this page with

Linear regression explains continuous outcomes through least squares. Logistic regression explains binary outcomes through log-odds and probability.

The useful order is: first understand the fitted relationship, then inspect residuals, then separate fit from inference, then connect the model to discrimination, calibration, and stability.

Once those pieces connect, regression stops being just coefficient reading and becomes a full model validation framework.