← ds learning track
notes · 01

Regression Analysis

Regression is where descriptive statistics becomes modelling. Linear regression explains continuous outcomes with least squares. Logistic regression explains binary outcomes with probabilities, log-odds, and classification thresholds.

Start with linear regression and residuals, then move to logistic regression and scorecard logic. The goal is to connect coefficients, diagnostics, discrimination, and calibration into one picture.

Linear vs logistic — why both matter

Linear regression (OLS)

Linear regression predicts a continuous outcome. It fits a line by minimising squared residuals, so coefficients are directly interpretable in the original scale.

If β₁ = 1.5, one-unit increase in X changes Y by 1.5 units on average, holding the rest fixed.

Y = β₀ + β₁X + ε
Used in: LGD modelling, EAD / CCF estimation, recovery severity, macro overlays, and continuous forecast problems.

Logistic regression (MLE)

Logistic regression predicts a binary outcome. Instead of a straight-line prediction for Y itself, it models the log-odds and maps the result through a sigmoid into a probability.

That is why it is the standard tool for PD scorecards and default models: predictions stay naturally in the [0, 1] interval.

P(Y=1) = 1 / (1 + e−(β₀ + β₁X))
Used in: PD scorecards, application scoring, behavioural scoring, early warning, and A-IRB default modelling.

A useful order for learning regression

01

Start with the fitted relationship

Learn what the coefficient actually means before worrying about p-values or diagnostics.

02

Then study residuals

Residuals are the fingerprint of what the model failed to learn. Most serious modelling issues show up there first.

03

Then separate fit from inference

R², RMSE, coefficient significance, discrimination, and calibration are not the same dimension. A model can look good on one and weak on another.

04

Then connect it to validation

In risk work, the question is not only “does it fit?” but also “is it stable, interpretable, monotonic, and decision-useful?”

OLS in action — fit, residuals, and assumptions

Choose a data pattern and change noise level. The fitted line updates, the residual plot changes, and the assumption cards tell you where OLS looks comfortable and where it starts to break.

Logistic regression — probabilities, odds, and thresholds

Move the coefficients and threshold, then watch how the sigmoid, log-odds line, and confusion matrix respond. This is the cleanest way to feel why scorecards are built on logistic models.

The five OLS assumptions and what failure looks like

Assumption What it means How to check What happens if it fails Typical remedy
Linearity E(Y|X) is linear in X Residual vs fitted plot Biased coefficients, systematic misspecification Transform X, add polynomial / spline terms, change model class
Independence Residuals are not serially correlated Durbin-Watson, time structure review Standard errors become misleading Time-series correction, clustered / robust SE
Normality Residuals are roughly Normal QQ plot, residual skewness, Shapiro-Wilk Inference weakens, especially in small samples Transform Y, bootstrap, robust inference
Homoscedasticity Residual variance is constant Residual fan shape, Breusch-Pagan SE and p-values become unreliable Robust SE, weighted least squares, transform target
No multicollinearity Predictors are not too strongly correlated VIF, correlation matrix Unstable coefficients, inflated uncertainty Drop variables, combine variables, ridge / PCA

What matters in practice

OLS intuition

Why squared errors?

OLS punishes large misses more than small ones. That makes the math elegant and the estimator efficient under the right assumptions, but also sensitive to outliers.

logistic logic

Probability is not linear

Logistic regression works because it is linear in log-odds, not in raw probability. That is the key conceptual shift from linear regression.

discrimination vs calibration

Two different jobs

A logistic model can rank defaults well and still produce wrong PD levels. Good discrimination does not guarantee calibration.

overfitting

Better fit is not always better model

Adding predictors always helps in-sample. Validation is where you discover whether the improvement is real or just memorised noise.

residual analysis

Residuals are evidence

Residuals show whether the model systematically misses a region, a segment, or a structure. They are not decorative plots.

scorecard translation

Why logistic dominates scorecards

Logistic + WoE gives monotonicity, interpretability, and a direct route from coefficients to score points and odds scaling.

What to leave this page with

Linear regression explains continuous outcomes through least squares. Logistic regression explains binary outcomes through log-odds and probability.

The useful order is: first understand the fitted relationship, then inspect residuals, then separate fit from inference, then connect the model to discrimination, calibration, and stability.

Once those pieces connect, regression stops being just coefficient reading and becomes a full model validation framework.