Regression Analysis
Regression is where descriptive statistics becomes modelling. Linear regression explains continuous outcomes with least squares. Logistic regression explains binary outcomes with probabilities, log-odds, and classification thresholds.
Linear vs logistic — why both matter
Linear regression (OLS)
Linear regression predicts a continuous outcome. It fits a line by minimising squared residuals, so coefficients are directly interpretable in the original scale.
If β₁ = 1.5, one-unit increase in X changes Y by 1.5 units on average, holding the rest fixed.
Logistic regression (MLE)
Logistic regression predicts a binary outcome. Instead of a straight-line prediction for Y itself, it models the log-odds and maps the result through a sigmoid into a probability.
That is why it is the standard tool for PD scorecards and default models: predictions stay naturally in the [0, 1] interval.
A useful order for learning regression
Start with the fitted relationship
Learn what the coefficient actually means before worrying about p-values or diagnostics.
Then study residuals
Residuals are the fingerprint of what the model failed to learn. Most serious modelling issues show up there first.
Then separate fit from inference
R², RMSE, coefficient significance, discrimination, and calibration are not the same dimension. A model can look good on one and weak on another.
Then connect it to validation
In risk work, the question is not only “does it fit?” but also “is it stable, interpretable, monotonic, and decision-useful?”
OLS in action — fit, residuals, and assumptions
Choose a data pattern and change noise level. The fitted line updates, the residual plot changes, and the assumption cards tell you where OLS looks comfortable and where it starts to break.
Logistic regression — probabilities, odds, and thresholds
Move the coefficients and threshold, then watch how the sigmoid, log-odds line, and confusion matrix respond. This is the cleanest way to feel why scorecards are built on logistic models.
The five OLS assumptions and what failure looks like
| Assumption | What it means | How to check | What happens if it fails | Typical remedy |
|---|---|---|---|---|
| Linearity | E(Y|X) is linear in X | Residual vs fitted plot | Biased coefficients, systematic misspecification | Transform X, add polynomial / spline terms, change model class |
| Independence | Residuals are not serially correlated | Durbin-Watson, time structure review | Standard errors become misleading | Time-series correction, clustered / robust SE |
| Normality | Residuals are roughly Normal | QQ plot, residual skewness, Shapiro-Wilk | Inference weakens, especially in small samples | Transform Y, bootstrap, robust inference |
| Homoscedasticity | Residual variance is constant | Residual fan shape, Breusch-Pagan | SE and p-values become unreliable | Robust SE, weighted least squares, transform target |
| No multicollinearity | Predictors are not too strongly correlated | VIF, correlation matrix | Unstable coefficients, inflated uncertainty | Drop variables, combine variables, ridge / PCA |
What matters in practice
Why squared errors?
OLS punishes large misses more than small ones. That makes the math elegant and the estimator efficient under the right assumptions, but also sensitive to outliers.
Probability is not linear
Logistic regression works because it is linear in log-odds, not in raw probability. That is the key conceptual shift from linear regression.
Two different jobs
A logistic model can rank defaults well and still produce wrong PD levels. Good discrimination does not guarantee calibration.
Better fit is not always better model
Adding predictors always helps in-sample. Validation is where you discover whether the improvement is real or just memorised noise.
Residuals are evidence
Residuals show whether the model systematically misses a region, a segment, or a structure. They are not decorative plots.
Why logistic dominates scorecards
Logistic + WoE gives monotonicity, interpretability, and a direct route from coefficients to score points and odds scaling.
What to leave this page with
Linear regression explains continuous outcomes through least squares. Logistic regression explains binary outcomes through log-odds and probability.
The useful order is: first understand the fitted relationship, then inspect residuals, then separate fit from inference, then connect the model to discrimination, calibration, and stability.
Once those pieces connect, regression stops being just coefficient reading and becomes a full model validation framework.