← ds learning track
notes · 18

Regularization & Shrinkage

Classical regression tries to fit the data as well as possible. Regularization adds a second goal: keep the model controlled. That extra constraint reduces coefficient instability, softens overfitting, and makes the model less fragile under noisy or collinear predictors.

Start with why unconstrained coefficients explode, then move into ridge, lasso, and elastic net, then connect the result to bias-variance tradeoff, multicollinearity, and model stability.

Why unconstrained regression becomes unstable

Pure fit logic

Ordinary least squares only asks one question: which coefficients minimise residual error on the observed sample?

When predictors are noisy, correlated, or numerous, that freedom can produce large and unstable coefficients.

OLS cares only about fit

Regularized logic

Regularization adds a penalty for complexity. The model is no longer rewarded only for fitting the sample, but also for staying small, stable, and better behaved.

That is why some bias is introduced on purpose: to reduce variance and improve generalization.

Regularization = fit + discipline
Validation angle: very large coefficients, frequent sign flips, and unstable variable importance across samples are usually warnings that the unconstrained model is too brittle.

A useful order for learning shrinkage

01

Start with coefficient instability

Before penalties make sense, first understand why free coefficients become noisy under weak data structure.

02

Then compare the penalty shapes

Ridge shrinks continuously, lasso can force exact zeros, and elastic net blends both behaviours.

03

Then watch the coefficient path

The path across lambda explains more intuitively than the closed-form formulas alone.

04

Then connect to prediction error

The real payoff is not prettier coefficients, but a better bias-variance balance and more stable out-of-sample behaviour.

Ridge, Lasso, and Elastic Net under one slider

Increase lambda and watch coefficients shrink. Compare how ridge keeps all variables alive, lasso drops some to zero, and elastic net behaves in between.

The path tells the story

Instead of one lambda, view the whole shrinkage path. This is often the clearest way to understand which variables are robust and which survive only when the model is allowed to be too flexible.

Ridge / Lasso coefficient path

Method

Reading the path: variables that remain material across wide lambda ranges are usually more robust than variables that vanish immediately.
Interpretation caution: lasso dropping a variable to zero does not prove the variable is useless in the real world. It only means the penalized optimization found a cheaper representation.

Why adding bias can lower total error

This is the central logic of shrinkage. Training error rises with stronger penalties, but test error often improves until the model becomes too constrained.

Shrinkage as a response to correlated predictors

When predictors overlap strongly, OLS struggles to allocate weight cleanly. Coefficients become unstable, signs can flip, and tiny sample changes can produce very different models.

Shrinkage methods compared

Method Penalty Main effect Best use case
OLSNonePure fit, no shrinkageClean low-noise, low-collinearity settings
RidgeL2Shrinks all coefficients continuouslyMany correlated predictors, stability focus
LassoL1Shrinks and can set coefficients to zeroSparse solutions, variable selection
Elastic NetL1 + L2Selection + grouped shrinkageCorrelated predictors with sparsity needs
Adaptive LassoWeighted L1More selective variable penalizationWhen selection quality matters strongly

Concepts every validator should keep

bias

Bias is not always bad

In shrinkage methods, a little bias is introduced intentionally to reduce variance and improve generalization.

selection

Variable selection can be unstable

Lasso selection is useful, but in weak-signal settings the selected set can change noticeably across samples.

collinearity

Correlated predictors confuse OLS allocation

When variables overlap heavily, OLS can distribute weight erratically even if the overall fit still looks acceptable.

paths

The whole path matters more than one lambda

A variable that survives across a broad penalty range is often more robust than a variable that appears only at near-zero penalty.

governance

Interpretability can change with shrinkage

Regularized models may be more stable but can also be harder to narrate if coefficients are heavily distorted or grouped effects dominate.

validation

Penalty choice must be validated, not assumed

Lambda and alpha should be justified through validation logic, not selected because a single run happened to look good.

What to leave this page with

Regularization is not a cosmetic add-on to regression. It is a controlled way to trade a little bias for a lot more stability.

The useful order is: first understand why unconstrained coefficients become unstable, then compare ridge, lasso, and elastic net, then inspect the coefficient path, then evaluate how shrinkage changes test error and collinearity behaviour.

Once that structure is clear, shrinkage stops looking like an abstract penalty term and starts looking like a practical tool for building more robust models.