notes · 18

Regularization & Shrinkage

Classical regression tries to fit the data as well as possible. Regularization adds a second goal: keep the model controlled. That extra constraint reduces coefficient instability, softens overfitting, and makes the model less fragile under noisy or collinear predictors.

Start with why unconstrained coefficients explode, then move into ridge, lasso, and elastic net, then connect the result to bias-variance tradeoff, multicollinearity, and model stability.

Mindset Ridge vs Lasso Coefficient Paths Bias-Variance Collinearity Reference Summary

the core problem

Why unconstrained regression becomes unstable

Pure fit logic

Ordinary least squares only asks one question: which coefficients minimise residual error on the observed sample?

When predictors are noisy, correlated, or numerous, that freedom can produce large and unstable coefficients.

OLS cares only about fit

Regularized logic

Regularization adds a penalty for complexity. The model is no longer rewarded only for fitting the sample, but also for staying small, stable, and better behaved.

That is why some bias is introduced on purpose: to reduce variance and improve generalization.

Regularization = fit + discipline

Validation angle: very large coefficients, frequent sign flips, and unstable variable importance across samples are usually warnings that the unconstrained model is too brittle.

learning sequence

A useful order for learning shrinkage

Start with coefficient instability

Before penalties make sense, first understand why free coefficients become noisy under weak data structure.

Then compare the penalty shapes

Ridge shrinks continuously, lasso can force exact zeros, and elastic net blends both behaviours.

Then watch the coefficient path

The path across lambda explains more intuitively than the closed-form formulas alone.

Then connect to prediction error

The real payoff is not prettier coefficients, but a better bias-variance balance and more stable out-of-sample behaviour.

interactive · penalty comparison

Ridge, Lasso, and Elastic Net under one slider

Increase lambda and watch coefficients shrink. Compare how ridge keeps all variables alive, lasso drops some to zero, and elastic net behaves in between.

Coefficient magnitudes

Ridge Lasso Elastic Net

Penalty geometry (conceptual)

Controls

Lambda (penalty strength)1.00

Elastic Net mix α0.50

Ridge active vars

—

Lasso active vars

—

EN active vars

—

Avg shrinkage

—

Lasso sparsity

—

Interpretation

—

Ridge: minimize RSS + λ Σβ²
Lasso: minimize RSS + λ Σ|β|
Elastic Net: minimize RSS + λ[(1−α)Σβ² + αΣ|β|]

Ridge: useful when many variables carry signal and you mainly want stability.

Lasso: useful when you want variable selection and a sparser model.

Elastic Net: often useful when predictors are correlated and you want a compromise between selection and grouped shrinkage.

interactive · coefficient paths

The path tells the story

Instead of one lambda, view the whole shrinkage path. This is often the clearest way to understand which variables are robust and which survive only when the model is allowed to be too flexible.

Ridge / Lasso coefficient path

Method

Reading the path: variables that remain material across wide lambda ranges are usually more robust than variables that vanish immediately.

Interpretation caution: lasso dropping a variable to zero does not prove the variable is useless in the real world. It only means the penalized optimization found a cheaper representation.

interactive · bias-variance tradeoff

Why adding bias can lower total error

This is the central logic of shrinkage. Training error rises with stronger penalties, but test error often improves until the model becomes too constrained.

Error vs lambda

Bias and variance components

Scenario

Best lambda

—

Min test error

—

Train-test gap at λ=0

—

Train-test gap at best λ

—

Main lesson: no penalty often gives the lowest training error, but not the lowest test error.

interactive · multicollinearity

Shrinkage as a response to correlated predictors

When predictors overlap strongly, OLS struggles to allocate weight cleanly. Coefficients become unstable, signs can flip, and tiny sample changes can produce very different models.

Coefficient instability across resamples

Variance inflation view

Controls

Predictor correlation ρ0.80

Penalty λ1.00

Approx VIF

—

OLS std(β)

—

Ridge std(β)

—

OLS sign flips

—

Ridge sign flips

—

Stability verdict

—

Key mechanism: ridge is often preferred under strong collinearity because it keeps variables in the model while damping coefficient volatility.

reference

Shrinkage methods compared

Method	Penalty	Main effect	Best use case
OLS	None	Pure fit, no shrinkage	Clean low-noise, low-collinearity settings
Ridge	L2	Shrinks all coefficients continuously	Many correlated predictors, stability focus
Lasso	L1	Shrinks and can set coefficients to zero	Sparse solutions, variable selection
Elastic Net	L1 + L2	Selection + grouped shrinkage	Correlated predictors with sparsity needs
Adaptive Lasso	Weighted L1	More selective variable penalization	When selection quality matters strongly

deeper concepts

Concepts every validator should keep

bias

Bias is not always bad

In shrinkage methods, a little bias is introduced intentionally to reduce variance and improve generalization.

selection

Variable selection can be unstable

Lasso selection is useful, but in weak-signal settings the selected set can change noticeably across samples.

collinearity

Correlated predictors confuse OLS allocation

When variables overlap heavily, OLS can distribute weight erratically even if the overall fit still looks acceptable.

paths

The whole path matters more than one lambda

A variable that survives across a broad penalty range is often more robust than a variable that appears only at near-zero penalty.

governance

Interpretability can change with shrinkage

Regularized models may be more stable but can also be harder to narrate if coefficients are heavily distorted or grouped effects dominate.

validation

Penalty choice must be validated, not assumed

Lambda and alpha should be justified through validation logic, not selected because a single run happened to look good.

summary

What to leave this page with

Regularization is not a cosmetic add-on to regression. It is a controlled way to trade a little bias for a lot more stability.

The useful order is: first understand why unconstrained coefficients become unstable, then compare ridge, lasso, and elastic net, then inspect the coefficient path, then evaluate how shrinkage changes test error and collinearity behaviour.

Once that structure is clear, shrinkage stops looking like an abstract penalty term and starts looking like a practical tool for building more robust models.