← ds learning track
notes · 11

CLT & Correlation

This page connects two ideas that sit at the core of inference and risk modelling: why averages become Normal, and how variables move together. CLT explains the behaviour of sample means; correlation explains the structure between variables.

Start with CLT, then move into covariance and correlation, then compare Pearson vs Spearman, and finally use correlation matrices in a validation context.

Why the Normal distribution keeps returning

CLT is one of the reasons statistics works at all. It says that under broad conditions, the mean of many sampled observations tends toward a Normal distribution, even when the original data does not.

The theorem in plain language

Take any population distribution. Draw repeated samples of size n. Compute the mean of each sample.

As n gets larger, the distribution of those sample means becomes approximately Normal.

x̄ ~ N(μ, σ² / n) as n becomes large.

The result is not about the original observations becoming Normal. It is about the sampling distribution of the mean.

Why this matters in practice

CLT is what allows many confidence intervals, z-tests, backtesting bands, and approximation-based validation tools to exist.

In credit risk, observed default rate is a sample mean of default indicators. In portfolio risk, aggregate loss is often the sum of many small random pieces.

Validation context: CLT is most comfortable when observations are independent and no single observation dominates the sample. Dependence and concentration slow convergence.

How to think about CLT correctly

01

Do not confuse population with sample mean

The raw population can stay skewed forever. It is the distribution of repeated sample averages that becomes bell-shaped.

02

Sample size controls precision

The spread of sample means shrinks as n grows. This is why larger samples make the mean more stable.

03

Skewed sources converge more slowly

A heavily skewed source may need much larger n before the Normal approximation becomes operationally safe.

04

Dependence complicates everything

If observations are strongly dependent, CLT still may hold under special conditions, but convergence becomes much less straightforward.

Watch the sample mean become Normal

Choose a source distribution, set the sample size, and repeatedly draw samples. The top chart shows the source population shape. The bottom chart shows the distribution of sample means.

Source distribution

Distribution of sample means — 0 samples

sample means histogram theoretical normal overlay

Controls

Sample size (n)5
Pop. Mean (μ)
Pop. Std (σ)
Mean of x̄
Std of x̄
Theoretical SE
σ / √n
|Skewness of x̄|
smaller = more normal
Try this: start with Exponential at n = 2, then move n to 30. You will feel why “CLT exists” more than you would from any theorem statement.
x̄ ~ N(μ, σ² / n)
SE = σ / √n

How fast does CLT converge?

Not all source distributions converge at the same speed. A symmetric source reaches approximate normality much faster than a heavily skewed one.

|Skewness of x̄| vs sample size

Uniform Exponential Bimodal U-shaped

Interpretation

The lower the absolute skewness of the sample-mean distribution, the closer it is to a bell curve.

Uniform and already-symmetric sources converge quickly. Exponential takes longer because the original asymmetry is stronger.

Validation implication: for small, skewed, or concentrated portfolios, the usual Normal approximation behind z-style backtesting can be weaker than it looks.
Rough rule of thumb:
symmetric source → n ≈ 15
moderate skew → n ≈ 30
heavy skew → n ≈ 50+

How variables move together

Once you leave one-variable uncertainty and start looking at two variables together, the language changes from variance to covariance and correlation.

Covariance

Covariance asks whether X and Y move above and below their means together. Positive means same direction. Negative means opposite direction.

Cov(X,Y) = E[(X−μx)(Y−μy)]

Pearson

Pearson correlation standardises covariance. It measures linear association and always sits between −1 and +1.

ρ = Cov(X,Y) / (σx · σy)

Spearman

Spearman applies Pearson to ranks instead of levels. It measures monotonic association and is more robust to outliers and non-normality.

ρs = Pearson(rank(X), rank(Y))

See where Pearson and Spearman disagree

The gap between Pearson and Spearman is often more informative than either statistic alone. It tells you something about curvature, monotonicity, and outlier sensitivity.

Scatter plot

Scenario

Pearson ρ
linear association
Spearman ρs
rank monotonicity
Gap
|ρ − ρs|
n
Read this way: if Spearman is high but Pearson is lower, the relation may be monotonic but curved. If Pearson jumps because of one outlier, Spearman often stays calmer.

Correlation matrices and validation reading

Correlation matrices turn pairwise relationships into a compact map. In model development and validation, they are one of the fastest ways to check dependence structure and multicollinearity risk.

Heatmap

How to read the matrix

The diagonal is always 1.0. The matrix is symmetric. What matters are the off-diagonal relationships.

High predictor-to-predictor correlation can signal multicollinearity. Low predictor-to-target correlation can signal weak standalone information.

Validation context: in a scorecard or IRB model review, you check the matrix to understand whether predictors are redundant, unstable, or structurally dependent in a way that will make coefficients fragile.

Correlation measures compared

Measure Range Captures Robust? Typical use
Pearson[−1, +1]Linear associationNoContinuous, roughly well-behaved variables
Spearman[−1, +1]Monotonic associationMoreRanks, skewed data, outlier-prone settings
Kendall τ[−1, +1]ConcordanceMoreSmall samples, tied ranks, conservative dependence reading
Point-biserial[−1, +1]Continuous vs binaryNoPredictor vs default flag type analysis
Partial correlation[−1, +1]Conditional linear relationNoControlling for confounders
Autocorrelation[−1, +1]Serial dependenceNoTime series, residual checks, macro evolution
Asset correlation[0, 1] oftenSystematic dependenceN/APortfolio loss, IRB / Vasicek style dependence

Concepts every validator should keep

clt

CLT is an approximation, not magic

It becomes powerful under the right conditions, but concentration, dependence, and heavy tails can make the approximation much slower or less reliable.

dependence

Correlation is not causation

Co-movement does not explain mechanism. It only tells you how variables behave together inside the sample.

shape

No single coefficient replaces a scatter plot

Pearson and Spearman can both miss non-linear structure. Visual inspection remains part of serious analysis.

time series

Autocorrelation changes the game

Serial dependence breaks the usual independence story and can distort standard errors and inference.

portfolio risk

Asset correlation drives tail concentration

In portfolio models, correlation is not a side detail. It is one of the main drivers of diversification or concentration.

multicollinearity

Correlation between predictors matters

Even when model fit looks acceptable, highly correlated predictors can make coefficients unstable and interpretation unreliable.

What to leave this page with

CLT explains why means become stable and approximately Normal. Correlation explains how variables move together.

The useful order is: first understand the sampling distribution of the mean, then understand covariance, then compare Pearson and Spearman, then read full correlation matrices.

Once those pieces connect, inference and dependence stop looking like separate topics and start behaving like one system.