CLT & Correlation
This page connects two ideas that sit at the core of inference and risk modelling: why averages become Normal, and how variables move together. CLT explains the behaviour of sample means; correlation explains the structure between variables.
Why the Normal distribution keeps returning
CLT is one of the reasons statistics works at all. It says that under broad conditions, the mean of many sampled observations tends toward a Normal distribution, even when the original data does not.
The theorem in plain language
Take any population distribution. Draw repeated samples of size n. Compute the mean of each sample.
As n gets larger, the distribution of those sample means becomes approximately Normal.
The result is not about the original observations becoming Normal. It is about the sampling distribution of the mean.
Why this matters in practice
CLT is what allows many confidence intervals, z-tests, backtesting bands, and approximation-based validation tools to exist.
In credit risk, observed default rate is a sample mean of default indicators. In portfolio risk, aggregate loss is often the sum of many small random pieces.
How to think about CLT correctly
Do not confuse population with sample mean
The raw population can stay skewed forever. It is the distribution of repeated sample averages that becomes bell-shaped.
Sample size controls precision
The spread of sample means shrinks as n grows. This is why larger samples make the mean more stable.
Skewed sources converge more slowly
A heavily skewed source may need much larger n before the Normal approximation becomes operationally safe.
Dependence complicates everything
If observations are strongly dependent, CLT still may hold under special conditions, but convergence becomes much less straightforward.
Watch the sample mean become Normal
Choose a source distribution, set the sample size, and repeatedly draw samples. The top chart shows the source population shape. The bottom chart shows the distribution of sample means.
How fast does CLT converge?
Not all source distributions converge at the same speed. A symmetric source reaches approximate normality much faster than a heavily skewed one.
|Skewness of x̄| vs sample size
Interpretation
The lower the absolute skewness of the sample-mean distribution, the closer it is to a bell curve.
Uniform and already-symmetric sources converge quickly. Exponential takes longer because the original asymmetry is stronger.
symmetric source → n ≈ 15
moderate skew → n ≈ 30
heavy skew → n ≈ 50+
How variables move together
Once you leave one-variable uncertainty and start looking at two variables together, the language changes from variance to covariance and correlation.
Covariance
Covariance asks whether X and Y move above and below their means together. Positive means same direction. Negative means opposite direction.
Pearson
Pearson correlation standardises covariance. It measures linear association and always sits between −1 and +1.
Spearman
Spearman applies Pearson to ranks instead of levels. It measures monotonic association and is more robust to outliers and non-normality.
See where Pearson and Spearman disagree
The gap between Pearson and Spearman is often more informative than either statistic alone. It tells you something about curvature, monotonicity, and outlier sensitivity.
Correlation matrices and validation reading
Correlation matrices turn pairwise relationships into a compact map. In model development and validation, they are one of the fastest ways to check dependence structure and multicollinearity risk.
Heatmap
How to read the matrix
The diagonal is always 1.0. The matrix is symmetric. What matters are the off-diagonal relationships.
High predictor-to-predictor correlation can signal multicollinearity. Low predictor-to-target correlation can signal weak standalone information.
Correlation measures compared
| Measure | Range | Captures | Robust? | Typical use |
|---|---|---|---|---|
| Pearson | [−1, +1] | Linear association | No | Continuous, roughly well-behaved variables |
| Spearman | [−1, +1] | Monotonic association | More | Ranks, skewed data, outlier-prone settings |
| Kendall τ | [−1, +1] | Concordance | More | Small samples, tied ranks, conservative dependence reading |
| Point-biserial | [−1, +1] | Continuous vs binary | No | Predictor vs default flag type analysis |
| Partial correlation | [−1, +1] | Conditional linear relation | No | Controlling for confounders |
| Autocorrelation | [−1, +1] | Serial dependence | No | Time series, residual checks, macro evolution |
| Asset correlation | [0, 1] often | Systematic dependence | N/A | Portfolio loss, IRB / Vasicek style dependence |
Concepts every validator should keep
CLT is an approximation, not magic
It becomes powerful under the right conditions, but concentration, dependence, and heavy tails can make the approximation much slower or less reliable.
Correlation is not causation
Co-movement does not explain mechanism. It only tells you how variables behave together inside the sample.
No single coefficient replaces a scatter plot
Pearson and Spearman can both miss non-linear structure. Visual inspection remains part of serious analysis.
Autocorrelation changes the game
Serial dependence breaks the usual independence story and can distort standard errors and inference.
Asset correlation drives tail concentration
In portfolio models, correlation is not a side detail. It is one of the main drivers of diversification or concentration.
Correlation between predictors matters
Even when model fit looks acceptable, highly correlated predictors can make coefficients unstable and interpretation unreliable.
What to leave this page with
CLT explains why means become stable and approximately Normal. Correlation explains how variables move together.
The useful order is: first understand the sampling distribution of the mean, then understand covariance, then compare Pearson and Spearman, then read full correlation matrices.
Once those pieces connect, inference and dependence stop looking like separate topics and start behaving like one system.