notes · 09

Descriptive Statistics

Before modelling, you need to learn how to read a dataset. Descriptive statistics are the first layer: where the data sits, how spread out it is, how asymmetric it looks, and whether a few values are distorting the picture.

Start with centre, then spread, then shape. Use the playground to see how the same data can look stable, skewed, or misleading depending on what you summarise.

The idea How to read data Playground Interpretation Risk context Summary

start here

What descriptive statistics are really for

Descriptive statistics do not explain causality and they do not predict the future. Their job is simpler and more important: to give you a truthful first reading of the data in front of you.

Centre

Measures of central tendency try to answer one question: where does the dataset usually live?

Mean, median, and mode all answer that question differently. Mean reacts to every value. Median cares about position. Mode cares about frequency.

Different centre measures are not rivals — they are different lenses.

If they disagree sharply, that is already information about the shape of the data.

Spread

Two datasets can have the same centre and feel completely different. That difference often lives in their spread.

Variance, standard deviation, range, IQR, and MAD tell you how tightly or loosely the data clusters around its typical values.

In practice, spread is often what tells you whether a dataset looks reliable, noisy, stable, or fragile.

reading sequence

How to read a dataset before you model it

This is a useful order. If you follow it consistently, descriptive statistics stop feeling like isolated formulas and start acting like a checklist for data quality and intuition.

Check the centre

Start with mean, median, and mode. If they are close, the data may be reasonably symmetric. If they diverge, ask why.

Check the spread

Look at variance, standard deviation, and IQR. This tells you whether the dataset is tightly concentrated or structurally dispersed.

Check the shape

Skewness and kurtosis help you ask whether the data is asymmetric or heavy-tailed. This matters because some models quietly assume symmetry or thin tails.

Check quartiles and outliers

Percentiles, IQR, and box-plot logic show whether a few observations are dominating the summary.

core measures

What each family of measures is trying to capture

centre

Mean

The arithmetic average. Useful, standard, and efficient — but easy to pull with extreme values.

centre

Median

The middle observation after sorting. More robust than the mean when the data is skewed or contaminated by outliers.

centre

Mode

The most frequent value. Sometimes ignored, but useful when the data clusters around repeated points or categories.

spread

Variance & Std Dev

The default way to measure how far data drifts from the mean. Variance is squared; standard deviation puts spread back in original units.

spread

IQR

The width of the middle 50% of the data. Often a better summary of spread when outliers exist.

shape

Skewness & Kurtosis

Skewness captures asymmetry. Kurtosis captures tail behaviour and concentration of extremes.

interactive playground

Edit the data and watch the summaries move

This section is where the definitions become intuitive. Change the data, switch between presets, add a random point, and watch how centre, spread, and shape respond.

Dataset

12 data points

Central tendency

Mean

—

arithmetic average

Median

—

middle value

Mode

—

most frequent

Trimmed Mean (10%)

—

robust average

Spread

Variance

—

population

Std Dev

—

population

Sample Var

—

n−1 corrected

Sample Std

—

n−1 corrected

Range

—

max − min

IQR

—

Q3 − Q1

CV (%)

—

relative spread

MAD

—

mean abs deviation

Shape & percentiles

Skewness

—

asymmetry

Kurtosis (excess)

—

tail heaviness

—

25th percentile

—

75th percentile

Min

—

Max

—

P100

Median

—

P50

P95

—

tail checkpoint

Histogram

Mean Median Mode

Box plot

Dot plot (sorted)

Deviation from mean

interpretation

What to look for when the numbers disagree

The most interesting datasets are often the ones where the summaries do not line up cleanly. That tension is usually the clue.

Mean far from median

When mean and median separate, ask whether the data is skewed or whether a few extreme observations are pulling the average.

Right-skew often means the mean sits above the median. Left-skew often pushes it below.

High std dev, modest IQR

This often means the central mass is still relatively compact, but a few far observations are inflating the variance.

In other words: the core may be stable even if the tails are not.

modelling context

Where descriptive statistics matter in risk and validation

Situation	Most useful summary	Why it matters
Symmetric and well-behaved data	Mean + Std Dev	These two numbers already tell a large part of the story when the shape is stable.
Skewed variables	Median + IQR	More robust than the mean and variance when tails or asymmetry distort the average.
Outlier-heavy datasets	Trimmed Mean + MAD	Helps separate the central structure from contamination by extremes.
PD backtesting	Mean default rate + variance	The average observed rate matters, but so does its dispersion across periods, pools, or segments.
LGD / recovery style variables	Median + percentiles	Recovery data is often bounded, skewed, and full of structural spikes.
Stress / tail analysis	P95 / P99 + kurtosis	Tail checkpoints often say more than a single average when the concern is severity.
Comparing variability across scales	CV	Standard deviation alone is not enough when the means are on very different levels.
Validation diagnostics	Five-number summary	A simple way to compare predicted vs observed distributions before moving into more formal tests.

foundations

Key concepts worth keeping

Bessel correction

N vs N−1

Sample variance uses n−1 because estimating spread from a sample otherwise tends to underestimate population variability.

Robustness

Not all summaries break equally

Median and IQR are more robust than mean and standard deviation. That matters when the data contains structural outliers.

Skewness

Direction of asymmetry

Skewness is not just a statistic; it is an early signal that some modelling assumptions may be too clean for the data.

Kurtosis

Tail behaviour

Excess kurtosis above zero suggests more extreme outcomes than a Normal benchmark would expect.

Relative dispersion

The coefficient of variation makes spread comparable across datasets with different scales or units.

Five-number summary

A fast positional map

Min, Q1, median, Q3, and max often tell the story faster than a full model. They are the anatomy of the box plot.

summary

What to leave this page with

Descriptive statistics are not decorative. They are the first serious reading of a dataset.

The right sequence is: check centre, then spread, then shape, then quartiles and outliers.

Once that becomes habit, you start seeing data less as a raw list of values and more as a structured object with a centre, a width, a shape, and a set of modelling consequences.