← ds learning track
notes · 09

Descriptive Statistics

Before modelling, you need to learn how to read a dataset. Descriptive statistics are the first layer: where the data sits, how spread out it is, how asymmetric it looks, and whether a few values are distorting the picture.

Start with centre, then spread, then shape. Use the playground to see how the same data can look stable, skewed, or misleading depending on what you summarise.

What descriptive statistics are really for

Descriptive statistics do not explain causality and they do not predict the future. Their job is simpler and more important: to give you a truthful first reading of the data in front of you.

Centre

Measures of central tendency try to answer one question: where does the dataset usually live?

Mean, median, and mode all answer that question differently. Mean reacts to every value. Median cares about position. Mode cares about frequency.

Different centre measures are not rivals — they are different lenses.

If they disagree sharply, that is already information about the shape of the data.

Spread

Two datasets can have the same centre and feel completely different. That difference often lives in their spread.

Variance, standard deviation, range, IQR, and MAD tell you how tightly or loosely the data clusters around its typical values.

In practice, spread is often what tells you whether a dataset looks reliable, noisy, stable, or fragile.

How to read a dataset before you model it

This is a useful order. If you follow it consistently, descriptive statistics stop feeling like isolated formulas and start acting like a checklist for data quality and intuition.

01

Check the centre

Start with mean, median, and mode. If they are close, the data may be reasonably symmetric. If they diverge, ask why.

02

Check the spread

Look at variance, standard deviation, and IQR. This tells you whether the dataset is tightly concentrated or structurally dispersed.

03

Check the shape

Skewness and kurtosis help you ask whether the data is asymmetric or heavy-tailed. This matters because some models quietly assume symmetry or thin tails.

04

Check quartiles and outliers

Percentiles, IQR, and box-plot logic show whether a few observations are dominating the summary.

What each family of measures is trying to capture

centre

Mean

The arithmetic average. Useful, standard, and efficient — but easy to pull with extreme values.

centre

Median

The middle observation after sorting. More robust than the mean when the data is skewed or contaminated by outliers.

centre

Mode

The most frequent value. Sometimes ignored, but useful when the data clusters around repeated points or categories.

spread

Variance & Std Dev

The default way to measure how far data drifts from the mean. Variance is squared; standard deviation puts spread back in original units.

spread

IQR

The width of the middle 50% of the data. Often a better summary of spread when outliers exist.

shape

Skewness & Kurtosis

Skewness captures asymmetry. Kurtosis captures tail behaviour and concentration of extremes.

Edit the data and watch the summaries move

This section is where the definitions become intuitive. Change the data, switch between presets, add a random point, and watch how centre, spread, and shape respond.

Dataset

12 data points

Central tendency

Mean
arithmetic average
Median
middle value
Mode
most frequent
Trimmed Mean (10%)
robust average

Spread

Variance
population
Std Dev
population
Sample Var
n−1 corrected
Sample Std
n−1 corrected
Range
max − min
IQR
Q3 − Q1
CV (%)
relative spread
MAD
mean abs deviation

Shape & percentiles

Skewness
asymmetry
Kurtosis (excess)
tail heaviness
Q1
25th percentile
Q3
75th percentile
Min
P0
Max
P100
Median
P50
P95
tail checkpoint

Histogram

Mean Median Mode

Box plot

Dot plot (sorted)

Deviation from mean

What to look for when the numbers disagree

The most interesting datasets are often the ones where the summaries do not line up cleanly. That tension is usually the clue.

Mean far from median

When mean and median separate, ask whether the data is skewed or whether a few extreme observations are pulling the average.

Right-skew often means the mean sits above the median. Left-skew often pushes it below.

High std dev, modest IQR

This often means the central mass is still relatively compact, but a few far observations are inflating the variance.

In other words: the core may be stable even if the tails are not.

Where descriptive statistics matter in risk and validation

Situation Most useful summary Why it matters
Symmetric and well-behaved data Mean + Std Dev These two numbers already tell a large part of the story when the shape is stable.
Skewed variables Median + IQR More robust than the mean and variance when tails or asymmetry distort the average.
Outlier-heavy datasets Trimmed Mean + MAD Helps separate the central structure from contamination by extremes.
PD backtesting Mean default rate + variance The average observed rate matters, but so does its dispersion across periods, pools, or segments.
LGD / recovery style variables Median + percentiles Recovery data is often bounded, skewed, and full of structural spikes.
Stress / tail analysis P95 / P99 + kurtosis Tail checkpoints often say more than a single average when the concern is severity.
Comparing variability across scales CV Standard deviation alone is not enough when the means are on very different levels.
Validation diagnostics Five-number summary A simple way to compare predicted vs observed distributions before moving into more formal tests.

Key concepts worth keeping

Bessel correction

N vs N−1

Sample variance uses n−1 because estimating spread from a sample otherwise tends to underestimate population variability.

Robustness

Not all summaries break equally

Median and IQR are more robust than mean and standard deviation. That matters when the data contains structural outliers.

Skewness

Direction of asymmetry

Skewness is not just a statistic; it is an early signal that some modelling assumptions may be too clean for the data.

Kurtosis

Tail behaviour

Excess kurtosis above zero suggests more extreme outcomes than a Normal benchmark would expect.

CV

Relative dispersion

The coefficient of variation makes spread comparable across datasets with different scales or units.

Five-number summary

A fast positional map

Min, Q1, median, Q3, and max often tell the story faster than a full model. They are the anatomy of the box plot.

What to leave this page with

Descriptive statistics are not decorative. They are the first serious reading of a dataset.

The right sequence is: check centre, then spread, then shape, then quartiles and outliers.

Once that becomes habit, you start seeing data less as a raw list of values and more as a structured object with a centre, a width, a shape, and a set of modelling consequences.