Descriptive Statistics
Before modelling, you need to learn how to read a dataset. Descriptive statistics are the first layer: where the data sits, how spread out it is, how asymmetric it looks, and whether a few values are distorting the picture.
What descriptive statistics are really for
Descriptive statistics do not explain causality and they do not predict the future. Their job is simpler and more important: to give you a truthful first reading of the data in front of you.
Centre
Measures of central tendency try to answer one question: where does the dataset usually live?
Mean, median, and mode all answer that question differently. Mean reacts to every value. Median cares about position. Mode cares about frequency.
If they disagree sharply, that is already information about the shape of the data.
Spread
Two datasets can have the same centre and feel completely different. That difference often lives in their spread.
Variance, standard deviation, range, IQR, and MAD tell you how tightly or loosely the data clusters around its typical values.
In practice, spread is often what tells you whether a dataset looks reliable, noisy, stable, or fragile.
How to read a dataset before you model it
This is a useful order. If you follow it consistently, descriptive statistics stop feeling like isolated formulas and start acting like a checklist for data quality and intuition.
Check the centre
Start with mean, median, and mode. If they are close, the data may be reasonably symmetric. If they diverge, ask why.
Check the spread
Look at variance, standard deviation, and IQR. This tells you whether the dataset is tightly concentrated or structurally dispersed.
Check the shape
Skewness and kurtosis help you ask whether the data is asymmetric or heavy-tailed. This matters because some models quietly assume symmetry or thin tails.
Check quartiles and outliers
Percentiles, IQR, and box-plot logic show whether a few observations are dominating the summary.
What each family of measures is trying to capture
Mean
The arithmetic average. Useful, standard, and efficient — but easy to pull with extreme values.
Median
The middle observation after sorting. More robust than the mean when the data is skewed or contaminated by outliers.
Mode
The most frequent value. Sometimes ignored, but useful when the data clusters around repeated points or categories.
Variance & Std Dev
The default way to measure how far data drifts from the mean. Variance is squared; standard deviation puts spread back in original units.
IQR
The width of the middle 50% of the data. Often a better summary of spread when outliers exist.
Skewness & Kurtosis
Skewness captures asymmetry. Kurtosis captures tail behaviour and concentration of extremes.
Edit the data and watch the summaries move
This section is where the definitions become intuitive. Change the data, switch between presets, add a random point, and watch how centre, spread, and shape respond.
Dataset
Central tendency
Spread
Shape & percentiles
Histogram
Box plot
Dot plot (sorted)
Deviation from mean
What to look for when the numbers disagree
The most interesting datasets are often the ones where the summaries do not line up cleanly. That tension is usually the clue.
Mean far from median
When mean and median separate, ask whether the data is skewed or whether a few extreme observations are pulling the average.
Right-skew often means the mean sits above the median. Left-skew often pushes it below.
High std dev, modest IQR
This often means the central mass is still relatively compact, but a few far observations are inflating the variance.
In other words: the core may be stable even if the tails are not.
Where descriptive statistics matter in risk and validation
| Situation | Most useful summary | Why it matters |
|---|---|---|
| Symmetric and well-behaved data | Mean + Std Dev | These two numbers already tell a large part of the story when the shape is stable. |
| Skewed variables | Median + IQR | More robust than the mean and variance when tails or asymmetry distort the average. |
| Outlier-heavy datasets | Trimmed Mean + MAD | Helps separate the central structure from contamination by extremes. |
| PD backtesting | Mean default rate + variance | The average observed rate matters, but so does its dispersion across periods, pools, or segments. |
| LGD / recovery style variables | Median + percentiles | Recovery data is often bounded, skewed, and full of structural spikes. |
| Stress / tail analysis | P95 / P99 + kurtosis | Tail checkpoints often say more than a single average when the concern is severity. |
| Comparing variability across scales | CV | Standard deviation alone is not enough when the means are on very different levels. |
| Validation diagnostics | Five-number summary | A simple way to compare predicted vs observed distributions before moving into more formal tests. |
Key concepts worth keeping
N vs N−1
Sample variance uses n−1 because estimating spread from a sample otherwise tends to underestimate population variability.
Not all summaries break equally
Median and IQR are more robust than mean and standard deviation. That matters when the data contains structural outliers.
Direction of asymmetry
Skewness is not just a statistic; it is an early signal that some modelling assumptions may be too clean for the data.
Tail behaviour
Excess kurtosis above zero suggests more extreme outcomes than a Normal benchmark would expect.
Relative dispersion
The coefficient of variation makes spread comparable across datasets with different scales or units.
A fast positional map
Min, Q1, median, Q3, and max often tell the story faster than a full model. They are the anatomy of the box plot.
What to leave this page with
Descriptive statistics are not decorative. They are the first serious reading of a dataset.
The right sequence is: check centre, then spread, then shape, then quartiles and outliers.
Once that becomes habit, you start seeing data less as a raw list of values and more as a structured object with a centre, a width, a shape, and a set of modelling consequences.