notes · 13

Parametric & Non-Parametric Tests

Once hypothesis testing is clear, the next practical question is which test to use. This page is about that decision: t-test, ANOVA, Chi-Square, Mann-Whitney, and the logic for choosing between parametric and non-parametric tools.

Start with the distinction, then the test-selection flow, then work through the main tests one by one. The goal is not memorising names — it is learning which question each test actually answers.

Distinction Decision Flow t-Test ANOVA Chi-Square Mann-Whitney Summary

the distinction

Parametric vs non-parametric

The real distinction is not “old tests vs new tests.” It is “tests that rely on stronger assumptions” versus “tests that sacrifice power in exchange for robustness.”

Parametric tests

Parametric tests assume a particular structure for the data, usually some combination of Normality, equal variances, continuity, and independence.

When those assumptions are acceptable, they are usually more powerful because they use more information from the data.

Typical examples: t-test, ANOVA, Pearson correlation, z-test.

They are often the first choice in clean experimental data, but they become fragile if shape assumptions are badly violated.

Non-parametric tests

Non-parametric tests avoid strong distributional assumptions. They often work on ranks or frequencies instead of raw values.

They are less efficient when parametric assumptions truly hold, but safer when the data is skewed, bounded, heavy-tailed, ordinal, or contaminated by outliers.

Typical examples: Mann-Whitney U, Kruskal-Wallis, Chi-Square, Spearman.

Risk reality: PD, LGD, recovery, utilisation, and delinquency data are often skewed, bounded, spiky, or discrete. In many validation settings, non-parametric confirmation is not optional — it is just good practice.

decision framework

Which test should I use?

The cleanest way to choose a test is to ask a small number of structural questions: how many groups, what type of variable, and how strong the assumptions look.

How many groups or categories?

2 groups → t-test / Mann-Whitney
3+ groups → ANOVA / Kruskal-Wallis
Association table → Chi-Square

What type of variable?

Continuous → mean/rank based tests
Categorical → count/frequency based tests

Are parametric assumptions plausible?

Roughly yes → parametric route
No / unsure → non-parametric route

Independent or paired?

Independent → independent t / MW-U
Paired → paired t / Wilcoxon

Variable screening example: suppose you compare DTI between defaulted and non-defaulted borrowers. Two groups + continuous variable. If shape looks acceptable, try an independent t-test. If DTI is heavily skewed, confirm with Mann-Whitney.

learning sequence

A useful way to learn the family of tests

Start with the question, not the formula

Every test is answering a structural question: different means, different distributions, or association between categories.

Then inspect assumptions

Normality, equal variances, and sample size decide whether parametric tools are justified or whether rank-based alternatives are safer.

Then separate significance from effect size

A tiny p-value with a negligible effect is not an important predictor. Validation work should report both.

Then think about business relevance

A statistically detectable difference is not automatically useful. Ask whether the result changes segmentation, ranking, or modelling decisions.

interactive · t-test

Independent samples t-test

The t-test asks whether two groups have meaningfully different means relative to the variability inside those groups.

Group distributions

Group A Group B

Parameters

μA50

μB58

σA10

σB10

n per group40

x̄A

—

x̄B

—

t-statistic

—

p-value

—

Cohen's d

—

effect size

Decision (α = 0.05)

—

t = (x̄A − x̄B) / √(s²A/nA + s²B/nB)

Validation lens: if a predictor has clearly different central tendency between good and bad accounts, that is early evidence of discriminatory value. But if the variable is highly skewed, use Mann-Whitney as a robustness check.

interactive · anova

One-way ANOVA

ANOVA asks whether at least one group mean differs from the others. It does not tell you which one — only whether the group structure matters.

Group means

Scenario

SSbetween

—

SSwithin

—

F-statistic

—

p-value

—

df (between, within)

—

η²

—

effect size

Decision (α = 0.05)

—

F = MSbetween / MSwithin

Validation lens: rating grades should not all look the same. If observed PD or score means are statistically indistinguishable across grades, the rating structure is not separating risk well.

Important: ANOVA only says “at least one group differs.” To identify where the differences sit, you need post-hoc testing.

interactive · chi-square

Chi-Square test of independence

Chi-Square is for categorical association. It asks whether the observed contingency table is too far from what independence would imply.

Observed table

Expected table under H₀

Cell contributions

Scenario

χ² statistic

—

p-value

—

Cramér's V

—

effect size

Decision (α = 0.05)

—

χ² = Σ (O − E)² / E

Variable selection: if a categorical feature such as homeownership or employment status is strongly associated with default, Chi-Square gives you the formal evidence. Cramér's V tells you whether the association is trivial or operationally meaningful.

interactive · mann-whitney

Mann-Whitney U

Mann-Whitney is the rank-based alternative to the two-sample t-test. It is useful when distributions are skewed, heavy-tailed, bounded, or clearly non-Normal.

Rank distributions

Group A Group B

Scenario

U statistic

—

z (approx.)

—

p-value

—

Rank-biserial r

—

effect size

Median A

—

Median B

—

Decision (α = 0.05)

—

U = ΣRA − nA(nA+1)/2

Risk reality: LGD and recovery data are often exactly the kinds of variables that make t-tests uncomfortable. Mann-Whitney is usually the cleaner first diagnostic.

reference

Test selection cheat sheet

Scenario	Parametric	Non-parametric	Data type	Validation use
2 groups	Independent t-test	Mann-Whitney U	Continuous	Default vs non-default comparison
Paired comparison	Paired t-test	Wilcoxon signed-rank	Continuous	Before vs after comparison on same sample
3+ groups	One-way ANOVA	Kruskal-Wallis	Continuous	Grades, segments, collateral types
Categorical association	—	Chi-Square	Categorical	Categorical predictor vs default
Small 2×2 table	—	Fisher's exact	Categorical	Rare-event contingency tables
Linear association	Pearson	Spearman	Continuous / ranked	Predictor relationships, dependence checks
Distribution equality	—	KS test	Continuous	Score distribution comparison

deeper concepts

Concepts every validator should keep

assumptions

Test the assumptions before the test

Normality, equal variances, and cell counts are not side details. They determine whether the chosen test is even interpretable.

effect size

P-value is not enough

Report effect sizes such as Cohen's d, η², Cramér's V, or rank-biserial correlation. Significance without materiality is weak validation evidence.

multiple testing

False positives multiply fast

If many predictors are screened, some will appear significant by chance. Statistical filtering should be combined with domain logic.

post-hoc logic

ANOVA is only the gatekeeper

ANOVA tells you that differences exist, not where they sit. For graded systems, post-hoc testing matters.

robustness

Non-parametric does not mean weaker thinking

It often means more realistic thinking when the variable is skewed, bounded, or structurally messy.

validation style

Use tests as evidence, not as authority

Good validation combines significance, effect size, stability, model context, and business meaning. No single test is the whole answer.

summary

What to leave this page with

Parametric tests are efficient when their assumptions are credible. Non-parametric tests are safer when the data is messy, skewed, or structurally non-Normal.

The right workflow is: define the question, inspect assumptions, choose the test family, then interpret significance together with effect size.

Once that habit is in place, test selection stops being memorisation and becomes structured judgement.