Superiority vs Non-Inferiority Trials in AB Testing

Background

Most AB tests at HomeBuddy ask: “is the new variant better?” — a superiority trial. But sometimes the question is different: we want to ship a change for cost, performance, or architectural reasons, and we need to verify that it does not harm the key metric by more than a tolerable margin. This is a non-inferiority trial.

The distinction is not cosmetic — it changes the null hypothesis, the test statistic, and critically, the sample size formula. Using a superiority design when you actually want non-inferiority leads to either an underpowered test (too few users to detect the margin) or a miscalibrated one (wrong null hypothesis).

Problem Definition

Let $\mu_E$ and $\mu_C$ be the true means of the experimental and control groups respectively. Both trial types test a claim about $\mu_E - \mu_C$, but they differ in what they try to prove.

Superiority trial

$$H_0: \mu_E - \mu_C \le 0 \quad \text{vs} \quad H_1: \mu_E - \mu_C > 0$$

Rejecting $H_0$ means the treatment is strictly better than control.

Non-inferiority trial

$$H_0: \mu_E - \mu_C \le -\Delta \quad \text{vs} \quad H_1: \mu_E - \mu_C > -\Delta$$

where $\Delta > 0$ is the non-inferiority margin — the largest degradation you are willing to tolerate. Rejecting $H_0$ means the treatment is not worse than control by more than $\Delta$.

Side-by-Side Comparison

Feature	Superiority Trial	Non-Inferiority Trial
Goal	Prove $\mu_E > \mu_C$	Prove $\mu_E \ge \mu_C - \Delta$
Null hypothesis	$\mu_E - \mu_C \le 0$	$\mu_E - \mu_C \le -\Delta$
Alternative	$\mu_E - \mu_C > 0$	$\mu_E - \mu_C > -\Delta$
Test type	Two-sided (typically)	One-sided (always)
Null center	$0$	$-\Delta$
Test statistic	$Z = \dfrac{(\bar x_E - \bar x_C) - 0}{SE}$	$Z = \dfrac{(\bar x_E - \bar x_C) - (-\Delta)}{SE}$
Key parameter	$\delta$ (MDE)	$\Delta$ (margin)
Sample size	See below	See below

Sample Size Formulas

Let $\sigma^2$ be the common variance, $\alpha$ the significance level, and $1-\beta$ the desired power.

Superiority (two-sided, detects effect $\delta$):

$$ n \approx \frac{\left(Z_{1-\alpha/2} + Z_{1-\beta}\right)^2 \cdot 2\sigma^2}{\delta^2} $$

Non-inferiority (one-sided, with anticipated true effect $\delta_{\text{ant}}$):

$$ n \approx \frac{\left(Z_{1-\alpha} + Z_{1-\beta}\right)^2 \cdot 2\sigma^2}{(\Delta - \delta_{\text{ant}})^2} $$

Typically $\delta_{\text{ant}} = 0$ is assumed (no true difference), which simplifies the denominator to $\Delta^2$.

Key insight: the non-inferiority formula uses the one-sided critical value $Z_{1-\alpha}$ (not $Z_{1-\alpha/2}$), and the effect in the denominator is the margin $\Delta$, not the MDE $\delta$. For a tight margin (small $\Delta$), non-inferiority tests require more users than an equivalent superiority test.

Sample size calculator

from scipy.stats import norm
import numpy as np


def n_superiority(delta, sigma2, alpha=0.05, power=0.8):
    """Per-group sample size for a two-sided superiority test."""
    z = norm.ppf(1 - alpha / 2) + norm.ppf(power)
    return int(np.ceil(z**2 * 2 * sigma2 / delta**2))


def n_non_inferiority(margin, sigma2, alpha=0.05, power=0.8, anticipated_delta=0.0):
    """Per-group sample size for a one-sided non-inferiority test."""
    z = norm.ppf(1 - alpha) + norm.ppf(power)
    return int(np.ceil(z**2 * 2 * sigma2 / (margin - anticipated_delta)**2))


# Example: conversion rate ~ 20%, margin / MDE = 1 pp
p = 0.20
sigma2 = p * (1 - p)
effect = 0.01  # 1 percentage point

n_sup  = n_superiority(effect, sigma2)
n_ni   = n_non_inferiority(effect, sigma2)

print(f'Conversion rate: {p:.0%},  effect / margin: {effect:.1%}')
print(f'Superiority   n per group: {n_sup:>8,}')
print(f'Non-inferiority n per group: {n_ni:>8,}')

Conversion rate: 20%,  effect / margin: 1.0%
Superiority   n per group:   25,117
Non-inferiority n per group:   19,785

Comparison across margins and conversion rates

import pandas as pd

rows = []
for p in [0.05, 0.10, 0.20]:
    for margin_pct in [0.01, 0.02, 0.05]:
        s2 = p * (1 - p)
        rows.append({
            'Baseline p': f'{p:.0%}',
            'Margin': f'{margin_pct:.0%}',
            'Superiority n': n_superiority(margin_pct, s2),
            'Non-inferiority n': n_non_inferiority(margin_pct, s2),
        })

pd.DataFrame(rows).set_index(['Baseline p', 'Margin'])

		Superiority n	Non-inferiority n
Baseline p	Margin
5%	1%	7457	5874
	2%	1865	1469
	5%	299	235
10%	1%	14128	11129
	2%	3532	2783
	5%	566	446
20%	1%	25117	19785
	2%	6280	4947
	5%	1005	792

When to Use Each

Use a superiority trial when:

You expect the new variant to improve a metric and want to prove it.
The cost of a false positive (shipping a harmful change) is high.

Use a non-inferiority trial when:

You are shipping a change for non-metric reasons (performance, cost, tech debt) and need to confirm no regression.
You are replacing an old component and the new one is expected to be equivalent, not better.
You want to retire a feature and need to prove its removal doesn’t degrade key metrics.

A common mistake is to run a superiority test and interpret a non-significant result as proof of non-inferiority — it is not. “We couldn’t detect a difference” is not the same as “we proved there is no meaningful difference.” Only a properly designed non-inferiority test with a pre-specified margin gives you that guarantee.

Conclusion

Superiority and non-inferiority trials answer fundamentally different questions and require different statistical designs. Choosing the right one before you run the experiment — not after — is essential for valid inference.

For practical implementation: non-inferiority tests at HomeBuddy use one-sided Z-tests with the margin as the null center. The sample size is calculated against the margin $\Delta$, not the MDE, and the significance level uses the one-sided critical value $Z_{1-\alpha}$ instead of $Z_{1-\alpha/2}$.

Background#

Problem Definition#

Side-by-Side Comparison#

Sample Size Formulas#

When to Use Each#

Conclusion#

References#