Background

Most AB tests at HomeBuddy ask: “is the new variant better?” — a superiority trial. But sometimes the question is different: we want to ship a change for cost, performance, or architectural reasons, and we need to verify that it does not harm the key metric by more than a tolerable margin. This is a non-inferiority trial.

The distinction is not cosmetic — it changes the null hypothesis, the test statistic, and critically, the sample size formula. Using a superiority design when you actually want non-inferiority leads to either an underpowered test (too few users to detect the margin) or a miscalibrated one (wrong null hypothesis).

Problem Definition

Let $\mu_E$ and $\mu_C$ be the true means of the experimental and control groups respectively. Both trial types test a claim about $\mu_E - \mu_C$, but they differ in what they try to prove.

Superiority trial

$$H_0: \mu_E - \mu_C \le 0 \quad \text{vs} \quad H_1: \mu_E - \mu_C > 0$$

Rejecting $H_0$ means the treatment is strictly better than control.

Non-inferiority trial

$$H_0: \mu_E - \mu_C \le -\Delta \quad \text{vs} \quad H_1: \mu_E - \mu_C > -\Delta$$

where $\Delta > 0$ is the non-inferiority margin — the largest degradation you are willing to tolerate. Rejecting $H_0$ means the treatment is not worse than control by more than $\Delta$.

Side-by-Side Comparison

Feature Superiority Trial Non-Inferiority Trial
Goal Prove $\mu_E > \mu_C$ Prove $\mu_E \ge \mu_C - \Delta$
Null hypothesis $\mu_E - \mu_C \le 0$ $\mu_E - \mu_C \le -\Delta$
Alternative $\mu_E - \mu_C > 0$ $\mu_E - \mu_C > -\Delta$
Test type Two-sided (typically) One-sided (always)
Null center $0$ $-\Delta$
Test statistic $Z = \dfrac{(\bar x_E - \bar x_C) - 0}{SE}$ $Z = \dfrac{(\bar x_E - \bar x_C) - (-\Delta)}{SE}$
Key parameter $\delta$ (MDE) $\Delta$ (margin)
Sample size See below See below

Sample Size Formulas

Let $\sigma^2$ be the common variance, $\alpha$ the significance level, and $1-\beta$ the desired power.

Superiority (two-sided, detects effect $\delta$):

$$ n \approx \frac{\left(Z_{1-\alpha/2} + Z_{1-\beta}\right)^2 \cdot 2\sigma^2}{\delta^2} $$

Non-inferiority (one-sided, with anticipated true effect $\delta_{\text{ant}}$):

$$ n \approx \frac{\left(Z_{1-\alpha} + Z_{1-\beta}\right)^2 \cdot 2\sigma^2}{(\Delta - \delta_{\text{ant}})^2} $$

Typically $\delta_{\text{ant}} = 0$ is assumed (no true difference), which simplifies the denominator to $\Delta^2$.

Key insight: the non-inferiority formula uses the one-sided critical value $Z_{1-\alpha}$ (not $Z_{1-\alpha/2}$), and the effect in the denominator is the margin $\Delta$, not the MDE $\delta$. For a tight margin (small $\Delta$), non-inferiority tests require more users than an equivalent superiority test.

Sample size calculator
from scipy.stats import norm
import numpy as np


def n_superiority(delta, sigma2, alpha=0.05, power=0.8):
    """Per-group sample size for a two-sided superiority test."""
    z = norm.ppf(1 - alpha / 2) + norm.ppf(power)
    return int(np.ceil(z**2 * 2 * sigma2 / delta**2))


def n_non_inferiority(margin, sigma2, alpha=0.05, power=0.8, anticipated_delta=0.0):
    """Per-group sample size for a one-sided non-inferiority test."""
    z = norm.ppf(1 - alpha) + norm.ppf(power)
    return int(np.ceil(z**2 * 2 * sigma2 / (margin - anticipated_delta)**2))


# Example: conversion rate ~ 20%, margin / MDE = 1 pp
p = 0.20
sigma2 = p * (1 - p)
effect = 0.01  # 1 percentage point

n_sup  = n_superiority(effect, sigma2)
n_ni   = n_non_inferiority(effect, sigma2)

print(f'Conversion rate: {p:.0%},  effect / margin: {effect:.1%}')
print(f'Superiority   n per group: {n_sup:>8,}')
print(f'Non-inferiority n per group: {n_ni:>8,}')
Conversion rate: 20%,  effect / margin: 1.0%
Superiority   n per group:   25,117
Non-inferiority n per group:   19,785
Comparison across margins and conversion rates
import pandas as pd

rows = []
for p in [0.05, 0.10, 0.20]:
    for margin_pct in [0.01, 0.02, 0.05]:
        s2 = p * (1 - p)
        rows.append({
            'Baseline p': f'{p:.0%}',
            'Margin': f'{margin_pct:.0%}',
            'Superiority n': n_superiority(margin_pct, s2),
            'Non-inferiority n': n_non_inferiority(margin_pct, s2),
        })

pd.DataFrame(rows).set_index(['Baseline p', 'Margin'])

Superiority n Non-inferiority n
Baseline p Margin
5% 1% 7457 5874
2% 1865 1469
5% 299 235
10% 1% 14128 11129
2% 3532 2783
5% 566 446
20% 1% 25117 19785
2% 6280 4947
5% 1005 792

When to Use Each

Use a superiority trial when:

  • You expect the new variant to improve a metric and want to prove it.
  • The cost of a false positive (shipping a harmful change) is high.

Use a non-inferiority trial when:

  • You are shipping a change for non-metric reasons (performance, cost, tech debt) and need to confirm no regression.
  • You are replacing an old component and the new one is expected to be equivalent, not better.
  • You want to retire a feature and need to prove its removal doesn’t degrade key metrics.

A common mistake is to run a superiority test and interpret a non-significant result as proof of non-inferiority — it is not. “We couldn’t detect a difference” is not the same as “we proved there is no meaningful difference.” Only a properly designed non-inferiority test with a pre-specified margin gives you that guarantee.

Conclusion

Superiority and non-inferiority trials answer fundamentally different questions and require different statistical designs. Choosing the right one before you run the experiment — not after — is essential for valid inference.

For practical implementation: non-inferiority tests at HomeBuddy use one-sided Z-tests with the margin as the null center. The sample size is calculated against the margin $\Delta$, not the MDE, and the significance level uses the one-sided critical value $Z_{1-\alpha}$ instead of $Z_{1-\alpha/2}$.

References

  1. Non-inferiority trials — FDA guidance
  2. Superiority, equivalence, and non-inferiority trials
  3. Sample size for non-inferiority studies