Background
We at HomeBuddy run AB tests on conversion metrics and engagement signals whose natural variance is high. High variance means we need more users to detect the same effect — or equivalently, we might miss a real improvement because the noise drowns it out.
CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance-reduction technique introduced by Microsoft Research that exploits the correlation between a user’s pre-experiment behavior and their in-experiment metric. By subtracting a scaled version of the pre-experiment covariate from the metric, CUPED can cut variance by 50–90%, effectively doubling or tripling test sensitivity with no additional traffic.
Prerequisites
Python 3.11.4 with NumPy, SciPy, pandas, and statsmodels.
Theory
Let $Y$ be the in-experiment metric and $X$ a covariate unaffected by the treatment (e.g. the same metric measured before the experiment began). Define the CUPED-adjusted metric:
$$ Y_{\text{CUPED}} = Y - \theta(X - \mathbb{E}X) $$Computing the moments:
$$ \mathbb{E}[Y_{\text{CUPED}}] = \mathbb{E}[Y] - \theta \underbrace{\mathbb{E}[X - \mathbb{E}X]}_{=0} = \mathbb{E}[Y] $$$$ \mathbb{D}[Y_{\text{CUPED}}] = \mathbb{D}[Y] + \theta^2 \mathbb{D}[X] - 2\theta \operatorname{cov}(Y, X) $$The expectation is preserved. The variance can be reduced by choosing $\theta$ to minimise $\mathbb{D}[Y_{\text{CUPED}}]$:
$$ \frac{\partial}{\partial \theta}\mathbb{D}[Y_{\text{CUPED}}] = 2\theta\mathbb{D}[X] - 2\operatorname{cov}(Y, X) = 0 \implies \theta^* = \frac{\operatorname{cov}(Y, X)}{\mathbb{D}[X]} $$Substituting back:
$$ \mathbb{D}[Y_{\text{CUPED}}] = \mathbb{D}[Y]\left(1 - \rho^2_{XY}\right) $$where $\rho_{XY}$ is the Pearson correlation between $X$ and $Y$. If $\rho = 0.9$, variance drops by 81%. If $\rho = 0$, CUPED provides no benefit (and should not be applied).
Key constraints:
- The covariate $X$ must not be affected by the treatment — pre-experiment data is ideal.
- $\theta$ must be estimated on the combined sample (test + control), not on each group separately.
- For a two-sample test, $\mathbb{E}X$ must be the pooled mean — not the per-group mean.
Synthetic data generation with correlated past/future metrics
def lognorm():
return sts.lognorm(s=1, loc=100)
def generate_samples(distribution, n_users=100, n_days=1, seed=0):
"""
Generate a dataset with historical (past) and experimental (future) periods.
Past and future metrics are correlated: future = past + small noise.
"""
np.random.seed(seed)
def encoder(x):
uid = hashlib.md5(str(x).encode()).hexdigest()
test_flg = hash(str(x).encode()) % 2
return (uid, 'test' if test_flg else 'control')
df = pd.DataFrame(
list(map(
encoder,
np.array([[u] * (2 * n_days) for u in range(2 * n_users)]).ravel()
)),
columns=['user_id', 'group'],
)
df['date'] = pd.to_datetime(
[datetime.date.today() - datetime.timedelta(days=x) for x in range(2 * n_days)] * 2 * n_users
)
df['history'] = np.where(
df['date'] > pd.Timestamp(datetime.date.today() - datetime.timedelta(days=n_days)),
'future', 'past'
)
future_metric = distribution.rvs(size=n_days * n_users * 2)
past_metric = future_metric + sts.norm.rvs(loc=0, scale=1, size=len(future_metric))
df = df.sort_values(by=['date', 'user_id'], ascending=True)
df['metric'] = np.hstack((past_metric, future_metric))
return df.reset_index(drop=True)
rv = lognorm()
EV = rv.mean() # true population mean ≈ 101.6
df = generate_samples(rv)
user_level = df.groupby(['group', 'history', 'user_id'])[['metric']].mean().reset_index()
print(f'Shape: {df.shape}, EV ≈ {EV:.2f}')
user_level.sample(3)
T-Test Utilities
Two test procedures are needed: a one-sample T-test (used when comparing a single group against a known mean) and Welch’s T-test (for two independent groups with potentially unequal variances).
One-sample and Welch T-test implementations
def one_samp_t_test(X, mu0, alpha=0.05):
mu = np.mean(X) - mu0
sigma = np.sqrt(np.var(X, ddof=1) / len(X))
t = mu / sigma
T = sts.t(df=len(X) - 1)
return {'pvalue': 2 * min(T.sf(t), T.cdf(t))}
def welch_t_test(X, Y, alpha=0.05):
vx = np.var(X, ddof=1) / len(X)
vy = np.var(Y, ddof=1) / len(Y)
mu = np.mean(X) - np.mean(Y)
sigma = np.sqrt(vx + vy)
t = mu / sigma
nu = int((vx + vy)**2 / (vx**2 / (len(X) - 1) + vy**2 / (len(Y) - 1)))
T = sts.t(df=nu)
return {'pvalue': 2 * min(T.sf(t), T.cdf(t))}
Two-Sample CUPED
In a standard AB test with test group $T$ and control group $C$, define:
$$ T_{\text{CUPED}} = T - \theta A, \quad C_{\text{CUPED}} = C - \theta B $$where $A$ and $B$ are pre-experiment covariates for test and control respectively, with $\mathbb{E}A = \mathbb{E}B$ (they’re drawn from the same pre-experiment distribution).
The optimal $\theta$ for variance minimisation of the difference $T_{\text{CUPED}} - C_{\text{CUPED}}$:
$$ \theta^* = \frac{\text{cov}(T, A) + \text{cov}(C, B)}{\mathbb{D}[A] + \mathbb{D}[B]} $$The resulting variance reduction:
$$ \mathbb{D}[T_{\text{CUPED}} - C_{\text{CUPED}}] = \left(1 - \rho^2\right)\mathbb{D}[T - C], \quad \rho = \text{corr}(T - C, A - B) $$Two-sample CUPED implementation
def two_samples_cuped(test_target, control_target, test_cov, control_cov):
theta = (
np.cov(test_target, test_cov)[0, 1] + np.cov(control_target, control_cov)[0, 1]
) / (np.var(test_cov) + np.var(control_cov))
return test_target - theta * test_cov, control_target - theta * control_cov
def apply_two_samples_cuped(user_df, cuped, ids='user_id', date='history', metric='metric'):
data = user_df.sort_values(by=[date, ids], ascending=True)
ft = data[(data.group == 'test') & (data.history == 'future')][metric].values
fc = data[(data.group == 'control') & (data.history == 'future')][metric].values
pt = data[(data.group == 'test') & (data.history == 'past')][metric].values
pc = data[(data.group == 'control') & (data.history == 'past')][metric].values
data = data.copy()
data['cuped_metric'] = np.nan
tc, cc = cuped(ft, fc, pt, pc)
data.loc[(data.group == 'test') & (data.history == 'future'), 'cuped_metric'] = tc
data.loc[(data.group == 'control') & (data.history == 'future'), 'cuped_metric'] = cc
return data.dropna().reset_index(drop=True)
df_cuped = apply_two_samples_cuped(user_level, two_samples_cuped)
orig_var = df_cuped['metric'].var()
cuped_var = df_cuped['cuped_metric'].var()
print(f'Original variance: {orig_var:.2f}')
print(f'CUPED variance: {cuped_var:.2f}')
print(f'Reduction factor: {orig_var / cuped_var:.1f}×')
Correctness: AA Test
Before measuring power gains, we verify the criterion is properly calibrated: in an AA test (no real effect), the false-positive rate should stay at $\alpha$.
We also check a common mistake — subtracting the per-group mean of the covariate instead of the pooled mean. This breaks the equal-expectation assumption and inflates FPR dramatically.
AA test: correct vs incorrect demeaning
def two_samples_cuped_demeaned(test_target, control_target, test_cov, control_cov):
"""Correct: subtract the POOLED covariate mean."""
theta = (
np.cov(test_target, test_cov)[0, 1] + np.cov(control_target, control_cov)[0, 1]
) / (np.var(test_cov) + np.var(control_cov))
pooled_mean = np.hstack((test_cov, control_cov)).mean()
return (test_target - theta * (test_cov - pooled_mean),
control_target - theta * (control_cov - pooled_mean))
def two_samples_cuped_incorrect(test_target, control_target, test_cov, control_cov):
"""Incorrect: subtract per-group mean — inflates FPR."""
theta = (
np.cov(test_target, test_cov)[0, 1] + np.cov(control_target, control_cov)[0, 1]
) / (np.var(test_cov) + np.var(control_cov))
return (test_target - theta * (test_cov - test_cov.mean()),
control_target - theta * (control_cov - control_cov.mean()))
def simulate(procedure, n_tests=500, alpha=0.05, mode='AA'):
n_err = 0
for i in range(n_tests):
d = generate_samples(lognorm(), seed=8 * i)
ul = d.groupby(['group', 'history', 'user_id'])[['metric']].mean().reset_index()
if mode == 'AB':
ul['metric'] = ul.apply(
lambda r: 1.01 * r.metric if r.group == 'test' and r.history == 'future' else r.metric,
axis=1)
dc = ul.pipe(apply_two_samples_cuped, cuped=procedure)
p = welch_t_test(
dc.loc[dc.group == 'test', 'cuped_metric'],
dc.loc[dc.group == 'control', 'cuped_metric'],
)['pvalue']
if p < alpha:
n_err += 1
lo, hi = proportion_confint(n_err, n_tests, alpha=0.05, method='wilson')
return f'{n_err / n_tests:.3f} 95% CI [{lo:.3f}, {hi:.3f}]'
print('AA FPR — correct (pooled mean): ', simulate(two_samples_cuped_demeaned))
print('AA FPR — correct (no demeaning): ', simulate(two_samples_cuped))
print('AA FPR — incorrect (per-group): ', simulate(two_samples_cuped_incorrect))
The incorrect per-group demeaning blows up the FPR to ~40% — a catastrophic failure that would be hard to detect without a simulation like this. Both the no-demeaning and pooled-demeaning variants maintain FPR ≈ 5%.
Power Improvement
Now let’s measure the power gain under a 1% true effect.
Power comparison: no CUPED vs CUPED
def no_cuped(test_target, control_target, test_cov, control_cov):
return test_target, control_target
print('Power (1% effect) — no CUPED: ', simulate(no_cuped, mode='AB'))
print('Power (1% effect) — CUPED: ', simulate(two_samples_cuped, mode='AB'))
Conclusion
CUPED is one of the most cost-effective variance-reduction techniques available:
- No extra traffic — it uses data you already have (pre-experiment history).
- Large gains — if pre/post correlation is high ($\rho \approx 0.9$), variance drops ~81%.
- Unbiased — the expected value of the metric is preserved exactly.
- Easy to implement — requires only three statistics per group: covariance, variance of covariate, and means.
Critical implementation detail: $\theta$ must be estimated from both groups combined, and the covariate mean subtracted must be the pooled pre-experiment mean — not per-group. Subtracting per-group means silently inflates the false-positive rate from 5% to ~40%.
Alternatives when pre-experiment data is unavailable or CUPED gains are small:
- Stratification — randomise within strata rather than fully at random, reducing imbalance.
- Prediction subtraction — generalises CUPED from a single covariate to a full regression model; useful when many pre-experiment features are available.
References
- Deng, A. et al. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data — the original CUPED paper.
- Booking.com: How CUPED increases power of online experiments
- Avito: Variance reduction methods — Part 1, Part 2
- Variance reduction techniques — conference talk