Background

We at HomeBuddy track many ratio metrics: average session duration, pages per visit, revenue per order. These metrics share a common structure — a sum of events in the numerator divided by a count of sessions (or users) in the denominator, where both vary per user.

$$\mathcal{R} = \frac{\sum_{u \in A} X(u)}{\sum_{u \in A} Y(u)}$$

The problem: $\mathcal{R}$ is not a simple average of independent observations. Each user $u$ contributes $Y(u)$ observations to the denominator, creating dependence between rows. Naively applying a T-test to the raw rows yields invalid p-values. Collapsing to a per-user mean loses the weighting — a user with 100 sessions gets the same weight as one with 1 session, which distorts the direction of the test.

Linearization resolves this with an elegant algebraic trick: it produces a per-user scalar that is both unbiased for $\mathcal{R}$ and fully compatible with standard T-tests and sensitivity methods.

Four Approaches to Ratio Metrics

Approach Description Problem
Naive per-user average Mean of per-user means Distorts weighting; direction of test may flip
Bootstrap Resample the ratio statistic Computationally expensive at scale
Delta Method Asymptotic variance for X/Y Requires 3 aggregations per user; see dedicated post
Linearization Per-user scalar via Taylor expansion Simple; enables all per-user techniques

Theory

Taylor Expansion

Expand $\mathcal{R}(X, Y)$ around the control group means $(\mu_X, \mu_Y)$ to first order:

$$\mathcal{R}(X, Y) \approx \frac{\mu_X}{\mu_Y} + \frac{1}{\mu_Y}(X - \mu_X) - \frac{\mu_X}{\mu_Y^2}(Y - \mu_Y)$$$$= \frac{1}{\mu_Y}\left(X - \frac{\mu_X}{\mu_Y}\cdot Y\right) + const$$

Denoting $\alpha = \mu_X / \mu_Y = \mathcal{R}_{control}$ (the control group ratio), the linearized metric per user is:

$$L(u) = X(u) - \alpha \cdot Y(u)$$

The per-user values $\{L(u)\}$ are independent and identically distributed — T-test ready.

Why It Works

The group-level mean of $L$ satisfies:

$$\overline{L_A} = \overline{X_A} - \alpha \cdot \overline{Y_A} = \overline{Y_A}\left(\mathcal{R}_A - \alpha\right)$$

Since $\overline{Y_A} > 0$, the sign of $\overline{L_A}$ equals the sign of $(\mathcal{R}_A - \alpha)$. This means the T-test on $L$ rejects whenever $\mathcal{R}_{test} \neq \mathcal{R}_{control}$ — exactly what we want.

The parameter $\alpha$ is estimated from the control group:

$$\hat\alpha = \frac{\sum_{u \in control} X(u)}{\sum_{u \in control} Y(u)}$$

Implementation

Linearization function
def linearization(control: list[list], test: list[list]) -> tuple[list, list]:
    """
    Convert ratio-metric observations into linearized per-user scalars.

    Parameters
    ----------
    control, test : list of lists, one inner list per user with all their observations

    Returns
    -------
    (linearized_control, linearized_test) ” one scalar per user, T-test compatible
    """
    total_x = sum(sum(row) for row in control)
    total_y = sum(len(row) for row in control)
    alpha = total_x / total_y  # control ratio estimate

    linearized_control = [sum(row) - len(row) * alpha for row in control]
    linearized_test    = [sum(row) - len(row) * alpha for row in test]
    return linearized_control, linearized_test

Synthetic Data

We simulate an AB test where each user has a persistent quality effect that correlates all their sessions — the key property that makes row-level tests invalid for ratio metrics. The test group gets a 5% uplift on the ratio metric for the single-run p-value comparison.

Data generation
def generate_samples(n_users, n_samples, seed=0, effect=0.0):
    np.random.seed(seed)

    # User-level quality — each user has a fixed base that correlates all their sessions
    user_quality = np.random.lognormal(0, 1, n_users)

    def encoder(x):
        uid = hashlib.md5(str(x).encode()).hexdigest()
        # deterministic group split via MD5 (avoids Python hash randomisation)
        test_flg = int(uid, 16) % 2
        return (uid, 'test' if test_flg else 'control', float(user_quality[x]))

    rows = list(map(encoder, np.random.randint(0, n_users, 2 * n_samples)))
    df = pd.DataFrame(rows, columns=['user_id', 'group', 'user_quality'])

    # Row metric = user quality * session noise (creates within-user correlation)
    row_noise = sts.lognorm.rvs(0.5, size=2 * n_samples)
    metric = df['user_quality'].values * row_noise * 100

    is_test = (df['group'] == 'test').values
    metric[is_test] *= (1 + effect)
    return df[['user_id', 'group']].assign(metric=metric)


df = generate_samples(100, 10000, effect=0.05)

# Build per-user observation lists
grouped = df.groupby(['group', 'user_id'])['metric'].apply(list)
control_obs = [grouped['control'][uid] for uid in grouped['control'].index]
test_obs    = [grouped['test'][uid]    for uid in grouped['test'].index]

print(f'Control users: {len(control_obs)}, test users: {len(test_obs)}')
print(f'Avg obs per control user: {np.mean([len(r) for r in control_obs]):.1f}')
Control users: 48, test users: 52
Avg obs per control user: 200.2
Apply linearization and compare approaches
lin_control, lin_test = linearization(control_obs, test_obs)

# Naive: row-level T-test (incorrect)
ctrl_rows = df[df.group == 'control'].metric.values
test_rows = df[df.group == 'test'].metric.values
_, p_naive = sts.ttest_ind(test_rows, ctrl_rows)

# Per-user average T-test (biased weighting)
ctrl_avg = [np.mean(r) for r in control_obs]
test_avg = [np.mean(r) for r in test_obs]
_, p_avg = sts.ttest_ind(test_avg, ctrl_avg)

# Linearization T-test (correct)
_, p_lin = sts.ttest_ind(lin_test, lin_control)

print(f'Row-level T-test p-value:      {p_naive:.4f}')
print(f'Per-user average p-value:      {p_avg:.4f}')
print(f'Linearization p-value:         {p_lin:.4f}')
Row-level T-test p-value:      0.0000
Per-user average p-value:      0.0715
Linearization p-value:         0.0635

Correctness and Power

A simulation over 500 AA (no-effect) trials verifies that the linearized T-test holds the FPR at $\alpha = 5\%$, while the row-level approach inflates it to ~88% due to within-user correlation. A subsequent AB test with a 50% effect checks true power once the FPR is controlled.

AA / AB simulation
def simulate(effect=0.0, n_tests=500, alpha=0.05):
    row_fp, avg_fp, lin_fp = 0, 0, 0
    for i in range(n_tests):
        d = generate_samples(100, 10000, seed=i, effect=effect)
        g = d.groupby(['group', 'user_id'])['metric'].apply(list)
        ctrl = [g['control'][u] for u in g['control'].index]
        tst  = [g['test'][u]    for u in g['test'].index]

        ctrl_rows = d[d.group == 'control'].metric.values
        test_rows = d[d.group == 'test'].metric.values

        _, p1 = sts.ttest_ind(test_rows, ctrl_rows)
        _, p2 = sts.ttest_ind([np.mean(r) for r in tst], [np.mean(r) for r in ctrl])
        lc, lt = linearization(ctrl, tst)
        _, p3 = sts.ttest_ind(lt, lc)

        if p1 < alpha: row_fp += 1
        if p2 < alpha: avg_fp += 1
        if p3 < alpha: lin_fp += 1

    def ci(n): 
        lo, hi = proportion_confint(n, n_tests, alpha=0.05, method='wilson')
        return f'{n/n_tests:.3f} [{lo:.3f},{hi:.3f}]'

    label = 'AA FPR' if effect == 0 else f'AB power ({effect:.0%} effect)'
    print(f'{label}')
    print(f'  Row-level:      {ci(row_fp)}')
    print(f'  Per-user avg:   {ci(avg_fp)}')
    print(f'  Linearization:  {ci(lin_fp)}')
    print()


simulate(effect=0.0)
simulate(effect=0.50)
AA FPR
  Row-level:      0.884 [0.853,0.909]
  Per-user avg:   0.036 [0.023,0.056]
  Linearization:  0.038 [0.024,0.059]

AB power (50% effect)
  Row-level:      0.972 [0.954,0.983]
  Per-user avg:   0.370 [0.329,0.413]
  Linearization:  0.378 [0.337,0.421]

Linearization vs Delta Method

Both linearization and the Delta Method address ratio metrics, but they differ in approach and output:

Linearization Delta Method
Output Per-user scalar (T-test ready) Asymptotic confidence interval
Variance reduction Yes — enables CUPED on the linearized metric Not directly
Bias Slight bias: α estimated from control only Asymptotically unbiased
Enables bucketing/CUPED Yes No
SQL friendliness Very high Very high

The key advantage of linearization is that once you have per-user scalars, you can apply any per-user technique: CUPED, bucketing, stratification. The Delta Method gives you a confidence interval directly but stops there.

Conclusion

Linearization is a lightweight transformation that unlocks the full toolkit of per-user AB methods for ratio metrics. The formula is a single line: L(u) = X(u) - alpha * Y(u) where alpha is the control group ratio.

Practical notes:

  • Estimate $\alpha$ from the control group only — using test data would introduce a circular dependency.
  • The linearized metric has mean $\approx 0$ in the control group by construction — this is expected and correct.
  • For further variance reduction, apply CUPED or bucketing to the linearized values.

References

  1. Deng, A. et al. (2018). Applying the Delta Method in Metric Analytics
  2. Video: Sensitivity improvement for ratio metrics — includes the square-root reweighting trick
  3. Video: Advantages of linearization vs reweighting
  4. Avito variance reduction series — Part 1, Part 2