Background

We at HomeBuddy deal with metrics that are measured at the event level but compared at the user level: each user may have dozens of sessions, purchases, or page views. Working with millions of raw rows is computationally expensive, and the raw metric distribution is typically heavy-tailed, making a T-test unreliable without a large enough sample.

Bucketing solves both problems at once: users are randomly partitioned into $b$ buckets, and the metric is aggregated within each bucket. The resulting $b$ bucket-level values are nearly normally distributed by the Central Limit Theorem, enabling a T-test with far fewer data points — and dramatically reduced storage and computation.

Prerequisites

Python 3.11.4 with NumPy, SciPy, pandas, and Plotly.

Synthetic Data

We reuse the same lognormal simulation from the Delta Method post: 1 000 unique users, 10 000 rows per group, metric sampled from $\mathrm{LogNormal}(3, 100)$.

Data generation
def generate_samples(n_users, n_samples, seed=0):
    np.random.seed(seed)

    def encoder(x):
        uid = hashlib.md5(str(x).encode()).hexdigest()
        test_flg = hash(str(x).encode()) % 2
        return (uid, 'test' if test_flg else 'control')

    df = pd.DataFrame(
        list(map(encoder, np.random.randint(0, n_users, 2 * n_samples))),
        columns=['user_id', 'group'],
    )
    return df.assign(metric=scipy.stats.lognorm.rvs(3, loc=100, size=2 * n_samples))


df = generate_samples(1000, 10000)
df.sample(3)

user_id group metric
2512 f2fc990265c712c49d51a18a32b39f0c control 100.555829
19645 98d6f58ab0dafbb86b083a001561bb34 test 100.458050
8912 07a96b1f61097ccb54be14d6a47439b0 test 100.369466

Bucketing

Each user is deterministically assigned to one of $b$ buckets using the hash of their ID modulo $b$. The metric is then summed within each bucket, yielding $b$ approximately normally distributed values per group.

Bucketing function
def make_bucket(df, ids='user_id', groups='group', num_buck=100):
    return (
        df.assign(bucket=df.apply(lambda x: 1 + hash(x[ids]) % num_buck, axis=1))
        .groupby([groups, 'bucket'])
        .agg({'metric': 'sum'})
        .reset_index()
    )


bucketed = make_bucket(df)
bucketed.sample(3)

group bucket metric
180 test 83 18836.189558
2 control 3 35560.334840
169 test 72 19833.689535

Distribution Before and After Bucketing

The raw metric is lognormally distributed — highly right-skewed. After bucketing into 100 groups, the bucket sums are approximately normal, justifying the T-test.

Distribution comparison chart
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from pathlib import Path


def find_git_repo(path=None):
    source = Path(path) if path else Path.cwd()
    for p in source.parents:
        if (p / '.git').is_dir():
            return p


fig = make_subplots(rows=2, cols=2,
    subplot_titles=('Raw Metric — Test', 'Raw Metric — Control',
                    'Bucket Sums — Test', 'Bucket Sums — Control'))

colors = {'test': '#636efa', 'control': '#EF553B'}
for col, grp in enumerate(['test', 'control'], start=1):
    fig.add_trace(go.Histogram(
        x=df[df.group == grp].metric.values, nbinsx=100,
        name=f'Raw {grp}', marker_color=colors[grp], showlegend=False), row=1, col=col)
    fig.add_trace(go.Histogram(
        x=bucketed[bucketed.group == grp].metric.values,
        name=f'Buckets {grp}', marker_color=colors[grp], showlegend=False), row=2, col=col)

fig.update_layout(
    title={'text': 'Bucketing: Raw vs Bucket-Sum Distribution', 'x': 0.5},
    template='plotly_dark', height=600)
fig.show()

Choosing the Number of Buckets

A critical constraint: the number of users must substantially exceed the number of buckets. If $n \approx b$, many buckets will be empty, making the statistic meaningless.

Let $P(n, b)$ be the probability that all $b$ buckets contain at least one of $n$ users. Counting successful arrangements as inserting $b-1$ dividers in $n-1$ positions yields $\binom{n-1}{b-1}$. Counting all arrangements (dividers can repeat) gives $(n+1)^{b-1}$. Therefore:

$$ P(n, b) = \frac{\binom{n-1}{b-1}}{(n + 1)^{b - 1}} $$

Applying Stirling’s approximation $n! \approx \sqrt{2\pi n}(n/e)^n$:

$$ P(n, b) \approx \frac{n!}{(n-b)!\cdot b! \cdot n^b} \xrightarrow{n \to \infty} 0 $$

Two particularly informative regimes:

  • $b \sim n$: $P \sim 1/n^n \to 0$ (vanishingly small probability)
  • $b \sim n/2$: $P \sim \sqrt{2/(\pi n)}(4/n)^{n/2} \to 0$ (also zero)

Practical rule of thumb: keep $b \ll n$, typically $b \leq n/10$. With 1 000 users, use at most 100 buckets; with 10 000 users, 200–500 buckets is reasonable.

Empty bucket rate vs bucket count
n_users = df.user_id.nunique() // 2  # per group
print(f'Users per group: {n_users}')

for b in [10, 50, 100, 200, 500]:
    grp = df[df.group == 'test']
    bucketed_check = make_bucket(grp, num_buck=b)
    filled = bucketed_check.bucket.nunique()
    print(f'  {b:4d} buckets → {filled:4d} non-empty  ({100*(b-filled)/b:.1f}% empty)')
Users per group: 500
    10 buckets →   10 non-empty  (0.0% empty)
    50 buckets →   50 non-empty  (0.0% empty)
   100 buckets →   99 non-empty  (1.0% empty)
   200 buckets →  185 non-empty  (7.5% empty)
   500 buckets →  320 non-empty  (36.0% empty)

Conclusion

Bucketing is one of the simplest variance-reduction techniques available in AB testing:

  • Normalises the metric distribution, making T-tests valid on far fewer data points.
  • Reduces computation: instead of testing millions of rows, you compare $b$ bucket aggregates.
  • Distributable: bucket assignment is a hash modulo — trivially parallelisable in any SQL or Spark query.

The main constraint is bucket count: choose $b$ significantly smaller than the number of users per group to avoid empty buckets that would bias the test statistic. For our typical experiment sizes at HomeBuddy, 100–200 buckets gives a good trade-off between normalisation effectiveness and statistical efficiency.