Intro

Sequential testing designs have gained significant popularity in AB testing due to their ability to potentially reduce the required sample size and experiment duration while maintaining statistical correctness. This approach allows for interim analyses of data as it accumulates, offering the possibility to stop the experiment early if a clear winner emerges, or if it becomes evident that the treatment effect is insufficient to justify continuing (stop for futility)

In this article, we shift our focus away from the theoretical intricacies of the problem and instead delve into a comprehensive exploration of available sequential testing solutions. We will discuss their implementations, compare their performance, and highlight their strengths and weaknesses. By examining these practical aspects, we aim to equip practitioners with the knowledge to make informed decisions when incorporating sequential testing into their AB testing workflows.

Approach

Among the variety of sequential testing designs, there are basically two broad families of algorithms: Group Sequential Testing (GST) and Always Valid Inference (AVI). These methods represent distinct philosophies in their handling of interim analyses and experiment stopping criteria..

GST works with predefined interim analysis points and utilizes predetermined stopping boundaries to decide whether to stop the experiment at each stage.
AVI allows continuous monitoring and providing valid confident intervals at any point, making it adaptable for uncertain experiment duration or analysis frequencies.

This article primarily focuses on these two techniques, providing an overview of their methodologies and practical implications.

Group Sequential Testing

For group sequential testing there is a handy package in R and I will show below how to use it, although for those, who prefers Python due to any reason, whether it is the absence of an interpreter, infrastructure limitations or just personal preferences, there is no direct and popular alternative package, so I had to write it on my own and now ready to share with you after careful testing and benchmarking.

GST in R

If you’re ready to use R library there are two options: use R runtime directly or through rpy2 Python package. Both options are available for example within Google Colab environment.

Here is an instance of rpy2 package inference within Colab Notebook, when you run R code from the Python interpreter using an extension.

iPython Notebook

%load_ext rpy2.ipython

%%R
R.version.string
install.packages("ldbounds")

%%R
library(ldbounds)
ldBounds(t=seq(1/4, 1, 1/4), iuse=3, phi=1, alpha=0.05, sides=1)$upper.bounds

[1] 2.241403 2.125078 2.018644 1.925452

Python File

The code above may be rewritten into a simple .py file as follows, you are to use created stats package as a plain Python package thereafter.

⚠️ Caution

It will not work in Google Colab for instance, as it requires R installed in addition to Python.

Code

import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector

utils = rpackages.importr('utils')

# select a mirror for R packages
utils.chooseCRANmirror(ind=1)

# R package names
packnames = ('ldbounds')

# Selectively install what needs to be install.
names_to_install = [
    package for package in packnames if not rpackages.isinstalled(package)
]

if len(names_to_install) > 0:
    utils.install_packages(StrVector(names_to_install))

stats = rpackages.importr('ldbounds')

If R is not your cup of tea, or simply there is no option to run it within the scope of the production infrastructure, what is the most common limitation by the way, now begins exactly what you need.

GST in Python

There is quite popular yet inaccurate implementation powered by Zalando expan. It works without probability integration and mistakenly leverage alpha spending function as a critical value at each step, it’s common misunderstanding about alpha-spending function approach, there are number of implementations that do it in exact same way, and even world’s leading publication for data science, according to their own definition, sometimes makes the same mistake, for instance: Understanding of Group Sequential Testing published in Towards Data Science is good post for beginners, though alpha spending function’s application is misleading. As it will shown below that approach is statistically incorrect and so it’s highly recommended to avoid it.

Instead I propose you to apply the new library seqabpy that is powerful and accurate and what is more important implemented according to the original papers Interim analysis: The alpha spending function approach by K. K. Gordon Lan and David L. DeMets (1983) and further related publications, you may find them all mentioned in Reference lines of the methods’ docstrings, let’s take a look at the functionality

Code

import numpy as np
import pandas as pd

from scipy.stats import norm

seabpy provides Group Sequential Testing in a separate module - gatsby which name is an anagram of Group Sequential AB Testing in PYthon and below it’s shown why gatsby is often referred as «The Great Gatsby»

#!pip install seqabpy
from seqabpy import gatsby

Lan-DeMets

calculate_sequential_bounds function implements rigorous approach to calculate confidence bounds in one-sided GST. In addition to upper bounds, if beta is provided it calculates lower bounds to unlock the option of stopping the test for futility, maintaining provided Type II error rate. The algorithm is taken from the article Group sequential designs using both type I and type II error probability spending functions by Chang MN, Hwang I, Shih WJ. (1998)

gatsby.calculate_sequential_bounds(np.linspace(1/10, 1, 10), alpha=0.05, beta=0.2)

Sequential bounds algorithm to stop for futility converged to 0.00064 tolerance in 8 iterations using O'Brien-Fleming spending function.

(array([-3.02102866, -1.41478896, -0.59632535, -0.05664366,  0.35069266,
         0.68203514,  0.96419224,  1.21185008,  1.43395402,  1.79496377]),
 array([6.08789285, 4.22919942, 3.39632756, 2.90614903, 2.57897214,
        2.34174062, 2.15981329, 2.0146325 , 1.89528829, 1.79496377]))

ldBounds function returns the exact same numbers as R package and tailored to have similar interface, in both input and output.

As a subtle benefit it supports more spending functions, take a look at the docstring to know more details.

gatsby.ldBounds(t=np.linspace(1/4, 1, 4), iuse=3, phi=1)

{'time.points': array([0.25, 0.5 , 0.75, 1.  ]),
 'alpha.spending': array([0.0125, 0.0125, 0.0125, 0.0125]),
 'overall.alpha': 0.05,
 'upper.bounds': array([2.24140273, 2.1251188 , 2.01870509, 1.92553052]),
 'nominal.alpha': array([0.0125    , 0.01678835, 0.02175894, 0.02708151])}

In case of calculation of upper bounds only the algorithm is faster, given that these boundaries shall be defined and fixed offline prior to the experiment start, it’s totally sensible performance, when in addition lower aka futility bounds are computed it takes more time

%%time
gatsby.ldBounds(t=np.linspace(1/10, 1, 10), alpha=0.1)

CPU times: total: 7.73 s
Wall time: 7.75 s

{'time.points': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 'alpha.spending': array([1.97703624e-07, 2.34868089e-04, 2.43757240e-03, 6.62960181e-03,
        1.07070137e-02, 1.37029812e-02, 1.55891351e-02, 1.66134835e-02,
        1.70337597e-02, 1.70513868e-02]),
 'overall.alpha': 0.10000000000000009,
 'upper.bounds': array([5.07115563, 3.4973068 , 2.79510152, 2.38848818, 2.11870064,
        1.92350461, 1.77394818, 1.65463799, 1.55659472, 1.47418673]),
 'nominal.alpha': array([1.97703624e-07, 2.34990500e-04, 2.59417098e-03, 8.45892623e-03,
        1.70578873e-02, 2.72083541e-02, 3.80358613e-02, 4.89989759e-02,
        5.97833696e-02, 7.02156613e-02])}

An example of less popular Haybittle-Peto spending function usage

gatsby.ldBounds(t=np.linspace(1/4, 1, 4), iuse=5)

{'time.points': array([0.25, 0.5 , 0.75, 1.  ]),
 'alpha.spending': array([0.0013499, 0.       , 0.       , 0.0486501]),
 'overall.alpha': 0.05,
 'upper.bounds': array([3.        , 3.        , 3.        , 1.63391418]),
 'nominal.alpha': array([0.0013499 , 0.0013499 , 0.0013499 , 0.05113844])}

GST

GST is the general function that accounts for various deviations in the experiment design, in other words if the peeking strategy is different the method adjusts the bounds to guarantee the valid statistical approach whenever it’s possible.

In particular it’s perfect to handle a few changed peeking points when the total number of peeking remains the same, and it works in a best possible way with under- and oversampling, in the latter case the procedure is not fully correct though as we will see in the simulations part.

The idea of implementation is taken from Group Sequential and Confirmatory Adaptive Designs in Clinical Trials by G. Wassmer and W. Brannath (2016)

Here are a few examples, that shows how different peeking strategy affects the bounds:

gatsby.ldBounds(t=np.array([0.3, 0.6, 1.0]), alpha=0.025)

{'time.points': array([0.3, 0.6, 1. ]),
 'alpha.spending': array([4.27257874e-05, 3.76533752e-03, 2.11919367e-02]),
 'overall.alpha': 0.02499999999999991,
 'upper.bounds': array([3.92857254, 2.669972  , 1.98103004]),
 'nominal.alpha': array([4.27257874e-05, 3.79287858e-03, 2.37939526e-02])}

in case of under-sampling the last upper bound is lower what reflects that all the rest of alpha volume is spent at this point
in case of over-sampling the last upper bound is higher, what helps to control Type I error rate, after the expected sample size is reached

gatsby.GST(actual=np.array([0.3, 0.6, 0.8]), expected=np.array([0.3, 0.6, 1]), alpha=0.025)

array([3.92857254, 2.669972  , 1.96890411])

gatsby.GST(actual=np.array([0.3, 0.6, 1.2]), expected=np.array([0.3, 0.6, 1]), alpha=0.025)

array([3.92857254, 2.669972  , 1.98949242])

Under- and over- sampling may also happen in a more natural way, when a few peeking points are added or removed

gatsby.ldBounds(t=np.array([0.3, 0.6, 0.8, 1.0]), alpha=0.025)

{'time.points': array([0.3, 0.6, 0.8, 1. ]),
 'alpha.spending': array([4.27257874e-05, 3.76533752e-03, 8.40372704e-03, 1.27882097e-02]),
 'overall.alpha': 0.02499999999999991,
 'upper.bounds': array([3.92857254, 2.669972  , 2.28886308, 2.03074404]),
 'nominal.alpha': array([4.27257874e-05, 3.79287858e-03, 1.10436544e-02, 2.11404831e-02])}

gatsby.GST(actual=np.array([0.3, 0.6, 0.8]), expected=np.array([0.3, 0.6, 0.8, 1]), alpha=0.025)

array([3.92857254, 2.669972  , 2.15083427])

gatsby.GST(actual=np.array([0.3, 0.6, 1, 1.2]), expected=np.array([0.3, 0.6, 1]), alpha=0.025)

array([3.92857254, 2.669972  , 1.98102292, 2.0375539 ])

GST also supports int as an input, if peeking point are distributed uniformly it’s what you should use in sake of convenience

gatsby.GST(7, 7)

array([5.05481268, 3.48557771, 2.78550934, 2.38021304, 2.1113425 ,
       1.9168349 , 1.76778516])

While the method comes handy in most of scenarios, it doesn’t support all the possible deviations: the beginning of the expected and actual peeking strategies must be the same: so it’s either over- or under- sampling or the change in peeking points when their number remains equal

Needless to say that the application of GST and other functions mentioned above apparently is not limited to one-sided hypotheses, in order to test two-sided alternative: just set $\alpha$ to half of the value, like 0.025 if you want to challenge two-sided hypothesis at 0.95 confidence level, and define lower bounds symmetrically about zero, so they would be the same in absolute values, but negative.

gatsby.GST(actual=np.array([0.3, 0.6, 0.9, 1.2]), expected=np.array([0.3, 0.6, 0.8, 1]), alpha=0.025)

# the following will not work
# gatsby.GST(actual=np.array([0.3, 0.6]), expected=np.array([0.8, 1]))
# gatsby.GST(
#     actual=np.array([0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 1]),
#     expected=np.array([0.3, 0.6, 0.8, 1])
# )

array([3.92857254, 2.669972  , 2.30467653, 2.05129976])

Always Valid Inference

While AVI is becoming increasingly popular in the field, bypassing GST, it’s worth noting that there are currently no widely adopted, comprehensive Python or R packages that focus solely on this approach.

There is one recent package savvi appeared this year, but it’s still in v0.. version and have not been yet fully acknowledged by the community. What is more it focuses only on the publications of Lindon et al. from 2022 and 2024, while there are other notable authors like Zhao et al. and Howard et al. whose approach will be challenged in addition to Lindon’s work

from seqabpy import gavi

seqabpy provides Always Valid Inference functionality in gavi module where as of now, AlwaysValidInference is a main class that implements confidence intervals valid at any point. While intervals and namely their continuous comparison to the current z-score provides the apparatus that is just enough for practical decisions, p-values are to be released later as well, to complete experiment analysis picture.

AlwaysValidInference an array of sample sizes when the peeking happens along with the metric variance and the result point difference. Multiple supported properties comprise different algorithms (the detailed description may be found in each docsrting) that return a boolean array indicating whether the null hypothesis is rejected in favour of one- or two- sided alternative for each size.

avi = gavi.AlwaysValidInference(size=np.arange(10, 100, 10), sigma2=1, estimate=1)

GAVI is the method proposed by Howard et al. and widely adopted in tech by Eppo

avi.GAVI(50)

array([False,  True,  True,  True,  True,  True,  True,  True,  True])

mSPRT is the approach proposed by M. Lindon in his article and is leveraged by Netflix

avi.mSPRT(0.08)

array([False, False,  True,  True,  True,  True,  True,  True,  True])

StatSig_SPRT is the variation proposed by Zhao et al. and as it comes from the name used currently by StatSig

avi.StatSig_SPRT()

array([False,  True,  True,  True,  True,  True,  True,  True,  True])

The last and, this time, indeed least heavily criticized statsig_alpha_corrected_v1 approach, which was their first attempt to furnish their platform with a sequential testing framework. It’s mainly added for the reference to show how sequential testing must not work like

avi.statsig_alpha_corrected_v1(100)

array([False, False, False,  True,  True,  True,  True,  True,  True])

Simulations

For those who have visited my blog before, there is nothing new in how we will conduct testing, it is good old Monte Carlo. For more details checkout my previous posts like Dunnett’s Correction for ABC testing

We will measure False and True positive rates for two kinds of the target metric: a continuous variable and a conversion. Furthermore we will learn how tolerant are different methods to under- and over- sampling.

Code

# Global simulation settings
N = 500
alpha = 0.05
n_iterations = 100_000


def stops_at(is_significant: np.ndarray, sample_size: np.ndarray) -> int:
    """
    Determines the stopping sample size.

    This function identifies the first instance where the input
    condition is True and returns the corresponding sample size.

    Args:
        is_significant: A boolean array of the stop condition for each size
        sample_size: An array of sample sizes.

    Returns:
        The stopping sample size.

    Example:
        >>> detN([False, False, True, True], [50, 100, 150, 200])
        150
    """
    if len(is_significant) != len(sample_size):
        raise ValueError("Input arrays must have the same length.")
    w = np.where(is_significant)[0]
    return None if len(w) == 0 else sample_size[w[0]]

One thing about GST is that incredible freedom in spending function choice what makes it possible to experiment and find the best fit for your data. For demonstration purposes I suggest using Kim-DeMets spending function with different values of the power $\phi$: the higher $\phi$ the more strict the function is at the beginning of the experiment.

Code

gst_linear = gatsby.GST(actual=10, expected=10, iuse=3, phi=1, alpha=alpha)
gst_quadratic = gatsby.GST(actual=10, expected=10, iuse=3, phi=2, alpha=alpha)
gst_cubic = gatsby.GST(actual=10, expected=10, iuse=3, phi=3, alpha=alpha)

You can play around the trade-off: would you like to spend more $\alpha$ at the start, detecting faster if there are greater uplifts in you experiment group or to preserve the major part of alpha until the end keeping maximum power to reject the hypothesis when the expected sample size is reached.

💡 Tip

If the title or the legend items are not visible to you - double click to one of legend items and it will make the chart rendered properly. It may happen due to LaTeX usage.

Code

import plotly.express as px
import plotly.graph_objs as go

def hex2rgba(hex, alpha):
    """
    Convert plotly hex colors to rgb and enables transparency adjustment
    """
    col_hex = hex.lstrip('#')
    col_rgb = tuple(int(col_hex[i : i + 2], 16) for i in (0, 2, 4))
    col_rgb += (alpha,)
    return 'rgba' + str(col_rgb)

def get_new_color(colors):
    while True:
        for color in colors:
            yield color

colors_list = px.colors.qualitative.Plotly
rgba_colors = [hex2rgba(color, alpha=0.5) for color in colors_list]
palette = get_new_color(rgba_colors)

def add_chart(figure, data, title=None):

    x = np.arange(1, len(data) + 1) / len(data)
    color = next(palette)

    figure.add_trace(
        go.Scatter(
            name=title,
            x=x,
            y=data,
            mode='lines',
            line=dict(color=color, width=4, dash='solid'),
            hovertemplate="%{y:.3f}"
        ),
    )


figure = go.Figure()

add_chart(figure, gst_linear, r"$\text{Linear: } \phi = 1$")
add_chart(figure, gst_quadratic, r"$\text{Quadratic: } \phi = 2$")
add_chart(figure, gst_cubic, r"$\text{Cubic: } \phi = 3$")

figure.update_xaxes(
    title_text="Peeking moments"
)

figure.update_layout(
    yaxis_title="Critical value for z-score",
    title={
        "x": 0.5,
        "text": r"$\text{Kim-DeMets spending function: } \alpha \cdot t^{\phi} \text{ differences}$",
    },
    hovermode="x",
    template="plotly_dark",
)

figure.write_json("alpha-spending-functions-comparison.json")
figure.show()

Expan Flaw

It’s kind of a hot take, remember I promised to provide evidence that expan way to determine boundaries is inaccurate, so here is a quick proof: the code is taken with no changes from their GitHub: zalando/expan/early_stopping

Code

from statsmodels.stats.proportion import proportion_confint


def sample_size(x):
    """ Calculates valid sample size given the data.

    :param x: sample to calculate the sample size
    :type  x: pd.Series or list (array-like)

    :return: sample size of the sample excluding nans
    :rtype: int
    """
    # cast into a dummy numpy array to infer the dtype
    x_as_array = np.array(x)

    if np.issubdtype(x_as_array.dtype, np.number):
        _x = np.array(x, dtype=float)
        x_nan = np.isnan(_x).sum()
    # assuming categorical sample
    elif isinstance(x, pd.core.series.Series):
        x_nan = x.str.contains('NA').sum()
    else:
        x_nan = list(x).count('NA')

    return int(len(x) - x_nan)

def obrien_fleming(information_fraction, alpha=0.05):
    """ Calculate an approximation of the O'Brien-Fleming alpha spending function.

    :param information_fraction: share of the information  amount at the point of evaluation,
                                 e.g. the share of the maximum sample size
    :type  information_fraction: float
    :param alpha: type-I error rate
    :type  alpha: float

    :return: redistributed alpha value at the time point with the given information fraction
    :rtype:  float
    """
    return (1 - norm.cdf(norm.ppf(1 - alpha / 2) / np.sqrt(information_fraction))) * 2


def group_sequential(x, y, spending_function='obrien_fleming', estimated_sample_size=None, alpha=0.05, cap=8):
    """ Group sequential method to determine whether to stop early.

    :param x: sample of a treatment group
    :type  x: pd.Series or array-like
    :param y: sample of a control group
    :type  y: pd.Series or array-like
    :param spending_function: name of the alpha spending function, currently supports only 'obrien_fleming'.
    :type  spending_function: str
    :param estimated_sample_size: sample size to be achieved towards the end of experiment
    :type  estimated_sample_size: int
    :param alpha: type-I error rate
    :type  alpha: float
    :param cap: upper bound of the adapted z-score
    :type  cap: int

    :return: results of type EarlyStoppingTestStatistics
    :rtype:  EarlyStoppingTestStatistics
    """

    # Coercing missing values to right format
    _x = np.array(x, dtype=float)
    _y = np.array(y, dtype=float)

    n_x = sample_size(_x)
    n_y = sample_size(_y)

    if not estimated_sample_size:
        information_fraction = 1.0
    else:
        information_fraction = min(1.0, (n_x + n_y) / estimated_sample_size)

    # alpha spending function
    if spending_function in ('obrien_fleming'):
        func = eval(spending_function)
    else:
        raise NotImplementedError
    alpha_new = func(information_fraction, alpha=alpha)

    # calculate the z-score bound
    bound = norm.ppf(1 - alpha_new / 2)
    # replace potential inf with an upper bound
    if bound == np.inf:
        bound = cap

    mu_x = np.nanmean(_x)
    mu_y = np.nanmean(_y)
    sigma_x = np.nanstd(_x)
    sigma_y = np.nanstd(_y)
    z = (mu_x - mu_y) / np.sqrt(sigma_x ** 2 / n_x + sigma_y ** 2 / n_y)

    if z > bound or z < -bound:
        stop = True
    else:
        stop = False

    return stop

fpr = 0

for r in range(n_iterations):

    x = np.random.normal(1, 1, N)
    y = np.random.normal(1, 1, N)

    for current_size in np.linspace(N/10, N, 10).astype(int):
        stopping = group_sequential(x[:current_size], y[:current_size], estimated_sample_size=2*N, alpha=0.05)
        if stopping:
            fpr += 1
            break

l, r = proportion_confint(count=fpr, nobs=n_iterations, alpha=0.10, method='wilson')
print(f"false positives: {fpr/n_iterations:.3f} ± {(r - l) / 2:.3f} is significantly higher than {alpha}")

false positives: 0.070 ± 0.001 is significantly higher than 0.05

So, as it was said above, it doesn’t control FPR as it should according to Group Sequential Testing problem design and hence this myth of the direct application of alpha spending function have to be dispelled: it doesn’t work this way and further you will see that it’s not way better than custom ad-hoc corrections.

⚠️ Warning

Please, do not use expan for sequential testing as it inflates Type I error rate.

Monte Carlo

Code

from collections import defaultdict


def monte_carlo(
    metric: str="normal",
    sampling: str="accurate",
    effect_size: float=0.10,
    aa_test: bool=True,
    N: int = N,
) -> pd.DataFrame:

    result = defaultdict(list)
    eff = 0 if aa_test else effect_size

    if metric == "normal":
        mu, sigma = 1, 1
    else:
        p = 0.10
        sigma = (p * (1 - p)) ** 0.5
        # for bernoulli rv sigma is less than for normal
        # so it's better to increase N to get similar power
        N *= int((sigma / p) ** 2)

    for _ in range(n_iterations):
        if metric == "normal":
            x = np.random.normal(mu, sigma, N)
            y = np.random.normal(mu+eff, sigma, N)
        else:
            x = np.random.choice(a=[0, 1], size=N, replace=True, p=[1 - p, p])
            y = np.random.choice(a=[0, 1], size=N, replace=True, p=[1 - p*(1+eff), p*(1+eff)])

        size = np.arange(1, N + 1)
        diff = (np.cumsum(y) / size) - (np.cumsum(x) / size)

        test = gavi.AlwaysValidInference(size=size, sigma2=sigma**2, estimate=diff, alpha=alpha)

        itermittent_analyses = np.linspace(N/10, N, 10).astype(int) - 1
        z_score = diff[itermittent_analyses] / np.sqrt(2 * sigma ** 2 / size[itermittent_analyses])

        result['No_Seq'].append(N if z_score[-1] > norm.ppf(1 - alpha) else None)

        if sampling == "accurate":

            result['GAVI'].append(stops_at(test.GAVI(), size))
            result['mSPRT'].append(stops_at(test.mSPRT(), size))
            result['StatSig_SPRT'].append(stops_at(test.StatSig_SPRT(), size))
            result['StatSig_v1'].append(stops_at(test.statsig_alpha_corrected_v1(), size))

            result['GST_linear'].append(stops_at(z_score > gst_linear, size[itermittent_analyses]))
            result['GST_quadratic'].append(stops_at(z_score > gst_quadratic, size[itermittent_analyses]))
            result['GST_cubic'].append(stops_at(z_score > gst_cubic, size[itermittent_analyses]))

        elif sampling == "undersampled":

            result['GAVI'].append(stops_at(test.GAVI(phi=N*7/5), size))
            # undersampling is the case, when the effect is larger than expected
            # so let's say effect ~ 7/5 times larger, 4 * (5/7)^2 ~ 2
            result['mSPRT'].append(stops_at(test.mSPRT(phi=2 * sigma**2 / diff**2), size))
            result['StatSig_SPRT'].append(stops_at(test.StatSig_SPRT(), size))
            result['StatSig_v1'].append(stops_at(test.statsig_alpha_corrected_v1(N=N*7/5), size))

            result['GST_linear'].append(stops_at(z_score > gst_linear_undersampled, size[itermittent_analyses]))
            result['GST_quadratic'].append(stops_at(z_score > gst_quadratic_undersampled, size[itermittent_analyses]))
            result['GST_cubic'].append(stops_at(z_score > gst_cubic_undersampled, size[itermittent_analyses]))

        elif sampling == "oversampled":

            result['GAVI'].append(stops_at(test.GAVI(phi=N*7/10), size))
            # oversmapling is the case, when the effect is lower than expected
            # so let's say effect ~ 7/10 times lower, 4 * (7/10)^2 ~ 8
            result['mSPRT'].append(stops_at(test.mSPRT(phi=8 * sigma**2 / diff**2), size))
            result['StatSig_SPRT'].append(stops_at(test.StatSig_SPRT(), size))
            result['StatSig_v1'].append(stops_at(test.statsig_alpha_corrected_v1(N=N*7/10), size))

            result['GST_linear'].append(stops_at(z_score > gst_linear_oversampled, size[itermittent_analyses]))
            result['GST_quadratic'].append(stops_at(z_score > gst_quadratic_oversampled, size[itermittent_analyses]))
            result['GST_cubic'].append(stops_at(z_score > gst_cubic_oversampled, size[itermittent_analyses]))

        else:
            raise ValueError("Unknown sampling method")

    # remove StatSig_v1 from Power comparison
    if not aa_test:
        result.pop('StatSig_v1')

    df = pd.DataFrame(result).agg(["count", "median"]).T.assign(
          PositiveRate=lambda x: (x["count"] / n_iterations).round(3)
        ).assign(
            SampleSize=lambda x: x["median"].astype(int)
        )[["PositiveRate", "SampleSize"]]

    return df


def plot_positive_rate(
    df: pd.DataFrame,
    aa_test: bool=True,
    sampling: str=None
):

    fig = go.Figure()

    if aa_test:
        error_const = round(3 * (alpha * (1 - alpha) / n_iterations) ** 0.5, 3)
    else:
        error_array = round(3 * (df["PositiveRate"] * (1 - df["PositiveRate"]) / n_iterations) ** 0.5, 3)

    fig.add_trace(go.Bar(
        x=df.index,
        y=df["PositiveRate"],
        marker_color=next(palette),
        error_y=dict(type='constant', value=error_const) if aa_test else dict(type='data', array=error_array),
    ))

    if aa_test:
        fig.add_hline(
            y=0.05,
            line_dash="dot",
            annotation_text="designed Type I error rate",
            annotation_position="top right"
        )

    title = (
        f"{'Correctness' if aa_test else 'Power'} of"
        f"{' ' + sampling if sampling else ''} Sequential Testing Design"
    )
    
    fig.update_layout(
        yaxis_title=f"{str(not aa_test)} Positive Rate",
        title={
            "x": 0.5,
            "text": title,
        },
        hovermode="x",
        template="plotly_dark",
    )

    fig.write_json(f"{title.replace(' ', '-').lower()}.json")

    fig.show()

Continuous Variable

As you can see for GST bounds are pre-calculated for the necessary intermittent analyses number that were expected to and in fact take place. We calculate bounds for 10 intermittent analyses scenario, in addition considering over- and under- sampling designs.

Code

gst_linear_undersampled = gatsby.GST(actual=10, expected=14, iuse=3, phi=1, alpha=alpha)
gst_quadratic_undersampled = gatsby.GST(actual=10, expected=14, iuse=3, phi=2, alpha=alpha)
gst_cubic_undersampled = gatsby.GST(actual=10, expected=14, iuse=3, phi=3, alpha=alpha)

gst_linear_oversampled = gatsby.GST(actual=10, expected=7, iuse=3, phi=1, alpha=alpha)
gst_quadratic_oversampled = gatsby.GST(actual=10, expected=7, iuse=3, phi=2, alpha=alpha)
gst_cubic_oversampled = gatsby.GST(actual=10, expected=7, iuse=3, phi=3, alpha=alpha)

False Positives

Code

df = monte_carlo(aa_test=True)
df

	PositiveRate	SampleSize
No_Seq	0.051	500
GAVI	0.018	221
mSPRT	0.048	38
StatSig_SPRT	0.026	43
StatSig_v1	0.074	421
GST_linear	0.051	300
GST_quadratic	0.050	400
GST_cubic	0.051	400

Code

plot_positive_rate(df, aa_test=True)

As it immediately comes clear: StatSig v1 correction was a flaw, all the other methods are targeting $\alpha$ as needed, however out of AVI it’s only mSPRT that gives high enough level, the rest of them make fewer false positives what usually is a sign of lower statistical power, we will see it later.

Code

monte_carlo(aa_test=True, sampling="undersampled")

	PositiveRate	SampleSize
No_Seq	0.050	500
GAVI	0.014	253
mSPRT	0.043	38
StatSig_SPRT	0.026	41
StatSig_v1	0.015	451
GST_linear	0.047	350
GST_quadratic	0.046	500
GST_cubic	0.045	500

Code

df = monte_carlo(aa_test=True, sampling="oversampled")
df

	PositiveRate	SampleSize
No_Seq	0.051	500
GAVI	0.021	189
mSPRT	0.046	35
StatSig_SPRT	0.027	36
StatSig_v1	0.187	378
GST_linear	0.065	250
GST_quadratic	0.072	300
GST_cubic	0.077	350

Code

plot_positive_rate(df, aa_test=True, sampling="oversampled")

Over-sampling is a tough cookie for GST, in such a case GST doesn’t work correctly, it inflates Type I error, so it’s important to note the difference here between AVI and GST, the latter one is not designed to handle over-sampling
StatSig distinguished itself: their v1 version suffers more than any other method form both under- and over- sampling, while on the other flip their SPRT implementation is totally resistant to under- and over- sampling and if identifies the positive, it does it quickly, most likely it will be underpowered though.
As of now mSPRT seems to be the best choice as it identifies the differences so fast and just a little less often than it should.

True Positives

It’s time to compare the power of different methods, I’m not going to consider StatSig Alpha corrected version anymore as it’s not a valid procedure

Code

df = monte_carlo(aa_test=False)
df

	PositiveRate	SampleSize
No_Seq	0.474	500
GAVI	0.222	285
mSPRT	0.268	202
StatSig_SPRT	0.188	230
GST_linear	0.409	300
GST_quadratic	0.445	350
GST_cubic	0.459	400

Code

plot_positive_rate(df, aa_test=False)

Code

df = monte_carlo(aa_test=False, sampling="undersampled")
df

	PositiveRate	SampleSize
No_Seq	0.478	500
GAVI	0.211	306
mSPRT	0.253	210
StatSig_SPRT	0.190	233
GST_linear	0.425	350
GST_quadratic	0.449	450
GST_cubic	0.455	500

Code

plot_positive_rate(df, aa_test=False, sampling="undersampled")

Code

monte_carlo(aa_test=False, sampling="oversampled")

	PositiveRate	SampleSize
No_Seq	0.476	500
GAVI	0.236	268
mSPRT	0.263	208
StatSig_SPRT	0.190	233
GST_linear	0.443	300
GST_quadratic	0.496	300
GST_cubic	0.519	350

So, as it comes from bar chart and tables:

all AVI (including over-sampled options) are way weaker than even under-sampled GST, so power-wise GST is an unconditional winner
it’s appealing that for under-sampled GST the power has just a subtle decline, and even then only for strict spending function (Cubic), providing an increase for permissive spending function (Linear)
Although if AVI rejects null hypothesis it does quicker (the required Sample Size is smaller) than GST on average

Conversion Rate

In addition to continuous measure, let’s consider ratio variable, how the methods work with conversions

False Positives

Code

df = monte_carlo(aa_test=True, metric="choice")
df

	PositiveRate	SampleSize
No_Seq	0.050	4500
GAVI	0.018	1913
mSPRT	0.072	88
StatSig_SPRT	0.044	60
StatSig_v1	0.074	3796
GST_linear	0.050	2700
GST_quadratic	0.049	3600
GST_cubic	0.049	3600

Code

plot_positive_rate(df, aa_test=True, sampling="ratio")

As you see, in addition to StatSig v1 Alpha Correction which is again an outsider, mSPRT approximation is not good enough for Bernoulli random variable, for conversions it’s another approach that shall be applied, savvi package might come handy here as it the main purpose the that library - to work with inhomogeneous Bernoulli or Poisson process. Alternatively, you may use sequential_p_value function from gavi module of seqabpy, it’s a valid procedure following the algorithm defined by M. Lindon and A. Malek in Anytime-Valid Inference For Multinomial Count Data (2022), could be a little less powerful though than savvi implementation that follows even more recent articles.

Code

expected_probs = [0.5, 0.5]

# it's an asymptotic algorithm, so only numerators are compared
# assuming the denominators of convesrion are similar like in fair A/B test
actual_counts = [156, 212]

print(f"AVI p-value for Conversion: {gavi.sequential_p_value(actual_counts, expected_probs):.3f}")

AVI p-value for Conversion: 0.075

Code

df = monte_carlo(aa_test=True, metric="choice", sampling="oversampled")
df

	PositiveRate	SampleSize
No_Seq	0.051	4500
GAVI	0.022	1620
mSPRT	0.069	85
StatSig_SPRT	0.045	61
StatSig_v1	0.189	3377
GST_linear	0.064	2250
GST_quadratic	0.072	2700
GST_cubic	0.076	3150

Code

plot_positive_rate(df, aa_test=True, sampling="oversampled ratio")

This chart above is just to assure you, that for conversions over-sampled GST doesn’t work neither, I can’t help but prove that GST in oversampling design is a flaw, while yet much better than Statsig v1 Alpha Corrections.

True Positives

Let’s take a brief look at the power comparison for a couple different effect sizes

Code

df = monte_carlo(aa_test=False, metric="choice", effect_size=0.10)
df

	PositiveRate	SampleSize
No_Seq	0.480	4500
GAVI	0.240	2482
mSPRT	0.310	1441
StatSig_SPRT	0.226	1702
GST_linear	0.420	2700
GST_quadratic	0.449	3150
GST_cubic	0.460	3600

Code

plot_positive_rate(df, aa_test=False, sampling="ratio")

Code

df = monte_carlo(aa_test=False, metric="choice", effect_size=0.2)
df

	PositiveRate	SampleSize
No_Seq	0.930	4500
GAVI	0.767	2127
mSPRT	0.796	1593
StatSig_SPRT	0.718	1923
GST_linear	0.898	2250
GST_quadratic	0.915	2250
GST_cubic	0.921	2700

Code

plot_positive_rate(df, aa_test=False, sampling="strong effect")

With the growing effect size the relative difference in power is getting lower, but you can check that with any kind of reasonable effect size, GST outperforms AVI and what is more even for conversion variable, where mSPRT method doesn’t really control Type I error rate, it’s less powerful than GST after all.

Conclusion

Generally speaking, I’d rather say that GST is yet the best framework for sequential testing, despite all the recent publications on cutting-edge AVI variations.

However, I have to make a clause: while AVI is noticeably less powerful, it’s perfect to work in a streaming manner for guardrail metrics, while GST is better for target metrics within you AB test.

💡 Practical Tip

Combining these methodologies you may set up robust Sequential Testing framework, gaining from both: quick detection of major deterioration in your product with AVI and reliable uplifts discoveries in your decisive metrics with the most powerful GST procedure.

Another important point is to be conscious about the choice of the specific version of the algorithms that you will use.

For instance running an under-sampled experiments where GST with strict alpha spending functions, like Cubic, applied is less preferable, under-sampling works better with permissive spending functions, as well as over-sampling with Cubic spending is worse as it inflates $\alpha$ more.

💡 General Rule

The more permissive spending function is the faster effect is identified, but the less power at the end of experiment is achieved, what is especially striking for less substantial effect sizes.

Rounding this extensive blog-post up, here are the recommendations on choosing the sequential testing framework wrapped up into a single decision tree:

References

seqabpy is an open source library that is perfect for GST and AVI in Python. In addition to implemented functionality it contains all the referenced original papers in functions’ docstrings, so you may get acquainted with the original works.

There were similar posts made by Booking and Spotify, however they do not share implementation details, hence you may read it to deepen the understanding, but barely can apply it in practice:

Intro#

Approach#

Group Sequential Testing#

GST in R#

iPython Notebook#

Python File#

GST in Python#

Lan-DeMets#

GST#

Always Valid Inference#

Simulations#

Expan Flaw#

Monte Carlo#

Continuous Variable#

False Positives#

True Positives#

Conversion Rate#

False Positives#

True Positives#

Conclusion#

References#

Intro

Approach

Group Sequential Testing

GST in R

iPython Notebook

Python File

GST in Python

Lan-DeMets

GST

Always Valid Inference

Simulations

Expan Flaw

Monte Carlo

Continuous Variable

False Positives

True Positives

Conversion Rate

False Positives

True Positives

Conclusion

References