II. Design of experiments: Behaviour of e-variables under alternatives

Alexander Ly, Rosanne Turner, Udo Boehm, Rianne de Heide, Peter Grunwald

In the previous markdown document we used simulations to illustrate the behaviour of three sequential testing procedures. We saw that if data are generated under the null that the sequential use of both classical p-values and subjective Bayes factors leads to the incorrect conclusion that the null is false with chance larger than the tolerable alpha. On the other hand, with e-variables we retain type I error control uniformly across time.

A testing problem typically admits a large candidate set of e-variables. The class of all e-variables contains the procedure that always outputs the e-value one. This e-variable has the desired type I error control regardless of the sample size, because it just never rejects the null. This is not a great procedure, as it is unable to detect any effect. To select amongst a family of e-variables, we should not only be consider the null, but also the alternative hypothesis.

For the working two-sample z-test example we considered the candidate set of e-variables of type “eGauss” with parameter g. Any g > 0 defines an e-variable that we can use for inference. The selection of a particular e-variable with in this class can be guided by the alternative. In particular, when given a minimally clinically relevant effect size, the design function selects that g such that the fastest detecting “eGauss” e-variable is used for inference.

We will see that the choice of the minimal clinically relevant effect size and therefore g, however, is not crucial for fast inference. To illustrate this, we will elaborate on the following:

  1. In the first section we give insights on the safe design function, which determines both the parameter g and the planned sample size, if it is given a minimal clinically relevant effect size, and a targeted power/tolerable type II error rate, respectively.
  2. We show how such an e-variable can lead to fast inference due to its capacity for early stopping in the setting where the true effect size equals the minimal clinical relevant effect size.
  3. Simulations with a true effect size larger than the minimal clinically relevant effect size, then show that the test is able to extract information regarding the presence of the effect even faster.
  4. To further illustrate that a minimal clinically relevant effect size is not crucial for fast inference, we simulate under a true effect size smaller than the minimal clinically relevant effect size.

For this demonstration we use functions from the safestats package and the following colours:

eColours <- c("#1F78B4E6", "#A6CEE380")
lineColour <- "#DAA52066"
histInnerColour <- eColours[2]
histBorderColour <- eColours[1]

1. Designing safe z-test experiments

One sample path

Recall the example described in the installation guide with a minimal clinically relevant mean difference of 10 (IQ score increase) in a population with standard deviation sigmaTrue=12. In other words, a minimal clinically relevant (standardised) effect size of 10/12 = 0.83.

To sequentially detect such a minimal clinically relevant mean difference with 80% chance/power, thus, tolerable type II error beta=0.2, we have to plan an experiment with about 39 participants in both the treatment and the control group:

alpha <- 0.05
beta <- 0.2
meanDiffMin <- 10
sigmaTrue <- 12

designObj <- designSafeZ(meanDiffMin=meanDiffMin, beta=beta,

To compute a z-test e-value the function safeZTest uses the parameter defined in the designObj, which for the e-value type “eGauss” is referred to as g. This g is the variance of a so-called Gaussian mixture distribution, similar, to a normal prior in a Bayes factor with prior variance g. The design function sets g equal to meanDiffMin^2 / sigma^2, that is, (10 / 12)^2=0.6944.

In the installation guide we noticed that for a particular sample generated with a true mean difference of meanDiffMin, we correctly reject the null, if we compute the e-value after observing the planned (batch) sample sizes.

treatmentGroup <- rnorm(designObj$nPlanBatch[1], mean=122, sd=sigmaTrue)
controlGroup <- rnorm(designObj$nPlanBatch[2], mean=112, sd=sigmaTrue)

resultObj <- safeZTest(x=treatmentGroup, y=controlGroup,

The resulting e-value of 310.36 is much larger than 1/alpha, and a natural question to ask is whether we could have rejected the null earlier. The following code shows that this is indeed the case:

n1Plan <- designObj$nPlanBatch[1]
eValuesAtTime <- numeric(n1Plan)

for (i in 1:n1Plan) {
  eValuesAtTime[i] <- safeZTest(x=treatmentGroup[1:i],

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes()
plot(1:n1Plan, eValuesAtTime, type="l", lwd=2, xlab="n",
     ylab="e-value", col=eColours[1])
lines(c(1, n1Plan), c(1/alpha, 1/alpha), lwd=2, lty=2)
which(eValuesAtTime > 1/alpha)

The plot shows that we could have stopped after 23 pairs of observations rather than waiting until we saw 44 pairs. Hence, this emphasises how the realised sample size is random, as it depends on the data via the e-value. The more compelling the evidence indicated by the e-value, the earlier we can stop sampling. This thus allows us to adapt to the difficulty of the problem.

It is worth emphasising that the planned sample size is not a commitment, nor a promise of the number of samples we have to collect. The realised sample size at which the experiment is stopped can be smaller than planned, as was the case above. Below we see a case where it is larger. Furthermore, we also do not have to stop as soon as the e-variable crosses the threshold of 1/alpha. If a researcher wants to gather more evidence, then there are no repercussions in terms of over-inflating the type I error. In general, e-variables will have an (exponentially) fast upwards drift as a function of the sample size, whenever the alternative holds true. Hence, continue sampling, thus, yields even more evidence for an effect, whenever it holds true.

Early stopping: True effect size equals the minimal clinical relevant mean effect size

The planned sample sizes as reported by the design object result from a simulation under the worst case alternative. That is, when the true mean difference equals the minimal clinically relevant mean difference. The “designSafeZ” function first determines the batch sample size(s) necessary to detect the effect with 80% power if data analysis can only be performed once. For the problem at hand, after n1PlanBatch=n2PlanBatch=44 observations.

As multiple looks yield more opportunities to reject the null, we simulate at most up to the batch sample size. The simulation is performed by the function “sampleStoppingTimesSafeZ”, which generates a 1,000 data sets of length nPlanBatch, and calculates the e-values as the data come in. The first-time at which the e-value passes the threshold of 1/alpha is then saved. The following code visualises the stopping times.

stoppingTimes <- designObj$bootObjN1Plan$data

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes()
     xlim=c(0, designObj$nPlanBatch[1]),col=histInnerColour, 
     border=histBorderColour, lwd=2, main="", density=100)

Note the bump at nPlanBatch, which contains all experiments that yielded e-values above, but also those yielding e-values below 20. The recommended sample size to plan for to detect the minimal clinically relevant effect size with 80% chance is the 80% quantile of the stopping time distribution. The reported mean sample size is the average of this stopping time distribution.

The following code shows only the first-passage times, the times at which the e-value crosses the threshold of 1/alpha, and in yellow/gold e-value sample paths underneath it.

firstPassageTimes <-
firstPassageTimeHist <-
  hist(firstPassageTimes, plot=FALSE,

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes()
plot(NULL, xlim = c(0, designObj$nPlanBatch[1]),
     ylim = c(-1 * log(30), log(3000)),
     xlab = "", ylab = "", cex.lab = 1.3, cex.axis = 1.3, las = 1,
     yaxt = "n", bty = "n", type = "p", pch = 21, bg = "grey")

abline(h = log(1), col = "darkgrey", lwd = 1, lty = 2)
abline(h = log(20))

criticalP = log(c(1/20, 1, 20))

axis(side = 2, at = c(criticalP), tick = TRUE, las = 2, cex.axis = 1,
     labels = c("1/20", "1", "20"))

mtext("Evidence", side = 2, line = 2.5, las = 0, cex = 1.3, adj=0.175)
mtext("Sample size", side = 1, line = 2.5, las = 1, cex = 1.3)

stoppedPaths <- designObj$samplePaths[!designObj$breakVector, ]

someConstant <- 120

y <- firstPassageTimeHist$density
nB <- length(firstPassageTimeHist$breaks)
ylim <- range(y, 0)

rect(firstPassageTimeHist$breaks[-nB]+0.5, log(20),
     firstPassageTimeHist$breaks[-1L]+0.5, someConstant*y+log(20),
     col = histInnerColour, border = histBorderColour, lwd=2,
     angle = 45, density = 600, lty = NULL)

for (i in 1:100) {
  stoppedTime <- firstPassageTimes[i]
  evidenceLine <- stoppedPaths[i, 1:stoppedTime]

  if (evidenceLine[stoppedTime] > 20)
    evidenceLine[stoppedTime] <- 20

  lines(0:stoppedTime, c(0, log(evidenceLine)), col=lineColour,
        lwd=1.5, lty=1)

  if (evidenceLine[stoppedTime]==20)
    points(stoppedTime, log(evidenceLine[stoppedTime]),
           col=histBorderColour, pch=15, lwd=1.5)

Even earlier stopping: True effect size larger than the minimal clinical relevant mean effect size

When the true minimal clinically relevant effect size is larger than the minimal clinically relevant one, we can stop data collection even earlier. The following code describes a scenario where the true mean difference is 1.3 times larger than the minimal clinical relevant mean difference. For this we use the function “sampleStoppingTimesSafeZ” with the same parameter “g” as defined by the designObj as follows:

samplingResultLarger <- sampleStoppingTimesSafeZ(meanDiffTrue=1.3*designObj$esMin,

stoppingTimesLarger <- samplingResultLarger$stoppingTimes

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes()
     xlim=c(0, designObj$nPlanBatch[1]),col=histInnerColour, 
     border=histBorderColour, lwd=2, main="", density=100)
quantile(stoppingTimesLarger, 0.8)

This shows that the 80% quantile is now at sample size n1Plan=n2Plan=23, compared to n1Plan=n2Plan=39, and in this situation the procedure takes on a mean stopping time at sample size 16.8, compared to n1=n2=24.77. Hence, the relevant quantiles moved to smaller sample sizes.

Optional continuation: True effect size smaller than the minimal clinical relevant mean effect size

On the other hand, the 80% quantile of the stopping time distribution shifts to larger values whenever the true effect size is smaller than the minimal clinical relevant effect. The following code provides some insights:

samplingResultSmaller <- sampleStoppingTimesSafeZ(meanDiffTrue=designObj$esMin/1.3,

stoppingTimesSmaller <- samplingResultSmaller$stoppingTimes

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes()
     xlim=c(0, designObj$nPlanBatch[1]),col=histInnerColour, 
     border=histBorderColour, lwd=2, main="", density=100)
quantile(stoppingTimesSmaller, 0.8)

nRejectedExperiments <- sum(1-samplingResultSmaller$breakVector)

In this case, about 57.7% of the experiments stopped due to the e-values being larger than 1/alpha.

Fortunately, with e-variables we can continue sampling without incurring a type I error. This means that whenever the null is true, we cannot nudge the e-value above 1/alpha by simply extending trials, even if the decision to extend depends on the data. To derive the derive the additional number of samples needed to reach the desired power, we can use the design function as follows:

designSafeZ(meanDiffMin=designObj$esMin/1.3, beta=0.2, parameter=designObj$parameter, 
            sigma=sigmaTrue, testType="twoSample")

This shows that we should have planned for about 64 samples in each group instead. Thus, by sampling an additional 25 samples.

Multiplying with e-values from a replication attempt

Instead of directly extending the study, we can also consider a replication attempt and multiply the resulting e-values. This is also will not over-inflate the type I error, even if the decision to start a new replication attempt depends on the already acquired data. In general, to reach the desired power of 80%, however, we need replication attempts consisting of more than 25 samples in the two groups. In this case about 38 in each condition. As 577 studies yielded a rejection there are still 423 that did not. The following code simulates 423 replication attempts and multiply the resulting e-values.

# First batch of e-values
indexNotRejectedStudies <- which(samplingResultSmaller$breakVector==1)

firstBatchEvalues <- samplingResultSmaller$eValuesStopped[indexNotRejectedStudies]

secondBatch <- sampleStoppingTimesSafeZ(meanDiffTrue=designObj$esMin/1.3,

metaEValues <- firstBatchEvalues*secondBatch$eValuesStopped
additionalNRejections <- sum(metaEValues > 20)

Concluding remarks

We provide some insights in how to tune an e-variable, and we showed how inference can be sped up by exploiting the sequential nature of e-variables.