III. Interval estimation: Classical confidence intervals, (Bayesian) credible interval, and anytime-valid confidence sequences

Alexander Ly, Rosanne Turner, Udo Boehm, Rianne de Heide, Peter Grunwald

Previously we considered the testing problem. We showed that within a tolerable type I error rate alpha, we have sufficient evidence to reject the null, and stop the experiment as soon as e > 1/alpha.

Here we consider the problem of interval estimation. An anytime-valid 95% interval estimate should contain the true value with at least 95% chance –regardless of when, why or even if data collection stopped.

We illustrate the following:

  1. Classical 95% confidence intervals fail to cover the true mean with 95% chance across time
  2. Similarly, Bayesian 95% credible intervals fail to cover the true mean with 95% chance across time
  3. The 95% anytime-valid confidence sequence/intervals will always cover the true mean with at least 95% chance

For this demonstration we use functions from the safestats package and the following colours:

library("safestats")
freqColours <- c("#E31A1CE6", "#FB9A9980")
bayesColours <- c("#B15928E6", "#FFFF9980")
eColours <- c("#1F78B4E6", "#A6CEE380")
alpha <- 0.05

Interval estimation for a normal mean

As a running example we consider the problem of interval estimation of a normal mean with known variance, say, \(\sigma=1\). That is, we want to give an uncertainty estimate of \(\mu\) where

\[\begin{align} X_{i} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^{2}). \end{align}\]

For all three examples we use the computeConfidenceIntervalZ function from the safestats package, which produces (1) classical confidence intervals if eType="freq", (2) credible intervals if eType="credibleInterval", and (3) an anytime-valid confidence sequence if eType="mom".

To assess the quality of these intervals across time, we fix a true data generating mean, say, muTrue=8, and we set sigmaTrue=1. For simplicity we take n=100 and the following code generates a single data set that we use to highlight the difference between (1) the standard confidence intervals, (2) the Bayesian intervals, and (3) the anytime-valid confidence sequence.

n <- 100
muTrue <- 8
sigmaTrue <- 1
  
set.seed(4)
someDataSingle <- rnorm(n, mean=muTrue,
                        sd=sigmaTrue)

After each observation a confidence interval can be computed yielding n=100 intervals. If any of these n=100 intervals does not contain muTrue we say that the sequence of intervals does not cover across time.

1. Classical 95% confidence intervals fail to cover the true mean with 95% chance across time

The classical confidence interval for a normal mean is directly related to the p-value test z-test, where the z-statistic is defined as

\[\begin{align} Z := \frac{\sqrt{n} (\bar{x} - \mu)}{\sigma} \end{align}\]

All parameter values \(\mu\) in a classical 95% confidence intervals are all those \(\mu\) that did not lead to a rejection at level alpha=5%. “Inverting” the z-test shows that

\[\begin{align} \bar{X} \sim \mathcal{N}(\mu, \frac{1}{n}) . \end{align}\]

A 95% confidence interval can be derived from quantiles of a normal distribution with location \(\bar{x}\) and variance \(1/n\).

One sample path

To assess the performance of the classical confidence interval on the data generated with muTrue=8, we run the following code:

freqCi <- matrix(nrow=n, ncol=2)

meanVector <- 1/(1:n)*cumsum(someDataSingle)

for (i in 1:n) {
  tempResult <- computeConfidenceIntervalZ(nEff=i,
                                          meanObs=meanVector[i],
                                          eType="freq")
  freqCi[i, ] <- tempResult
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, freqCi[, 1], xlim=c(0, n), ylim=c(7, 9),
     type="l", xlab = "", ylab = "", cex.lab = 1.3, 
     cex.axis = 1.3)

polygon(c(1:n, n:1), c(freqCi[, 2], rev(freqCi[, 1])),
        col=freqColours[2], border=freqColours[1], lwd=2,
        density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

The following code shows that between the 21st and 45th observation the true mean is outside the intervals.

# The indices where the lower bound is above muTrue
which(freqCi[, 1] > muTrue)

# No indices where the upper bound is below muTrue
which(freqCi[, 2] < muTrue)

Hence, classical confidence intervals lead to incorrect conclusions regarding muTrue, if data collection happened to stop in this range.

The behaviour of classical confidence intervals across time under multiple use

The previous code worked for a specific data set/outcomes of a single experiment. Note that this 1 data set yielded 100 intervals. We now repeat the procedure, but over 1,000 data sets/experiments. The code below stores 1,000 times 100 classical confidence intervals in the variable allFreqCis. We will see that far fewer than 950 out of these 1,000 data sets/repeated experiments will cover the true mean across time.

The following code stores a 1,000 times 100 confidence intervals:

mIter <- 1000

set.seed(1)
allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter)
allFreqCis <- array(dim=c(mIter, n, 2))


# This indicates whether a simulation yielded an interval that 
# does not cover the true mean
freqCiError <- integer(mIter)

# This indicates the first time an interval does not cover the true mean
firstPassageTime <- rep(Inf, times=mIter)

for (sim in 1:mIter) {
  someData <- allData[sim, ]
  meanVector <- 1/(1:n)*cumsum(someData)
  
  for (i in 1:n) {
    tempResult <- computeConfidenceIntervalZ(
      nEff=i, meanObs=meanVector[i], eType="freq")
    allFreqCis[sim, i, ] <- tempResult
    
    if ((tempResult[1] > muTrue || tempResult[2] < muTrue)
        && freqCiError[sim] != 1) {
      freqCiError[sim] <- 1
      firstPassageTime[sim] <- i
    }
  }
}

To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:

complementCoverageRate <- numeric(n)

for (i in 1:n) {
  complementCoverageRate[i] <- mean(firstPassageTime <= i)
}

freqCoverageRate <- 1-complementCoverageRate

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, 100*freqCoverageRate, type="l", lwd=2, xlab="n",
     ylab="Coverage rate (%)", col=freqColours[1], ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)

This plot reiterates the point that classical methods (i.e. p-values and confidence intervals) are made for one-shot inference. After the first time the confidence interval is computed the tolerable 5% error was already “spent”.

The following code can be run to inspect one of the failed intervals:

failedIndeces <- which(freqCiError==1)

someIndex <- failedIndeces[3]

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
     type="l", xlab = "", ylab = "", cex.lab = 1.3, 
     cex.axis = 1.3)

polygon(c(1:n, n:1), c(allFreqCis[someIndex, , 2],
                       rev(allFreqCis[someIndex, , 1])),
        col=freqColours[2], border=freqColours[1], lwd=2,
        density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

2. Bayesian 95% credible intervals fail to cover the true mean with 95% chance across time

The (Bayesian) credible interval for a normal mean is easily computed with a normal prior for the mean parameter mu. A normal prior with normal data leads to a normal posterior, which is computationally attractive. More precisely, with the normal prior centred at a and variance g, that is,

\[\begin{align} \mu \sim \mathcal{N}(a, g) \end{align}\]

and normal data with variance \(\sigma^{2}=1\) we get the following normal posterior

\[\begin{align} \mu \mid x, n \sim \mathcal{N}\left ( \tfrac{n g}{1 + n g} \bar{x} + \tfrac{1}{1+ n g} a, \frac{g}{1 + n g} \right ) \end{align}\]

For our demonstration we set a=0 and g=1. We discuss sensitivity to the hyper parameter g below. A 95% credible interval can now be derived from quantiles of a normal distribution with the mean and variance shown above.

One sample path

To assess the performance of the credible interval on the data with muTrue=8, we run the following code:

someA <- 0
someG <- 1
bayesCi <- matrix(nrow=n, ncol=2)

meanVector <- 1/(1:n)*cumsum(someDataSingle)

for (i in 1:n) {
  tempResult <- computeConfidenceIntervalZ(
    nEff=i, meanObs=meanVector[i], 
    a=someA, g=someG, eType="credibleInterval")
  bayesCi[i, ] <- tempResult
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
     type="l", xlab = "", ylab = "", cex.lab = 1.3, 
     cex.axis = 1.3)
polygon(c(1:n, n:1), c(bayesCi[, 2], rev(bayesCi[, 1])),
        col=bayesColours[2], border=bayesColours[1], lwd=2,
        density = NULL, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

Similar to the frequentist confidence intervals, the true mean falls outside the credible intervals between the 21st and the 45th observations, but also at the 11th and 17th obsrevations.

# The indices where the lower bound is above muTrue
which(bayesCi[, 1] > muTrue)

# No indices where the upper bound is below muTrue
which(bayesCi[, 2] < muTrue)

The credible intervals thus provide incorrect information regarding the size of the true population mean, if data collection happened to stop in this range. Better coverage is retrieved if the prior mean were set equal to the true mean. However, if the true mean was known before hand, then it would not be prior knowledge. In fact, there would be no need to do the experiment. In general, if the hyper parameter “a” is set close to the true mean, then the coverage will be better. For any a priori guess “a” there are many true values of mu for which the Bayesian interval will have bad coverage. Our point is not to open a discussion on the prior choice of “a”, but about coverage across time. We will see that even if “a” is set exactly equal to muTrue, then across time a 95% credible interval will have coverage rate below 95%.

The behaviour of (Bayesian) credible intervals across time under multiple use

The previous code worked for a specific data set/outcomes of a single experiment. Note that this 1 data set yielded 100 intervals. We now repeat the procedure, but over 1,000 data sets/experiments. The code below stores 1,000 times 100 credible intervals in the variable allBayesCis.

We will see that far fewer than 950 out of these 1,000 data sets/repeated experiments will cover the true mean across time. To show that it is not a matter of setting the mean parameter a, we set it in our demonstration equal to the true mean, despite this not being realistic in practice.

mIter <- 1000
someA <- muTrue

set.seed(1)
allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter)
allBayesCis <- array(dim=c(mIter, n, 2))


# This indicates whether a simulation yielded an interval that 
# does not cover the true mean
bayesCiError <- integer(mIter)

# This indicates the first time an interval does not cover the true mean
firstPassageTime <- rep(Inf, times=mIter)

for (sim in 1:mIter) {
  someData <- allData[sim, ]
  meanVector <- 1/(1:n)*cumsum(someData)
  
  for (i in 1:n) {
    tempResult <- computeConfidenceIntervalZ(
      nEff=i, meanObs=meanVector[i], 
      a=someA, g=someG, 
      eType="credibleInterval")
    allBayesCis[sim, i, ] <- tempResult
    
    if ((tempResult[1] > muTrue || tempResult[2] < muTrue) 
        && bayesCiError[sim] != 1) {
      bayesCiError[sim] <- 1
      firstPassageTime[sim] <- i
    }
  }
}

To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:

complementCoverageRate <- numeric(n)

for (i in 1:n) {
  complementCoverageRate[i] <- mean(firstPassageTime <= i)
}

bayesCoverageRate <- 1-complementCoverageRate

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, 100*bayesCoverageRate, type="l", lwd=2, xlab="n",
     ylab="Coverage rate (%)", col=bayesColours[1], ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)

This plot shows that the credible interval centred at the a=muTrue yield worse coverage than the classical confidence intervals.

The following code can be run to inspect one of the failed intervals:

failedIndeces <- which(bayesCiError==1)

someIndex <- failedIndeces[3]

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
     type="l", xlab = "", ylab = "", cex.lab = 1.3, 
     cex.axis = 1.3)

polygon(c(1:n, n:1), c(allBayesCis[someIndex, , 2],
                       rev(allBayesCis[someIndex, , 1])),
        col=bayesColours[2], border=bayesColours[1], lwd=2,
        density = NULL, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

3. The 95% anytime-valid confidence sequence/intervals will always cover the true mean with at least 95% chance

The anytime-valid confidence sequence inverts the e-variable based on a so-called non-local moment distribution mixture. In other words, the 95% anytime-valid confidence sequence contain all parameter values mu for which the e-variable is less than 1/alpha = 20. Because the e-variable itself retains a type I error rate across time, so will the resulting confidence sequence.

designObj <- designSafeZ(meanDiffMin=0.5,
                         testType="twoSample",
                         sigma=sigmaTrue)
designObj

One sample path

To assess the performance of the anytime-valid confidence sequence on the data with muTrue=8, we run the following code:

anytimeCi <- matrix(nrow=n, ncol=2)

meanVector <- 1/(1:n)*cumsum(someDataSingle)

for (i in 1:n) {
  tempResult <- computeConfidenceIntervalZ(
    nEff=i, parameter=designObj$parameter, 
    meanObs=meanVector[i])
  anytimeCi[i, ] <- tempResult
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
     type="l", xlab = "", ylab = "", cex.lab = 1.3, 
     cex.axis = 1.3)
polygon(c(1:n, n:1), c(anytimeCi[, 2], rev(anytimeCi[, 1])),
        col=eColours[2], border=eColours[1], lwd=2,
        density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

The following code shows that the anytime-valid confidence sequence covers the true mean at each moment in time.

# No indices where the lower bound is above muTrue
which(anytimeCi[, 1] > muTrue)

# No indices where the upper bound is below muTrue
which(anytimeCi[, 2] < muTrue)

The anytime-valid intervals would thus lead to the correct conclusion that muTrue is a viable value for the true mean, regardless if, whether, or why data collection is stopped. To compare the three intervals we run the following code.

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
     type="l", xlab = "", ylab = "", cex.lab = 1.3, 
     cex.axis = 1.3)

polygon(c(1:n, n:1), c(anytimeCi[, 2], rev(anytimeCi[, 1])),
        col=eColours[2], border=eColours[1], lwd=2,
        density = NULL, angle = -20)
polygon(c(1:n, n:1), c(freqCi[, 2], rev(freqCi[, 1])),
        col=freqColours[2], border=freqColours[1], lwd=2,
        density = 60, angle = -20)
polygon(c(1:n, n:1), c(bayesCi[, 2], rev(bayesCi[, 1])),
        col=bayesColours[2], border=bayesColours[1], lwd=2,
        density = 40, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

For large sample sizes the standard confidence interval and the credible interval become indistinguishable, and the difference between them is more prominent at the start.

The behaviour of anytime-valid confidence sequences across time under multiple use

The previous code worked for a specific data set/outcomes of a single experiment. Note that this 1 data set yielded 100 intervals. We now repeat the procedure, but over 1,000 data sets/experiments. The code below stores 1,000 times 100 credible intervals in the variable allAnytimeCis.

We will see that more than 950 out of these 1,000 data sets/repeated experiments will cover the true mean across time.

mIter <- 1000

set.seed(1)
allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter)
allAnytimeCis <- array(dim=c(mIter, n, 2))


# This indicates whether a simulation yielded an interval that 
# does not cover the true mean
anytimeCiError <- integer(mIter)

# This indicates the first time an interval does not cover the true mean
firstPassageTime <- rep(Inf, times=mIter)

for (sim in 1:mIter) {
  someData <- allData[sim, ]
  meanVector <- 1/(1:n)*cumsum(someData)
  
  for (i in 1:n) {
    tempResult <- computeConfidenceIntervalZ(
      nEff=i, meanObs=meanVector[i],
      parameter=designObj$parameter)
    
    allAnytimeCis[sim, i, ] <- tempResult
    
    if ((tempResult[1] > muTrue || tempResult[2] < muTrue) 
        && anytimeCiError[sim] != 1) {
      anytimeCiError[sim] <- 1
      firstPassageTime[sim] <- i
    }
  }
}

To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:

complementCoverageRate <- numeric(n)

for (i in 1:n) {
  complementCoverageRate[i] <- mean(firstPassageTime <= i)
}

anytimeCiCoverageRate <- 1-complementCoverageRate

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, 100*anytimeCiCoverageRate, type="l", lwd=2,
     xlab="n", ylab="Coverage rate (%)", col=eColours[1],
     ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)

This plot shows that the anytime-valid confidence sequences cover the true mean under repeated use as promised.

The following code can be run to inspect one of the failed intervals:

failedIndeces <- which(anytimeCiError==1)

someIndex <- failedIndeces[3]

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
     type="l", xlab = "", ylab = "", cex.lab = 1.3, 
     cex.axis = 1.3)

polygon(c(1:n, n:1), c(allAnytimeCis[someIndex, , 2],
                       rev(allAnytimeCis[someIndex, , 1])),
        col=eColours[2], border=eColours[1], lwd=2,
        density = NULL, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

The following code can be used to see all coverage rate profiles.

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, 100*anytimeCiCoverageRate, type="l", lwd=2,
     xlab="n", ylab="Coverage rate (%)", col=eColours[1],
     ylim=c(60, 100))
lines(1:n, 100*bayesCoverageRate, col=bayesColours[1],
      lwd=2)
lines(1:n, 100*freqCoverageRate, col=freqColours[1], lwd=2)
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)

Sensitivity wrt to the parameter g and gMom: Credible intervals versus anytime-valid confidence intervals

We provide some further elaboration on the how/why an anytime-valid confidence interval improves over the other intervals. Of importance is that the widths of both the classical confidence intervals and credible intervals are proportional to \(1/\sqrt{n}\). On the other hand, the width of the anytime-valid confidence interval is stretched by a logarithmic function of \(n\). To see this we plot the widths of the intervals as a function of n. 

someA <- 0

nDomain <- 1:30
anytimeCiWidth <- credibleIntervalWidth <- freqCiWidth <- numeric(30)

for (i in nDomain) {
  tempResult <- computeConfidenceIntervalZ(
    nEff=i, meanObs=8,  eType="freq")
  freqCiWidth[i] <- tempResult[2]-tempResult[1]
  
  tempResult <- computeConfidenceIntervalZ(
    nEff=i, meanObs=8, a=someA,
    g=someG, eType="credibleInterval")
  credibleIntervalWidth[i] <- tempResult[2]-tempResult[1]
  
  tempResult <- computeConfidenceIntervalZ(
    nEff=i, meanObs=8, parameter=designObj$parameter)
  anytimeCiWidth[i] <- tempResult[2]-tempResult[1]
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(nDomain, anytimeCiWidth, col=eColours[1], type="l",
     lwd=2, ylim=c(0, 12))
lines(nDomain, credibleIntervalWidth, col=bayesColours[1],
      lwd=2)
lines(nDomain, freqCiWidth, col=freqColours[1], lwd=2,
      lty=2)

Hence, the anytime-valid confidence sequence is wider than the standard and Bayesian intervals. Equivalently, both standard confidence and (Bayesian) credible intervals are too narrow for them to be anytime-valid.

Note that the width of the credible intervals behave similarly to that of the classical confidence intervals. This correspondence is exact if the hyper parameter g of the credible interval is taken to be large. In that case, the credible intervals equals the classical confidence intervals.

Another difference to highlight is that credible intervals and anytime-valid confidence sequences change differently depending on the hyper parameter g and gMom. For g small the credible interval will be more narrow around “a”. The prior belief that “a” is the true mean then dominates. As noted above, as g grows then the credible interval coincides with the classical confidence interval.

For anytime-valid confidence sequences the following holds. For gMom small, the interval becomes infinitely wide resulting in more uncertainty. As gMom increases the width also increases, though, slowly. The following code plots the width of an anytime-valid confidence interval as a function of gMom. The following code illustrates the dependency of the interval widths on g and gMom.

someA <- 0

gDomain <- seq(0.01, 100, by=0.01)
anytimeCiWidth <- credibleIntervalWidth <- numeric(30)


for (i in seq_along(gDomain)) {
  tempResult <- computeConfidenceIntervalZ(
    nEff=1, meanObs=8, a=0, g=gDomain[i], 
    eType="credibleInterval")
  
  credibleIntervalWidth[i] <- tempResult[2]-tempResult[1]
  
  tempResult <- computeConfidenceIntervalZ(
    nEff=1, meanObs=8, parameter=gDomain[i])
  anytimeCiWidth[i] <- tempResult[2]-tempResult[1]
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(gDomain, anytimeCiWidth, col=eColours[1], type="l",
     lwd=2, ylim=c(0, 16), xlab="g/gMom", ylab="Interval width")
lines(gDomain, credibleIntervalWidth, col=bayesColours[1],
      lwd=2)

The plot provides an indication on how the widths of the two intervals differ for a range of hyper parameter values g and gMom set equal to g. The increase in width of the anytime-valid confidence sequence is hardly noticeable, because it increases very slowly (logarithmically), whereas the width of the credible interval stabilises at the width of the classical confidence interval.