# III. Interval estimation: Classical confidence intervals, (Bayesian) credible interval, and anytime-valid confidence sequences

Previously we considered the testing problem. We showed that within a tolerable type I error rate alpha, we have sufficient evidence to reject the null, and stop the experiment as soon as e > 1/alpha.

Here we consider the problem of interval estimation. A 95% interval estimate should contain the true value with at least 95% chance –regardless of when, why or even if data collection stopped.

We illustrate the following:

For this demonstration we use functions from the safestats package and the following colours:

library("safestats")
freqColours <- c("#E31A1CE6", "#FB9A9980")
bayesColours <- c("#B15928E6", "#FFFF9980")
eColours <- c("#1F78B4E6", "#A6CEE380")
alpha <- 0.05

# Interval estimation for a normal mean

As a running example we consider the problem of interval estimation of a normal mean with known variance, say, $$\sigma=1$$. That is, we want to give an uncertainty estimate of $$\mu$$ where

\begin{align} X_{i} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^{2}). \end{align}

For all three examples we use the computeConfidenceIntervalZ function from the safestats package, which produces (1) classical confidence intervals if intervalType="freq", (2) credible intervals if intervalType="credibleInterval", and (3) an anytime-valid confidence sequence if intervalType="eGaussian".

To assess the quality of these intervals across time, we fix a true data generating mean, say, muTrue=8. For simplicity we take n=100 and generate data as follows:

n <- 100
muTrue <- 8
sigmaTrue <- 1

set.seed(4)
someData <- rnorm(n, mean=muTrue,
sd=sigmaTrue)

After each observation a confidence interval can be computed yielding n=100 intervals. If any of these n=100 intervals does not contain muTrue we say that the sequence of intervals does not cover across time.

# 1. Classical 95% confidence intervals fail to cover the true mean with 95% chance across time

The classical confidence interval for a normal mean is directly related to the p-value test z-test, where the z-statistic is defined as

\begin{align} Z := \frac{\sqrt{n} (\bar{x} - \mu)}{\sigma} \end{align}

All parameter values $$\mu$$ in a classical 95% confidence intervals are all those $$\mu$$ that did not lead to a rejection at level alpha=5%. “Inverting” the z-test shows that

\begin{align} \bar{X} \sim \mathcal{N}(\mu, \frac{1}{n}) . \end{align}

A 95% confidence interval can be derived from quantiles of a normal distribution with location $$\bar{x}$$ and variance $$1/n$$.

### One sample path

To assess the performance of the classical confidence interval on the data with muTrue=8, we run the following code:

freqCi <- matrix(nrow=n, ncol=2)

meanVector <- 1/(1:n)*cumsum(someData)

for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(nEff=i,
meanObs=meanVector[i],
intervalType="freq")
freqCi[i, ] <- tempResult
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, freqCi[, 1], xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)

polygon(c(1:n, n:1), c(freqCi[, 2], rev(freqCi[, 1])),
col=freqColours[2], border=freqColours[1], lwd=2,
density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

Note that between the 21st and 45th observation the true mean is outside the intervals. Hence, classical confidence intervals would lead us to the incorrect conclusion regarding muTrue, if data collection happened to stop in this range.

### The behaviour of classical confidence intervals across time under multiple use

The previous code worked for a specific data set/outcome of a single experiment. Note that this 1 data set yielded 100 intervals. We now repeat the procedure, but over 1,000 data sets/experiments. The code below stores 1,000 times 100 classical confidence intervals in the variable allFreqCis.

We will see that far fewer than 950 out of these 1,000 data sets/repeated experiments will cover the true mean across time.

The following code stores a 1,000 times 100 p-values:

mIter <- 1000

set.seed(1)
allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter)
allFreqCis <- array(dim=c(mIter, n, 2))

# This indicates whether a simulation yielded an interval that
# does not cover the true mean
freqCiError <- integer(mIter)

# This indicates the first time an interval does not cover the true mean
firstPassageTime <- rep(Inf, times=mIter)

for (sim in 1:mIter) {
someData <- allData[sim, ]
meanVector <- 1/(1:n)*cumsum(someData)

for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(nEff=i,
meanObs=meanVector[i],
intervalType="freq")
allFreqCis[sim, i, ] <- tempResult

if ((tempResult[1] > muTrue || tempResult[2] < muTrue) &&
freqCiError[sim] != 1) {
freqCiError[sim] <- 1
firstPassageTime[sim] <- i
}
}
}

To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:

complementCoverageRate <- numeric(n)

for (i in 1:n) {
complementCoverageRate[i] <- mean(firstPassageTime <= i)
}

freqCoverageRate <- 1-complementCoverageRate

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, 100*freqCoverageRate, type="l", lwd=2, xlab="n",
ylab="Coverage rate (%)", col=freqColours[1], ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)

This plot reiterates the point that classical methods (i.e. p-values and confidence intervals) are made for one-shot inference. After the first time the confidence interval is compute the tolerable 5% error was already spent.

The following code can be run to inspect one of the failed intervals:

failedIndeces <- which(freqCiError==1)

someIndex <- failedIndeces[3]

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)

polygon(c(1:n, n:1), c(allFreqCis[someIndex, , 2],
rev(allFreqCis[someIndex, , 1])),
col=freqColours[2], border=freqColours[1], lwd=2,
density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

# 2. Bayesian 95% credible intervals fail to cover the true mean with 95% chance across time

The (Bayesian) credible interval for a normal mean is easily computed with a normal prior for the mean parameter mu. A normal prior with normal data leads to a normal posterior, which is computationally attractive. More precisely, with the normal prior centred at a and variance g, that is,

\begin{align} \mu \sim \mathcal{N}(a, g) \end{align}

and normal data with variance $$\sigma^{2}=1$$ we get the following normal posterior

\begin{align} \mu \mid x, n \sim \mathcal{N}\left ( \tfrac{n g}{1 + n g} \bar{x} + \tfrac{1}{1+ n g} a, \frac{g}{1 + n g} \right ) \end{align}

For our demonstration we set a=0 and g=1. We discuss sensitivity to the hyper parameter g below. A 95% credible interval can now be derived from quantiles of a normal distribution with the mean and variance shown above.

### One sample path

To assess the performance of the credible interval on the data with muTrue=8, we run the following code:

someA <- 0
someG <- 1
bayesCi <- matrix(nrow=n, ncol=2)

meanVector <- 1/(1:n)*cumsum(someData)

for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(nEff=i,
meanObs=meanVector[i],
a=someA, g=someG,
intervalType="credibleInterval")
bayesCi[i, ] <- tempResult
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(bayesCi[, 2], rev(bayesCi[, 1])),
col=bayesColours[2], border=bayesColours[1], lwd=2,
density = NULL, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

Note that between the 1st and 28th observation the true mean is outside the intervals. The credible intervals would lead to the incorrect conclusion that muTrue, if data collection happened to stop in this range. Better coverage is retrieved if the prior mean were set equal to the true mean. However, if the true mean was known before hand, then it would not be prior knowledge. In fact, there would be no need to do the experiment. In general, if the hyper parameter a is set close to the true mean, then the coverage will be better. Similarly, however, true values of mu far from yield bad coverage. Our point is not to open a discussion on prior choice, but about coverage across time. We will see that even if a is set to be exactly equal to muTrue, then across time a 95% credible interval will have coverage rate below 95%.

### The behaviour of (Bayesian) credible intervals across time under multiple use

The previous code worked for a specific data set/outcome of a single experiment. Note that this 1 data set yielded 100 intervals. We now repeat the procedure, but over 1,000 data sets/experiments. The code below stores 1,000 times 100 credible intervals in the variable allBayesCis.

We will see that far fewer than 950 out of these 1,000 data sets/repeated experiments will cover the true mean across time. To show that it is not a matter of setting the mean parameter a, we set it in our demonstration equal to the true mean, despite this not being realistic in practice.

mIter <- 1000
someA <- muTrue

set.seed(1)
allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter)
allBayesCis <- array(dim=c(mIter, n, 2))

# This indicates whether a simulation yielded an interval that
# does not cover the true mean
bayesCiError <- integer(mIter)

# This indicates the first time an interval does not cover the true mean
firstPassageTime <- rep(Inf, times=mIter)

for (sim in 1:mIter) {
someData <- allData[sim, ]
meanVector <- 1/(1:n)*cumsum(someData)

for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(nEff=i,
meanObs=meanVector[i],
a=someA, g=someG,
intervalType="credibleInterval")
allBayesCis[sim, i, ] <- tempResult

if ((tempResult[1] > muTrue || tempResult[2] < muTrue) &&
bayesCiError[sim] != 1) {
bayesCiError[sim] <- 1
firstPassageTime[sim] <- i
}
}
}

To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:

complementCoverageRate <- numeric(n)

for (i in 1:n) {
complementCoverageRate[i] <- mean(firstPassageTime <= i)
}

bayesCoverageRate <- 1-complementCoverageRate

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, 100*bayesCoverageRate, type="l", lwd=2, xlab="n",
ylab="Coverage rate (%)", col=bayesColours[1], ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)

This plot shows that the credible interval centred at the a=muTrue does indeed yield better coverage than the classical confidence intervals. However, as time goes by and with more data the coverage rate still dips below the desirable 95%. Setting a=muTrue is not realistic and typically we just set a=0.

The following code can be run to inspect one of the failed intervals:

failedIndeces <- which(bayesCiError==1)

someIndex <- failedIndeces[3]

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)

polygon(c(1:n, n:1), c(allBayesCis[someIndex, , 2],
rev(allBayesCis[someIndex, , 1])),
col=bayesColours[2], border=bayesColours[1], lwd=2,
density = NULL, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

# 3. The 95% anytime-valid confidence sequence/intervals will always cover the true mean with at least 95% chance

The anytime-valid confidence sequence is based on the inversion of the e-variable based on a Gaussian mixture centred at a=0 and variance g. In other words, the 95% anytime-valid confidence sequence contain all parameter values mu for which the e-variable is less than 1/alpha = 20. Because the e-variable itself retains a type I error rate across time, so will the resulting confidence sequence.

designObj <- designSafeZ(meanDiffMin=0.5,
testType="twoSample",
sigma=sigmaTrue)
designObj

### One sample path

To assess the performance of the anytime-valid confidence sequence on the data with muTrue=8, we run the following code:

anytimeCi <- matrix(nrow=n, ncol=2)

meanVector <- 1/(1:n)*cumsum(someData)

for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(nEff=i,
parameter=designObj$"parameter", meanObs=meanVector[i], intervalType="eGauss") anytimeCi[i, ] <- tempResult } oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes(); plot(NULL, xlim=c(0, n), ylim=c(7, 9), type="l", xlab = "", ylab = "", cex.lab = 1.3, cex.axis = 1.3) polygon(c(1:n, n:1), c(anytimeCi[, 2], rev(anytimeCi[, 1])), col=eColours[2], border=eColours[1], lwd=2, density = NULL, angle = -20) lines(c(1, n), c(muTrue, muTrue), lwd=2) Note that the anytime-valid confidence sequence covers the true mean at each moment in time. The credible intervals would thus lead to the correct conclusion that muTrue is a viable value for the true mean, regardless if, whether, or why data collection is stopped. To compare the three intervals we run the following code. oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes(); plot(NULL, xlim=c(0, n), ylim=c(7, 9), type="l", xlab = "", ylab = "", cex.lab = 1.3, cex.axis = 1.3) polygon(c(1:n, n:1), c(anytimeCi[, 2], rev(anytimeCi[, 1])), col=eColours[2], border=eColours[1], lwd=2, density = NULL, angle = -20) polygon(c(1:n, n:1), c(freqCi[, 2], rev(freqCi[, 1])), col=freqColours[2], border=freqColours[1], lwd=2, density = NULL, angle = -20) polygon(c(1:n, n:1), c(bayesCi[, 2], rev(bayesCi[, 1])), col=bayesColours[2], border=bayesColours[1], lwd=2, density = NULL, angle = 20) lines(c(1, n), c(muTrue, muTrue), lwd=2) ### The behaviour of anytime-valid confidence sequences across time under multiple use The previous code worked for a specific data set/outcome of a single experiment. Note that this 1 data set yielded 100 intervals. We now repeat the procedure, but over 1,000 data sets/experiments. The code below stores 1,000 times 100 credible intervals in the variable allAnytimeCis. We will see that more than 950 out of these 1,000 data sets/repeated experiments will cover the true mean across time. mIter <- 1000 set.seed(1) allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter) allAnytimeCis <- array(dim=c(mIter, n, 2)) # This indicates whether a simulation yielded an interval that # does not cover the true mean anytimeCiError <- integer(mIter) # This indicates the first time an interval does not cover the true mean firstPassageTime <- rep(Inf, times=mIter) for (sim in 1:mIter) { someData <- allData[sim, ] meanVector <- 1/(1:n)*cumsum(someData) for (i in 1:n) { tempResult <- computeConfidenceIntervalZ(nEff=i, meanObs=meanVector[i], parameter=designObj$parameter,
intervalType="eGauss")
allAnytimeCis[sim, i, ] <- tempResult

if ((tempResult[1] > muTrue || tempResult[2] < muTrue) &&
anytimeCiError[sim] != 1) {
anytimeCiError[sim] <- 1
firstPassageTime[sim] <- i
}
}
}

To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:

complementCoverageRate <- numeric(n)

for (i in 1:n) {
complementCoverageRate[i] <- mean(firstPassageTime <= i)
}

anytimeCiCoverageRate <- 1-complementCoverageRate

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, 100*anytimeCiCoverageRate, type="l", lwd=2, xlab="n",
ylab="Coverage rate (%)", col=eColours[1], ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)

This plot shows that the anytime-valid confidence sequences cover the true mean under repeated use as promised.

The following code can be run to inspect one of the failed intervals:

failedIndeces <- which(anytimeCiError==1)

someIndex <- failedIndeces[3]

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)

polygon(c(1:n, n:1), c(allAnytimeCis[someIndex, , 2],
rev(allAnytimeCis[someIndex, , 1])),
col=eColours[2], border=eColours[1], lwd=2,
density = NULL, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

The following code can be used to see all coverage rate profiles.

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, 100*anytimeCiCoverageRate, type="l", lwd=2, xlab="n",
ylab="Coverage rate (%)", col=eColours[1], ylim=c(60, 100))
lines(1:n, 100*bayesCoverageRate, col=bayesColours[1], lwd=2)
lines(1:n, 100*freqCoverageRate, col=freqColours[1], lwd=2)
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)

# Sensitivity wrt to g: Credible intervals versus anytime-valid confidence intervals

We provide some further elaboration on the how/why an anytime-valid confidence interval improves over the other intervals. Of importance is that the widths of both the classical confidence intervals and credible intervals are proportional to $$1/\sqrt{n}$$. On the other hand, the width of the anytime-valid confidence interval is stretched by a logarithmic function of $$n$$. More precisely, it is proportional to $$\log(\tfrac{n g}{\alpha^{2}}/\sqrt{n}$$ instead. To see this we plot the widths of the intervals as a function of n.

someA <- 0

nDomain <- 1:30
anytimeCiWidth <- credibleIntervalWidth <- freqCiWidth <- numeric(30)

for (i in nDomain) {
tempResult <- computeConfidenceIntervalZ(nEff=i, meanObs=8,
intervalType="freq")
freqCiWidth[i] <- tempResult[2]-tempResult[1]

tempResult <- computeConfidenceIntervalZ(nEff=i, meanObs=8, a=someA,
g=someG,
intervalType="credibleInterval")
credibleIntervalWidth[i] <- tempResult[2]-tempResult[1]

tempResult <- computeConfidenceIntervalZ(nEff=i, meanObs=8,
parameter=designObj\$parameter,
intervalType="eGauss")
anytimeCiWidth[i] <- tempResult[2]-tempResult[1]

}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(nDomain, anytimeCiWidth, col=eColours[1], type="l", lwd=2, ylim=c(0, 12))
lines(nDomain, credibleIntervalWidth, col=bayesColours[1], lwd=2)
lines(nDomain, freqCiWidth, col=freqColours[1], lwd=2, lty=2)

Hence, the width of the credible interval behaves similarly to that of the classical confidence interval. This correspondence is exact if the hyper parameter g of the credible interval is taken to be large. In that case, the credible interval equals the classical confidence interval.

Another difference to highlight is that credible intervals and anytime-valid confidence sequences change differently depending on the hyper parameter g. An anytime-valid confidence sequence with designed with a meanDiffMin corresponds to a Gaussian distribution centred at zero and variance “g=(meanDiffMin/sigmaTrue)^2”. For g small the credible interval will be more narrow around a. The prior belief that a is true mean then dominates. As noted above, as g grows then the credible interval coincides with the classical confidence interval.

For anytime-valid confidence sequences the following holds. For g small, the interval becomes infinitely wide resulting in more uncertainty. For g large, the same interval again becomes infinitely wide. The following code plots the dependency of the interval widths on g.

someA <- 0

gDomain <- seq(0.01, 5, by=0.01)
anytimeCiWidth <- credibleIntervalWidth <- numeric(30)

for (i in seq_along(gDomain)) {
tempResult <- computeConfidenceIntervalZ(nEff=3, meanObs=8,
a=0, g=gDomain[i],
intervalType="credibleInterval")

credibleIntervalWidth[i] <- tempResult[2]-tempResult[1]

tempResult <- computeConfidenceIntervalZ(nEff=3, meanObs=8,
a=0, g=gDomain[i],
intervalType="eGauss")
anytimeCiWidth[i] <- tempResult[2]-tempResult[1]
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(gDomain, anytimeCiWidth, col=eColours[1], type="l", lwd=2,
ylim=c(0, 16), xlab="g", ylab="Interval width")
lines(gDomain, credibleIntervalWidth, col=bayesColours[1], lwd=2)

The plot provides an indication on how the widths of the two intervals differ for a range of hyper parameter values g. The increase in width of the anytime-valid confidence sequence is hardly noticeable, because it increases very slowly, whereas the width of the credible interval stabilises at the width of the classical confidence interval.