For work, I’ve needed to use confidence intervals for the sample proportion of a binomial distributed random variable, in particular for “small samples.” In the interests of concreteness, I am going to motivate this problem with an example.

Suppose we have a weighted coin which, when flipped, comes up heads with probability and comes up tails with . We don’t know the actual value of , only that its between and . To estimate , we start flipping our coin. If the outcome of the flip is heads, we record a ; if tails, a . After flips, we observe the number of heads, which we can write as

Obviously, the value of is an integer between and , including the endpoints. There is no reason to suspect that the outcome of one flip affects the outcome of another flip; in other words, the trials are independent. So the probability of any specific sequence of heads and tails depends only on the number of heads . For instance, if and , then

There are sequences with four heads, hence the probability that four of the ten flips are heads is

In general, the probability that of the flips are heads is

By now, you might have surmised that is a binomially distributed random variable with parameters and . More compactly, .

Returning to our original problem, we estimate the value of by computing the sample proportion of heads . The sample proportion is itself a random variable. It takes values in the set and has probability mass function

Using the mean and variance of the binomial distribution and the linearity of expectation, we see that the mean and variance of are respectively given by

The sample proportion provides us with a point estimate of the coin’s parameter , but a point estimate alone is not terribly helpful if its inaccuracte. Hence, we seek a measure of the error of the estimator . One technique is to construct a random interval based on the observed data that contains the true value with high probability. We call such an interval a *confidence interval* and refer to the probability that the interval contains true the value of the parameter as the *coverage probability*. Typically, we specify a nominal significance value and then say that the confidence of the interval is %. A common value is 95%.

Most–or at least, all the ones I have seen–introductory statistics textbooks take the approach that, by the central limit theorem, is approximately normally distributed with mean and variance , where this approximation improves as . Not knowing , we also do not know the standard deviation of . Instead, we estimate the standard deviation by substituting for and obtain an approximate pivotal quantity

If our significance level is , then the idea is the approximation

where and are the values of the inverse standard normal cummulative distribution function evaluated at and , respectively. Since by the symmetry of the standard normal distribution, we are led to the Wald interval

,

One cannot stress enough the fact that the distribution used to compute the Wald interval relies on an approximation, the appropriateness of which depends on the sample size . The expression “sufficiently large” is qualitative and vague and therefore useless. Quantitative conditions for the appropriateness of the Wald interval include

All three require that is not too small; but the first two also require that is only close to the endpoints and , if is large. Students seem to accept these prescriptions without question; I certainly did in my AP Statistics class in high school. Some far more learned professor is telling you the rule of thumb. What basis do you have to doubt him or her?

I wrote a little R script to calculate the coverage probability for the Wald interval for fixed sample size and varying population parameter . Below are some graphs to illustrate the poor performance, as measured in coverage probability, of the Wald interval for sample sizes considered sufficiently large by many statistics textbooks.

wald.coverage <- function(n,a,subtitle.str){

p = seq(0,1,1/20000)

z <- abs(qnorm(.5*a,0,1))

coverage = 0

for(i in 1:length(p)){

coverage[i]=1

for(k in 0:n){

p_hat <- k/n

se <- sqrt((p_hat*(1-p_hat))/n)

coverage[i] <- (coverage[i]-ifelse((p_hat+(z*se))<p[i] | p[i]<(p_hat-(z*se)),1,0)*dbinom(k,n,p[i]))

}}

plot(p,coverage,type="l",main="Wald Interval Coverage Probability",xlab="p (population parameter)",ylab="coverage probability",ylim=c(.8,1),xaxp=c(0,1,20),yaxp=c(.8,1,20))

abline(h=(1-a))

title(sub = subtitle.str)

}

```
```

`k <- ceiling(50*(p[i]+z*sqrt((p[i]*(1-p[i]))/50)))`

coverage[i] <- pbinom(k-1,50,p[i])

I found your blog on google and also test some of your early posts. Your site is just wonderful.

I find this material a good treatment on binomial distribution topic. By the coverage probability plots, are you trying to show that the curve is far away from a normal/gaussian distribution? I am kind of lost on that point. Also kindly share the R code thanks.

Hi Stanley,

My apologies for getting back to you so late; I’ve been very busy. In the plots, I am showing that even for n large by “rule of thumb” standards, the Wald interval has poor coverage probability if the actual parameter p is close to 0 or 1.

I’ve uploaded the R code to the site and added a link in the post.

Best Regards,

Matt