Checking Levels of Tests or Coverage Probabilities of CIs

Using the methodology developed by Gandy et al. (2019), we can check if tests reject with the desired frequency (i.e. we can check the level of the tests) and we can check if confidence intervals have the desired coverage probabilities.


Checking the Level of Tests


Suppose we want to test if the two-sided t-test (for the mean of iid normally distributed random variables to be 0) implemented in R has a specific nominal level (e.g. 5%). The following does this.

This example uses three buckets \([0,0.45]\), \([0.04,0.06]\),\([0.55,1]\). The method returns, with at least the probability 1-error, a bucket that contains the correct rejection probability.

The code below implements a test that succeeds if the bucket \([0.04, 0.06]\) is returned. In other words, if the true rejection probability is in \((0.045, 0.055)\), then the test is guaranteed to succeed (with probability of at least 1-error). If it is in \([0.04, 0.045]\cup[0.55,0.06]\), the test may or may not succeed (as different buckets may be returned. If it is in \([0,0.04)\cup(0.06,1]\) then it is guaranteed to fail (with probability of at least 1-error).

The following is a function that simulates data and then returns the test decision.

Now, we are setting up the buckets:

##        low   ok  high
## [1,] 0.000 0.04 0.055
## [2,] 0.045 0.06 1.000

Next, the test is run.

As expected, this does not return an error.

However, if one (wrongly) assumes that this also works if the data is sampled under a Cauchy distribution, then, as expected, the test returns an error. We use the same test and buckets as before, but now the data is simulated from a Cauchy distribution.

## Error: Test returned bucket [0,0.045], called 'low', not a bucket called 'ok'.

Pearson’s Chi-Squared Test

It is well know that the asymptotic distribution of Pearson’s chi-squared goodness of fit test is not reliable for small sample sizes, and the implementation in R correctly warns about it for small sample sizes.

## Warning in chisq.test(c(rmultinom(1, size = 4, prob = c(1/3, 1/3, 1/3)))): Chi-
## squared approximation may be incorrect
## [1] FALSE

A unit test would also detect this.

## Error: Test returned bucket [0,0.045], called 'low', not a bucket called 'ok'.

However, with a larger number of samples and a wide interval \([0.035,0.065]\), it can be confirmed that the level is around the desired level.

##       low    ok high
## [1,] 0.00 0.035 0.06
## [2,] 0.04 0.065 1.00

If one needs a more precise statement, one can use a finer grid of buckets - and this reveals that the rejection probability in this case is less than the nominal level.

##      very low    low    ok   high
## [1,]     0.00 0.0350 0.045 0.0525
## [2,]     0.04 0.0475 0.055 1.0000
## Error: Test returned bucket [0.035,0.0475], called 'low', not a bucket called 'ok'.

Checking Coverage Probabilities of Confidence Intervals

Similarly, the function below tests if the 95% confidence interval returned by t.test has the correct coverage probability.

First, we set up a function that simulates data, computes the confidence interval and then returns whether or not it contains the true mean.

gen <- function(){
    x <- rnorm(10,mean=3.7)
    CI <- t.test(x)$

Then we set up the buckets.

J <- matrix(nrow=2,c(0,0.945, 0.94,0.96, 0.955,1))
colnames(J) <- c("low","ok","high")
##        low   ok  high
## [1,] 0.000 0.94 0.955
## [2,] 0.945 0.96 1.000

Running the test in this way does not lead to an error.


However, if one chooses different buckets, then, as expected, an error is reported.

J <- matrix(nrow=2,c(0,0.895, 0.89,0.91, 0.905,1))
colnames(J) <- c("low","ok","high")
##        low   ok  high
## [1,] 0.000 0.89 0.905
## [2,] 0.895 0.91 1.000
## Error: Test returned bucket [0.905,1], called 'high', not a bucket called 'ok'.


Gandy, A., Hahn, G., and Ding, D. (2019), “Implementing Monte Carlo Tests with P-value Buckets,” Scandinavian Journal of Statistics.