Introducing discretefit

Josh McCormick

October 20, 2021

The package discretefit implements Monte Carlo simulations for goodness-of-fit (GOF) tests for discrete distributions. This includes tests based on the root-mean-square statistic, the Chi-squared statistic, the log-likelihood-ratio (\(G^2\)) statistic, the Freeman-Tukey (Hellinger-distance) statistic, the Kolmogorov-Smirnov statistic, and the Cramer-von Mises statistic.

Simulations are written in C++ (utilizing Rcpp) and are considerably faster than the simulated Chi-squared GOF test in the R stats package.

Usage

The GOF tests in discretefit function on a vector of counts, x, and a vector of probabilities, p. In the below example, x represents a vector of counts for five categories, and p represents a vector of probabilities for each corresponding category. The GOF tests provides p-values for the null hypothesis that x is a random sample of the discrete distribution defined by p. 

library(discretefit)
library(bench)

x <- c(42, 0, 13, 2, 109)
p <- c(0.2, 0.05, 0.1, 0.05, 0.6)

chisq_gof(x, p)
#> 
#>  Simulated Chi-squared goodness-of-fit test
#> 
#> data:  x
#> Chi-squared = 17.082, p-value = 0.0017
rms_gof(x, p)
#> 
#>  Simulated root-mean-square goodness-of-fit test
#> 
#> data:  x
#> RMS = 1.731, p-value = 0.0397
g_gof(x, p)
#> 
#>  Simulated log-likelihood-ratio goodness-of-fit test
#> 
#> data:  x
#> G2 = 27.362, p-value = 9.999e-05
ft_gof(x, p)
#> 
#>  Simulated Freeman-Tukey goodness-of-fit test
#> 
#> data:  x
#> FT = 45.599, p-value = 9.999e-05
ks_gof(x, p)
#> 
#>  Simulated Kolmogorov-Smirnov goodness-of-fit test
#> 
#> data:  x
#> KS = 0.056627, p-value = 0.2385
cvm_gof(x, p)
#> 
#>  Simulated Cramer-von Mises goodness-of-fit test
#> 
#> data:  x
#> W2 = 0.12578, p-value = 0.1833

Root-mean-square statistic

In a surprising number of cases, a simulated GOF test based on the root-mean-square statistic outperforms the Chi-squared test and other tests in the Cressie-Read power divergence family. This has been demonstrated by Perkins, Tygert, and Ward (2011). They provide the following toy example.

Take a discrete distribution with 50 bins (or categories). The probability for the first and second bin is 0.25. The probability for each of the remaining 48 bins is 0.5 / 48 (~0.0104).

Now take the observed counts of 15 for the first bin, 5 for the second bin, and zero for each of the remaining 48 bins. It’s obvious that these observations are very unlikely to occur for a random sample from the above distribution. However, the Chi-squared, \(G^2\), and Freeman-Tukey tests fail to reject the null hypothesis.

x <- c(15, 5, rep(0, 48))
p <- c(0.25, 0.25, rep(1/(2 * 50 -4), 48))

chisq_gof(x, p)
#> 
#>  Simulated Chi-squared goodness-of-fit test
#> 
#> data:  x
#> Chi-squared = 30, p-value = 0.9707
g_gof(x, p)
#> 
#>  Simulated log-likelihood-ratio goodness-of-fit test
#> 
#> data:  x
#> G2 = 32.958, p-value = 0.6641
ft_gof(x, p)
#> 
#>  Simulated Freeman-Tukey goodness-of-fit test
#> 
#> data:  x
#> FT = 50.718, p-value = 0.1334

By contrast, the root-mean-square test convincingly rejects the null hypothesis.

rms_gof(x, p)
#> 
#>  Simulated root-mean-square goodness-of-fit test
#> 
#> data:  x
#> RMS = 5.1042, p-value = 9.999e-05

For additional examples, also see Perkins, Tygert, and Ward (2011) and Ward and Carroll (2014).

Speed

The simulated Chi-squared GOF test in discretefit produces identical p-values to the simulated Chi-squared GOF test in the stats package that is part of base R.

set.seed(499)
chisq_gof(x, p, reps = 2000)$p.value
#> [1] 0.9685157
set.seed(499)
chisq.test(x, p = p, simulate.p.value = TRUE)$p.value
#> [1] 0.9685157

However, because Monte Carlo simulations in discretefit are implemented in C++, chisq_gof is much faster than chisq.test, especially when a large number of simulations are required.

bench::system_time(
  chisq_gof(x, p, reps = 20000)
)
#> process    real 
#>   203ms   201ms

bench::system_time(
  chisq.test(x, p = p, simulate.p.value = TRUE, B = 20000)
)
#> process    real 
#>   1.61s   1.61s

Additionally, the simulated GOF tests in base R is vectorized, so for large vectors attempting a large number of simulations may not be possible because of memory constraints. Since the functions in discretefit are not vectorized, memory use is minimized.

Computation

The cvm_gof function utilizes the formula for the Cramer-von Mises statistic for discrete distributions originally introduced by Choulakian, Lockhart, and Stephens (1994). Let N represent the number of observations, let k represent the number of bins, and let \(p_{i}\) represent the probability of an observation falling in the \(i^{th}\) bin. Furthermore, let \(S_{i}\) correspond to the empirical cumulative distribution function for the observed data and \(T_{i}\) correspond to the cumulative distribution function for the theoretical distribution. Then where \(Z_{i}\) = \(S_{i}\) - \(T_{i}\), the Cramer-von Mises statistic, represented by \(W^2\), is defined as follows:

\[ W^2 = N^{-1} \sum_{i = 1}^{k} Z_{i}^2p_{i} \] For an alternative formula see Lockhart, Spinelli, and Stephens (2007).

All p-values calculated by the functions in discretefit follow the formula for simulated p-values proposed by Dwass (1957). For the below equation, let m represent the number of simulations and let B represent the number of simulations where the test statistic for the simulated data is greater than or equal to the test statistic for the observed data.

\[{p}_{u} = P(B <= b) = \frac{b + 1}{m + 1}\]

This is the equation used to calculate simulated p-values in the chisq.test function in the stats package but some implementations of simulated p-values, for example the dgof package, use the unbiased estimator, \(\frac{B}{m}\). For an explanation of why the biased estimator yields a test of the correct size and the unbiased estimator does not, see Phipson and Smyth (2011).

Alternatives

Several other packages implement GOF tests for discrete distributions.

As noted above, the stats package in base R implements a simulated Chi-squared GOF test.

I’m not aware of an R package that implements a simulated \(G^2\) GOF test but the packages RVAideMemoire and DescTools implement GOF tests that utilize approximations based on the Chi-squared distribution.

The dgof package (Anderson and Emerson 2011) implements simulated Kolmogorov-Smirnov GOF tests and simulated Cramer-von Mises GOF tests . The cvmdisc package also implements a simulated Cramer-von Mises GOF test.

The KSgeneral package (Dimitrova, Kaishev, and Tan, 2020) implements exact Kolmogorov-Smirnov GOF tests and fast, simulated GOF tests utilizing the algorithm introduced by Wood and Altavela (1978) which depends on asymptotic properties.

I’m not aware of another R package that implements a root-mean-square GOF test.

References

Arnold, Taylor B. and John W. Emerson. “Nonparametric goodness-of-fit tests for discrete null distributions.” R Journal. https://doi.org/10.32614/rj-2011-016

Choulakian, V., Richard. A. Lockhart, and Michael A. Stephens. “Cramer–von Mises statistics for discrete distributions.” Canadian Journal of Statistics, 1994. https://doi.org/10.2307/3315828

Dimitrova, Dimitrina S., Vladimir K. Kaishev, and Senren Tan. “Computing the Kolmogorov-Smirnov distribution when the underlying CDF is purely discrete, mixed, or continuous.” Journal of Statistical Software, 2020. https://doi.org/10.18637/jss.v095.i10

Dwass, Meyer. “Modified randomization tests for nonparametric hypotheses.” Annuls of Mathematical Statistics, 1957. https://doi.org/10.1214/aoms/1177707045

Eddelbuettel, Dirk and Romain Francois. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software, 2011. https://www.jstatsoft.org/article/view/v040i08

Lockhart, Richard A., John J. Spinelli and Michael A. Stephens. “Cramer–von Mises statistics for discrete distributions with unknown parameters,” Canadian Journal of Statistics, 2007. https://doi.org/10.1002/cjs.5550350111

Perkins, William, Mark Tygert, and Rachel Ward. “Computing the confidence levels for a root-mean-square test of goodness-of-fit.” Applied Mathematics and Computation, 2011. https://doi.org/10.1016/j.amc.2011.03.124

Phipson, Belinda, and Gordon K. Smyth. “Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn.” Statistical Applications in Genetics and Molecular Biology, 2010. https://doi.org/10.2202/1544-6115.1585

Ward, Rachel and Raymond J. Carroll. “Testing Hardy–Weinberg equilibrium with a simple root-mean-square statistic.” Biostatistics, 2014. https://doi.org/10.1093/biostatistics/kxt028

Wood, Constance L., and Michele M. Altavela. “Large-Sample Results for Kolmogorov–Smirnov Statistics for Discrete Distributions.” Biometrika, 1978. https://doi.org/10.1093/biomet/65.1.235