dfba_sim_data

library(DFBA)

1 Introduction

An important aspect of research is planning a forthcoming empirical study. Some key issues that scientists need to consider are: the nature of the population to be studied, the measure or measures that will be collected per sampled unit, the design of the study, and the planned sample size. The first two of those topics are not really statistical; rather, they involve the nature of the research question and the availability of measurement tools in the scientific field. But the last two issues – the research design and the sample size – are statistical topics. Statistical precision and therefore power grow with the sample size. Determining the statistical power of studies has important ethical and practical implications. When the research subjects are animals or patients, and the number tested was more than necessary to answer the research question, then too many subjects might have been unnecessarily exposed to participation risks. But if too few were tested, then the risks undertaken by the subjects might be wasted because there is not enough evidence to answer the research question. Time and financial resources are limited, so a careful researcher plans future studies so that there is a good chance that the research question is answered. In fact, funding agencies usually require scientists to provide a statistical rationale showing that their research design is feasible. Too few samples compromises the likelihood that the research question can be answered, but too many samples sinks more time and financial resources than needed into one experiment.

The DFBA package has three functions that are designed to assist researchers in their planning of a forthcoming study where there are two conditions and where there is a univariate continuous measure for each observation. The dfba_bayes_vs_t_power() function and the dfba_power_curve() function deal with frequentist and Bayesian statistical power. These two functions are discussed together in a separate vignette (see the dfba_power_functions vignette). The third function is the dfba_sim_data() function.

The dfba_sim_data() function creates two data sets from one of nine different probability models. The dfba_sim_data() function has an argument called model for stipulating the model for data generation. The model argument requires a text string from the following list:

• "normal"

• "weibull"

• "cauchy"

• "lognormal"

• "chisquare"

• "logistic"

• "exponential"

• "gumbel"

• "pareto"

The dfba_sim_data() function is called many times in the Monte Carlo sampling that is used by the dfba_bayes_vs_t_power() and dfba_power_curve() functions. The output from the dfba_sim_data() function has the frequentist $$p$$-value from the appropriate $$t$$-test and the posterior probability from the corresponding Bayesian distribution-free test for a difference between the two conditions. If the research design has paired scores for the two variates, then the frequentist $$t$$-test is the paired $$t$$-test, and the Bayesian analysis is the Wilcoxon signed-rank test. If the design has two independent conditions, then the frequentist test is the two-sample $$t$$-test, and the Bayesian analysis is the Mann-Whitney $$U$$ test. Thus, from each call of the dfba_sim_data() function, there are two data sets generated along with the primary results from a frequentist $$t$$-test and the corresponding results from a Bayesian distribution-free statistical assessment. However, power is not defined for a single sample. Power estimates are calculated by either the dfba_bayes_vs_t_power() function or the dfba_power_curve() function from the Monte Carlo simulations that repeatedly call the dfba_sim_data() function.

Besides generating random scores for each of two conditions, the dfba_sim_data() function can also include a block effect – for example, variation among individuals with regard to a variable – specified by the block.max argument. The dfba_sim_data() function generates random scores from a uniform distribution over the interval of $$[0, \texttt{block.max}]$$ where block.max is a non-negative number. The default value for block.max is $$0$$. A score for blocking is considered true variability rather than random measurement error. If an experiment is designed such that each participant is tested in both the control condition and in the experimental condition, then the experimental design is a paired (also known by other names including within-block, repeated-measures, matched-samples, and randomized block) design. For a paired design, the variation of blocks cancels out because the same block effect is added to the scores for each condition, thus, the difference scores per block removes the effect of block variation. This feature of paired-design experiments is a key advantage. For many research topics, though, it is either impractical or impossible to use the same block for testing performance in both conditions. If an experiment is designed such that participants are tested in only one of two conditions, then the experimental design is an independent groups (also known by other names including between-block, completely randomized, and between-participants) design. With an independent-grousp design, block variation does not cancel out, so block variability adds to statistical noise. Block variation increases statistical noise for the independent-groups design, thus affecting statistical power.

Researchers might be wary of using distribution-free methods because of a presumed lack of statistical power. Those concerns are largely unjustified. In fact, in many cases the distribution-free analyses such as the Bayesian Wilcoxon-signed rank test and the Bayesian Mann-Whitney $$U$$ test outperforms the frequentist $$t$$-test (Chechile, 2020). The DFBA power functions provide power estimates across a wide range of data types, so researchers can see for themselves what the relative power may be for a wide variety of data models. Since most researchers cannot be sure that the data in a forthcoming experiment will be from Gaussian distributions, power studies using functions in the DFBA package are prudent experimental design tools. The DFBA functions provide the user with the opportunity to explore Gaussian data as well as eight other probability models to see the relative benefits for the frequentist parametric $$t$$-test versus a distribution-free Bayesian analysis. The dfba_sim_data() function is the essential function that interfaces with the other functions in the DFBA package for doing power analyses (even though the dfba_sim_data() function does not itself compute power estimates).

Since the dfba_sim_data() function is used primarily by the two DFBA power functions, it is not expected that most users will ever need to implement the dfba_sim_data() function directly. Consequently. for those users it may not be necessary to read this vignette further. Such readers can skip to the dfba_power_functions vignette that describes both the dfba_bayes_vs_t_power() function and the dfba_power_curve() function. The rest of the current vignette is primarily reference material. Nine Probability Models for Data Generation is devoted to describing each of the nine probability models. Using the dfba_sim_data() Function describes the arguments for the function. Examples consists of some selective examples for calling the dfba_sim_data() function to get two vectors of random scores along with the Bayesian posterior probability for a difference between the conditions along with frequentist $$p$$-value from a $$t$$-test.

2 Nine Probability Models for Data Generation

2.1 Normal Distribution

The normal or Gaussian distribution is the familiar symmetric probability-density function that is the basis of parametric statistics. The normal distribution has two population parameters: the mean and the standard deviation. Random values from any normal distribution can be sampled using the stats package via the command rnorm(n, mean = 0, sd = 1). For example, rnorm(100, 5, 2.2) produces $$100$$ random values from a normal distribution with the mean of $$5$$ and the standard deviation of $$2.2$$. The normal distribution is important when all values of a continuous variate are composed of some true score plus measurement error, where the measurement errors are composed of the sum of latent independent random perturbations. The normal has the density function:

$$$f(x)=\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{(x-\mu)}{\sigma}\right)^2}, \tag{2.1}$$$

where $$\mu$$ is the mean and $$\sigma$$ is the standard deviation. The support for $$x$$ is the real line (i.e., $$x \in (-\infty,\,\infty)$$). The normal does not have a closed form function for the cumulative probability $$F(x)=P(y \le x)$$; however, estimates are easily obtained with the pnorm(x, mean = 0, sd = 1) command.

In the dfba_sim_data() function with the argument model = "normal", the control condition variate has a normal distribution with a mean equal to $$0$$ and a standard deviation equal to the shape1 argument for the function; in the experimental condition, the variate has a normal distribution with a mean equal to the delta argument with a standard deviation equal to the shape2 input for the dfba_sim_data() function.

2.2 Weibull Distribution

The Weibull distribution plays an important role for variates where the underlying process depends on a maximum or a minimum of a latent factor. For example, a system with many components, which are all required for proper functioning, fails when the weakest component breaks down (i.e., the component with the minimum lifetime). The Weibull probability density function is:

$$$f(x)=\frac{k}{s}\left(\frac{x}{s}\right)^{k-1}e^{-(\frac{x}{s})^{k}}, \tag{2.2}$$$

where $$s$$ is a scale factor and $$k$$ is a shape parameter and where $$x \ge 0$$. Both the $$k$$ and $$s$$ parameters must be positive. When the shape parameter $$k$$ equals $$1$$, the Weibull distribution is equivalent to the exponential distribution.

From the stats package, $$n$$ random values from a Weibull distribution can be sampled with the command rweibull(n, shape, scale=1). Consequently, the command rweibull(30, .7) randomly samples $$30$$ scores from a Weibull distribution with a shape parameter of $$.7$$ and with a scale factor of $$1$$. In the dfba_sim_data() function, the variate for the first (control) condition C has a Weibull distribution with a scale factor $$s=1$$ and a shape parameter $$k$$ equal to the value of the input for the function. The variate for the second (experimental) condition E has an offset equal to the delta input for the function plus a value from a Weibull distribution with a scale factor $$s=1$$ and a shape parameter $$k$$ equal to the shape2 input. The value for the offset is added to the Weibull component of the scores in the experimental condition. The offset parameter and both shape parameter inputs must be positive values.

2.3 Cauchy Distribution

The Cauchy distribution occurs when there is a latent process that is the ratio of two independent normal distributions each with a mean of $$0$$ (Johnson, Kotz, & Balakrishnan, 1995). The density function distribution is similar in shape to – but has heavier tails than – a normal distribution. This distribution is unusual because it does not have a defined expected value or variance. The density function is

$$$f(x) = \frac{1}{\pi s\left[1+\left(\frac{x-l}{s}\right)^{2}\right]}, \tag{2.3}$$$

where $$s$$ is the scale factor and $$l$$1 is the location parameter (note: when $$s=1$$ and $$l=0$$, the distribution is equivalent to a $$t$$-distribution with 1 degree of freedom). The support for the Cauchy distribution is $$x \in (-\infty,\infty)$$, and the scale factor must be positive.

The Cauchy is one of the distributions included in the stats package: for example, the command rcauchy(50, location=0, scale=1) generates $$50$$ random scores from a Cauchy distribution where the location parameter $$l$$ equals $$0$$ and where the scale factor $$s$$ is $$1$$. In the dfba_sim_data() function, the first (control) condition variate has a Cauchy distribution with location $$l$$ equal to $$0$$ and scale factor $$s$$ equal to the value of the shape1 argument for the function; in the second (experimental) condition, the variate has a Cauchy distribution with location $$l$$ equal to the value of the delta input and with a scale factor $$s$$ equal to the shape2 argument for the dfba_sim_data() function. The dfba_sim_data() function thus enables power studies where the variates are Cauchy-distributed and where the location parameters are separated by the amount stipulated with the delta argument and where the scale factor for the experimental condition can be varied to be different from the scale factor for the control condition.

2.4 Lognormal Distribution

The lognormal distribution is a continuous density function that arises when the variate is the multiplicative product of a number of latent independent positive components (Johnson, Kotz, & Balakrishnan, 1995). The probability density is

$$$f(x) = \frac{1}{x \,\sigma \sqrt{2\pi}}e^{\left(-\frac{1}{2}\frac{(\log x- \mu)^{2}}{\sigma^{2}}\right)}, \tag{2.4}$$$

where $$\mu$$ and $$\sigma$$ are the respective mean and standard deviation on the log scale. The lognormal has support for $$x$$ on the positive real numbers.

The lognormal is one of the distributions included in the stats package, for example, the command rlnorm(40, meanlog = 0, sdlog = 2) generates $$40$$ random scores from a lognormal distribution that has the mean and standard deviation on the log scale of, respectively, $$0$$ and $$2$$. In the dfba_sim_data() function, the first (control) condition variate C has a lognormal distribution with a mean and standard deviation on the log scale of respectively $$0$$ and the shape1 argument; the second (experimental) condition variate E has a lognormal distribution with a mean and standard deviation on the log scale of the respective the delta and shape2 arguments.

2.5$$\chi^2$$ Distribution

The $$\chi^2$$ distribution is a continuous density function for a variate that is the sum of squares of independent latent normal variables (Johnson, Kotz, & Balakrishnan, 1995). The probability density is

$$$f(x) = \frac{1}{2^{k/2}\Gamma(k/2)} x^{k/2 -1}e^{ -\frac{x}{2}}, \tag{2.5}$$$

where $$k$$ is the degrees-of-freedom $$(df)$$ parameter and where $$\Gamma$$ is the gamma function. The $$k$$ parameter must be non-negative, but it need not be an integer. The support for $$x$$ is on the non-negative real numbers.

The $$\chi^2$$ distribution is included in the stats package, for example: the command rchisq(35, df = 5) generates $$35$$ scores from a $$\chi^2$$ distribution with $$k = 5$$. For the dfba_sim_data() function, the first (control) condition variable C has a $$\chi^2$$ distribution where the degrees of freedom are equal to the shape1 argument; the second (experimental) condition variate E has a positive offset equal to the delta argument plus a $$\chi^2$$ component with degrees of freedom equal to the shape2 argument.

2.6 Logistic Distribution

The logistic distribution is often used as an approximation to a normal, although the logistic has heavier tails than that of the normal (Johnson, Kotz, & Balakrishnan, 1995). Unlike the normal, the logistic has a closed-form equation for the cumulative probability. The probability density function $$f(x)$$ and the cumulative probability $$F(x)$$ are

\begin{aligned} f(x) &= \frac{e^{-(x-\mu)/s}}{s(1+ e^{-(x-\mu)/s})^{2}},\\ F(x) &= \frac{1}{1+e^{-(x-\mu)/s}}, \end{aligned} \tag{2.6}

where $$\mu$$ and $$s$$ are, respectively, the mean and scale factor of the logistic distribution. The support for $$x$$ is the whole real line. The logistic distribution is a good approximation of a normal with mean $$\mu$$ and standard deviation $$\sigma$$ when the mean of the logistic is $$\mu$$ and its scale factor $$s=\frac{\sqrt{3}\sigma}{\pi}$$.

The logistic distribution is included in the stats package, for example: the command rlogis(50, location = 0, scale = 3) generates $$50$$ random scores from a logistic distribution with a mean of $$0$$ and a scale factor of $$3$$. In the dfba_sim_data() function, the first (control) variate C is sampled from a logistic distribution with a mean of $$0$$ and a scale factor equal to the shape1 argument; the second (experimental) variate E is sampled from a logistic distribution with a mean and scale factor that are, respectively, the values of the delta and shape2 arguments.

2.7 Exponential Distribution

The exponential distribution is a special continuous density function that has the property of constant hazard (Chechile, 2003), that is: the probability for a critical event occurring given that it did not occur earlier (the hazard) is a constant. Because of this characteristic, the exponential is described as a memoryless distribution. The probability density function is:

$$$f(x) = k e^{-k x}, \tag{2.7}$$$

where $$k$$ is a positive constant that is equal to the hazard rate. The support for the distribution for $$x \in [0,~\infty)$$.

The exponential distribution is included in the stats package, for example: command rexp(60, rate = 5) generates $$50$$ random scores from an exponential distribution with a rate parameter equal to $$5$$. In the dfba_sim_data() function, the first (control) variate C is sampled from an exponential distribution with a rate parameter equal to the shape1 argument; the second (experimental) variate E is sampled from an exponential distribution with a rate parameter equal to the shape2 argument plus an added offset value that is equal to the delta argument.

2.8 Gumbel Distribution

The Gumbel distribution, like the Weibull distribution, is a probability model of a system where the process is controlled by the maximum or the minimum of latent factors (Gumbel, 1958). The probability density $$f(x)$$ and the cumulative probability $$F(x)$$ are

\begin{aligned} f(x) &= \frac{1}{s} e^{-\left[(x-\mu)/s\right]+e^{-(x-\mu)/s}}\,,\\ F(x) &= e^{-e^{-(x-\mu)/s}} , \end{aligned} \tag{2.8}

where $$\mu$$ is the mode and $$s$$ is a scale factor. The support for the distribution is for $$x \in -(\infty,~\infty)$$. The scale factor must be a positive value.

The Gumbel distribution is not included in the stats package. However, since the Gumbel distribution has a closed form cumulative probability, random samples from any Gumbel can be obtained via the inverse transform method (Fishman, 1996). For the dfba_sim_data() function, the first (control) variate C is a Gumbel with a mode $$\mu$$ equal to $$0$$ and scale factor $$s$$ equal to the shape1 argument; the second (experimental) variate E is a Gumbel with a mode and scale factor that are, respectively, the values of the delta and the shape2 arguments.

2.9 Pareto Distribution

The Pareto distribution arose as a probability model of incomes in economics (Pareto, 1897). Harris (1968) observed that a Pareto distribution can model cases of a mixture of exponentially distributed variates where the component rate parameters $$k^{-1}$$ have a gamma distribution. The cumulative function for a Pareto distribution is

$$$F(x) = 1-\left(\frac{x_m}{x}\right)^{\alpha}, \tag{2.9}$$$

where $$x\ge x_m$$ and where $$x_m$$ is the mode (Arnold, 1983). Pareto observed that the value of $$\alpha$$ is typically near the value of $$1.5$$. However, when $$\alpha=1.161$$, the distribution represents a 80-20 law, which stipulates that 20 percent of people receive 80 percent of the income (Hardy, 2010).

Although the Pareto distribution is not included in the stats package, random values can be easily obtained by the inverse transform method of Monte Carlo sampling (Fishman, 1996). In the dfba_sim_data() function, the first (control) variate C is sampled from a Pareto distribution with $$x_m=1$$; the second (experimental) variate E is sampled from a Pareto with $$x_m$$ equal to the value of the delta argument plus $$1$$. The $$\alpha$$ parameters for the control and experimental conditions are 1.16 times the respective shape1 and shape2 arguments. Since the default value for both the shape1 and shape2 arguments is $$1$$, the default condition results in random data samples that satisfy the 80-20 law.

3 Using the dfba_sim_data() Function

There are three required arguments for the dfba_sim_data() function: model, design and delta. The model argument is a character string from the following list:

• normal

• weibull

• cauchy

• lognormal

• chisquare

• logistic

• exponential

• gumbel

• pareto

The design argument must be either the character string paired or the character string independent. The delta argument must be a non-negative value for the separation between the first (experimental) variable and the second (control) variable (corresponding respectively to the values E and C).

The dfba_sim_data() function also has six optional arguments; listed with their corresponding default values, they are: n = 20, a0 = 1, b0 = 1, shape1 = 1, shape2 = 1, and block.max = 0. The value of the n argument must be an integer greater than or equal to $$20$$. The reason for the constraint on n is to assure that the Bayesian posterior distribution has sufficient sample size to use the large-$$n$$ approximation method for either the Bayesian Mann-Whitney analysis or the Bayesian Wilcoxon signed-rank analysis. The a0 and b0 arguments represent the shape parameters for the prior beta distribution used for the large-$$n$$ approximation for the two Bayesian methods; the default value of $$1$$ for each represents a uniform prior. The shape1 and shape2 arguments are associated with, respectively, the C and E variates. These arguments refer to different distribution shape parameters depending on the model input, for example: given the argument model = "normal", shape1 and shape2 define the values of the standard deviations for the normal distributions from which the E and the C variates, respectively, are sampled. (see above for more details). The last optional argument is block.max. As described above, the dfba_sim_data() function has the feature of enabling two other DFBA functions – dfba_bayes_vs_t_power() and dfba_power_curve() – to show the effect of block variation on power. In the default value block.max = 0, there is no blocking effect. As the value of the block.max argument increases, there can be reduced power for studies that have independent groups. But as mentioned previously, the dfba_sim_data() function itself does not compute power. Variation of the block.max argument from the default value is an option best employed by way of either the dfba_bayes_vs_t_power() or the dfba_power_curve() function rather than with the dfba_sim_data() function.

4 Examples

As an example of dfba_sim_data() function consider the following commands:

set.seed(1)
example_A1 <- dfba_sim_data(n = 80,
model = "normal",
design = "paired",
delta = 0.4,
shape2 = 2)

example_A1
#> Frequentist p-value
#>  0.3475369
#> Bayesian posterior probability
#>  0.5473132
example_A1$E #> [1] -0.73733747 0.12964277 2.75617399 -2.64713360 1.58789238 1.06590074 #> [7] 2.52619967 -0.20836785 1.14003762 0.93419758 -0.68504006 2.81573561 #> [13] 2.72080523 1.80042730 3.57366691 1.51697285 -2.15318442 -0.74653083 #> [19] -2.04922523 -0.54680127 -0.84073335 0.48423175 -1.42184330 0.71605754 #> [25] -0.90916929 3.93457454 1.83341495 2.22034846 1.16837072 3.76435216 #> [31] -0.87147291 -0.52328946 3.26456448 -0.90139271 -0.01476149 -0.38561586 #> [37] -0.23998574 -0.15822661 1.38837666 0.04533904 -0.61191492 3.08607765 #> [43] -0.02915882 0.04088694 0.19961852 1.82533261 0.25287119 0.32473166 #> [49] -0.96332096 -0.24854054 0.52032088 -0.77778897 1.46299239 -2.63678816 #> [55] 1.01311572 -2.67289965 -0.20195225 -0.65655981 -0.90418956 0.28620644 #> [61] -3.42871885 2.75316662 -2.92994487 -0.52706080 -1.83184021 -1.10163800 #> [67] 4.57433309 0.43479124 -2.17260106 -2.88121107 1.30037420 0.36288033 #> [73] -0.23613675 -1.45872429 -2.57492062 -1.75038459 2.40005761 -0.84253339 #> [79] -2.36885369 4.13858124 example_A1$C
#>  [1] -0.626453811  0.183643324 -0.835628612  1.595280802  0.329507772
#>  [6] -0.820468384  0.487429052  0.738324705  0.575781352 -0.305388387
#> [11]  1.511781168  0.389843236 -0.621240581 -2.214699887  1.124930918
#> [16] -0.044933609 -0.016190263  0.943836211  0.821221195  0.593901321
#> [21]  0.918977372  0.782136301  0.074564983 -1.989351696  0.619825748
#> [26] -0.056128740 -0.155795507 -1.470752384 -0.478150055  0.417941560
#> [31]  1.358679552 -0.102787727  0.387671612 -0.053805041 -1.377059557
#> [36] -0.414994563 -0.394289954 -0.059313397  1.100025372  0.763175748
#> [41] -0.164523596 -0.253361680  0.696963375  0.556663199 -0.688755695
#> [46] -0.707495157  0.364581962  0.768532925 -0.112346212  0.881107726
#> [51]  0.398105880 -0.612026393  0.341119691 -1.129363096  1.433023702
#> [56]  1.980399899 -0.367221476 -1.044134626  0.569719627 -0.135054604
#> [61]  2.401617761 -0.039240003  0.689739362  0.028002159 -0.743273209
#> [66]  0.188792300 -1.804958629  1.465554862  0.153253338  2.172611670
#> [71]  0.475509529 -0.709946431  0.610726353 -0.934097632 -1.253633400
#> [76]  0.291446236 -0.443291873  0.001105352  0.074341324 -0.589520946
plot(example_A1)

Note that the above commands generate $$80$$ random samples from the standard normal for the C variate and $$80$$ random samples for the E variate from a normal distribution with a mean equal to $$0.4$$ and a standard deviation of $$2$$. Repeating the above commands with a different seed draws a second set of $$80$$ scores for each condition:

set.seed(2)

example_A2 <- dfba_sim_data(n = 80,
model = "normal",
design = "paired",
delta = 0.4,
shape2 = 2)

example_A2$E #> [1] 2.39196918 -2.99152981 -0.66674429 -2.34453890 -4.01583956 4.04424504 #> [7] -0.90678682 -0.16936244 -0.37389921 1.17338995 3.60078170 3.76230991 #> [13] -1.96721278 -2.31691451 -2.62534159 -2.10620980 4.31871415 0.41529174 #> [19] -1.28523040 -0.80232021 2.54891881 0.92119567 -0.22854396 -1.09926019 #> [25] -1.32439666 4.49608061 2.27984016 4.41737423 -0.44274714 -0.30166885 #> [31] -1.65476120 -0.10103825 1.34371893 3.11787964 1.52833721 1.31196018 #> [37] 2.86190733 2.69427370 0.61319608 -1.16663331 2.88239965 0.67771684 #> [43] 3.82126318 -0.46128195 -1.68845916 1.47515905 -0.93917197 1.67761122 #> [49] -3.04797967 -3.08486016 1.77960835 1.06192635 2.14213542 -3.63249116 #> [55] 2.82515821 2.80098940 2.46413665 1.97282051 4.62014703 -2.50761969 #> [61] -0.76620770 1.21944797 -1.21396327 0.57110088 1.89248634 -0.90734612 #> [67] 1.71421197 1.49981847 -1.21345872 -1.59475943 2.35178128 0.06115364 #> [73] 1.84438356 -1.28883721 2.95458737 -2.28622110 1.93068134 1.32840514 #> [79] 0.93598656 1.73504537 example_A2$C
#>  [1] -0.896914547  0.184849185  1.587845331 -1.130375674 -0.080251757
#>  [6]  0.132420284  0.707954729 -0.239698024  1.984473937 -0.138787012
#> [11]  0.417650751  0.981752777 -0.392695356 -1.039668977  1.782228960
#> [16] -2.311069085  0.878604581  0.035806718  1.012828692  0.432265155
#> [21]  2.090819205 -1.199925820  1.589638200  1.954651642  0.004937777
#> [26] -2.451706388  0.477237303 -0.596558169  0.792203270  0.289636710
#> [31]  0.738938604  0.318960401  1.076164354 -0.284157720 -0.776675274
#> [36] -0.595660499 -1.725979779 -0.902584480 -0.559061915 -0.246512567
#> [41] -0.383586228 -1.959103175 -0.841705060  1.903547467  0.622493930
#> [46]  1.990920436 -0.305483725 -0.090844235 -0.184161452 -1.198767765
#> [51] -0.838287148  2.066301356 -0.562247053  1.275715512 -1.047572627
#> [56] -1.965878241 -0.322971094  0.935862527  1.139229803  1.671618767
#> [61] -1.788242207  2.031242519 -0.703144333  0.158164763  0.506234797
#> [66] -0.819995106 -1.998846995 -0.479292591  0.084179904 -0.895486611
#> [71] -0.921275666  0.330449503 -0.141660809  0.434847762 -0.053722626
#> [76] -0.907110376  1.303512232  0.771789776  1.052525595 -1.410038341

As an example with a distribution very different than the normal, consider the following commands:

set.seed(1)

example_B1 <- dfba_sim_data(n = 100,
model = "cauchy",
design = "paired",
delta = 0.5)

example_B1
#> Frequentist p-value
#>  0.9103605
#> Bayesian posterior probability
#>  0.2911831
example_B1$E #> [1] -1.39263889 2.51232669 1.63614785 0.47701225 -1.74300312 #> [6] 1.29195003 0.93039956 15.02375500 0.25684386 -2.61894118 #> [11] 0.42499787 -0.62148240 2.56959890 5.07309228 1.00246836 #> [16] 0.54110755 -0.74366507 0.83601581 6.36947328 -1.62335831 #> [21] 0.47435465 72.73310490 20.82227994 1.10608966 -0.47015879 #> [26] 7.35574602 -27.98571014 1.26340017 1.37415416 -2.72486647 #> [31] -3.67267954 0.74694880 0.61212033 -1.57755063 0.27190120 #> [36] -2.64162132 -4.66276783 -11.70237908 0.45314101 -41.14565941 #> [41] -1.04565497 -2.52772284 1.43239638 1.55267080 -0.63924892 #> [46] 7.16152714 1.11334873 -0.52096358 0.84233315 0.04676054 #> [51] -2.15537644 -5.00881091 2.17617954 7.24239304 -721.33547226 #> [56] 1.13844383 -10.21155879 0.74099516 1.69158751 1.28935329 #> [61] 1.74652167 0.15795373 6.36401818 -0.32735191 0.10631988 #> [66] 4.07253422 0.70318925 2.25944225 -0.68038392 2.28714708 #> [71] -1.80263989 -0.04724313 0.01450588 3.31526997 3.03720625 #> [76] 0.15918611 -1.55240928 -0.55768640 -2.41170041 0.18575031 #> [81] 1.82086740 1.18540206 0.12733169 -94.81354023 0.09333932 #> [86] 1.17590232 -0.45034045 -0.67458889 0.32134102 -6.13067393 #> [91] -0.77471045 3.24791285 0.82795500 0.26755719 1.73409762 #> [96] -2.91903140 0.86130239 -0.04768216 2.05368834 -0.31229431 example_B1$C
#>   [1]   1.10251990   2.35383060  -4.29262624  -0.29664256   0.73464686
#>   [6]  -0.33052199  -0.17557937  -1.80824308  -2.32862439   0.19658244
#>  [11]   0.75561999   0.61954832  -1.50147326   2.62405381  -0.88250440
#>  [16] 138.34760311  -1.22737450  -0.02543323   2.52652746  -0.84088155
#>  [21]  -0.20805599   0.78651704  -1.93735849   0.41625813   1.11450727
#>  [26]   2.67469787   0.04209180   2.58214078  -0.43389224   1.82372681
#>  [31]  17.74417330  -3.09202775  49.27718408   0.66236678  -0.60259133
#>  [36]  -1.70964962  -0.75456222   0.35274134  -1.18049671   3.49417974
#>  [41]  -0.63045697  -2.00824906  -0.81186981  -5.94609241 -10.67930318
#>  [46]  -0.77892393   0.07342868  13.95554098  -1.11779219  -1.44463156
#>  [51]  14.19927731  -0.46593149   5.07709560   0.96783322   0.22576743
#>  [56]   0.32306654   1.53568997 -17.06244971  -1.79215906   3.31831965
#>  [61]  -0.28075465   1.31977477   7.73320790   1.72031518  -1.94941320
#>  [66]   1.05168352  14.81386093  -0.90243119   0.27102773  -0.41303137
#>  [71]   1.80651036  -0.55204644   1.91306982   1.73761346  13.43512638
#>  [76]  -0.35223994  -0.45401782   2.77732059  -0.84154958  -0.12435738
#>  [81]   4.80293716  -1.26837364   3.07749829   1.63591139  -0.95643295
#>  [86]   0.73954537  -1.27985783   0.40208959   0.97204780   0.48330667
#>  [91]   0.93687401   0.18729283  -2.08605029  -0.40954977  -0.83303134
#>  [96]  -0.73954011   7.07006090   3.44541936  -0.67561006  -2.92275964
plot(example_B1)

As with the previous example, repeating these commands with a different starting seed results in two different random scores for paired Cauchy variates.

set.seed(2)

example_B2 <- dfba_sim_data(n = 100,
model = "cauchy",
design = "paired",
delta = 0.5)

example_B2$E #> [1] 1.231039e+00 4.822881e+00 4.389775e-01 -9.597813e-02 1.764190e+00 #> [6] -2.717543e+00 1.715058e-01 7.278421e+00 9.993489e-01 9.278128e-01 #> [11] 5.776155e-01 -5.899310e-01 2.879434e+00 -3.701513e+00 -1.113842e-01 #> [16] -1.627588e-01 7.728193e-02 8.619926e-01 3.502999e-01 -4.040561e+00 #> [21] 6.163465e-01 1.470841e+00 4.335672e-01 1.247760e-01 1.444890e+00 #> [26] -4.556848e-01 -4.499709e+00 1.923836e+00 -9.357222e-01 2.265352e+00 #> [31] 1.256286e+00 2.408186e-01 5.717904e-01 3.856534e-01 2.031410e+00 #> [36] -1.245426e+00 -8.954369e+00 -1.443718e-01 1.158063e+00 3.561885e+00 #> [41] 1.127824e+00 1.751699e+00 -1.821464e+00 1.877294e+00 6.092095e+00 #> [46] -6.328349e-01 -1.213392e+00 1.988059e+00 1.533664e+01 1.801053e+00 #> [51] 1.144243e+00 2.638391e+00 1.882284e-01 3.392492e+00 -3.279043e-01 #> [56] 1.741467e+00 5.111861e-03 1.100947e+00 7.542938e-01 1.806607e+00 #> [61] -4.823711e-02 3.263398e-01 6.422095e-01 -4.486686e-01 1.848459e+00 #> [66] -1.446642e+00 7.735345e-01 4.228188e-01 5.428301e-01 -7.648460e+00 #> [71] 3.920843e-01 8.286522e-01 1.543346e+00 1.581401e-01 3.222311e+00 #> [76] -2.543375e-01 2.453450e+00 9.657548e-01 -1.454877e+00 -6.687467e+00 #> [81] 3.262620e-01 -3.861250e+00 3.532966e-01 -3.259340e-01 8.897068e-01 #> [86] -6.163003e-02 7.808741e-01 -5.088356e-01 7.076853e-01 7.490537e-01 #> [91] 8.426706e-01 3.812900e+00 4.211861e-01 1.069697e+00 -1.038526e+02 #> [96] 1.234029e+00 1.225208e+00 1.126431e+00 1.662443e+00 9.350540e-01 example_B2$C
#>   [1]    0.65634786   -1.35501351   -4.26394344    0.58316746   -0.17828774
#>   [6]   -0.17946910    0.42960529   -0.57686664    9.91942932   -6.31583937
#>  [11]   -5.98774837    0.93254957   -0.93603240    0.63823928    3.26083375
#>  [16]   -0.49556224   -0.07428251    0.85857558    5.70953592    0.24001040
#>  [21]   -1.79356668    2.71191838   -0.56255042    0.51151135    1.92171947
#>  [26]   28.34100330    0.50654872    2.07518260   -0.11789894    0.44161475
#>  [31]    0.03272988    0.56890129   -0.67871723   -0.43699346  -22.27289880
#>  [36]   -2.36786977   -0.53177166    1.24716447   -1.72504774    0.51138575
#>  [41]   -0.05746675    1.34954628    0.37817039    0.56292291   -0.17763152
#>  [46]   -0.75149043   -0.07968836    1.94879322 -161.58632951   -0.67777639
#>  [51]    0.02233742    0.04619510   -1.53912291   -0.22444953    1.17384226
#>  [56]   -0.67109223   -0.79662763   -0.03487904   -2.67298247   -1.28764014
#>  [61]   -0.88146307   -0.37076171   -2.41160461    1.06690534   -0.47412584
#>  [66]    5.02634933    2.72762030    8.22767128    0.82028557    0.21015696
#>  [71]    1.17608458    1.47525153    0.13327918    0.65540973    0.64958580
#>  [76]   -0.96625441    1.27309215   -0.44080158    3.16690936   -4.30292473
#>  [81]    1.97244393   -1.66692710    0.07886089    3.11429671    0.72642988
#>  [86]   -0.48396807   -0.08972656    1.61724576   -1.11160867    1.81991733
#>  [91]   -0.07315583    2.98227544    2.52568571   -5.20772070    8.75714527
#>  [96]    0.71118400    4.28025666    0.30086180    0.37897874    5.24501729

5 References

Arnold, B. C. (1983). Pareto Distribution. Fairland, MD: International Cooperative Publishing House.

Chechile, R. A. (2003). Mathematical tools for hazard function analysis. Journal of Mathematical Psychology, 47, 478-494.

Fishman, G. S. (1996). Monte Carlo: Concepts, Algorithms, and Applications, Springer, New York.

Gumbel, E. J. (1958). Statistics of Extremes, Columbia University Press, New York.

Hardy, M. (2010). Pareto’s Law. Mathematical Intelligencer, 32, 38-43.

Harris, C. M. (1968). The Pareto distribution as a queue service discipline. Operations Research, 16, 307-313.

Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995). Continuous Univariate Distributions, Wiley, New York.

Pareto, V. (1897). Cours d’Economie Politique Vol. 2, F. Rouge: Lausanna.

1. The letter $$l$$, not to be confused with the numeral $$1$$↩︎