For illustration, we selected the Cleveland Clinic Heart Disease Data set from the University of California in Irvine (UCI) machine learning data repository (Dua and Graff 2017). Below, we are using eleven variables, five of which are continuous, four are dichotomous, and two categorical variables.

```
library(modgo)
data("Cleveland", package = "modgo")
```

```
# Specifying dichotomous and ordinal categorical variables
<- c("Sex","HighFastBloodSugar","CAD","ExInducedAngina")
binary_variables <- c("Chestpaintype","RestingECG")
categorical_variables <- 500
nrep <- c("Age", "STDepression", binary_variables[c(1,3)], categorical_variables) plot_variables
```

In this section, we run *modgo* with its default settings. For
*modgo* to produce results that mimic the original data set
efficiently, user needs to specify dichotomous and ordinal categorical
variables. Variables will be considered as continuous, otherwise. All
*modgo* runs in this and the following sections will produce 500
data sets with the specification nrep = 500; the default is 100.

Figure 1 shows the correlation plots for the default *modgo*
run, and Figure 2 displays the distribution plots for the original data
set and one simulated data set. The default displayed simulated data set
is the first one. Moreover, for all the plots a set of variables are
used.

```
<- modgo(data = Cleveland,
test bin_variables = binary_variables,
categ_variables = categorical_variables,
nrep = nrep)
```

*modgo* provides an option so that only subjects (instances)
are simulated that fulfill a specific requirement. In the simplest case
(Section 2.1), the user can specify an upper or a lower boundary, or an
interval for a variable. The use may alternatively specify a combination
of variables and thresholds.

Three steps are required when subjects need to fulfill a specific
selection criterion for a continuous variable. First, the name of the
variable needs to be specified, for which the threshold needs to be set.
Second, the left and right boundaries need to be specified. Third, a
data frame with three columns is defined with Column 1: variable name of
threshold variable, Column 2: left boundary, i.e., lower bound, Column
3: right boundary, i.e., upper bound. Finally, the data frame is
imported using the *thresh_var* argument. In the example, all
subjects have to be at least 66 years old. The selection variable
therefore is *Age* with left threshold *65* and right
threshold infinity *NA*.

If the percentage of samples fulfilling the indicated threshold
requirements are less than 10% of the simulated samples, *modgo*
stops to avoid excessive computation time. However, users can force
*thresh_force = TRUE* the requested simulation to be run.

Figure 3 shows the correlation plot for this illustration. Substantial differences between the original and the simulated correlation plots can be observed for the RestingECG and several other variables. Figure 4 displays the corresponding distribution plot. The age distribution is shifted as expected. Furthermore, the distribution of subjects with coronary artery disease (CAD = 1) is higher in the simulated than the original data set.

```
<- c("Age")
Variables <- c(65)
thresh_left <- c(NA)
thresh_right <- data.frame(Variables, thresh_left, thresh_right)
thresholds
print(as.matrix(thresholds))
```

```
## Variables thresh_left thresh_right
## [1,] "Age" "65" NA
```

```
<- modgo(data = Cleveland,
test_thresh bin_variables = binary_variables,
categ_variables = categorical_variables,
thresh_var = thresholds,
nrep = nrep,
thresh_force = TRUE)
```

For continuous variables, *modgo* provides the option to add a
normally distributed noise with mean 0 and variance \(\sigma_{p}^2\). With this perturbation, the
variance of the perturbed variable is identical to the variance of the
original variable. This option permits the generation of values from
continuous variables, which were not observed in the original data
set.

To specify which variables are to be perturbed and to which degree,
i.e., percentage, the user needs to provide *modgo* with a named
vector of the percentages and with the corresponding variables names as
the names of the vector.

Similar to the previous examples, Figure 5 shows the correlation plots for the expansion to perturbations, and Figure 6 displays the distribution plots. Figure 6 shows that the distribution of both resting blood pressure and cholesterol change substantially due to the perturbation.

```
#Create named vector
<- c(0.9,0.7)
perturb_vector names(perturb_vector) <- c("RestingBP","Cholsterol")
<- modgo(data = Cleveland,
test_pertru bin_variables = binary_variables,
categ_variables = categorical_variables,
pertr_vec = perturb_vector,
nrep = nrep)
```