# Model Customisation in missingHE

For each of the three types of models that can be fitted using missingHE, namely selection, pattern mixture, and hurdle models, the package provides a series of customisation options to allow a flexible specification of the models in terms of modelling assumptions and prior choices. These can be extremely useful for handling the typical features of trial-based CEA data, such as non-normality, clustering, and type of missingness mechanism. This tutorial shows how it is possible to customise different aspects of the models using the arguments of each type of function in the package. Throughout, we will use the built-in dataset called MenSS as a toy example, which is directly available when installing and loading missingHE in your R workspace. See the vignette called Introduction to missingHE for an introductory tutorial of each function in missingHE and a general presentation of the data from the MenSS dataset.

If you would like to have more information on the package, or would like to point out potential issues in the current version, feel free to contact the maintainer at a.gabrio@maastrichtuniversity.nl. Suggestions on how to improve the package are also very welcome.

## Changing the distributional assumptions

A general concern when analysing trial-based CEA data is that, in many cases, both effectiveness and costs are characterised by highly skewed distributions, which may cause standard normality modelling assumptions to be difficult to justify, especially for small samples. missingHE allows to chose among a range of parametric distributions for modelling both outcome variables, which were selected based on the most common choices in standard practice and the literature.

In each model, the specific type of distributions for the effectiveness ($$e$$) and cost ($$c$$) outcome can be selected by setting the arguments dist_e and dist_c to specific character names. Available choices include: Normal ("norm") and Beta ("beta") distributions for $$e$$ and Normal ("norm") and Gamma ("gamma") for $$c$$. Distributions for modelling both discrete and binary effectiveness variables are also available, such as Poisson ("pois") and Bernoulli (bern) distributions. The full list of available distributions for each type of outcome can be seen by using the help function on each function of the package.

In the MenSS dataset the default effectiveness variables are the QALYs. However, in general, other types of effectiveness measures may be of interest in the economic analysis (e.g. the primary outcome from a trial). In our dataset we have the number of instances of unprotected sex at $$12$$ months follow-up, denoted as sex_inst, which could be used in the CEA instead of QALYs. Thus, we create a second dataset called MenSS2, where we assign the name $$e$$ to the variable sex_inst, which allows missingHE to identify this variable as the main effectiveness variable for the analysis. This can be done by typing

> MenSS2 <- MenSS
> MenSS2$e <- MenSS$sex_inst
>
> #first 10 entries of e
> head(MenSS2$e, n = 10) [1] NA 50 3 0 99 NA 20 NA NA NA The new effectiveness outcome is now a discrete variable and therefore the use of discrete distributions is likely to be more appropriate for modelling purposes compared to standard normality assumptions. We can check the empirical histograms of $$e$$ by treatment group by typing > par(mfrow=c(1,2)) > hist(MenSS2$e[MenSS2$t==1], main = "N sex instances - Control") > hist(MenSS2$e[MenSS2$t==2], main = "N sex instances - Intervention") We can also see that the proportion of missing values in $$e$$ is considerably large in both treatment groups. > #proportions of missing values in the control group > sum(is.na(MenSS2$e[MenSS$t==1])) / length(MenSS2$e[MenSS$t==1]) [1] 0.48 > > #proportions of missing values in the intervention group > sum(is.na(MenSS2$e[MenSS$t==2])) / length(MenSS2$e[MenSS$t==2]) [1] 0.5238095 As an example, we fit a selection model assuming Poisson distributions to handle the discrete nature of $$e$$, and we choose Gamma distributions to capture the skewness in the costs. We note that, in case some of individuals have costs that are equal to zero (as in the MenSS dataset), standard parametric distributions with a positive support are not typically defined at $$0$$ (e.g. the Gamma distributions), making their implementation impossible. Thus, in these cases, it is necessary to use a trick to modify the boundary values before fitting the model. A common approach is to add a small constant to the cost variables. These can be done by typing > MenSS2$c <- MenSS2$c + 0.01 We note that, although simple, this strategy has the potential drawback that results may be affected by the choice of the constant added and therefore sensitivity analysis to the value used is typically recommended. missingHE provides an alternative way to deal with this issue by means of fitting a two-part regression or hurdle model which does not require the use of any constant. For more information on hurdle models, type help(hurdle). We are now ready to fit our selection model to the MenSS2 dataset using the following command > PG.sel=selection(data = MenSS2, model.eff = e ~ sex_inst.0, model.cost = c ~ 1, + model.me = me ~ age + ethnicity + employment, + model.mc = mc ~ age + ethnicity + employment, type = "MAR", + n.iter = 1000, dist_e = "pois", dist_c = "gamma") The arguments dist_e = "pois" and dist_c = "gamma" specify the distributions assumed for the outcome variables and, in the model of $$e$$, we also adjust for the baseline outcome values (sex_inst.0). According to the type of distributions chosen, missingHE automatically models the dependence between covariates and the mean outcome on a specific scale to reduce the chance of incurring into numerical problems due to the constraints of the distributions. For example, for both Poisson and Gamma distributions means are modelled on the log scale, while for Beta and Bernoulli distirbutions they are modelled on the logit scale. To see the modelling scale used by missingHE according to the type of distribution selected, use the help command on each function of the package. The model assumes MAR conditional on age, ethnicity and employment as auxiliary variables for predicting missingness in both outcomes. We can look at how the model generate imputations for the outcomes by treatment group using the generic function plot. For example, we can look at how the missing $$e$$ are imputed by typing > plot(PG.sel, outcome = "effects") Summary results of our model from a statistical perspective can be inspected using the command coef, which extracts the estimates of the mean regression coefficients for $$e$$ and $$c$$ by treatment group. By default, the lower and upper bounds provide the $$95\%$$ credibile intervals for each estimate (based on the $$0.025$$ and $$0.975$$ quantiles of the posterior distribution). However, it is possible to modify these values using the argument prob to change the level of the intervals to match the one desired. For example, if we want $$90\%$$ intervals, we can type > coef(PG.sel, prob = c(0.05, 0.95))$Comparator
$Comparator$Effects
mean    sd lower upper
(Intercept) 3.233 0.049 3.155 3.312
sex_inst.0  0.004 0.001 0.003 0.006

$Comparator$Costs
mean   sd lower upper
(Intercept) 5.505 0.39 4.908 6.193

$Reference$Reference$Effects mean sd lower upper (Intercept) 3.136 0.056 3.042 3.225 sex_inst.0 0.011 0.002 0.007 0.015$Reference$Costs mean sd lower upper (Intercept) 5.48 0.482 4.699 6.211 The entire posterior distribution for each parameter of the model can also be extracted from the output of the model by typing PG.sel$model_output, which returns a list object containing the posterior estimates for each model parameter. An overall summary of the economic analysis based on the model estimates can be obtained using the summary command

> summary(PG.sel)

Cost-effectiveness analysis summary

Comparator intervention: intervention 1
Reference intervention: intervention 2

Parameter estimates under MAR assumption

Comparator intervention
mean      sd      LB      UB
mean effects (t = 1)    27.3   1.083  25.603  29.152
mean costs (t = 1)   266.067 114.279 135.404 489.173

Reference intervention
mean     sd      LB     UB
mean effects (t = 2)  25.867  1.056  24.178 27.602
mean costs (t = 2)   269.266 135.37 109.892 498.09

Incremental results
mean      sd       LB      UB
delta effects -1.433   1.524   -3.764    1.03
delta costs    3.199 162.358 -251.133 255.221
ICER          -2.233                         

which shows summary statistics for the mean effectiveness and costs in each treatment group, for the mean differentials and the estimate of the ICER.

## Including random effects terms

For each type of model, missingHE allows the inclusion of random effects terms to handle clustered data. To be precise, the term random effects does not have much meaning within a Bayesian approach since all parameters are in fact random variables. However, this terminology is quite useful to explain the structure of the model.

We show how random effects can be added to the model of $$e$$ and $$c$$ within a pattern mixture model fitted to the MenSS2 dataset using the function pattern. The clustering variable over which the random effects are specified is the factor variable site, representing the centres at which data were collected in the trial. Using the same distributional assumptions of the selection model, we fit the pattern mixture model by typing

> PG.pat=pattern(data = MenSS2, model.eff = e ~ sex_inst.0 + (1 | site),
+                model.cost = c ~ 1 + (1 | site), type = "MAR", restriction = "AC",
+                n.iter = 1000, Delta_e = 0, Delta_c = 0, dist_e = "pois", dist_c = "gamma")

The function fits a random intercept only model for $$e$$ and $$c$$, as indicated by the notation (1 | site)and (1 | site). In both models, site is the clustering variable over which the random coefficients are estimated. missingHE allows the user to choose among different clustering variables for the model of $$e$$ and $$c$$ if these are available in the dataset. Random intercept and slope models can be specified using the notation (1 + sex_inst.0 | site), where sex_inst.0 is the name of a covariate which should be inlcuded as fixed effects in the corresponding outcome model. Finally, it is also possible to specify random slope only models using the notation (0 + sex_inst.0 | site), where 0 indicates the removal of the random intercept. The same notation can be applied when using the selection and hurdle functions inside missingHE, with the addition that for these models random effects can also be specified for the missingness and structural value mechanisms. Use the help command to obtain more information on how random effects can be specified for each type of model.

Coefficient estimates for the random effects can be extracted using the coef function and setting the argument random = TRUE (if set to FALSE then the fixed effects estimates are displayed).

> coef(PG.pat, random = TRUE)
$Comparator$Comparator$Effects mean sd lower upper (Intercept) 1 -3.478 21.040 -52.371 26.520 (Intercept) 2 -2.725 21.031 -51.548 27.215 (Intercept) 3 -3.971 21.024 -52.842 25.958$Comparator$Costs mean sd lower upper (Intercept) 1 4.859 0.809 3.486 6.249 (Intercept) 2 4.823 0.824 3.473 6.218 (Intercept) 3 4.869 0.811 3.575 6.265$Reference
$Reference$Effects
mean     sd   lower  upper
(Intercept) 1 23.342 27.106 -13.118 64.545
(Intercept) 2 22.863 27.112 -13.472 64.096
(Intercept) 3 23.483 27.104 -12.883 64.719

$Reference$Costs
mean    sd  lower upper
(Intercept) 1 2.892 1.982  0.176 5.683
(Intercept) 2 2.698 2.000 -0.208 5.446
(Intercept) 3 2.945 2.020  0.179 5.703

For both $$e$$ and $$c$$ models, summary statistics for the random coefficient estimates are displayed for each treatment group and each of the $$3$$ clusters in site.

## Changing the priors

By default, all models in missingHE are fitted using vague prior distributions so that posterior results are essentially derived based on the information from the observed data alone. This ensures a rough approximation to results obtained under a frequentist approach based on the same type of models.

However, in some cases, it may be reasonable to use more informative priors to ensure a better stability of the posterior estimates by restricting the range over which estimates can be obtained. For example if, based on previous evidence or knowledge, the range over which a specific parameter is likely to vary is known, then priors can be specified so to give less weight to values outside that range when deriving the posterior estimates. However, unless the user is familiar with the choice of informative priors, it is generally recommended not to change the default priors of missingHE as the unintended use of informative priors may substantially drive posterior estimates and lead to incorrect results.

For each type of model in missingHE, priors can be modified using the argument prior, which allows to change the hyperprior values for each model parameter. The interpretation of the prior values change according to the type of parameter and model considered. For example, we can fit a hurdle model using hurdle to the MenSS2 dataset using more informative priors on some parameters.

Prior values can be modified by first creating a list object which, for example, we call my.prior. Within this list, we create a number of elements (vectors of length two) which should be assigned specific names based on the type of parameters which priors we want to change.

> my.prior <- list(
+   "alpha0.prior" = c(0 , 0.0000001),
+   "alpha.prior" = c(0, 0.0000001),
+   "beta0.prior" = c(0, 0.0000001),
+   "gamma0.prior.c"= c(0, 1),
+   "gamma.prior.c" = c(0, 0.01),
+   "mu.b0.prior" = c(0, 0.001),
+   "mu.g0.prior.c"= c(0, 0.001),
+   "s.b0.prior" = c(0, 100),
+   "s.g0.prior.c"= c(0, 100),
+   "sigma.prior.c" = c(0, 10000)
+ )

The names above have the following interpretations in terms of the model parameters:

• "alpha0.prior" is the intercept of the model of $$e$$. The first and second elements inside the vector for this parameter are the mean and precision (inverse of variance) that should be used for the normal prior given to this parameter by missingHE.

• "alpha.prior" are the regression coefficients (exclusing the intercept) of the model of $$e$$. The first and second elements inside the vector for this parameter are the mean and precision (inverse of variance) that should be used for the normal priors given to each coefficient by missingHE.

• "beta0.prior" is the intercept of the model of $$c$$. The first and second elements inside the vector for this parameter are the mean and precision (inverse of variance) that should be used for the normal prior given to this parameter by missingHE.

• "gamma0.prior.c" is the intercept of the model of $$sc$$. The first and second elements inside the vector for this parameter are the mean and precision (inverse of variance) that should be used for the logistic prior given to this parameter by missingHE.

• "gamma.prior.c" are the regression coefficients (exclusing the intercept) of the model of $$sc$$. The first and second elements inside the vector for this parameter are the mean and precision (inverse of variance) that should be used for the normal priors given to each coefficient by missingHE.

• "mu.b0.prior" is the mean of the random intercept of the model of $$c$$. The first and second elements inside the vector for this parameter are the mean and precision (inverse of variance) that should be used for the normal prior given to this parameter by missingHE.

• "s.b0.prior" is the standard deviation of the random intercept of the model of $$c$$. The first and second elements inside the vector for this parameter are the lower and upper bounds that should be used for the uniform prior given to this parameter by missingHE.

• "mu.g0.prior" is the mean of the random intercept of the model of $$sc$$. The first and second elements inside the vector for this parameter are the mean and precision (inverse of variance) that should be used for the normal prior given to this parameter by missingHE.

• "s.g0.prior" is the standard deviation of the random intercept of the model of $$sc$$. The first and second elements inside the vector for this parameter are the lower and upper bounds that should be used for the uniform prior given to this parameter by missingHE.

• "sigma.prior.c" is the standard deviation of the model of $$c$$. The first and second elements inside the vector for this parameter are the lower and upper bounds that should be used for the uniform prior given to this parameter by missingHE.

The values shown above are the default values set in missingHE for each of these parameters. It is possible to change the priors by providing different values, for example by increasing the precision for some of the coefficient estimates or decreasing the upper bound for standard deviation parameters. Different names should be used to indicate for which parameter the prior should be modified, keeping in mind that the full list of names that can be used varies depending on the type of models and modelling assumptions specified. The full list of parameter names for each type of model can be assessed using the helpcommand on each function of missingHE.

We can now fit the hurdle model using our priors by typing

> #remove added constant from costs
> #MenSS2$c <- MenSS2$c - 0.01
>
> PG.hur=hurdle(data = MenSS2, model.eff = e ~ sex_inst.0, model.cost = c ~ 1 + (1 | site),
+   model.se = se ~ 1, model.sc = sc ~ age + (1 | site), type = "SAR", se = NULL, sc = 0,
+   n.iter = 1000, dist_e = "pois", dist_c = "gamma", prior = my.prior)

In this case, we do not require handling any structural value in $$e$$ and we pass this information to the function by setting se = NULL. Finally, we can inspect the statistical results from the model by typing

> coef(PG.hur, random = FALSE)
$Comparator$Comparator$Effects mean sd lower upper (Intercept) 3.256 0.049 3.166 3.359 sex_inst.0 0.004 0.001 0.002 0.006$Comparator$Costs mean sd lower upper (Intercept) 2.525 3.86 -4.329 8.299$Reference
$Reference$Effects
mean    sd lower upper
(Intercept) 3.194 0.055 3.087 3.300
sex_inst.0  0.009 0.002 0.005 0.014

$Reference$Costs
mean   sd  lower upper
(Intercept) 3.675 2.69 -1.288 6.845

and

> coef(PG.hur, random = TRUE)
$Comparator$Comparator$Effects NULL$Comparator$Costs mean sd lower upper (Intercept) 1 3.044 3.837 -2.871 9.735 (Intercept) 2 3.319 3.834 -2.544 10.072 (Intercept) 3 3.147 3.812 -2.627 9.772$Reference
$Reference$Effects
NULL

$Reference$Costs
mean    sd  lower upper
(Intercept) 1 1.817 2.710 -1.446 6.979
(Intercept) 2 2.063 2.704 -1.219 7.103
(Intercept) 3 1.773 2.680 -1.445 6.916