# Generalized Integration Model

#### 2020-06-12

The gim package is used to integrate summary data reported to literature (external studies) with your own raw data (internal study). In this short tutorial, I will use examples to show various applications with gim. To use gim, you will have to find out what models are fitted for those external summary data, while it is up to you to chose the model to be fitted on your own internal data. Keep in mind that the parameters in the model fitted on internal data are of interest in gim analysis.

## Flexible Input

The gim package supports flexible ways to specify models fitted for the internal data and those of external studies. We use formula to specify a linear or logistic regression model for the internal raw data, and consider this model as the true underlying model. This is important as gim will calculate the standard error of its estimations under this model. Also, only the estimations of the coefficients in this model are of your interest. Currently, formula accept arithmetic expressions, e.g., log(x), or other operators like * for interactions. For example, y~as.factor(x1) * I(x2 >0)-1 is a valid formula for gim.

Three options are supported for family: gaussian for linear regression; binomial for logistic regression fitted on random sample; case-control for logistic regression fitted on case-control data. Note that the case/control ratio can vary among case-control datasets. It is always benefitial to use case-control over binomial even if your data for binary outcome is indeed a random sample from the population.

Your individual-level internal data should be passed through data. Note that every variable in formula should be present in data. gim discards incomplete samples from analysis.

gim uses model, a list of list, to organize summary information from external studies. Please use ?gim for more details of model. A rule of thumb is that if a set of coefficients are estimated from a fitted model, then one should pack them as an element into model. Two fitted external models could use the same or partially overlapping datasets, however, estimations from these two models should be stored as two separate elements in model as you do have fitted two models. Please refer to examples below.

External studies may or may not share data when models being fitted. If summary data from k external models are stored in model, then nsample should be a k by k matrix, with its (i,j) element representing the numbers of shared samples being used when fitting the two models. If models are fitted on independent datasets, then the corresponding elements in nsample should be zeros. The diagonal elements are numbers of samples used in fitting each of the external models. When analyzing case-control data, we have to specify ncase and nctrl instead, and leave nsample to be NULL, the default value.

gim is able to handle the senario when internal and external studies are conducted on different populations. In that case, an extra reference set is required for making a valid inference on model parameters, otherwise, as shown in our paper, the estimation could be biased and the standard error of estimation may be underestimated. The reference set ref is a data frame consisting variables used in formula and model[[i]]$formula. Unused variables in ref would be discarded automatically. gim does not assume an outcome variable is present in ref, that is what a reference means. Such a reference set is used to estimate the distribution of covariates in the external population. It is tricky to determine if data or ref contain ALL necessary variables when invoking gim. If you encounter error message when calling gim, try glm(formula, family, data) and for(m in model){ glm(m$formula, family, data)
}

to find out a full list of those variables. You may also add an artificial outcome into ref to replace data in codes above. If you still encounter any error message when running those codes, then very likely some variables are missing in data or ref.

?gim provides a good example showing how to use gim. Here we give some more specified examples that could be of interest in practice.

## Example 1: Integrating Population Characteristics

Many research articles in epidemiology usually provide a table in which two-way contingency tables are listed for outcome versus several population characteristics. For example, assume that in a paper about cancer, it may gives the following table

N <- 800
set.seed(0)
sex <- sample(c('F', 'M'), N, TRUE, c(.4, .6))
snp <- rbinom(N, 2, c(.4, .3, .3))
age <- runif(N, 20, 60)
pr <- plogis(.1 + .5 * I(sex == 'F') + .5 * I(age > 40) - .2 * snp)
cancer <- rbinom(N, 1, pr)
ext <- data.frame(cancer, sex, snp, age, stringsAsFactors = FALSE)
Case Control
Sex M 261 222
F 216 101
Age < 30 127 92
[30,50] 214 159
> 50 136 72
SNP 0 241 133
1 or 2 236 190

Note that the raw dataset ext could not be obtained from literature. Instead, one can only calculate univariate odds ratio of sex based on numbers in the table. For age, the odds ratios are subject to indicators I(age < 30) and I(age > 50). For SNP, the best we can have is the odds ratio in a dominant model. Statistically, these are equivalent to fitting univariate models on ext as follows

m1 <- glm(cancer ~ sex, data = ext, family = "binomial")
m2 <- glm(cancer ~ I(age < 30) + I(age > 50), data = ext, family = "binomial")
m3 <- glm(cancer ~ I(snp == 0), data = ext, family = "binomial")

Sometimes even the contingency tables are also inaccessible, but only the odds ratios are mentioned in text of articles. We consider a more complicated senario in this example.

summary(m2)$coef #> Estimate Std. Error z value Pr(>|z|) #> (Intercept) 0.2970718 0.1047006 2.8373472 0.004549012 #> I(age < 30)TRUE 0.0253267 0.1723537 0.1469461 0.883174576 #> I(age > 50)TRUE 0.3389170 0.1794548 1.8885921 0.058946501 We can see that only I(age > 50) shows significant association with cancer, researchers may only report its log odds ratio 0.34 in their articles. We thus illustrate gim by ignoring estimation of intercept in all the three models above. We also assume that only the log odds ratio of I(age > 50) is given to gim, while that of I(age < 30) is not. We create model as model1 <- list(formula = 'cancer ~sex', info = data.frame(var = names(coef(m1))[-1], bet = coef(m1)[-1], stringsAsFactors = FALSE)) model2 <- list(formula = 'cancer~I(age< 30) + I(age > 50)', info = data.frame(var = names(coef(m2))[3], bet = coef(m2)[3], stringsAsFactors = FALSE)) model3 <- list(formula = 'cancer ~ I(snp == 0)', info = data.frame(var = names(coef(m3))[-1], bet = coef(m3)[-1], stringsAsFactors = FALSE)) model <- list(model1, model2, model3) model2 #>$formula
#> [1] "cancer~I(age< 30) + I(age > 50)"
#>
#> $info #> var bet #> I(age > 50)TRUE I(age > 50)TRUE 0.338917 For each external model, gim needs to know the model used to fit external data (model[[i]]$formula), as well as the estimation of coefficients (model[[i]]$info). In this example, some coefficients (e.g. intercept and I(age < 30)) are estimated but not given to gim. This is a sweet spot of gim that can work with partial summary information quite well. Note that it is Since all three external models are fitted on the same dataset, we specify nsample for the three fitted models as nsample <- matrix(N, 3, 3) If we would consider external studies as case-control studies, which is what we recommend, we could instead specify ncase <- matrix(sum(cancer), 3, 3) nctrl <- matrix(sum(!cancer), 3, 3) and leave nsample to be NULL. Now we collect samples from an internal study. Note that we modify the model by using a differnt intercept 0.3 while keeping all other coefficients remain the same as before, because we want a case-control data of a different case/control ratio. n <- 300 set.seed(1) sex <- sample(c('F', 'M'), n, TRUE, c(.4, .6)) snp <- rbinom(n, 2, c(.4, .3, .3)) age <- runif(n, 20, 60) pr <- plogis(.3 + .5 * I(sex == 'F') + .5 * I(age > 40) - .2 * snp) cancer <- rbinom(n, 1, pr) smoking <- sample(c(TRUE, FALSE), n, TRUE, c(.3, .7)) int <- data.frame(cancer, sex, snp, age, smoking, stringsAsFactors = FALSE) In this internal study, we also collect a binary smoking, although it does not affect the cancer risk in our setting. Now we can integrate internal data int with external model. Let’s assume that, based on some knowledges, we believe that a single cutoff for age at 40 should be used in the underlying model instead 30 and 50 used in external studies, and an additive model for SNP would be more of interest. We also want to investigate potential effect of smoking status. So we assume the underlying true model would be cancer ~ I(sex == "F") + I(age > 40) + snp + smoking, and invoke gim as follows library(gim) fit1 <- gim(cancer ~ I(sex == "F") + I(age > 40) + snp + smoking, "case-control", int, model, ncase = ncase, nctrl = nctrl) summary(fit1) #> Call: #> gim.default(formula = cancer ~ I(sex == "F") + I(age > 40) + #> snp + smoking, family = "case-control", data = int, model = model, #> ncase = ncase, nctrl = nctrl) #> #> Estimate Std. Error z value Pr(>|z|) #> (Intercept) -0.23729 0.15403 -1.5405 0.1234325 #> I(sex == "F")TRUE 0.55741 0.14218 3.9204 8.839e-05 *** #> I(age > 40)TRUE 0.67326 0.20335 3.3109 0.0009301 *** #> snp -0.31230 0.10497 -2.9752 0.0029278 ** #> smokingTRUE -0.15390 0.27265 -0.5645 0.5724365 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 One can compare the result with the one from glm fitted on int, which has higher standard errors on its estimations. fit0 <- glm(cancer ~ I(sex =="F") + I(age>40) +snp + smoking, "binomial", int) summary(fit0)$coef
#>                     Estimate Std. Error    z value    Pr(>|z|)
#> (Intercept)        0.4469066  0.2417339  1.8487546 0.064493257
#> I(sex == "F")TRUE  0.4094388  0.2636631  1.5528865 0.120450258
#> I(age > 40)TRUE    0.6681712  0.2526914  2.6442180 0.008187991
#> snp               -0.2303803  0.1784850 -1.2907543 0.196788903
#> smokingTRUE       -0.1550470  0.2727146 -0.5685323 0.569673562

The following table compares models fitted on internal and external datasets:

External
Internal model 1 model 2 model 3
Sex I(sex == 'F') I(sex == 'M')
Age I(age > 40) I(age < 30)
I(age > 50)
SNP snp I(snp == 0)
Smoking smoking

In this example, we used the feature in gim that supports flexible formula. One can, however, create variable instead using such a feature. For example, we can add columns to int as follows. Exactly the same result would be returned.

int$sex_F <- ifelse(int$sex == "F", 1, 0)
int$age_gt_40 <- ifelse(int$age > 40, 1, 0)
fit2 <- gim(cancer ~ sex_F + age_gt_40 + snp + smoking,
"case-control", int, model,
ncase = ncase, nctrl = nctrl)
summary(fit2)
#> Call:
#> gim.default(formula = cancer ~ sex_F + age_gt_40 + snp + smoking,
#>     family = "case-control", data = int, model = model, ncase = ncase,
#>     nctrl = nctrl)
#>
#>             Estimate Std. Error z value  Pr(>|z|)
#> (Intercept) -0.23729    0.15403 -1.5405 0.1234325
#> sex_F        0.55741    0.14218  3.9204 8.839e-05 ***
#> age_gt_40    0.67326    0.20335  3.3109 0.0009301 ***
#> snp         -0.31230    0.10497 -2.9752 0.0029278 **
#> smokingTRUE -0.15390    0.27265 -0.5645 0.5724365
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The list model may looks complicated. For example,

model[[1]]
#> $formula #> [1] "cancer ~sex" #> #>$info
#>       var        bet
#> sexM sexM -0.5983149

So where does the sexM come from? It is because that gim parses model[[1]]$formula and automatically creates a dummy variable for I(sex == "M"). R names such a dummy variable to be sexM, and in this example, we put this name to model[[1]]$info$var. As long as sex could be found in int, gim is able to define a dummy variable sexM itself and move everything forward smoothly. We can, however, use a more straight way to invoke gim. For example, we add a column dummy_sex to int int$dummy_sex <- ifelse(int$sex == "M", 1, 0) model1 <- list(formula = 'cancer ~ dummy_sex', info = data.frame(var = 'dummy_sex', bet = coef(m1)[-1], stringsAsFactors = FALSE)) model1 #>$formula
#> [1] "cancer ~ dummy_sex"
#>
#> $info #> var bet #> sexM dummy_sex -0.5983149 model <- list(model1, model2, model3) fit3 <- gim(cancer ~ I(sex == "F") + I(age > 40) + snp + smoking, "case-control", int, model, ncase = ncase, nctrl = nctrl) summary(fit3) #> Call: #> gim.default(formula = cancer ~ I(sex == "F") + I(age > 40) + #> snp + smoking, family = "case-control", data = int, model = model, #> ncase = ncase, nctrl = nctrl) #> #> Estimate Std. Error z value Pr(>|z|) #> (Intercept) -0.23729 0.15403 -1.5405 0.1234325 #> I(sex == "F")TRUE 0.55741 0.14218 3.9204 8.839e-05 *** #> I(age > 40)TRUE 0.67326 0.20335 3.3109 0.0009301 *** #> snp -0.31230 0.10497 -2.9752 0.0029278 ** #> smokingTRUE -0.15390 0.27265 -0.5645 0.5724365 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The result is still the same. Note that row name in model[[1]]$info does not matter.

This example illustrates several features of gim:

• flexible support on formula
• capability to work with partial extenrnal information
• account for variation due to sample overlapping through nsample or ncase, nctrl

gim also provides generic accessor functions coef, confint and vcov for extract information from its returned object.

coef(fit1)
#>       (Intercept) I(sex == "F")TRUE   I(age > 40)TRUE               snp
#>        -0.2372865         0.5574127         0.6732635        -0.3123015
#>       smokingTRUE
#>        -0.1539044
confint(fit1, level = 0.9)
#>                          5 %        95 %
#> (Intercept)       -0.4906427  0.01606971
#> I(sex == "F")TRUE  0.3235451  0.79128019
#> I(age > 40)TRUE    0.3387818  1.00774520
#> snp               -0.4849581 -0.13964487
#> smokingTRUE       -0.6023813  0.29457241
all(diag(vcov(fit1))^.5 == summary(fit1)\$coef[, 2])
#> [1] TRUE