[@Hurvich_Tsai_1989; @Al-Subaihi_2002]$^1$", "$n\\ln\\left(\\frac{|\\text{SSE}|}{n}\\right) + \\frac{nq(n+p)}{n-p-q-1}$

[@Hurvich_Tsai_1989; @Bedrick_Tsai_1994]$^2$", "$n\\ln\\left(\\frac{SSE}{n}\\right) + 2(p+2)o - 2o^2, o = \\frac{n\\sigma^2}{SSE}$

[@Sawa_1978; @Judge_1985]

not available for MMR", "$\\frac{SSE}{\\sigma^2} + 2p - n$

[@Mallows_1973; @Hocking_1976]

not available for MMR", "$n\\ln\\left(\\frac{|\\text{SSE}|}{n}\\right) + 2pq\\ln(\\ln(n))$

[@Hannan_Quinn_1979; @McQuarrie_Tsai_1998; @Hurvich_Tsai_1989]", "$n\\ln\\left(\\frac{|\\text{SSE}|}{n}\\right) + p$

[@Nelder_Wedderburn_1972; @Smith_Spiegelhalter_1980] not available for MMR", "$n\\ln\\left(\\frac{|\\text{SSE}|}{n}\\right) + \\frac{3}{2}p$

[@Smith_Spiegelhalter_1980]

not available for MMR", "$n\\ln\\left(\\frac{|\\text{SSE}|}{n}\\right) + pq \\ln(n)$

[@Hurvich_Tsai_1989; @Schwarz_1978; @Judge_1985; @Al-Subaihi_2002]

not available for MMR", "$\\textit{F test}$ for UMR and $\\textit{Approximate F test}$ for MMR", "$1 - \\frac{SSE}{SST}$

not available for MMR", "$1 - \\frac{(n-1)(1-R^2)}{n-p}$

[@Darlington_1968; @Judge_1985]

not available for MMR") Formula_in_Logit_Cox_Poisson_Gamma <- c("logit, cox, poisson and gamma", "$-2\\text{LL} + 2p$

[@Darlington_1968; @Judge_1985]", "$-2\\text{LL} + \\frac{n(n+p)}{n-p-2}$

[@Hurvich_Tsai_1989]", "not available", "not available", "$-2\\text{LL} + 2p\\ln(\\ln(n))$

[@Hannan_Quinn_1979]", "$-2\\text{LL} + p$

[@Nelder_Wedderburn_1972; @Smith_Spiegelhalter_1980]", "$-2\\text{LL} + \\frac{3}{2}p$

[@Smith_Spiegelhalter_1980]", "$-2\\text{LL} + p\\ln(n)$

[@Schwarz_1978; @Judge_1985]", "Forward: LRT and Rao Chi-square test (logit, poisson, gamma); LRT (cox);

Backward: Wald test", "not available", "not available") df <- data.frame(Abbreviation, Definition, Formula_in_Linear, Formula_in_Logit_Cox_Poisson_Gamma) colnames(df) <- c("Abbreviation","Definition","Formula","") kable(df, format = "html", align = "l", booktabs = TRUE, escape = F, caption = 'Abbreviation, Definition, and Formula of the Selection Metric for Linear, Logit, Cox, Possion, and Gamma regression') %>% footnote(number = c("Unsupported AIC formula (which does not affect the selection process as it only differs by constant additive and multiplicative factors):\n $AIC=n\\ln\\left(\\frac{SSE}{n}\\right) + 2p$ [@Darlington_1968; @Judge_1985]", "Unsupported AICc formula (which does not affect the selection process as it only differs by constant additive and multiplicative factors):\n $AICc=\\ln\\left(\\frac{SSE}{n}\\right) + 1 + \\frac{2(p+1)}{n-p-2}$ [@McQuarrie_Tsai_1998]")) %>% kable_styling() %>% column_spec(3, width = "0.5in") %>% column_spec(4, width = "0.4in") ``` No metric is necessarily optimal for all datasets. The choice of them depends on your data and research goals. We recommend using multiple metrics simultaneously, which allows the selection of the best model based on your specific needs. Below summarizes general guidance. + AIC: AIC works by penalizing the inclusion of additional variables in a model. The lower the AIC, the better performance of the model. AIC does not include sample size in penalty calculation, and it is optimal in minimizing the mean square error of predictions [@Brewer_2016]. + AICc: AICc is a variant of AIC, which works better for small sample size, especially when `numObs / numParam < 40` [@Burnham_2002]. + Cp: Cp is used for linear models. It is equivalent to AIC when dealing with Gaussian linear model selection. + IC(1) and IC(3/2): IC(1) and IC(3/2) have 1 and 3/2 as penalty factors respectively, compared to 2 used by AIC. As such, IC(1) turns to return a complex model with more variables that may suffer from overfitting issues. + BIC and SBC: Both BIC and SBC are variants of Bayesian Information Criterion. The main distinction between BIC/SBC and AIC lies in the magnitude of the penalty imposed: BIC/SBC are more parsimonious when penalizing model complexity, which typically results to a simpler model [@SAS_Institute_2018; @Sawa_1978; @Hurvich_Tsai_1989; @Schwarz_1978; @Judge_1985; @Al-Subaihi_2002]. The precise definitions of these criteria can vary across literature and in the SAS environment. Here, BIC aligns with the definition of the Sawa Bayesion Information Criterion as outlined in [SAS](https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.3/statug/statug_glmselect_syntax07.htm) documentation, while SBC corresponds to the Schwarz Bayesian Information Criterion. According to Richard's [post](https://www.linkedin.com/pulse/aicbic-model-selection-richard-randa/), whereas AIC often favors selecting overly complex models, BIC/SBC prioritize a small models. Consequently, when dealing with a limited sample size, AIC may seem preferable, whereas BIC/SBC tend to perform better with larger sample sizes. + HQ: HQ is an alternative to AIC, differing primarily in the method of penalty calculation. However, HQ has remained relatively underutilized in practice [@Burnham_2002]. + Rsq: The R-squared (R²) statistic measures the proportion of variations that is explained by the model. It ranges from 0 to 1, with 1 indicating that all of the variability in the response variables is accounted for by the independent variables. As such, R-squared is valuable for communicating the explanatory power of a model. However, R-squared alone is not sufficient for selection because it does not take into account the complexity of the model. Therefore, while R-squared is useful for understanding how well the model fits the data, it should not be the sole criterion for model selection. + adjRsq: The adjusted R-squared (adj-R²) seeks to overcome the limitation of R-squared in model selection by considering the number of predictors. It serves a similar purpose to information criteria, as both methods compare models by weighing their goodness of fit against the number of parameters. However, information criteria are typically regarded as superior in this context [@Stevens_2016]. + SL: SL stands for Significance Level (P-value), embodying a distinct approach to model selection in contrast to information criteria. The SL method operates by calculating a P-value through specific hypothesis testing. Should this P-value fall below a predefined threshold, such as 0.05, one should favor the alternative hypothesis, indicating that the full model significantly outperforms the reduced model. The effectiveness of this method hinges upon the selection of the P-value threshold, wherein smaller thresholds tend to yield simpler models. ## Multicollinearity {#multicollinearity} This [blog](https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/) by Jim Frost gives an excellent overview of multicollinearity and when it is necessary to remove it. Simply put, a dataset contains multicollinearity when input predictors are correlated. When multicollinearity occurs, the interpretability of predictors will be badly affected because changes in one input variable lead to changes in other input variables. Therefore, it is hard to individually estimate the relationship between each input variable and the dependent variable. Multicollinearity can dramatically reduce the precision of the estimated regression coefficients of correlated input variables, making it hard to find the correct model. However, as Jim pointed out, “Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.” In StepReg, [QC Matrix Decomposition](https://towardsdatascience.com/qr-matrix-factorization-15bae43a6b2#:~:text=The%20QR%20matrix%20decomposition%20allows%20one%20to%20express%20a%20matrix,zero%2C%20it%20is%20also%20invertible.) is performed ahead of time to detect and remove input variables causing multicollinearity. # StepReg output {#stepregoutput} StepReg provides multiple functions for summarizing the model building results. The function `stepwise()` generates a list of tables that describe the feature selection steps and the final model. To facilitate collaborations, you can redirect the tables into various formats such as "xlsx", "html", "docx", etc. with the function `report()`. Furthermore, you can easily compare the variable selection procedures for multiple selection metrics by visualizing the steps with the function `plot()`. Details see below. Depending on the number of selected regression strategies and metrics, you can expect to receive at least four tables from `stepwise()`. Below describes the content of each of 4 tables from Quick Demo \@ref(quickdemo). ```{r, echo = FALSE} Table_Name <- c("Summary of arguments for model selection", "Summary of variables in dataset", "Summary of selection process under xxx(strategy) with xxx(metric)", "Summary of coefficients for the selected model with xxx(dependent variable) under xxx(strategy) and xxx(metric)") Table_Description <- c("Arguments used in the stepwise function, either default or user-supplied values.", "Variable names, types, and classes in dataset.", "Overview of the variable selection process under specified strategy and metric.", "Coefficients for the selected models under specified strategy with metric. Please note that this table will not be generated for the strategy 'subset' when using the metric 'SL'.") data.frame(Table_Name, Table_Description) %>% kable(format = "html", align = 'l', escape = F, caption = "Tables generated by StepReg") %>% kable_styling() ``` You can save the output in different format like "xlsx", "docx", "html", "pptx", and others, facilitating easy sharing. Of note, the suffix will be automatically added to the `report_name`. For instance, the following example generates both "results.xlsx" and "results.docx" reports. ```{r, eval = FALSE} report(res, report_name = "results", format = c("xlsx", "docx")) ``` # Use cases {#usecases} Please choose the regression model that best suits the type of response variable. For detailed guidance, see section \@ref(regressioncategories). Below, we present various examples utilizing different models tailored to specific datasets. ## Linear regression with the _mtcars_ dataset In this section, we'll demonstrate how to perform linear regression analysis using the [mtcars](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html) dataset, showcasing different scenarios with varying numbers of predictors and dependent variables. We set `type = "linear"` to direct the function to perform linear regression. **Description of the `mtcars` dataset** The `mtcars` is a classic dataset in statistics and is included in the base R installation. It was sourced from the 1974 _Motor Trend_ US magazine, comprising 32 observations on 11 variables. Here's a brief description of the variables included: 1. mpg: miles per gallon (fuel efficiency) 2. cyl: number of cylinders 3. disp: displacement (engine size) in cubic inches 4. hp: gross horsepower 5. drat: rear axle ratio 6. wt: weight (in thousands of pounds) 7. qsec: 1/4 mile time (in seconds) 8. vs: engine type (0 = V-shaped, 1 = straight) 9. am: transmission type (0 = automatic, 1 = manual) 10. gear: number of forward gears 11. carb: number of carburetors **Why choose linear regression** Linear regression is an ideal choice for analyzing the `mtcars` dataset due to its inclusion of continuous variables like "mpg", "hp", or "weight", which can serve as response variables. Furthermore, the dataset exhibits potential linear relationships between the response variable and other variables. ### Example1: single dependent variable ("mpg") In this example, we employ "forward" strategy with "AIC" as the selection criteria. Additionally, we specify using the `include` argument that "disp", "cyl" must always be included in the model. ```{r, message = FALSE} library(StepReg) data(mtcars) formula <- mpg ~ . res1 <- stepwise(formula = formula, data = mtcars, type = "linear", include = c("disp", "cyl"), strategy = "forward", metric = "AIC") res1 ``` To visualize the selection process: ```{r plot_res1, warning = FALSE} plot_list <- plot(res1) cowplot::plot_grid(plotlist = plot_list$forward, ncol = 1) ``` To exclude the intercept from the model, adjust the formula as follows: ```{r, eval = FALSE} formula <- mpg ~ . + 0 ``` To limit the model to a specific subset of predictors, adjust the formula as follows, which will only consider "cyp", "disp", "hp", "wt", "vs", and "am" as the predictors. ```{r, eval = FALSE} formula <- mpg ~ cyl + disp + hp + wt + vs + am + 0 ``` Another way is to use minus symbol(`"-"`) to exclude some predictors for variable selection. For example, include all variables except "disp", "wt", and intercept. ```{r, eval = FALSE} formula <- mpg ~ . - 1 - disp - wt ``` You can simultaneously provide multiple selection strategies and metrics. For example, the following code snippet employs both "forward" and "backward" strategies using metrics "AIC", "BIC", and "SL". It's worth mentioning that when "SL" is specified, you may also want to set the significance level for entry ("sle") and stay ("sls"), both of which default to 0.15. ```{r, message = FALSE, warning = FALSE} formula <- mpg ~ . res2 <- stepwise(formula = formula, data = mtcars, type = "linear", strategy = c("forward", "backward"), metric = c("AIC", "BIC", "SL"), sle = 0.05, sls = 0.05) res2 ``` ```{r, message = FALSE, warning = FALSE, fig.width=9, fig.height=12} plot_list <- plot(res2) cowplot::plot_grid(plotlist = plot_list$forward, ncol = 1, rel_heights = c(2, 1)) cowplot::plot_grid(plotlist = plot_list$backward, ncol = 1, rel_heights = c(2, 1)) ``` ### Example2: multivariate regression ("mpg" and "drat") In this scenario, there are two dependent variables, "mpg" and "drat". The model selection aims to identify the most influential predictors that affect both variables. ```{r} formula <- cbind(mpg, drat) ~ . + 0 res3 <- stepwise(formula = formula, data = mtcars, type = "linear", strategy = "bidirection", metric = c("AIC", "HQ")) res3 plot(res3) ``` ## Logistic regression with the _remission_ dataset In this example, we'll showcase logistic regression using the `remission` dataset. By setting `type = "logit"`, we instruct the function to perform logistic regression. **Description of the `remission` dataset** The [remission](https://online.stat.psu.edu/stat501/book/export/html/1011) dataset, obtained from the online course STAT501 at Penn State University, has been integrated into StepReg. It consists of 27 observations across seven variables, including a binary variable named "remiss": 1. remiss: whether leukemia remission occurred, a value of 1 indicates occurrence while 0 means non-occurrence 2. cell: cellularity of the marrow clot section 3. smear: smear differential percentage of blasts 4. infil: percentage of absolute marrow leukemia cell infiltrate 5. li: percentage labeling index of the bone marrow leukemia cells 6. blast: the absolute number of blasts in the peripheral blood 7. temp: the highest temperature before the start of treatment **Why choose logistic regression** Logistic regression effectively captures the relationship between predictors and a categorical response variable, offering insights into the probability of being assigned into specific response categories given a set of predictors. It is suitable for analyzing binary outcomes, such as the remission status ("remiss") in the `remission` dataset. ### Example1: using "forward" strategy In this example, we employ a "forward" strategy with "AIC" as the selection criteria, while force ensuring that the "cell" variable is included in the model. ```{r, message = FALSE} data(remission) formula <- remiss ~ . res4 <- stepwise(formula = formula, data = remission, type = "logit", include= "cell", strategy = "forward", metric = "AIC") res4 plot(res4) ``` ### Example2: using "subset" strategy In this example, we employ a "subset" strategy, utilizing "SBC" as the selection criteria while excluding the intercept. Meanwhile, we set `best_n = 3` to restrict the output to the top 3 models for each number of variables. ```{r, message = FALSE} data(remission) formula <- remiss ~ . + 0 res5 <- stepwise(formula = formula, data = remission, type = "logit", strategy = "subset", metric = "SBC", best_n = 3) res5 plot(res5) ``` Here, the `0` in the above plot means that there is no intercept in the model. ## Cox regression with the `lung` dataset In this example, we'll demonstrate how to perform Cox regression analysis using the [`lung`](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/lung.html) dataset. By setting `type = "cox"`, we instruct the function to conduct Cox regression. **Description of the `lung` dataset** The `lung` dataset, available in the `"survival"` R package, includes information on survival times for 228 patients with advanced lung cancer. It comprises ten variables, among which the "status" variable codes for censoring status (1 = censored, 2 = dead), and the "time" variable denotes the patient survival time in days. To learn more about the dataset, use `?survival::lung`. **Why choose Cox regression** Cox regression, also termed the Cox proportional hazards model, is specifically designed for analyzing survival data, making it well-suited for datasets like `lung` that include information on the time until an event (e.g., death) occurs. This method accommodates censoring and assumes proportional hazards, enhancing its applicability to medical studies involving time-to-event outcomes. ### Example1: using "forward" strategy In this example, we employ a "forward" strategy with "AICc" as the selection criteria. ```{r, message = FALSE} library(dplyr) library(survival) # Preprocess: lung <- survival::lung %>% mutate(sex = factor(sex, levels = c(1, 2))) %>% # make sex as factor na.omit() # get rid of incomplete records formula = Surv(time, status) ~ . res6 <- stepwise(formula = formula, data = lung, type = "cox", strategy = "forward", metric = "AICc") res6 plot(res6) ``` ## Poisson regression with the `creditCard` dataset In this example, we'll demonstrate how to perform Poisson regression analysis using the [`creditCard`](https://search.r-project.org/CRAN/refmans/AER/html/CreditCard.html) dataset. We set `type = "poisson"` to direct the function to perform Poisson regression. **Descprition of the `creditCard` dataset** The `creditCard` dataset contains credit history information for a sample of applicants for a specific type of credit card, included in the `"AER"` package. It encompasses 1319 observations across 12 variables, including "reports", "age", "income", among others. The "reports" variable represents the number of major derogatory reports. For detailed information, refer to `?AER::CreditCard`. **Why choose Poisson regression** Poisson regression is frequently employed method for analyzing count data, where the response variable represents the occurrences of an event within a defined time or space frame. In the context of the `creditCard` dataset, Poisson regression can model the count of major derogatory reports ("reports"), enabling assessment of predictors' impact on this variable. ### Example1: using "forward" strategy In this example, we employ a "forward" strategy with "SL" as the selection criteria. We set the significance level for entry to 0.05 (`sle = 0.05`). ```{r, message = FALSE} data(creditCard) formula = reports ~ . res7 <- stepwise(formula = formula, data = creditCard, type = "poisson", strategy = "forward", metric = "SL", sle = 0.05) res7 plot(res7) ``` # Interactive app {#shinyapp} We have developed an interactive Shiny application to simplify model selection tasks for non-programmers. You can access the app through the following URL: https://junhuili1017.shinyapps.io/StepReg/ You can also access the Shiny app directly from your local machine with the following code: ```{r, eval = FALSE} library(StepReg) StepRegShinyApp() ``` Here is the user interface. ![](src/StepReg_shiny_UI_description_File.png) ![](src/StepReg_shiny_UI_description_Analyze.png) # Session info ```{r sessionInfo, echo = FALSE} sessionInfo() ```