The tbl_summary()
function calculates descriptive statistics for continuous, categorical, and dichotomous variables in R, and presents the results in a beautiful, customizable summary table perfect for creating tables ready for publication (for example, Table 1 or demographic tables).
This vignette will walk a reader through the tbl_summary()
function, and the various functions available to modify and make additions to an existing table summary object.
To start, a quick note on the {magrittr} package’s pipe function, %>%
. By default the pipe operator puts whatever is on the left hand side of %>%
into the first argument of the function on the right hand side. The pipe function can be used to make the code relating to tbl_summary()
easier to use, but it is not required. Here are a few examples of how %>%
translates into typical R notation.
x %>% f() is equivalent to f(x)
x %>% f(y) is equivalent to f(x, y)
y %>% f(x, .) is equivalent to f(x, y)
z %>% f(x, y, arg = .) is equivalent to f(x, y, arg = z)
Here’s how this translates into the use of tbl_summary()
.
mtcars %>% tbl_summary() is equivalent to tbl_summary(mtcars)
mtcars %>% tbl_summary(by = am) is equivalent to tbl_summary(mtcars, by = am)
tbl_summary(mtcars, by = am) %>% add_p() is equivalent to
tbl = tbl_summary(mtcars, by = am)
add_p(tbl)
Before going through the tutorial, install {gtsummary} and {gt}.
We’ll be using the trial
data set throughout this example.
This set contains data from 200 patients who received one of two types of chemotherapy (Drug A or Drug B). The outcomes are tumor response and death.
Each variable in the data frame has been assigned an attribute label (i.e. attr(trial$trt, "label") == "Chemotherapy Treatment")
with the labelled package, which we highly recommend using. These labels are displayed in the {gtsummary} output table by default. Using {gtsummary} on a data frame without labels will simply print variable names, or there is an option to add labels later.
trt Chemotherapy Treatment
age Age
marker Marker Level (ng/mL)
stage T Stage
grade Grade
response Tumor Response
death Patient Died
ttdeath Years from Treatment to Death/Censor
Our example dataset has a mix of continuous, dichotomous (0/1), and categorical variables, some with missing data (NA).
head(trial)
#> # A tibble: 6 x 8
#> trt age marker stage grade response death ttdeath
#> <chr> <dbl> <dbl> <fct> <fct> <int> <int> <dbl>
#> 1 Drug A 23 0.16 T1 II 0 0 24
#> 2 Drug B 9 1.11 T2 I 1 0 24
#> 3 Drug A 31 0.277 T1 II 0 0 24
#> 4 Drug A NA 2.07 T3 III 1 1 17.6
#> 5 Drug A 51 2.77 T4 III 1 1 16.4
#> 6 Drug B 39 0.613 T4 I 0 1 15.6
For brevity in the tutorial, let’s keep a subset of the variables from the trial data set.
The default output from tbl_summary()
is meant to be publication ready.
Let’s start by creating a table of summary statistics from the trial
data set. The tbl_summary()
function can take, at minimum, a data frame as the only input, and returns descriptive statistics for each column in the data frame.
Characteristic  N = 200^{1} 

Chemotherapy Treatment  
Drug A  98 (49%) 
Drug B  102 (51%) 
Marker Level (ng/mL)  0.64 (0.22, 1.39) 
Unknown  10 
T Stage  
T1  53 (26%) 
T2  54 (27%) 
T3  43 (22%) 
T4  50 (25%) 
^{
1
}
Statistics presented: n (%); median (IQR)

Note the sensible defaults with this basic usage (that can be customized later):
Variable types are automatically detected so that appropriate descriptive statistics are calculated.
Label attributes from the dataset are automatically printed.
Missing values are listed as “Unknown” in the table.
Variable levels are indented and footnotes are added if printed using {gt}. (can alternatively be printed using knitr::kable()
; see options here)
This is a great basic table, but for this study data the summary statistics should be split by treatment group, which can be done by using the by =
argument. To compare two or more groups, include add_p()
with the function call, which detects variable type and uses an appropriate test.
Characteristic  Drug A, N = 98^{1}  Drug B, N = 102^{1}  pvalue^{2} 

Marker Level (ng/mL)  0.84 (0.24, 1.57)  0.52 (0.19, 1.20)  0.085 
Unknown  6  4  
T Stage  0.9  
T1  28 (29%)  25 (25%)  
T2  25 (26%)  29 (28%)  
T3  22 (22%)  21 (21%)  
T4  23 (23%)  27 (26%)  
^{
1
}
Statistics presented: median (IQR); n (%)
^{
2
}
Statistical tests performed: Wilcoxon ranksum test; chisquare test of independence

There are four primary ways to customize the output of the summary table.
tbl_summary()
function input argumentsadd_*()
functionstbl_summary()
function argumentsThe tbl_summary()
function includes many input options for modifying the appearance.
label specify the variable labels printed in table
type specify the variable type (e.g. continuous, categorical, etc.)
statistic change the summary statistics presented
digits number of digits the summary statistics will be rounded to
missing whether to display a row with the number of missing observations
sort change the sorting of categorical levels by frequency
percent print column, row, or cell percentages
The {gtsummary} package has builtin functions for adding to results from tbl_summary()
. The following functions add columns and/or information to the summary table.
add_p() add pvalues to the output comparing values across groups
add_overall() add a column with overall summary statistics
add_n() add a column with N (or N missing) for each variable
add_stat_label() add a column showing a label for the summary statistics shown in each row
add_q() add a column of q values to control for multiple comparisons
The {gtsummary} package comes with functions specifically made to modify and format summary tables.
modify_header() relabel columns in summary table
bold_labels() bold variable labels
bold_levels() bold variable levels
italicize_labels() italicize variable labels
italicize_levels() italicize variable levels
bold_p() bold significant pvalues
The {gt} package is packed with many great functions for modifying table output—too many to list here. Review the package’s website for a full listing. https://gt.rstudio.com/index.html
To use the {gt} package functions with {gtsummary} tables, the summary table must first be converted into a gt
object. To this end, use the as_gt()
function after modifications have been completed with {gtsummary} functions.
The code below calculates the standard table with summary statistics split by treatment with the following modifications
trial2 %>%
# build base summary table
tbl_summary(
# split table by treatment variable
by = trt,
# change variable labels
label = list(marker ~ "Marker, ng/mL",
stage ~ "Clinical T Stage"),
# change statistics printed in table
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} / {N} ({p}%)"),
digits = list("marker" ~ c(1, 2))
) %>%
# add pvalues, report ttest, round large pvalues to two decimal place
add_p(test = list(marker ~ "t.test"),
pvalue_fun = function(x) style_pvalue(x, digits = 2)) %>%
# add statistic labels
add_stat_label() %>%
# bold variable labels, italicize levels
bold_labels() %>%
italicize_levels() %>%
# bold pvalues under a given threshold (default is 0.05)
bold_p(t = 0.2) %>%
# include percent in headers
modify_header(stat_by = "**{level}**, N = {n} ({style_percent(p, symbol = TRUE)})")
Characteristic  Drug A, N = 98 (49%)  Drug B, N = 102 (51%)  pvalue^{1} 

Marker, ng/mL, mean (SD)  1.0 (0.89)  0.8 (0.83)  0.12 
Unknown  6  4  
Clinical T Stage, n / N (%)  0.87  
T1  28 / 98 (29%)  25 / 102 (25%)  
T2  25 / 98 (26%)  29 / 102 (28%)  
T3  22 / 98 (22%)  21 / 102 (21%)  
T4  23 / 98 (23%)  27 / 102 (26%)  
^{
1
}
Statistical tests performed: ttest; chisquare test of independence

Each of the modification functions have additional options outlined in their respective help files.
There is flexibility in how you select variables for {gtsummary} arguments, which allows for many customization opportunities! For example, if you want to show age and the marker levels to one decimal place in tbl_summary()
, you can pass digits = c(age, marker) ~ 1
. The selecting input is flexible, and you may also pass quoted column names.
Going beyond typing out specific variables in your dataset, you can use:
All {tidyselect} helpers available throughout the tidyverse, such as starts_with()
, contains()
, and everything()
(i.e. anything you can use with the dplyr::select()
function can be used with {gtsummary}).
Additional {gtsummary} selectors that are included in the package to supplement tidyselect functions.
Summary type There are three types of summary types in {gtsummary}, and you may use the type to select columns. This is useful, for example, when you wish to report the mean and standard deviation for all continuous variables: statistic = all_continuous() ~ "{mean} ({sd})"
.
Vector class or type Select columns based on their class or type.
In the example below, we report the mean and standard deviation for continuous variables, and percent for all categorical. We’ll report ttests rather than Wilcoxon ranksum test for continuous variables, and report Fisher’s exact test for response.
Note that dichotomous variables are, by default, included with all_categorical()
. Use all_categorical(dichotomous = FALSE)
to exclude dichotomous variables.
trial %>%
select(trt, response, age, stage, marker, grade) %>%
tbl_summary(
by = trt,
type = list(c(response, grade) ~ "categorical"), # select by variables in c()
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{p}%") # select by summary type
) %>%
add_p(test = list(contains("response") ~ "fisher.test", # select using functions in tidyselect
all_continuous() ~ "t.test"))
Characteristic  Drug A, N = 98^{1}  Drug B, N = 102^{1}  pvalue^{2} 

Tumor Response  0.5  
0  71%  66%  
1  29%  34%  
Unknown  3  4  
Age  47 (15)  47 (14)  0.8 
Unknown  7  4  
T Stage  0.9  
T1  29%  25%  
T2  26%  28%  
T3  22%  21%  
T4  23%  26%  
Marker Level (ng/mL)  1.02 (0.89)  0.82 (0.83)  0.12 
Unknown  6  4  
Grade  0.9  
I  36%  32%  
II  33%  35%  
III  32%  32%  
^{
1
}
Statistics presented: %; mean (SD)
^{
2
}
Statistical tests performed: Fisher's exact test; ttest; chisquare test of independence

When you print output from the tbl_summary()
function into the R console or into an R markdown, there are default printing functions that are called in the background: print.tbl_summary()
and knit_print.tbl_summary()
. The true output from tbl_summary()
is a named list, but when you print the object, a formatted version of .$table_body
is displayed. All formatting and modifications are made using the {gt} package.
tbl_summary(trial2) %>% names()
#> [1] "table_body" "table_header" "meta_data" "inputs" "N"
#> [6] "call_list"
These are the additional data stored in the tbl_summary()
output list.
table_body data frame with summary statistics
meta_data data frame that is one row per variable with data about each
by, df_by the by variable name, and a data frame with information about the by variable
call_list named list of each function called on the `tbl_summary` object
inputs inputs from the `tbl_summary()` function call
When a {gtsummary} object is printed, it is first converted to a {gt} object with as_gt()
via a sequence of {gt} commands executed on x$table_body
. Here’s an example of the first few calls saved with tbl_summary()
:
tbl_summary(trial2) %>% as_gt(return_calls = TRUE) %>% head(n = 4)
#> $gt
#> gt::gt(data = x$table_body)
#>
#> $fmt_missing
#> gt::fmt_missing(columns = gt::everything(), missing_text = "")
#>
#> $fmt_missing_emdash
#> list()
#>
#> $cols_align
#> $cols_align[[1]]
#> gt::cols_align(columns = gt::vars(variable, row_type, stat_0),
#> align = "center")
#>
#> $cols_align[[2]]
#> gt::cols_align(columns = gt::vars(label), align = "left")
The {gt} functions are called in the order they appear, always beginning with the gt::gt()
function.
If the user does not want a specific {gt} function to run (i.e. would like to change default printing), any {gt} call can be excluded in the as_gt()
function. In the example below, the default footnote will be excluded from the output.
After the as_gt()
function is run, additional formatting may be added to the table using {gt} formatting functions. In the example below, a spanning header for the by=
variable is included with the {gt} function tab_spanner()
.
tbl_summary(trial2, by = trt) %>%
as_gt(include = tab_footnote) %>%
gt::tab_spanner(label = gt::md("**Treatment Group**"),
columns = gt::starts_with("stat_"))
Characteristic  Treatment Group  

Drug A, N = 98  Drug B, N = 102  
Marker Level (ng/mL)  0.84 (0.24, 1.57)  0.52 (0.19, 1.20) 
Unknown  6  4 
T Stage  
T1  28 (29%)  25 (25%) 
T2  25 (26%)  29 (28%) 
T3  22 (22%)  21 (21%) 
T4  23 (23%)  27 (26%) 
The {gtsummary} tbl_summary()
function and the related functions have sensible defaults for rounding and presenting results. If you, however, would like to change the defaults there are a few options. The default options can be changed using the {gtsummary} themes function set_gtsummary_theme()
. The package includes prespecified themes, and you can also create your own. Themes can control baseline behavior, for example, how pvalues and percentages are rounded, which statistics are presented in tbl_summary()
, default statistical tests in add_p()
, etc.
For details on creating a theme and setting personal defaults, visit the themes vignette.
The {gtsummary} package also supports survey data (objects created with the {survey} package) via the tbl_svysummary()
function. The syntax for tbl_svysummary()
and tbl_summary()
are nearly identical, thus the examples above apply to survey summaries as well.
To begin, we’ll install the {survey} package and load the apiclus1
data set which has a complex survey design.
Before we begin, we convert the data frame to a survey object, registering the ID and weighting columns, and setting the finite population correction column.
After creating the survey object, we can now summarize it similarly to a standard data frame using tbl_svysummary()
. Like tbl_summary()
, tbl_svysummary()
can accept a by
variable and works with add_p()
and add_overall()
functions.
One thing to note is that unlike tbl_summary()
, it is not possible to pass custom functions to the statistic
argument of tbl_svysummary()
. You must use one of the predefined summary statistic functions (e.g. {mean}
, {median}
) which leverage functions from the {survey} package to calculate the correct survey statistics.
svy_apiclus1 %>%
tbl_svysummary(
# stratify summary statistics by the "both" column
by = both,
# summarize a subset of the columns
include = c(cname, api00, api99, both),
# adding labels to table
label = list(
cname ~ "County",
api00 ~ "API in 2000",
api99 ~ "API in 1999"
)
) %>%
# comparing values by "both" column
add_p() %>%
add_overall() %>%
# adding spanning header
modify_spanning_header(starts_with("stat_") ~ "**Met Both Targets**")
Characteristic  Met Both Targets  pvalue^{2}  

Overall, N = 6,194  No, N = 1,692^{1}  Yes, N = 4,502^{1}  
County  0.13  
Alameda  372 (6.0%)  68 (4.0%)  305 (6.8%)  
Fresno  135 (2.2%)  0 (0%)  135 (3.0%)  
Kern  68 (1.1%)  34 (2.0%)  34 (0.8%)  
Los Angeles  508 (8.2%)  135 (8.0%)  372 (8.3%)  
Mendocino  135 (2.2%)  135 (8.0%)  0 (0%)  
Merced  135 (2.2%)  34 (2.0%)  102 (2.3%)  
Orange  542 (8.7%)  102 (6.0%)  440 (9.8%)  
Plumas  305 (4.9%)  169 (10%)  135 (3.0%)  
San Diego  1,862 (30%)  508 (30%)  1,354 (30%)  
San Joaquin  1,252 (20%)  372 (22%)  880 (20%)  
Santa Clara  880 (14%)  135 (8.0%)  745 (17%)  
API in 2000  652 (552, 718)  631 (556, 710)  654 (551, 722)  0.4 
API in 1999  615 (512, 691)  632 (548, 698)  611 (497, 686)  0.2 
^{
1
}
Statistics presented: n (%); median (IQR)
^{
2
}
Statistical tests performed: chisquared test with Rao & Scott's secondorder correction; Wilcoxon ranksum test for complex survey samples

tbl_svysummary()
can also handle weighted survey data where each row represents several individuals:
d < dplyr::as_tibble(Titanic)
head(d, n = 10)
#> # A tibble: 10 x 5
#> Class Sex Age Survived n
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 1st Male Child No 0
#> 2 2nd Male Child No 0
#> 3 3rd Male Child No 35
#> 4 Crew Male Child No 0
#> 5 1st Female Child No 0
#> 6 2nd Female Child No 0
#> 7 3rd Female Child No 17
#> 8 Crew Female Child No 0
#> 9 1st Male Adult No 118
#> 10 2nd Male Adult No 154
Characteristic  N = 2,201^{1} 

Class  
1st  325 (15%) 
2nd  285 (13%) 
3rd  706 (32%) 
Crew  885 (40%) 
Sex  
Female  470 (21%) 
Male  1,731 (79%) 
Age  
Adult  2,092 (95%) 
Child  109 (5.0%) 
Survived  711 (32%) 
^{
1
}
Statistics presented: n (%)

In addition to tbl_summary()
, you can also use tbl_cross()
to quickly and easily compare two categorical variables in your data. tbl_cross()
is a wrapper for tbl_summary()
that:
gt::tab_spanner()
to your table with the name or label of your comparison variable."{n} ({p}%)"
as the default statistic
argument with percent = "cell"
(customizable through the statistic
and percent
arguments).margin
argument).missing
argument).Characteristic  Chemotherapy Treatment  Total  pvalue^{1}  

Drug A  Drug B  
T Stage  0.9  
T1  28 (14%)  25 (12%)  53 (26%)  
T2  25 (12%)  29 (14%)  54 (27%)  
T3  22 (11%)  21 (10%)  43 (22%)  
T4  23 (12%)  27 (14%)  50 (25%)  
Total  98 (49%)  102 (51%)  200 (100%)  
^{
1
}
chisquare test of independence
