Introduction to Crosstable

Dan Chaltiel

2021-03-05

Crosstable

Crosstable is a package centered on a single function, crosstable, which easily computes descriptive statistics on datasets.

Before starting this vignette, here are a few points:

Dataset: modified mtcars

First, since it uses the power of the label attribute, let’s start by building a labelled dataset. In this vignette, we will use a modified version of the mtcars dataset, which comprises 11 aspects of design and performance for 32 automobiles. We modify it to add textual categories, make some numeric variables factors, and add labels from a table using import_labels.

For convenience, this dataset is already packed into crosstable so don’t bother re-creating it for your own tests.

library(crosstable)
library(dplyr)
mtcars_labels = read.table(header=TRUE, text="
  name label
  mpg  'Miles/(US) gallon'
  cyl  'Number of cylinders'
  disp 'Displacement (cu.in.)'
  hp   'Gross horsepower'
  drat 'Rear axle ratio'
  wt   'Weight (1000 lbs)'
  qsec '1/4 mile time'
  vs   'Engine'
  am   'Transmission'
  gear 'Number of forward gears'
  carb 'Number of carburetors'
")
mtcars2 = mtcars %>% 
  mutate(vs=ifelse(vs==0, "vshaped", "straight"),
         am=ifelse(am==0, "auto", "manual")) %>% 
  mutate_at(c("cyl", "gear"), factor) %>% 
  import_labels(mtcars_labels, name_from="name", label_from="label") 
#I also could have used `Hmisc::label` or `expss::apply_labels` to add labels

Here is the result when we apply crosstable on it:

crosstable(mtcars2) %>% head(11)
#>     .id                 label   variable               value
#> 1   mpg     Miles/(US) gallon  Min / Max         10.4 / 33.9
#> 2   mpg     Miles/(US) gallon  Med [IQR]    19.2 [15.4;22.8]
#> 3   mpg     Miles/(US) gallon Mean (std)          20.1 (6.0)
#> 4   mpg     Miles/(US) gallon     N (NA)              32 (0)
#> 5   cyl   Number of cylinders          4         11 (34.38%)
#> 6   cyl   Number of cylinders          6          7 (21.88%)
#> 7   cyl   Number of cylinders          8         14 (43.75%)
#> 8  disp Displacement (cu.in.)  Min / Max        71.1 / 472.0
#> 9  disp Displacement (cu.in.)  Med [IQR] 196.3 [120.8;326.0]
#> 10 disp Displacement (cu.in.) Mean (std)       230.7 (123.9)
#> 11 disp Displacement (cu.in.)     N (NA)              32 (0)

In this vignette, I only printed the mpg, cyl and disp variables (the 11 first rows), but all other variables were also described with this call.

By default, numeric variables (like mpg and disp) are described with min/max, median/IQR, mean/sd and number of observations/missing, while categorical (factor/character) variables (like cyl) are described with levels counts and fraction. All of this is fully customizable, as you will see hereafter.

Arguments

This is already pretty handy, but we can easily add some control with :

For instance, if we wanted the statistics of mpg and cyl for both levels of am, we could write this:

crosstable(mtcars2, mpg, cyl, by = am)
#>   .id               label   variable             auto           manual
#> 1 mpg   Miles/(US) gallon  Min / Max      10.4 / 24.4      15.0 / 33.9
#> 2 mpg   Miles/(US) gallon  Med [IQR] 17.3 [14.9;19.2] 22.8 [21.0;30.4]
#> 3 mpg   Miles/(US) gallon Mean (std)       17.1 (3.8)       24.4 (6.2)
#> 4 mpg   Miles/(US) gallon     N (NA)           19 (0)           13 (0)
#> 5 cyl Number of cylinders          4       3 (27.27%)       8 (72.73%)
#> 6 cyl Number of cylinders          6       4 (57.14%)       3 (42.86%)
#> 7 cyl Number of cylinders          8      12 (85.71%)       2 (14.29%)

The by column should be a factor, character or logical vector. If it is a numeric vector, then only numeric vectors can be described and correlation factors will be displayed.

There are many ways to select variables: with names, character vector, tidyselect helpers, formula… For instance, you could write crosstable(mtcars2, starts_with("d")), crosstable(mtcars2, where(is.numeric)), crosstable(mtcars2, -c(mpg:hp)), or crosstable(mtcars2, mpg~am). This is described in details in vignette("crosstable-selection") (link).

HTML printing

This might be good enough for a preview, but printing a table to the console has never been the greatest thing to visualize tables, especially in a vignette like this one.

Fortunately, you can use the as_flextable function to turn any crosstable into a HTML, ready-to-print table, which will be automatically displayed in the Viewer if your are using RStudio.

crosstable(mtcars2, mpg, cyl, by=am) %>% 
  as_flextable(keep_id=TRUE)

If you want to share your tables in a MS Word document, you can use David Gohel’s awesome package officer.

Here, I use the keep_id argument to keep the variable name, but in practice you usually drop it for the label. See vignette("crosstable-officer") (link) for more on as_flextable and on how to integrate crosstables with officer and Rmarkdown.

Descriptive functions

There may be too much information with crosstable’s default descriptive function. In this case, you can use any set of function that fits your need. They will be applied to all numeric variables.

crosstable(mtcars2, am, mpg, wt, by=vs, funs=c(median, mean, sd)) %>% 
  as_flextable(keep_id=TRUE)

You might want to use crosstable’s convenience functions such as meansd, mediqr, minmax, or nna. For additional arguments, you can use funs_arg argument: crosstable(mtcars3, c(disp, hp, am), by=vs, funs=c(meansd,quantile), funs_arg = list(dig=3, probs=c(0.25,0.75))). Numbers are formatted to have the same number of decimal places. You might want to take a look at ?summaryFunctions and ?format_fixed to customize their behaviour.

Margins for percentage calculation

When crossing two factor variables, like cyl and am, percentages are calculated by row. You can use the margin argument to change this behavior.

crosstable(mtcars2, am, mpg, by=vs, funs=meansd, margin="row") %>% 
  as_flextable(keep_id=TRUE)
crosstable(mtcars2, am, mpg, by=vs, funs=meansd, margin="column") %>% 
  as_flextable(keep_id=TRUE)
crosstable(mtcars2, am, mpg, by=vs, funs=meansd, margin="none") %>% 
  as_flextable(keep_id=TRUE)
#I also could have used margin="all" here
crosstable(mtcars2, am, mpg, by=vs, funs=meansd, margin=c("column","row","cell")) %>% 
  as_flextable(keep_id=TRUE) 

Of course, if by is not set, margin will have no effect and percentages will be calculated in column.

Missing values are not taken into account when calculating percentage calculation. You can change this behaviour by using tidyr::replace_na or forcats::fct_explicit_na on your dataset before applying crosstable.

Totals

crosstable can also compute totals and display it in rows and/or columns with the total argument:

crosstable(mtcars2, am, mpg, by=vs, funs=mean, total="row") %>% 
  as_flextable(keep_id=TRUE)
#of course, total="column" only has an effect on categorical variables.
crosstable(mtcars2, am, mpg, by=vs, funs=mean, total="column") %>% 
  as_flextable(keep_id=TRUE)
crosstable(mtcars2, am, mpg, by=vs, funs=mean, total="both") %>% 
  as_flextable(keep_id=TRUE)

Of course, if by is not set, total will always be calculated in column.

Totals take missing values into account. Therefore, be aware that if argument showNA="no", totals may be higher than the sum of the values inside the table.

Tests

It is possible to perform statistical tests automatically.

library(flextable)
crosstable(mtcars2, vs, qsec, by=am, funs=mean, test=TRUE) %>% 
  as_flextable(keep_id=TRUE)

Of course, this should only be done in an exploratory context, as it would cause extensive alpha inflation otherwise.

Tests are chosen depending on the characteristics of the crossed variables (class, size, distribution, …). See ?crosstable_test_args for more details on the test choice algorithm.

Effects

If the by variable has only 2 levels, it is also possible to automatically compute an effect.

library(flextable)
crosstable(mtcars2, vs, qsec, by=am, funs=mean, effect=TRUE) %>% 
  as_flextable(keep_id=TRUE)

Type of effect (method, bootstrap, etc) are also chosen depending on the characteristics of the crossed variables (class, size, distribution, …). See ?crosstable_effect_args for more details on the effect choice algorithm.

Miscellaneous

Dates

You can describe Dates and POSIXt variables with crosstable.

mtcars2$x_date = as.Date(mtcars2$hp , origin="2010-01-01") %>% set_label("Date")
mtcars2$x_posix = as.POSIXct(mtcars2$qsec*3600*24 , origin="2010-01-01") %>% set_label("Date+time")
crosstable(mtcars2, x_date, x_posix) %>% 
  as_flextable(keep_id=TRUE)

The format can be set using date_format and the date unit can be set globally in funs_arg using the date_unit key. This latter can come handy when two groups have different date unit.

crosstable(mtcars2, x_date, x_posix, by=vs, date_format="%d/%m/%Y", funs_arg = list(date_unit="days")) %>% 
  as_flextable(keep_id=TRUE)

Correlations

If by refers to a numeric variable, correlation coefficients will be calculated.

library(survival)
crosstable(mtcars2, where(is.numeric), by=mpg) %>% 
  as_flextable(keep_id=TRUE)

Note that you can use the cor_method argument to choose which coefficient to calculate ("pearson", "kendall", or "spearman").

Survival data

Crosstable is also able to describe survival data on specific times:

library(survival)
aml$surv = Surv(aml$time, aml$status)
crosstable(aml, surv, by=x, times=c(0,15,30,150), followup=TRUE) %>% 
  as_flextable(keep_id=TRUE)

Using the formula interface, you can also declare the Surv object directly inside the crosstable function: crosstable(aml, Surv(time, status) ~ x).

Acknowledgement

crosstable is a rewrite of the awesome biostat2 package written by David Hajage. The user interface is quite different but the concept is the same.

Thanks David!