COM(P)ADRE databases contain thousands of matrix population models. If we want to derive traits from a large set of these matrices, we’ll need to use either loops or vectorisation.
Vectorising means applying a function to each element of a vector.
Here vector is defined broadly — it could be a sequence of character
strings, a column of a data.frame
, a list
of
matrices, etc. Vectorised code generally runs faster than loops, and
many R users find that vectorised code is easier to write and
understand.
We’ll start by loading a few packages and a dataset that we’ll be
using throughout this vignette. The dataset Compadre
is a
subset of a recent COMPADRE release that’s built into
Rcompadre
.
To understand vectorisation, we first need a vector. For this
purpose, we’ll extract a list of A matrices from the
mat
column of Compadre
, and add this list of
matrices to Compadre
as a new column.
This new column, Compadre$matA
, is both a vector and a
list.
is.vector(Compadre$matA) # it really is a vector
#> [1] TRUE
is.list(Compadre$matA) # and also a list
#> [1] TRUE
length(Compadre$matA) # with 150 matrices
#> [1] 150
Compadre$matA[1:3] # here are the first three
#> [[1]]
#> A1 A2 A3 A4
#> A1 0.61 0.38 0.30 1.16
#> A2 0.23 0.50 0.31 0.32
#> A3 0.00 0.18 0.54 0.41
#> A4 0.00 0.05 0.08 0.50
#>
#> [[2]]
#> A1 A2 A3 A4 A5 A6
#> A1 0.8430 0.000 0.000 176.0600 635.7200 1601.2500
#> A2 0.0003 0.000 0.000 0.0584 0.2109 0.5313
#> A3 0.0000 0.476 0.132 0.1430 0.0000 0.0560
#> A4 0.0000 0.095 0.105 0.2860 0.0000 0.0000
#> A5 0.0000 0.167 0.474 0.4290 0.8000 0.0000
#> A6 0.0000 0.143 0.237 0.0710 0.2000 0.8890
#>
#> [[3]]
#> A1 A2 A3
#> A1 0.017 0.000 1.142
#> A2 0.655 0.902 0.009
#> A3 0.000 0.001 0.988
Let’s say we want to calculate the dimension of every matrix in
Compadre$matA
. In fact, Compadre
already has
this data in the column ‘MatrixDimension’, but let’s say we want to
double-check it. We’ll use the function nrow()
, and assume
that the number of rows and columns are equal. But we can’t use
nrow()
directly on Compadre$matA
, because the
function nrow()
isn’t vectorised. It can only take one
object at a time.
We could do something like this…
Compadre$dim <- numeric(nrow(Compadre)) # create empty vector to store output
Compadre$dim[1] <- nrow(Compadre$matA[[1]]) # nrow matrix 1
Compadre$dim[2] <- nrow(Compadre$matA[[2]]) # nrow matrix 2
Compadre$dim[3] <- nrow(Compadre$matA[[3]]) # nrow matrix 3
# ... all the way to 150
But that’s not very efficient for 150 matrices.
A loop would be much more efficient here.
An even nicer approach is to vectorise the function
nrow()
over the vector Compadre$matA
using
sapply()
.
sapply()
applies the function specified in the 2nd
argument (nrow()
) to every element of the vector in the
first argument (Compadre$matA
), and returns a vector of the
results. The advantage of sapply()
over the loop is that we
don’t need to pre-define an object to store the results.
We can also vectorise with a custom function. Let’s say we want to know, for every matrix, whether there are stages with no transitions (i.e. any column sums equal to zero). Here’s a vectorised approach.
# function to determine whether matrix 'mat' has any stages with no transitions
NullStages <- function(mat) any(colSums(mat) == 0)
# apply function to every element of A
Compadre$null_stages <- sapply(Compadre$matA, NullStages)
The key to vectorising is to make sure the function works on individual elements of the vector.
Note that, in the example above, it wasn’t necessary to create the
column Compadre$matA
before vectorising over the
A matrices. We could have simply used the
matA()
accessor within sapply()
.
cdb_unnest()
to avoid accessorsThat said, using accessor funtions can get tedious. Rather than
constantly using accessor functions to extract components of the
mat
column, we could use the function
cdb_unnest()
to extract separate columns for all matrix
components at the start of our analysis.
# create new columns matA, matU, matF, matC, MatrixClassAuthor, etc..
CompUnnest <- cdb_unnest(Compadre)
Then we can refer to any component using $
, e.g.
# apply NullStages to every matA
CompUnnest$null_stages <- sapply(CompUnnest$matA, NullStages)
# count number of dormant stages in every MatrixClassOrganized
NumberDormant <- function(stages) length(which(stages == "dorm"))
CompUnnest$n_dormant <- sapply(CompUnnest$MatrixClassOrganized, NumberDormant)
vapply()
is similar to sapply()
, except
that the output type is specified as an argument.
sapply(CompUnnest$matA[1:6], nrow)
#> [1] 4 6 3 5 5 3
vapply(CompUnnest$matA[1:6], nrow, numeric(1)) # must specify output type
#> [1] 4 6 3 5 5 3
lapply()
always returns a list
, so it’s
useful if our output is more complex than a single value for each input.
For example, we could use lapply
to calculate vectors of
stage-specific survival (column sums of matU
).
lapply(CompUnnest$matU[1:4], function(m) colSums(m))
#> [[1]]
#> U1 U2 U3 U4
#> 0.84 0.73 0.93 0.91
#>
#> [[2]]
#> U1 U2 U3 U4 U5 U6
#> 0.8433 0.8810 0.9480 0.9290 1.0000 0.9450
#>
#> [[3]]
#> U1 U2 U3
#> 0.672 0.903 0.997
#>
#> [[4]]
#> U1 U2 U3 U4 U5
#> 0.83820 0.82850 0.75345 0.86665 0.05000
mapply()
is for vectorising over multiple arguments. For
example, the lifeExpectancy()
function below (taken from
the package Rage)
calculates life expectancy given two arguments: a U
matrix, and an integer indicator for the stage class reflecting the
‘start of life’. The start of life is often defined as the first
‘active’ stage class (i.e. not propagule or dormant), the index of which
will vary from row to row.
# function to calculate life expectancy
lifeExpectancy <- function(matU, startLife) {
N <- solve(diag(nrow(matU)) - matU)
return(colSums(N)[startLife])
}
# get index of first active stage class with mpm_first_active()
CompUnnest$start_life <- mpm_first_active(CompUnnest)
# vectorise lifeExpectancy over matU and start_life
mapply(
lifeExpectancy, # function
CompUnnest$matU[1:6], # first argument to vectorise over
CompUnnest$start_life[1:6]
) # second argument to vectorise over
#> U1 U2 U1 U1 U1 U1
#> 5.727060 22.194872 8.439966 3.009829 10.877215 18.181818
Just like loops, vectorisation fails if the function being vectorised
throws an error on any element of the vector. Here’s an
example. The eigs()
function from popdemo calculates
the expected population growth rate given a projection matrix. It’ll
work on most of the A matrices in
Compadre
, but fails on matrices that contain missing values
(NA
).
# works for a single matrix
popdemo::eigs(CompUnnest$matA[[1]], what = "lambda")
#> [1] 1.058029
# but fails when applied to all matrices because a few have missing values
CompUnnest$lambda <- sapply(CompUnnest$matA, popdemo::eigs, what = "lambda")
#> Error in eigen(A): infinite or missing values in 'x'
There are two basic approaches to overcoming this:
If eigs()
doesn’t work on matrices with missing values,
one approach is to simply remove matrices with missing values. The
cdb_flag()
function is an easy way to check for missing
values, and other common issues that might hinder our analyses.
# add column 'check_NA_A', indicating whether matA contains missing values (T/F)
CompFlag <- cdb_flag(CompUnnest, checks = "check_NA_A")
# remove rows where matA contains missing values
CompSub <- subset(CompFlag, check_NA_A == FALSE)
# apply lambda() to every remaining matA
CompSub$lambda <- sapply(matA(CompSub), popdemo::eigs, what = "lambda")
#> Warning in FUN(X[[i]], ...): More than one eigenvalues have equal absolute
#> magnitude
Alternatively, if we want to avoid subsetting, we could pre-define a
placeholder column for the result, and then selectively apply the
eigs()
function to only those matrices that don’t contain
missing values.
# identify rows with no missing values in matA
no_missing <- which(CompFlag$check_NA_A == FALSE)
# create placeholder column for lambda
CompFlag$lambda <- NA
# apply eigs() to all matA with no missing values
CompFlag$lambda[no_missing] <- sapply(CompFlag$matA[no_missing],
popdemo::eigs,
what = "lambda"
)
#> Warning in FUN(X[[i]], ...): More than one eigenvalues have equal absolute
#> magnitude
Rows where there were missing values in matA
retain the
original placeholder value of NA
.
A second approach is to modify the function we want to vectorise with
so that it can natively handle special cases. For example, we might
modify eigs()
so that it returns NA
if a
matrix contains missing values.
lambdaFn1 <- function(mat) {
# check mat for missing values: if TRUE return NA, else return eigs(mat)
ifelse(anyNA(mat), NA, popdemo::eigs(mat, what = "lambda"))
}
CompUnnest$lambda <- sapply(CompUnnest$matA, lambdaFn1)
#> Warning in popdemo::eigs(mat, what = "lambda"): More than one eigenvalues have
#> equal absolute magnitude
If the special cases are harder to test for, we could use R’s
condition-handling functions like try()
or
tryCatch()
. Here’s an example.
lambdaFn2 <- function(mat) {
# try eigs(mat): if error return NA
tryCatch(eigs(mat, what = "lambda"), error = function(err) NA)
}
CompUnnest$lambda <- sapply(CompUnnest$matA, lambdaFn2)
#> Warning in eigs(mat, what = "lambda"): More than one eigenvalues have equal
#> absolute magnitude
This latter approach requires caution, as we’ll get an
NA
for any error. Some errors might reflect
problems with our data or code that are fixable, in which case an
NA
may be misleading.