packageRank: compute and visualize package download counts and percentiles

features

NOTE: ‘packageRank’ relies on an active internet connection.

getting started

To install ‘packageRank’ from CRAN:

install.packages("packageRank")

To install the latest development version from GitHub:

# You may need to first install the 'devtools' via install.packages("devtools").

devtools::install_github("lindbrook/packageRank", build_vignettes = TRUE)

background

The ‘cranlogs’ package computes the raw number of downloads using RStudio’s CRAN mirror. For example, we can see that the ‘HistData’ package was downloaded 51 times on the first day of 2019:

cranlogs::cran_downloads(packages = "HistData", from = "2019-01-01", to = "2019-01-01")
>         date count  package
> 1 2019-01-01    51 HistData

And 787 times in the first week:

cranlogs::cran_downloads(packages = "HistData", from = "2019-01-01", to = "2019-01-07")
>         date count  package
> 1 2019-01-01    51 HistData
> 2 2019-01-02   100 HistData
> 3 2019-01-03   137 HistData
> 4 2019-01-04   113 HistData
> 5 2019-01-05    85 HistData
> 6 2019-01-06    96 HistData
> 7 2019-01-07   205 HistData

In both cases, lurking in the background is the “compared to what?” question. Is 51 downloads large or small? Is the pattern that week typical or unusual? To answer these questions, ‘packageRank’ puts package download counts into greater context.

compute percentiles and ranks

To do so, the package can compute the rank percentile and nominal rank of a package’s downloads:

packageRank(packages = "HistData", date = "2019-01-01")
>         date packages downloads percentile          rank
> 1 2019-01-01 HistData        51       93.4 920 of 14,020

Here, we see that the 51 downloads puts ‘HistData’ in the 93rd percentile. This statistic, familiar to anyone who’s taken a standardized test, tell us that 93% of packages had fewer downloads than ‘HistData’: [1]

pkg.rank <- packageRank(packages = "HistData", date = "2019-01-01")
downloads <- pkg.rank$crosstab

round(100 * mean(downloads < downloads["HistData"]), 1)
> [1] 93.4

# OR

(pkgs.with.fewer.downloads <- sum(downloads < downloads["HistData"]))
> [1] 13092

(tot.pkgs <- length(downloads))
> [1] 14020

round(100 * pkgs.with.fewer.downloads / tot.pkgs , 1)
> [1] 93.4

We also see that 51 downloads puts ‘HistData’ in 920th place among the 14,020 packages with at least one download. What makes this rank “nominal” is the fact that multiple packages can have the same number of downloads. As a result, a package’s nominal rank (but not its rank percentile) will sometimes be affected by its name: packages with the same number of downloads will be sorted in alphabetical order. For the case at hand, ‘HistData’ benefits from the fact that it is second in the list (vector) of packages with 51 downloads:

pkg.rank <- packageRank(packages = "HistData", date = "2019-01-01")
downloads <- pkg.rank$crosstab

downloads[downloads == 51]
> 
>  dynamicTreeCut        HistData          kimisc  NeuralNetTools 
>              51              51              51              51 
>   OpenStreetMap       pkgKitten plotlyGeoAssets            spls 
>              51              51              51              51 
>        webutils            zoom 
>              51              51

visualization (cross-sectional)

To visualize a package’s relative position on a given day’s downloads, simply use the following:

plot(packageRank(packages = "HistData", date = "2019-05-01"))

This cross-sectional view plots a package’s rank (x-axis) against the logarithm of its downloads (y-axis) and highlights its position in the overall distribution of downloads.

In addition, it also illustrates 1) a package’s rank percentile and its raw count of downloads (in red); 2) the location of the 75th, 50th and 25th percentiles (dotted gray vertical lines); 3) the package with the most downloads, in this case ‘devtools’ (in blue); and 4) the total number of downloads (2,982,767) on that day (in blue).

Note that you can even pass a vector of packages:

plot(packageRank(packages = c("cholera", "HistData", "regtools"), date = "2019-05-01"))

visualization (longitudinal)

To visualize a package’s relative position over time, use packageRankTime():

plot(packageRankTime(packages = "HistData", when = "last-month"), graphics_pkg = "base")

This longitudinal view plots the date (x-axis) against the logarithm of a package’s downloads (y-axis).

In the background, the same variable are plotted (in gray) for a stratified random sample of packages.[2] This sample approximates the “typical” pattern of package downloads for that time period.

As above, you can pass a vector of packages:

plot(packageRankTime(packages = c("Rcpp", "HistData", "rlang"), when = "last-month"))

Note that only two time frames are available: “last-week” and “last-month”.

visualizing ‘cranlogs’

To visualize the download counts from cranlogs::cran_download(), ‘packageRank’ provides a generic S3 plot() method. All you need to do is substitute cran_downloads2() for cran_download():

plot(cran_downloads2(packages = c("data.table", "Rcpp", "rlang"), from = "2019-01-01", to = "2019-01-01"))

plot(cran_downloads2(packages = c("data.table", "Rcpp", "rlang"), when = "last-month"))

plot(cran_downloads2(packages = c("data.table", "Rcpp", "rlang"), from = "2019-01-01", to = "2019-01-31"))

graphics: base R and ‘ggplot2’

All plot are available as both base R and ‘ggplot2’ graphs. By default, plot with single frame/panels (one package or one day) use base graphics while those with multiple frames/panels use ‘ggplot2’. You can override these defaults by using the “graphics” argument in the plot() method.

memoization

To avoid the bottleneck of downloading multiple log files, packageRank() is limited to individual days. However, to reduce the need to re-download logs for a given day, ‘packageRank’ makes use of memoization via the ‘memoise’ package.

Here’s relevant code:

fetchLog <- function(x) data.table::fread(x)

mfetchLog <- memoise::memoise(fetchLog)

if (RCurl::url.exists(url)) {
  cran_log <- mfetchLog(url)
}

If you use fetchLog(), the log file, which can sometimes be as large as 50 MB, will be downloaded every time you call the function. If you use mfetchLog(), logs are intelligently cached; those that have already been downloaded, in your current R session, will not be downloaded again.

Notes

  1. Note that because packages with zero downloads are not recorded in the log, there is a censoring problem.

  2. Within each 5% interval of rank percentiles (e.g., 0 to 5, 5 to 10, 95 to 100, etc.), a random sample of 5% of packages is selected and tracked over time.