litteR User Manual

Dennis Walvoort Wageningen Environmental Research, The Netherlands
Willem van Loon Rijkswaterstaat, The Netherlands

2019-12-09

1 Introduction

litteR is a modular tool for analyzing litter data (e.g., beach litter). The current version (0.7.0) contains the following modules:

One can optionally switch modules on or off. These modules run independently from each other.

This user guide consists of two parts. In the first part, the user interface is described, the second part gives more details on the modules.

For applications with litteR see Schulz et al. (2019).

2 Loading the litteR-package

The litteR-package should be loaded in R before you can use it. This can be done by running the following code in the R-console or the RStudio-console: library(litteR).

3 User interface

3.1 Create a new project

The easiest way to start working with litteR is to create an empty project directory. This directory can be filled with example and reference files by running:

create_litter_project("d:/work/litter-projects/beach-litter")

in the RStudio-console. For more information on how to obtain and use RStudio, consult its website or read our installation guide.

The argument of function create_litter_project (i.e., the quoted part in parentheses) is an existing work directory on your computer. This can be any valid directory name with sufficient user privileges. Note for MS-Windows users: R requires forward slashes!

It is also possible to run create_litter_project() without an argument. In that case, a simple graphical user interface pops up for interactive directory selection.

3.2 Perform litter analysis

litteR can be started typing litter() in the RStudio console (see the figure below).

Functions to start a litteR session.

Functions to start a litteR session.

After entering litter(), a simple graphical user interface pops up for file selection. An example of a file selection dialogue is given below.

File open dialogue.

File open dialogue.

4 Input

4.1 Litter file

The current version of litteR supports three data formats:

These formats will be briefly described below.

4.1.1 Wide format

settings-file (*.yaml)

The wide format is litteR’s native and recommended format. It is comparable to the OSPAR format given below, but less restrictive. The following columns are required: “region_name”,“country_code”,“country_name”,“location_name”,and “date”. The columns are separated by comma’s (CSV-file)

The image below gives an example of the wide format.

# A tibble: 10 x 8
   region_name               country_code country_name location_code
   <chr>                     <chr>        <chr>        <chr>        
 1 North East Atlantic Ocean NL           Netherlands  NL001        
 2 North East Atlantic Ocean NL           Netherlands  NL001        
 3 North East Atlantic Ocean NL           Netherlands  NL001        
 4 North East Atlantic Ocean NL           Netherlands  NL001        
 5 North East Atlantic Ocean NL           Netherlands  NL001        
 6 North East Atlantic Ocean NL           Netherlands  NL001        
 7 North East Atlantic Ocean NL           Netherlands  NL001        
 8 North East Atlantic Ocean NL           Netherlands  NL001        
 9 North East Atlantic Ocean NL           Netherlands  NL001        
10 North East Atlantic Ocean NL           Netherlands  NL001        
   location_name date       `Plastic: Yokes [1]` `Plastic: Bags [2]`
   <chr>         <date>                    <dbl>               <dbl>
 1 Bergen        2012-01-27                    0                   3
 2 Bergen        2012-04-20                    0                   8
 3 Bergen        2012-07-22                    0                   1
 4 Bergen        2012-10-19                    0                   2
 5 Bergen        2013-02-19                    0                  24
 6 Bergen        2013-04-11                    0                   0
 7 Bergen        2013-07-20                    0                  10
 8 Bergen        2013-10-16                    0                   7
 9 Bergen        2014-01-08                    0                   9
10 Bergen        2014-04-23                    0                  10

It’s less restrictive than the OSPAR-format in the sense that litter types are not restricted to the format

litter group : litter type [litter code]

The only requirement is that a [litter code] should be available. In fact, all litter specifications given below are valid:

  • “Plastic: spoon [56]”
  • “Spoon [G56]”
  • “Spoon [AB56]”
  • “[56] spoon plastic”
  • “Spoon plastic [56]”
  • “Spoon [56] plastic”
  • “Spoon [56]”

The first three specifications correspond to the OSPAR-code, the TSG-ML general code (Technical Subgroup on Marine Litter), and the UNEP-code respectively.

4.1.2 Long format

The long format is convenient for data analysis. The following columns are required: “region_name”, “country_code”, “country_name”, “location_name”, “date”, “type_name”, and “abundance”. The columns are separated by comma’s (CSV-file)

The image below gives an example of the long format. It supports the same litter coding as the wide format.

# A tibble: 10 x 8
   region_name               country_code country_name location_code
   <chr>                     <chr>        <chr>        <chr>        
 1 North East Atlantic Ocean NL           Netherlands  NL001        
 2 North East Atlantic Ocean NL           Netherlands  NL001        
 3 North East Atlantic Ocean NL           Netherlands  NL001        
 4 North East Atlantic Ocean NL           Netherlands  NL001        
 5 North East Atlantic Ocean NL           Netherlands  NL001        
 6 North East Atlantic Ocean NL           Netherlands  NL001        
 7 North East Atlantic Ocean NL           Netherlands  NL001        
 8 North East Atlantic Ocean NL           Netherlands  NL001        
 9 North East Atlantic Ocean NL           Netherlands  NL001        
10 North East Atlantic Ocean NL           Netherlands  NL001        
   location_name date       type_name          abundance
   <chr>         <date>     <chr>                  <dbl>
 1 Bergen        2012-01-27 Plastic: Yokes [1]         0
 2 Bergen        2012-04-20 Plastic: Yokes [1]         0
 3 Bergen        2012-07-22 Plastic: Yokes [1]         0
 4 Bergen        2012-10-19 Plastic: Yokes [1]         0
 5 Bergen        2013-02-19 Plastic: Yokes [1]         0
 6 Bergen        2013-04-11 Plastic: Yokes [1]         0
 7 Bergen        2013-07-20 Plastic: Yokes [1]         0
 8 Bergen        2013-10-16 Plastic: Yokes [1]         0
 9 Bergen        2014-01-08 Plastic: Yokes [1]         0
10 Bergen        2014-04-23 Plastic: Yokes [1]         0

4.1.3 OSPAR format

The OSPAR format is a wide format, meaning that all litter types are stored in columns and each row represents a survey. OSPAR beach litter data can be downloaded from the OSPAR website.

The image below gives an example of the first 10 columns and records of litter data in the OSPAR-format.

# A tibble: 10 x 10
   RefNo `Beach name` Country     Region                `Survey date` Period
   <chr> <chr>        <chr>       <chr>                 <chr>          <dbl>
 1 NL001 Bergen       Netherlands 3. Southern North Sea 27/01/2012         1
 2 NL001 Bergen       Netherlands 3. Southern North Sea 20/04/2012         2
 3 NL001 Bergen       Netherlands 3. Southern North Sea 22/07/2012         3
 4 NL001 Bergen       Netherlands 3. Southern North Sea 19/10/2012         4
 5 NL001 Bergen       Netherlands 3. Southern North Sea 19/02/2013        -1
 6 NL001 Bergen       Netherlands 3. Southern North Sea 11/04/2013         2
 7 NL001 Bergen       Netherlands 3. Southern North Sea 20/07/2013         3
 8 NL001 Bergen       Netherlands 3. Southern North Sea 16/10/2013         4
 9 NL001 Bergen       Netherlands 3. Southern North Sea 08/01/2014         1
10 NL001 Bergen       Netherlands 3. Southern North Sea 23/04/2014         2
   `Plastic: Yokes [1]` `Plastic: Bags [2]` `Plastic: Small_bags [3]`
                  <dbl>               <dbl>                     <dbl>
 1                    0                   3                         9
 2                    0                   8                        12
 3                    0                   1                         5
 4                    0                   2                         4
 5                    0                  24                        23
 6                    0                   0                         9
 7                    0                  10                         4
 8                    0                   7                         5
 9                    0                   9                        20
10                    0                  10                        29
   `Plastic: Bag_ends [112]`
                       <dbl>
 1                         0
 2                         0
 3                         0
 4                         0
 5                        13
 6                         1
 7                         0
 8                         1
 9                         0
10                         0

The columns are separated by comma’s (CSV-file). Five columns are compulsory, i.e., “refno”, “beach name”, “country”, “region”, and “survey date”. Note that the OSPAR date format currently does not comply with ISO 6801 standard date format. Instead, OSPAR uses dd/mm/YYYY (see the image above). However, for convenience and consistency, litteR also allows for dates in the ISO 6801 format. The other columns contain litter types. The names of these columns have the following format

litter group: litter type [litter code]

for instance, ‘Plastic: Bags [2]’.

Optionally, other columns may be added as metadata. However, these columns will be ignored by litteR.

4.1.4 Data Quality Control

All input files are validated by litteR. The following validation rules apply:

  1. all required columns (see above) should be available;
  2. the date format should be valid, i.e. YYYY-mm-dd (ISO 8601) for the Wide and Long formats or dd/mm/YYYY or YYYY-mm-dd (ISO 8601) for the OSPAR format;
  3. litter type names should adhere to the specifications given above;
  4. abundances are natural numbers (ISO 80000-2);
  5. all records should be unique, duplicated records will be removed with a warning;
  6. all cells should be filled with the appropriate data type (numbers, text or dates).
  7. the data file should be a comma-separated values file (CSV), i.e., a text file where the columns are separated by commas (,) and not by spaces, semicolons (;) or tabs.

4.2 Settings file

The settings file contains all settings needed to run litteR. The settings file is in the YAML-format. This is a human-readable data language that is commonly used for settings files. An example of the contents of a settings file is given in the figure below.

### BASIC SETTINGS ###

# Name of analyst
analyst_name: "RWS"

# Which modules to run (false or true)
module_stats: true
module_trend: true
module_baseline: false
module_power: false

# Period to analyse (YYYY-mm-dd)
min_date: 2012-01-01
max_date: 2017-12-31

# Percentage of total abundance to analyse (0 < percentage_total_abundance <= 100)
percentage_total_abundance: 80

# name of group file (see package vignette for more details)
file_groups: ospar-groups.csv

# Litter type(s) and/or groups to analyse 
# (e.g., OSPAR codes in square brackets and [TA] for total abundance)
litter_types_groups: [[TA], [49]]

# Image quality: high or low
image_quality: high

### ADVANCED SETTINGS ###

# Power-analysis: number of Monte Carlo simulations
# Note that larger values lead to longer run times.
# The default number of simulations is 100 to speed up computation. 
# However, 1000 simulations generally give more accurate results
number_of_simulations: 100

# Power-analysis: significance level
alpha: 0.05

# Power-analysis: resolution of effect size (range: 5% ... 50%)
resolution_effect_size: 10

# Power-analysis: minimum number of surveys to sample from
min_surveys: 16

# Show source code? (true or false)
show_source_code: false

The YAML-file contains the following entries:

entry description value
analyst_name name of the person who performs the litter analysis text
module_stats Activate the descriptive statistics module? true or false
module_trend Activate the trend analysis module? true or false
module_baseline Activate the baseline analysis module? true or false
module_power Activate the power analysis module? true or false
min_date first date to analyse YYYY-mm-dd (ISO 6801)
max_date last date to analyse YYYY-mm-dd (ISO 6801)
percentage_total_abundance percentage of total abundance to analyse percentage, default value: 80%
file_groups name of litter group file text
litter_types_groups litter type(s) and group(s) to analyse litter/group code(s) in square brackets, e.g., [[49], [TA], [SUP]]
image_quality: high quality of the images high or low
number_of_simulations number of Monte Carlo simulations for power analysis integer greater than 0. Default value: 100
alpha significance level used for power-analysis numeric in 0..1. Default value: 0.05
resolution_effect_size resolution of the effectsize (power analysis) range: 5% .. 50%
min_surveys minimum number of surveys to sample from in power analysis integer greater than 0. Default value: 16
show_source_code Show all R source code? true or false

4.3 Groups file

The work directory should also contain a litter groups file. An example file, named ‘ospar-groups.csv’ is automatically generated when using the create_litter_project-function, described earlier in this tutorial. A groups file assigns each litter type (type_name, in rows) to one or more litter groups (columns) by placing an x in a cell. The first 11 rows of ospar-groups.csv are given in the table below.

First 10 records of the litter-groups.csv file.

First 10 records of the litter-groups.csv file.

Both individual type codes and litter groups (column names) can be specified as litter_types in the settings-file (*.yaml). For instance:

litter_types: [[TA], [49], [SUP], [FISH]]

The user may use ‘ospar-groups.csv’ as a template for his own group file. litteR will use the group file that has been specified in the settings-file (*.yaml), e.g., file_groups: ospar-groups.csv. Note that the litteR-user can create and use tailor-made groups files, which match the input data used.

5 Output

5.1 Report

litteR produces an HTML report that can best be viewed with modern web browsers like Mozilla FireFox, Google Chrome, or Safari. These browsers are freely available from the internet.

The filename of each report starts with ‘litter-report’, followed by

For example: litter-report-[TA][49]-STABP-20190521-074547.html

In the remainder of this section, each section of the HTML-report is briefly described.

5.1.1 Settings

This section gives a summary of the settings/parameters in the settings file.

5.1.2 Data Quality Control

In this section (potential) problems in the input files are reported.

5.1.3 Descriptive statistics

For each selected litter type and period, this section gives several descriptive statistics. These statistics provide useful information about the data in a concise way. The following statistics are given:

  • mean abundance (mean):, i.e., the arithmetic mean of the counts for each litter type;
  • median abundance (median), i.e., the median of the counts for each litter type;
  • relative abundance (rel.abund.): the contribution of each litter type to the total abundance of litter types (%);
  • coefficient of variation (CV): the ratio of the standard deviation to the mean of the counts for each litter type (%);
  • ratio of the MAD and the median (RMAD, %);
  • number of surveys (N).

These statistics will be estimated for the top x% types, i.e. types with the greatest abundances making up x% of the total abundance for each location. For example in OSPAR data analysis, the top 80% is used by default.

5.1.4 Trend analysis

This section gives trend analysis results. The figures show time-series of litter items for each location, together with a monotonic trend line based on the Theil-Sen slope estimator. The Theil-Sen slope estimator is usually more robust than slopes estimated by ordinary least squares regression. In addition, a loess-smoother is given to reveal potential non-linearities in the trend.

Finally, a table is provided showing the magnitude of the Theil-Sen slope estimator and its corresponding p-value.

Example of a trend plot for total abundance (TA) at a beach near Bergen (The Netherlands). In this plot, the black dots are the observations, the thin gray line segments connect the dots and guide the eye, the blue line is a loess-smoother, and the red line is the Theil-Sen slope.

Example of a trend plot for total abundance (TA) at a beach near Bergen (The Netherlands). In this plot, the black dots are the observations, the thin gray line segments connect the dots and guide the eye, the blue line is a loess-smoother, and the red line is the Theil-Sen slope.

5.1.5 Baseline analysis

The aim of baseline analysis is to identify the minimum number of surveys needed to obtain stable baseline estimates.

This section provides figures showing the moving average as function of window size, i.e. the number of consecutive years, for each location.

The following procedure was followed to produce these plots:

  1. Start with a window size of one year. One year usually corresponds to four surveys;
  2. For each selected litter type, move the window over its time series. The step size is equal to one survey.
  3. During each step, two statistics are computed for the survey data within the window:
    • the mean and median abundance;
    • the number of days spanned by the window, i.e. the window size. The window size will vary because the surveys are not equidistant in time;
  4. increase the window size by one survey and repeat the procedure above until the maximum window size has been reached.
Example of a baseline plot. Each dot is the average abundance of a specific litter type or the total abundance (TA) within a moving window of the size given on the x-axis.

Example of a baseline plot. Each dot is the average abundance of a specific litter type or the total abundance (TA) within a moving window of the size given on the x-axis.

In addition, also a table is presented giving for each location and number of years (# years) the mean, the standard deviation (sd), the coefficient of variation (CV), the median, the median absolute deviation (MAD), and the ratio of MAD to median of the baseline statistics (mean and median) plotted above.

Snapshot of the baseline table in the report. For an explanation, see main text.

Snapshot of the baseline table in the report. For an explanation, see main text.

5.1.6 Power analysis

In this section, the power of the Wilcoxon signed rank test is estimated. The null hypothesis of this test is

H0: distribution of litter data is symmetric about the baseline value

and the alternative hypothesis is

H1: distribution of litter data is less than the baseline value

Hence, this is a test for a step trend. The power of a hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when a specific alternative hypothesis (H1) is true.

The power is useful to check if the number of surveys is sufficient. If the power is too low, sampling effort should be increased to be able to correctly detect trends. On the other hand, if the power is too high, sampling effort can be reduced. In both cases, power analysis may lead to more efficient allocation of financial resources.

In litteR, power analysis is carried out by means of Monte Carlo simulation for different values of the reduction (effect size), sample size and statistical significance. The procedure is as follows:

For each location, the time-series of the selected litter types are selected. For each of these time-series:

  1. Estimate its mean abundance over the period of interest. This value is used as baseline value in the power analysis;
  2. Multiply the abundances by a reduction factor f, where 0 < f < 1 (effect size);
  3. Draw n (integer) values from the empirical cumulative distribution function (ECDF) of these reduced abundances;
  4. Test if the simulated data are significantly below the baseline value by means of the Wilcoxon test;
  5. Simulate a new data set and perform the test above many times (say, 1000 times);
  6. Estimate the power as the average number of times the null-hypothesis is rejected (i.e., p < \(\alpha\));
  7. Repeat this procedure for different values for n, and f, given \(\alpha\)

The reduction factor f scales the monitoring data. The following expression holds:

mean(simulated data) \(\approx\) f \(\times\) mean(monitoring data) = f \(\times\) (baseline value)

Note that f = 1 means no reduction (mean of the simulated data is approximately equal to the baseline value), and f = 0 means absence of litter (for instance, a pristine clean beach).

Example of a power analysis plot. It gives the power (y-axis) as function of the number of surveys (x-axis) for different effect sizes (see legend).

Example of a power analysis plot. It gives the power (y-axis) as function of the number of surveys (x-axis) for different effect sizes (see legend).

5.2 Statistical summary file

In addition to a report, a CSV-file with summary statistics will be produced for each location. This file is accompanied by a file with metadata. The metadata are given below:

column_name description unit
region_name administrative unit, e.g., OSPAR or Southern North Sea 1
country_code two-letter upper case country code according to ISO 3166-1 alpha-2 1
location_name name of the survey location 1
type_name name of the litter type 1
type_code code of the litter type 1
from first date of the survey date
to final date of the survey date
mean mean abundance count
median median abundance count
cv coefficient of variation of the abundance 1
rmad ratio of MAD to median 1
n number of surveys used to estimate these statistics 1
intercept intercept of the Theil-Sen trend line, i.e., intercept + slope * (year - 1970) count
slope slope of the Theil-Sen trend line (annual increase in abundance) 1/a
p_value_slope p-value of the Theil-Sen slope 1
min minimum abundance count
p01 1st percentile of the abundance count
p05 5th percentile of the abundance count
p10 10th percentile of the abundance count
p25 25th percentile of the abundance (first quartile) count
p50 50th percentile of the abundance (second quartile or median) count
p75 75th percentile of the abundance (third quartile) count
p90 90th percentile of the abundance count
p95 95th percentile of the abundance count
p99 99th percentile of the abundance count
max maximum abundance count

6 Troubleshooting

7 References

Schulz, Marcus, Dennis J.J. Walvoort, Jon Barry, David M. Fleet, Willem M.G.M. van Loon, 2019. Baseline and power analyses for the assessment of beach litter reductions in the European OSPAR region. Environmental Pollution 248:555-564. https://doi.org/10.1016/j.envpol.2019.02.030