# 2 Main workflow

## 2.1 Step 1: importation, check and normalization / transformation of data if needed

### 2.1.1 General format of imported data

#### 2.1.1.1 Importation of data from a unique text file

Whatever the type of data imported in DRomics (e.g. RNAseq,
microarray, metabolomic data), data can be imported from a .txt file
(e.g. “mydata.txt”) organized with **one row per item**
(e.g. transcript, probe, metabolite) and **one column per
sample**. In **an additional first row**, after a
name for the item identifier (e.g. “id”), we must have the
**tested doses or concentrations in a numeric format** for
the corresponding sample (for example, if there are triplicates for each
treatment, the first line could be “item”, 0, 0, 0, 0.1, 0.1, 0.1,
etc.). An **additional first column** must give the
**identifier of each item** (identifier of the probe,
transcript, metabolite, …, or name of the endpoint for anchoring data),
and the other columns give the responses of the item for each sample.
This file will be imported within DRomics using an internal call to the
function read.table() with its default field separator (sep argument)
and its default decimal separator (dec argument at “.”). So remember, if
necessary, to **transform another decimal separator (e.g. “,”) in
“.” before importing your data**.

Different examples of .txt files formatted for the DRomics workflow
are available in the package, such as the one named “RNAseq_sample.txt”.
You can have a look of how the data are coded in this file using the
following code. **To use a local dataset formatted in the same
way, such use a datafilename of type
"yourchosenname.txt".**

```
# Import the text file just to see what will be automatically imported
datafilename <- system.file("extdata", "RNAseq_sample.txt", package = "DRomics")
# datafilename <- "yourchosenname.txt" # for a local file
# Have a look of what information is coded in this file
d <- read.table(file = datafilename, header = FALSE)
nrow(d)
```

`## [1] 1000`

```
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 RefSeq 0 0 0.22 0.22 0.22 0.67 0.67 0.67 2
## 2 NM_144958 2072 2506 2519.00 2116.00 1999.00 2113.00 2219.00 2322.00 2359
## 3 NR_102758 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0
## 4 NM_172405 198 265 250.00 245.00 212.00 206.00 227.00 246.00 265
## 5 NM_029777 18 29 25.00 19.00 19.00 13.00 22.00 19.00 19
## 6 NM_001130188 0 0 0.00 0.00 0.00 0.00 0.00 1.00 0
## V11 V12 V13 V14 V15
## 1 2 2 6 6 6
## 2 1932 1705 2110 2311 2140
## 3 0 0 0 0 0
## 4 205 175 288 315 242
## 5 26 16 26 32 33
## 6 0 0 1 0 1
```

#### 2.1.1.2 Importation of data as an R object

**Alternatively an R object of class data.frame can be directly
given in input**, that corresponds to the output of
read.table(file, header = FALSE) on a file described in the previous section.

You can see below an example of an RNAseq data set that is available in DRomics as an R object (named Zhou_kidney_pce) and which is an extended version (more rows) of the previous dataset coded in “RNAseq_sample.txt”.

```
# Load and look at the dataset directly coded as an R object
data(Zhou_kidney_pce)
nrow(Zhou_kidney_pce)
```

`## [1] 33395`

```
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 RefSeq 0 0 0.22 0.22 0.22 0.67 0.67 0.67 2
## 2 NM_144958 2072 2506 2519.00 2116.00 1999.00 2113.00 2219.00 2322.00 2359
## 3 NR_102758 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0
## 4 NM_172405 198 265 250.00 245.00 212.00 206.00 227.00 246.00 265
## 5 NM_029777 18 29 25.00 19.00 19.00 13.00 22.00 19.00 19
## 6 NM_001130188 0 0 0.00 0.00 0.00 0.00 0.00 1.00 0
## V11 V12 V13 V14 V15
## 1 2 2 6 6 6
## 2 1932 1705 2110 2311 2140
## 3 0 0 0 0 0
## 4 205 175 288 315 242
## 5 26 16 26 32 33
## 6 0 0 1 0 1
```

If your data are already imported in R in a **different format
as the one described above**, you can use the
**formatdata4DRomics() function to build an R object directly
useable by the DRomics workflow**. formatdata4DRomics() needs two
arguments in input:

- the
**matrix of the data**with**one row for each item**and**one column for each sample**, - and the
**numeric vector coding for the dose**for each sample.

The names of the samples can be added in a third optional argument (see ?formatdata4DRomics for details). Below is an example using a RNAseq dataset of the package coded as an R object named zebraf.

```
## List of 3
## $ counts: int [1:1000, 1:16] 453 331 897 12 326 533 1948 904 583 154 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:1000] "ENSDARG00000102141" "ENSDARG00000102123" "ENSDARG00000114503" "ENSDARG00000115971" ...
## .. ..$ : chr [1:16] "I10_05mG_E5" "I10_05mG_E6" "I10_05mG_E7" "I10_C5" ...
## $ dose : num [1:16] 500 500 500 0 0 0 0 50000 50000 50000 ...
## $ batch : Factor w/ 2 levels "I10","I17": 1 1 1 1 1 1 1 2 2 2 ...
```

```
# Formatting of data for use in DRomics
#
data4DRomics <- formatdata4DRomics(signalmatrix = zebraf$counts,
dose = zebraf$dose)
# Look at the dataset coded as an R object
nrow(data4DRomics)
```

`## [1] 1001`

```
## V1 I10_05mG_E5 I10_05mG_E6 I10_05mG_E7 I10_C5 I10_C6 I10_C7
## 1 item 500 500 500 0 0 0
## 2 ENSDARG00000102141 453 656 590 636 529 453
## 3 ENSDARG00000102123 331 505 401 465 488 430
## 4 ENSDARG00000114503 897 1055 1073 1017 1022 992
## 5 ENSDARG00000115971 12 23 24 21 40 32
## 6 ENSDARG00000098311 326 544 416 484 330 297
## I10_C8 I17_50mG_E5 I17_50mG_E6 I17_50mG_E8 I17_5mG_E5 I17_5mG_E7 I17_5mG_E8
## 1 0 50000 50000 50000 5000 5000 5000
## 2 566 790 557 1005 688 775 425
## 3 582 498 427 815 450 675 307
## 4 1124 1153 900 2115 1064 1555 824
## 5 30 22 14 34 19 36 21
## 6 396 502 380 1065 522 688 332
## I17_C5 I17_C7 I17_C8
## 1 0 0 0
## 2 719 516 667
## 3 522 475 541
## 4 1449 1152 1341
## 5 44 27 22
## 6 666 471 506
```

Whatever the way you format your data, **we strongly recommend
you to carefully look at the following sections** to
**check that you use the good scale for you data**, which
**depends on the type of measured signal ** (counts of
reads, fluorescence signal, …).

### 2.1.2 What types of data can be analyzed using DRomics ?

#### 2.1.2.1 Description of the classical types of data handled by DRomics

DRomics offers the possibility to work on different types of omics
data (see following subsections for their description) but also on
continuous anchoring data. **When working on omics data, all the
lines of the data frame** (except the first one coding for the
doses or concentrations) **correspond to the same type of
data** (e.g. raw counts for RNAseq data). **When working on
anchoring data, the different lines** (except the first one
coding for the doses or concentrations) **correspond to different
endpoints that may correspond to different types of data**
(e.g. biomass, length,..), but all are assumed continuous data
compatible with a Gaussian (normal) error model (after transformation if
needed, e.g. logarithmic transformation) for the selection and modeling
steps (see the section on least squares if
you need a reminder on the is condition).

**Three types of omics data** may be imported in DRomics
using the following functions:

**RNAseqdata()**should be used to import**RNAseq as counts of reads**(for details look at the example with RNAseq data),**microarraydata()**should be used to import**single-channel microarray data in log2 scale**(for details look at the example with microarray data),**continuousomicdata()**should be used to import**other continuous omics data**such as metabolomics data, or proteomics data (only when expressed in intensity),…,**in a scale that enables the use of a Gaussian error model**(for details look at the example with metabolomic omics data).

It is also possible to import in DRomics **continuous anchoring
data** measured at the apical level, especially for **sake
of comparison of benchmark doses** (see Step
4 for definition of BMD) estimated **at different levels of
organization** but using **the same metrics**.
Nevertheless, one should keep in mind that the DRomics workflow was
optimized for an automatic analysis of high throughput omics data
(especially implying a selection and modeling steps on high-dimensional
data) and that **other tools may be better suited for the sole
analysis of apical dose-response data** (for details look at the
example with continuous apical data).

In Steps 1 and 2 **count data** are internally analysed
using functions of the Bioconductor package DESeq2,
continuous omics data (**microarray data and other continuous
omics data**) are internally analysed using functions of the
Bioconductor package limma
and **continuous anchoring data** are internally analysed
using the classical lm() function.

##### 2.1.2.1.1 An example with RNAseq data

For RNAseq data, **imperatively imported in raw counts**
(if your counts come from Kallisto or Salmon put add the argument
`round.counts = TRUE`

in order to round them), you have to
choose the transformation method used to stabilize the variance (“rlog”
or “vst”). In the example below “vst” was used to make this vignette
quick to compile, but **“rlog” is recommended and chosen by
default even if more computer intensive than “vst” except when the
number of samples is very large (> 30)** (as encountered for
*in situ* data for example: see ?RNAseqdata and the section dedicated to *in situ* data for
details on this point). Whatever the chosen method, **data are
automatically normalized with respect to library size and transformed in
log2 scale**.

```
RNAseqfilename <- system.file("extdata", "RNAseq_sample.txt", package = "DRomics")
# RNAseqfilename <- "yourchosenname.txt" # for a local file
```

```
## Elements of the experimental design in order to check the coding of the data:
## Tested doses and number of replicates for each dose:
##
## 0 0.22 0.67 2 6
## 2 3 3 3 3
## Number of items: 999
## Identifiers of the first 20 items:
## [1] "NM_144958" "NR_102758" "NM_172405" "NM_029777" "NM_001130188"
## [6] "NM_207141" "NM_001162368" "NM_008117" "NM_001168290" "NM_010910"
## [11] "NM_001004147" "NM_001146318" "NM_145597" "NM_001161797" "NM_021483"
## [16] "NR_002862" "NR_033520" "NM_134027" "NM_010381" "NM_019388"
## Data were normalized with respect to library size and tranformed using
## the following method: vst
```

The plot of the output shows the distribution of the signal on all the contigs/genes, for each sample, before and after the normalization and transformation of data.

##### 2.1.2.1.2 An example with microarray data

For single-channel microarray data, **imperatively imported in
log scale** (classical and recommended log2 scale), you can
choose between array normalization methods (“cyclicloess”, “quantile”,
“scale” or “none”). In the example below, “quantile” was used to make
this vignette quick to compile, but **“cyclicloess” is recommended
and chosen by default even if more computer intensive than the
others** (see ?microarraydata for details).

```
microarrayfilename <- system.file("extdata", "transcripto_sample.txt", package = "DRomics")
# microarrayfilename <- "yourchosenname.txt" # for a local file
```

```
## Elements of the experimental design in order to check the coding of the data:
## Tested doses and number of replicates for each dose:
##
## 0 0.69 1.223 2.148 3.774 6.631
## 5 5 5 5 5 5
## Number of items: 1000
## Identifiers of the first 20 items:
## [1] "1" "2" "3" "4" "5.1" "5.2" "6.1" "6.2" "7.1" "7.2"
## [11] "8.1" "8.2" "9.1" "9.2" "10.1" "10.2" "11.1" "11.2" "12.1" "12.2"
## Data were normalized between arrays using the following method: quantile
```

The plot of the output shows the distribution of the signal over all the probes, for each sample, before and after the normalization of data.

##### 2.1.2.1.3 An example with metabolomic data

**Neither normalization nor transformation** is provided
in the function continuousomicdata(). The **pre-treatment of
metabolomic data must be done before importation of data**, and
**data must be imported in log scale**, so that they can be
directly modelled using a **Gaussian (normal) error
model**. This strong hypothesis is required both for selection of
items and for dose-reponse modeling (see the section on least squares for a reminder if
needed). In the context of a multi-omics approach we recommend the use
of a log2 transformation, instead of the classical log10 for such data,
so as to facilitate the comparison of results obtained with
transcriptomics data generally handled in a log2 scale.

For instance, a basic procedure for the pre-treatment of metabolomic data could follow the three steps described thereafter: i) removing of metabolites for which the proportion of missing data (non detected) across all the samples is too high (more than 20 to 50 percents according to your tolerance level); ii) retrieving of missing values data using half minimum method (i.e. half of the minimum value found for a metabolite across all samples); iii) log-transformation of values. If a scaling to the total intensity (normalization by sum of signals in each sample) or another normalization is necessary and pertinent, we recommend to do it before those three previously described steps.

```
metabolofilename <- system.file("extdata", "metabolo_sample.txt", package = "DRomics")
# metabolofilename <- "yourchosenname.txt" # for a local file
```

```
## Elements of the experimental design in order to check the coding of the data:
## Tested doses and number of replicates for each dose:
##
## 0 0.69 1.1 1.79 2.92 4.78 7.76
## 10 6 2 2 2 6 2
## Number of items: 109
## Identifiers of the first 20 items:
##
## [1] "P_2" "P_4" "P_5" "P_6" "P_7" "P_10" "P_11" "P_12" "P_14" "P_16"
## [11] "P_19" "P_21" "P_22" "P_26" "P_32" "P_34" "P_35" "P_36" "P_37" "P_38"
```

The plot of the output shows the distribution of the signal over all the metabolites, for each sample.

The deprecated metabolomicdata() function was renamed
continuousomicdata() in the recent versions of the package (while
keeping the first name available) to **offer its use to other
continuous omic data** such as **proteomics data**
(when expressed in intensity) or **RT-qPCR data**. As for
metabolomic data, the **pre-treatment** of other continuous
omic data must be done **before importation**, and
**data must be imported in a scale that enables the use of a
Gaussian error model** as this strong hypothesis is required both
for selection of items and for dose-response modeling.

##### 2.1.2.1.4 An example with continuous anchoring apical data

No transformation is provided in the function
continuousanchoringdata(). **If needed the pre-treatment of data
must be done before importation of data**, so that they can be
directly modelled using a **Gaussian error model**. This
strong hypothesis is required both for selection of responsive endpoints
and for dose-reponse modeling (see the section
on least squares for a reminder if needed).

```
anchoringfilename <- system.file("extdata", "apical_anchoring.txt", package = "DRomics")
# anchoringfilename <- "yourchosenname.txt" # for a local file
```

In the following example the argument backgrounddose is used to
specify that doses below or equal to 0.1 are considered as 0 in the
DRomics workflow. Specifying this argument is necessary when there is no
dose at 0 in the data (see section on *in
situ* data for details on this point).

```
## Elements of the experimental design in order to check the coding of the data:
## Tested doses and number of replicates for each dose:
##
## 0 2.4 3.8 6.2 10.1 16.5 26.8 43.5 70.7
## 12 6 2 2 2 6 2 2 2
## Number of endpoints: 2
## Names of the endpoints:
## [1] "growth" "photosynthesis"
```

For such data the plot() function simply provides a dose-response plot for each endpoint.

#### 2.1.2.2 Handling of data collected through specific designs

The DRomics workflow was first developed for data collected through a
typical dose-response experiment, with a reasonable number of tested
doses (or concentrations - at least 4 in addition to the control and
ideally 6 to 8) and a small number of replicates per dose. Recently we
made some modifications in the package to **make possible the use
of designs with only 3 doses in addition to the control even if this
type of design is not recommended for dose-response
modeling**.

**We also extended our workflow to handle in situ
(observational) data**, for which there is

**no replication, as the dose (or concentration) is not controlled**(see an example with in situ data for more details).

It is also now possible to handle **experimental data collected
using a design with a batch effect** using DRomics together with
functions from the Bioconductor package sva
to correct for this batch effect before selection and modeling steps. We
also developed the **PCAplot() function to help visualizing this
batch effect** and the impact of the **batch effect
correction (BEC)** on data (see an example
with RNAseq data from an experiment with a batch effect for more
details on how to handle such a case and see ?PCAplot for details on
this specific function **that could also be used to identify
potential outlier samples**).

##### 2.1.2.2.1 An example with
*in situ* (observational) RNAseq data

One of the problem that may occur in particular with *in situ*
data, is the absence of real control samples, corresponding to a
strictly null exposure dose or concentration. **To prevent an
hazardous calculation of the BMD (see Step 4 for
definition of BMD) by extrapolation in such a case, one should use the
argument backgrounddose to define the maximal measured dose that can be
considered as a negligible dose.** All doses below or equal to
the value given in backgrounddose will be fixed at 0, so as to be
considered at the **background level of exposition**.

For *in situ* data (and more generally for data with a very
large number of samples), the **use of the rlog transformation in
RNAseqdata() is not recommended**, both for speed reason and
because you are **more likely to encounter a problem with the rlog
transformation in case of outliers** in such a case (see https://support.bioconductor.org/p/105334/
for an explanation of the author of DESeq2
and if you want to see an example of problems that may appear with
outliers in that case, just force the `transfo.method`

to
`"rlog"`

in the following example).

```
datafilename <- system.file("extdata", "insitu_RNAseq_sample.txt", package="DRomics")
# Importation of data specifying that observed doses below the background dose
# fixed here to 2e-2 will be considered as null dose to have a control
(o.insitu <- RNAseqdata(datafilename, backgrounddose = 2e-2, transfo.method = "vst"))
```

```
## Elements of the experimental design in order to check the coding of the data:
## Tested doses and number of replicates for each dose:
##
## 0 0.0205 0.0216 0.0248 0.0272 0.0322 0.0339 0.0387 0.0432 0.0463
## 25 1 1 1 1 1 1 1 1 1
## 0.04866 0.0528 0.0726 0.101 0.112 0.1122 0.167 2.089 2.474 2.892
## 1 1 3 1 1 1 1 1 1 1
## 2.899 2.904 3.008 3.16 3.199 3.251 3.323 3.483 3.604 4.484
## 1 1 1 1 1 1 1 1 1 1
## 4.917 4.924 5.509 8.53 9.16 9.38 9.5 9.83 10.35 11.06
## 1 1 1 1 1 1 1 1 1 1
## 11.34 11.94 12.11 12.39 13.51 14.8 16.26 16.89 18.2 18.28
## 1 1 1 1 1 1 1 1 1 1
## 19.14 20.62 20.7 21.13 23.0348 23.69 24.15 27.06 28.05 29.88
## 1 1 1 1 1 1 1 1 1 1
## 36.28
## 1
## Number of items: 1000
## Identifiers of the first 20 items:
## [1] "N00000000001_c0_g1" "N00000000002_c0_g1" "N00000000003_c0_g1"
## [4] "N00000000004_c0_g1" "N00000000005_c0_g1" "N00000000006_c0_g1"
## [7] "N00000000008_c0_g1" "N10000_c0_g1" "N10000_c0_g1.1"
## [10] "N10001_c0_g1" "N10008_c1_g1" "N10010_c0_g1"
## [13] "N10010_c1_g1" "N10011_c0_g1" "N10012_c0_g1"
## [16] "N100156_c0_g1" "N1001_c0_g1" "N10021_c0_g1"
## [19] "N10021_c0_g1.1" "N10021_c0_g1.2"
## Data were normalized with respect to library size and tranformed using
## the following method: vst
```

The plot of the output shows the distribution of the signal on all the contigs, for each sample (here the box plots are stuck to each other due to the large number of samples), before and after the normalization and transformation of data.

##### 2.1.2.2.2 An example with RNAseq data from an experiment with a batch effect

When omics data are collected through a **design with a known
potential batch effect**, the **DRomics function
PCAplot()** can be used as in the example below to
**visualize the impact of this batch effect on the data**.
If it seems necessary, **functions from specific packages**
can then be used to perform **batch effect correction
(BEC)**. We recommend the use of functions
**ComBat()** and **ComBat_seq()** from the
Bioconductor **sva**
package for this purpose, respectively for **microarray**
(or other continuous omic data) and **RNAseq data**.

As **sva**
is a Bioconductor package, it must be installed in the same way as
**DESeq2**
and **limma**
previously to be loaded. If needed look at the DRomics web page to get
the good instruction to install Bioconductor packages: https://lbbe.univ-lyon1.fr/fr/dromics).

Below is an example using ComBat-seq() on RNAseq data with batch effect. As the sva package does not import the RNAseq data in the same format as DRomics, it is necessary to use the DRomics function formatdata4DRomics() to interoperate between ComBat-seq in DRomics functions (see the section on importation as an R object for details on this function or ?formatdata4DRomics).

```
## List of 3
## $ counts: int [1:1000, 1:16] 453 331 897 12 326 533 1948 904 583 154 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:1000] "ENSDARG00000102141" "ENSDARG00000102123" "ENSDARG00000114503" "ENSDARG00000115971" ...
## .. ..$ : chr [1:16] "I10_05mG_E5" "I10_05mG_E6" "I10_05mG_E7" "I10_C5" ...
## $ dose : num [1:16] 500 500 500 0 0 0 0 50000 50000 50000 ...
## $ batch : Factor w/ 2 levels "I10","I17": 1 1 1 1 1 1 1 2 2 2 ...
```

```
## zebraf$batch
## zebraf$dose I10 I17
## 0 4 3
## 500 3 0
## 5000 0 3
## 50000 0 3
```

It appears in this design that the data were obtained using to batches, with only the controlled condition (null dose) appearing in both batches.

```
# Formating of data using the formatdata4DRomics() function
data4DRomics <- formatdata4DRomics(signalmatrix = zebraf$counts,
dose = zebraf$dose)
# Importation of data just to use DRomics functions
# As only raw data will be given to ComBat_seq after
(o <- RNAseqdata(data4DRomics))
```

```
## Just wait, the transformation using regularized logarithm (rlog) may
## take a few minutes.
```

```
## Elements of the experimental design in order to check the coding of the data:
## Tested doses and number of replicates for each dose:
##
## 0 500 5000 50000
## 7 3 3 3
## Number of items: 1000
## Identifiers of the first 20 items:
## [1] "ENSDARG00000102141" "ENSDARG00000102123" "ENSDARG00000114503"
## [4] "ENSDARG00000115971" "ENSDARG00000098311" "ENSDARG00000104839"
## [7] "ENSDARG00000100143" "ENSDARG00000102474" "ENSDARG00000104049"
## [10] "ENSDARG00000102226" "ENSDARG00000103095" "ENSDARG00000102128"
## [13] "ENSDARG00000110470" "ENSDARG00000100422" "ENSDARG00000104632"
## [16] "ENSDARG00000100660" "ENSDARG00000113107" "ENSDARG00000099787"
## [19] "ENSDARG00000112451" "ENSDARG00000070546"
## Data were normalized with respect to library size and tranformed using
## the following method: rlog
```

The PCA plot shows an impact of the batch effect, that clearly appears on controls (red points) which were obtained on two different batches.

```
# Batch effect correction using ComBat_seq{sva}
require(sva)
BECcounts <- ComBat_seq(as.matrix(o$raw.counts),
batch = as.factor(zebraf$batch),
group = as.factor(o$dose))
```

```
# Formating of data after batch effect correction
BECdata4DRomics <- formatdata4DRomics(signalmatrix = BECcounts,
dose = o$dose)
o.BEC <- RNAseqdata(BECdata4DRomics)
```

```
## Just wait, the transformation using regularized logarithm (rlog) may
## take a few minutes.
```

The PCA plot after BEC shows the impact of the correction on the batch effect that is no more visible on controls (red points).

## 2.2 Step 2: selection of significantly responding items

For the second step of the workflow, the function
**itemselect()** must be used simply **taking as
input in a first argument the output of the function used in step 1** (output of RNAseqdata(),
microarraydata(), continuousomicdata() or continuousanchoringdata()).
Below is an example with microarray data.

```
## Number of selected items using a quadratic trend test with an FDR of 0.01: 150
## Identifiers of the first 20 most responsive items:
## [1] "383.2" "384.2" "363.1" "383.1" "384.1" "363.2" "364.2" "364.1" "300.2"
## [10] "301.1" "300.1" "301.2" "263.2" "27.2" "25.1" "368.1" "351.1" "15"
## [19] "370" "350.2"
```

The **false discovery rate (FDR) corresponds to the expected
proportion of items that will be falsely detected as
responsive**. **With a very large data set** it is
important to define a selection step based on an FDR not only **to
reduce the number of items to be further processed**, but also
**to remove too noisy dose-response signals that may impair the
quality of the results**. We recommend to set a value between
0.001 and 0.1 depending of the initial number of items. When this number
is very high (more than several tens of thousands), we recommend a FDR
less than 0.05 (0.001 to 0.01) to increase the robustness of the results
(Larras et
al. 2018).

Concerning the method used for selection, **we recommend the
default choice (“quadratic”) for a typical omics dose-response design
(many doses/concentrations with few replicates per condition)**.
It enables the **selection of both monotonic and biphasic
dose-response relationships**. If you want to focus on monotonic
dose-response relationships, the “linear” method could be chosen. For a
design with a small number of doses/concentrations and many replicates
(not optimal for dose-response modeling), the “ANOVA” method could be
preferable. **For in situ data** (observational
data without replicates due to uncontrolled dose),

**only trend tests will be proposed**as the use of an ANOVA test in absence of replicates for some conditions is not reasonable.

Each of the three methods proposed for this selection step is based
on the use of a simple model (quadratic,linear or ANOVA-type) linking
the signal to the dose in a rank scale. This model is internally fitted
to data by an **empirical Bayesian approach** using the
respective packages **DESeq2**
and **limma**
for **RNAseq data** and **microarray or continuous
omics data**, and by classical **linear regression**
using the lm() function for **continuous anchoring data**.
The adjustment of p-values according to the specified FDR is performed
in any case, even for continuous anchoring data, so as to ensure the
unicity of the workflow independently of the type of data.

See ?itemselect for more details and Larras et al. 2018 comparison of the three proposed methods on an example.

It is easy, using for example the package
**VennDiagram**, to compare the selection of items obtained
using two different methods, as in the following example.

```
require(VennDiagram)
s_lin <- itemselect(o.microarray, select.method = "linear", FDR = 0.01)
index_quad <- s_quad$selectindex
index_lin <- s_lin$selectindex
plot(c(0,0), c(1,1), type = "n", xaxt = "n", yaxt = "n", bty = "n", xlab = "", ylab = "")
draw.pairwise.venn(area1 = length(index_quad), area2 = length(index_lin),
cross.area = length(which(index_quad %in% index_lin)),
category = c("quadratic trend test", "linear trend test"),
cat.col=c("cyan3", "darkorange1"), col=c("black", "black"),
fill = c("cyan3", "darkorange1"), lty = "blank", cat.pos = c(1,11))
```

## 2.3 Step 3: fit of dose-response models, choice of the best fit for each curve

### 2.3.1 Fit of the best model

For Step 3 the function drcfit() simply **takes as input in a
first argument the output of itemselect()**. For each item
selected in Step 2, the model that best fits the dose-response data is
chosen among a **family of five simple models built to describe a
wide variety of monotonic and biphasic dose-response curves
(DRC)** (and exclusively monotonic and biphasic curves : it is
why more flexible models such as polynomial third and fourth order
classical polynomial models were deliberately not considered). For a
complete description of those models see the last
section of Step 3 or Larras et
al. 2018.

The procedure used to **select the best fit** is based
on an **information criterion** as described in Larras et al. 2018
and in ?drcfit. The **classical** and **former
default option** of the **AIC** (Akaike criterion -
default information criterion used in DRomics versions < 2.2-0) was
**replaced by the default use of the AICc** (second-order
Akaike criterion) in order to **prevent the overfitting**
that may occur with dose-response designs with a small number of data
points, as recommended and now classically done in regression (Hurvich
and Tsai, 1989; Burnham and Anderson DR, 2004).

As the call to the drcfit() function may take time when the number of
pre-selected items is large, by default a progressbar is provided. Some
arguments of this function can be used to specify **parallel
computing to accelerate the computation** (see ?drcfit for
details).

```
## Results of the fitting using the AICc to select the best fit model
## 23 dose-response curves out of 150 previously selected were removed
## because no model could be fitted reliably.
## Distribution of the chosen models among the 127 fitted dose-response curves:
##
## Hill linear exponential Gauss-probit
## 0 30 39 48
## log-Gauss-probit
## 10
## Distribution of the trends (curve shapes) among the 127 fitted dose-response curves:
##
## U bell dec inc
## 19 38 37 33
```

In the following you can see the first lines of the output data frame
on our example (see ?drcfit for a complete description of the columns of
the output data frame.) This output data frame provides information for
each item, such as **the best-fit model**, **its
parameter values**, the **standard residual error
(SDres)** (see the section on least
squares for his definition), **coordinates of particular
points**, and the **trend of the curve** (among
increasing, decreasing, U-shaped, bell-shaped).

An extensive description of the outputs of the complete DRomics workflow is provided in the last section of the main workflow. Note that the number of items successfully fitted (output of Step 3) is often smaller that the number of items selected in Step 2, as for some of the selected items, all models may fail to converge or fail to significantly better describe the data than the constant model.

```
## id irow adjpvalue model nbpar b c d e f SDres
## 1 383.2 725 2.08e-07 Gauss-probit 4 5.5836 8.58 8.58 1.70 3.62 0.157
## 2 384.2 727 2.08e-07 exponential 3 -0.0298 NA 12.24 1.76 NA 0.160
## 3 363.1 686 2.24e-07 exponential 3 -0.2058 NA 9.10 3.11 NA 0.218
## 4 383.1 724 2.24e-07 Gauss-probit 4 5.4879 8.75 8.75 1.72 3.58 0.169
## 5 384.1 726 3.41e-07 Gauss-probit 4 6.8453 7.26 7.26 1.77 5.01 0.158
## 6 363.2 687 7.01e-07 exponential 3 -0.1467 NA 9.10 2.77 NA 0.206
## typology trend y0 yatdosemax yrange maxychange xextrem yextrem
## 1 GP.bell bell 12.0 11.03 1.17 1.004 1.70 12.2
## 2 E.dec.concave dec 12.2 10.99 1.25 1.251 NA NA
## 3 E.dec.concave dec 9.1 7.58 1.53 1.527 NA NA
## 4 GP.bell bell 12.2 11.15 1.18 1.012 1.72 12.3
## 5 GP.bell bell 12.1 11.15 1.11 0.949 1.77 12.3
## 6 E.dec.concave dec 9.1 7.64 1.46 1.461 NA NA
```

### 2.3.2 Plot of fitted curves

By default the plot() function used on the output of the drcfit()
function provides the first 20 fitted curves (or the ones you specify
using the argument items) with observed points. **Fitted curves
are represented in red, replicates are represented in open circles and
means of replicates at each dose/concentration are represented by solid
circles**. All the fitted curves may be saved in a pdf file using
the plotfit2pdf() function (see ?drcfit).

The fitted curves are by default **represented using a log
scale** for dose/concentration, which is more suited in common
cases where the range of observed doses/concentrations is very wide
and/or where tested doses/concentrations are obtained by dilutions. Use
`dose_log_transfo = FALSE`

to use the raw scale of doses (see
?drcfit for details and examples). It is why the observations at the
control appear differently as the other observations, as half circles on
the y-axis, to remind that their true value is minus infinity in a log
scale.

Another **specific plot function named targetplot() can be used
to plot targeted items, whether they were or not selected** in
step 2 and fitted in step 3. See an example below and details in
?targetplot

### 2.3.3 Plot of residuals

**To check the assumption of the Gaussian error model**
(see the section on least squares), two
types of residual plots can be used, `"dose_residuals"`

for
plot of residuals against the observed doses/concentrations, or
`"fitted_residuals"`

for plot of residuals against fitted
values of the modeled signal. The residual plots for all items may also
be saved in a pdf file using the plotfit2pdf() function (see
?drcfit).

### 2.3.4 Description of the family of dose-response models fitted in DRomics

The best fit model is chosen among the five following models describing the observed signal \(y\) as a function of \(x\) the dose (or concentration):

- the
**linear model**: \[y = d + b \times x\] with**2 parameters**, \(b\) the slope and \(d\) the mean signal at control. - the
**exponential model**: \[y = d + b \times \left(exp\left(\frac{x}{e}\right)-1\right)\] with**3 parameters**,

\(b\) a shape parameter, \(d\) the mean signal at control, \(e\) a shape parameter.- When \(e>0\) the dose response curve - DRC - is increasing if \(b>0\) and decreasing if \(b<0\), with no asymptote for high doses.
- When \(e<0\) the DRC is increasing if \(b<0\) and decreasing if \(b>0\), with an asymptote at \(d-b\) for high doses.

- the
**Hill model**: \[y = c + \frac{d-c}{1+(\frac{x}{e})^b}\] with**4 parameters**,

\(b\) (\(>0\)) a shape parameter, \(c\) the asymtotic signal for high doses, \(d\) the mean signal at control, and \(e\) (\(>0\)) the dose at the inflection point of the sigmoid. - the
**Gauss-probit model**built as the sum of a Gauss and a probit part sharing the same parameters as defined below:\[y = f \times exp\left(-0.5 \left(\frac{x-e}{b}\right)^2\right) +d+(c-d) \times \Phi\left(\frac{x-e}{b}\right)\] with**5 parameters**,

\(b\) (\(>0\)) a shape parameter corresponding to the standard deviation of the Gauss part of the model, \(c\) the asymtotic signal for high doses, \(d\) the asymptotic signal on the left of the DRC (generally corresponding to a fictive negative dose), \(e\) (\(>0\)) a shape parameter corresponding to the mean of the Gauss part of the model, and \(f\) the amplitude and sign of the Gauss part of the model (the model is U-shaped if \(f<0\) and bell-shaped if \(f<0\)).

\(\Phi\) represents the cumulative distribution function (CDF) of the standard Gauss (also named normal or Gaussian) distribution.

This model encompasses**two simplifed versions with 4 parameters**, one**monotonic**(for \(f=0\)) and one with**symmetrical asymptotes**(for \(c=d\)). - the
**log-Gauss-probit model**, a variant of the previous one with on the log scale of the dose: \[y = f \times exp\left(-0.5\left(\frac{ln(x)-ln(e)}{b}\right)^2\right) +d+(c-d) \times \Phi\left(\frac{ln(x)-ln(e)}{b}\right)\]

with**5 parameters**,

\(b\) (\(>0\)) a shape parameter corresponding to the standard deviation of the Gauss part of the model, \(c\) the asymtotic signal for high dose, \(d\) the asymptotic signal on the left of the DRC, reached at the control (for \(ln(x) = ln(0) = -\infty\)), \(ln(e)\) (\(>0\)) a shape parameter corresponding to the mean of the Gauss part of the model, and \(f\) the amplitude and sign of the Gauss part of the model (the model is U-shaped if \(f<0\) and bell-shaped if \(f<0\)).

\(\Phi\) represents the cumulative distribution function (CDF) of the standard Gauss distribution.

As the previous one, this model encompasses**two simplifed versions with 4 parameters**, one**monotonic**(for \(f=0\)) and one with**symmetrical asymptotes**(for \(c=d\)).

This family of five models was built to be able to describe a wide
range of monotonic and biphasic DRC. In the following plot were
represented typologies of curves that can be described using those
models, depending of the definition of their parameters. In the
following plot the curves are represented with the signal in y-axis and
the raw dose in x-axis. **As the range of tested or observed doses
is often large**, we decided to plot model fits **by
default using a log scale of doses**. The **shape of models
will be transformed in this log x-scale**, and **especially
the linear model will no more appear as a straight line as
below**.

### 2.3.5 Reminder on least squares regression

It is important when using DRomics to have in mind that the
dose-response models are fitted using the **least squares
regression**, assuming an **additive Gaussian (normal)
error model** for the observed signal. It is why the scale under
which you should import your data is very important: a log (or
pseudo-log) transformation may be necessary to meet the use conditions
of the model for some types of data.

Let us recall the formulation of the Gaussian model defining the signal
(after transformation if needed) \(y\)
as a function of the dose (or concentration) \(x\), \(f\)
being one of the five models previously described, and \(\theta\) the vector of its parameters (of
length 2 to 5).

\[y = f(x, \theta) + \epsilon\] with \[\epsilon \sim N(0, \sigma)\] \(N(0, \sigma)\) representing the Gaussian (normal) distribution of mean \(0\) and standard deviation (SD) \(\sigma\).

In this model, the **residual standard deviation \(\sigma\) is assumed constant**. It
is the classical “homoscedasticity” hypothesis (see the following figure
for an illustration). The examination of residuals (see the section on
plot of residuals) is a good way to check that
the error model is not strongly violated on your data.

## 2.4 Step 4: calculation of benchmark doses (BMD)

### 2.4.1 Calculation of BMD

The **two types of benchmark doses (BMD-zSD and BMD-xfold)
proposed by the EFSA
(2017) are systematically calculated** for each fitted
dose-response curve using the function **bmdcalc()** with
the output of the drcfit() function as a first argument, **but we
strongly recommend the use of the first one (BMD-zSD)** for
reasons explained in Larras et al. 2018
(see ?bmdcalc for details on this function).

```
## 1 BMD-xfold values and 0 BMD-zSD values are not defined (coded NaN as
## the BMR stands outside the range of response values defined by the model).
## 59 BMD-xfold values and 0 BMD-zSD values could not be calculated (coded
## NA as the BMR stands within the range of response values defined by the
## model but outside the range of tested doses).
```

For the **recommended BMD-zSD**,the argument \(z\), by default at 1, is used to define the
**BMD-zSD** as the dose at which the response is reaching
the **BMR (benchmark response)** defined as \[BMR = y_0 \pm z \times SD\] with \(y_0\) the level at the control given by the
dose-response fitted model and \(SD\)
the residual standard deviation of the dose-response fitted model (also
named \(\sigma\) in the previous mathematical definition of the Gaussian
model).

For the less recommended BMD-xfold, the argument \(x\), by default at 10 (for 10%), is used to define the BMD-xfold as the dose at which the response is reaching the BMR defined as \(BMR = y_0 \pm \frac{x}{100} \times y_0\). So this second BMD version does not take into account the residual standard deviation, and is strongly dependent of the magnitude of \(y_0\), which may be a problem if the signal at the control is close to 0, which is not rare on omics data that are classically handled on log scale.

In the following you can see the first lines of the output data frame
of the function bmdcalc() on our example. BMD values are coded
`NA`

when the BMR stands within the range of response values
defined by the model but outside the range of tested doses, and
`NaN`

when the BMR stands outside the range of response
values defined by the model due to asymptotes.

Very low BMD values obtained by extrapolation between 0 and the smallest
non null tested dose, that correspond to very sensitive items (that we
do not want to exclude), are thresholded at minBMD, an argument by
default fixed at the smallest non null tested dose divided by 100, but
that can be fixed by the user as what he considers to be a negligible
dose.

An extensive description of the outputs of the complete DRomics workflow is provided in the last section of the main workflow. You can also see ?bmdcalc for a complete description of its arguments and of the columns of its output data frame.

```
## id irow adjpvalue model nbpar b c d e f SDres
## 1 383.2 725 2.08e-07 Gauss-probit 4 5.5836 8.58 8.58 1.70 3.62 0.157
## 2 384.2 727 2.08e-07 exponential 3 -0.0298 NA 12.24 1.76 NA 0.160
## 3 363.1 686 2.24e-07 exponential 3 -0.2058 NA 9.10 3.11 NA 0.218
## 4 383.1 724 2.24e-07 Gauss-probit 4 5.4879 8.75 8.75 1.72 3.58 0.169
## 5 384.1 726 3.41e-07 Gauss-probit 4 6.8453 7.26 7.26 1.77 5.01 0.158
## 6 363.2 687 7.01e-07 exponential 3 -0.1467 NA 9.10 2.77 NA 0.206
## typology trend y0 yatdosemax yrange maxychange xextrem yextrem BMD.zSD
## 1 GP.bell bell 12.0 11.03 1.17 1.004 1.70 12.2 1.33
## 2 E.dec.concave dec 12.2 10.99 1.25 1.251 NA NA 3.26
## 3 E.dec.concave dec 9.1 7.58 1.53 1.527 NA NA 2.25
## 4 GP.bell bell 12.2 11.15 1.18 1.012 1.72 12.3 1.52
## 5 GP.bell bell 12.1 11.15 1.11 0.949 1.77 12.3 1.41
## 6 E.dec.concave dec 9.1 7.64 1.46 1.461 NA NA 2.43
## BMR.zSD BMD.xfold BMR.xfold
## 1 12.19 NA 10.83
## 2 12.08 6.59 11.02
## 3 8.89 5.26 8.19
## 4 12.33 NA 10.95
## 5 12.26 NA 10.89
## 6 8.90 5.47 8.19
```

### 2.4.2 Plots of the BMD distribution

The default plot of the output of the bmdcalc() function provides the distribution of benchmark doses as an ECDF (Empirical Cumulative Density Function) plot for the chosen BMD (“zSD”” or “xfold”). See an example below.

Different alternative plots are proposed (see ?bmdcalc for details) that can be obtained using the argument plottype to choose the type of plot (“ecdf”, “hist” or “density”) and the argument by to split the plot for example by “trend”. You can also use the bmdplot() function to make an ECDF plot of BMDs and personalize it (see ?bmdplot for details).

On a BMD ECDF plot one can add a **color gradient for each
item** coding for the **intensity of the signal**
(after shift of the control signal at 0) as a function of the dose (see
?bmdplotwithgradient for details and an example below). It is generally
necessary to use the argument line.size to manually adjust the width of
lines in that plot as the default value does not always give a
satisfactory resut. It is also recommended (but not mandatory but it is
the default option for the argument `scaling`

) to scale the
signal in order to focus on the shape of the dose-reponse curves and not
on the amplitude of the signal change.

### 2.4.3 Calculation of confidence intervals on the BMDs by bootstrap

**Confidence intervals on BMD values** can be calculated
by **bootstrap**. As the call to this function may take
much time, by default a progressbar is provided and some arguments can
be used to specify parallel computing to accelerate the computation (see
?bmdboot for details).

In the example below, a small number of iterations was used just to
make this vignette quick to compile, but **the default value of
the argument niter (1000) should be considered as a minimal value to
obtain stable results**.

```
## Bootstrap confidence interval computation failed on 21 items among 127
## due to lack of convergence of the model fit for a fraction of the
## bootstrapped samples greater than 0.5.
## For 5 BMD.zSD values and 65 BMD.xfold values among 127 at least one
## bound of the 95 percent confidence interval could not be computed due
## to some bootstrapped BMD values not reachable due to model asymptotes
## or reached outside the range of tested doses (bounds coded Inf)).
```

This function gives an output corresponding to the output of the bmdcalc() function completed with bounds of BMD confidence intervals (by default 95% confidence intervals) and the number of bootstrap iterations for which the model was successfully fitted to the data. An extensive description of the outputs of the complete DRomics workflow is provided in the last section of the main workflow.

```
## id irow adjpvalue model nbpar b c d e f SDres
## 1 383.2 725 2.08e-07 Gauss-probit 4 5.5836 8.58 8.58 1.70 3.62 0.157
## 2 384.2 727 2.08e-07 exponential 3 -0.0298 NA 12.24 1.76 NA 0.160
## 3 363.1 686 2.24e-07 exponential 3 -0.2058 NA 9.10 3.11 NA 0.218
## 4 383.1 724 2.24e-07 Gauss-probit 4 5.4879 8.75 8.75 1.72 3.58 0.169
## 5 384.1 726 3.41e-07 Gauss-probit 4 6.8453 7.26 7.26 1.77 5.01 0.158
## 6 363.2 687 7.01e-07 exponential 3 -0.1467 NA 9.10 2.77 NA 0.206
## typology trend y0 yatdosemax yrange maxychange xextrem yextrem BMD.zSD
## 1 GP.bell bell 12.0 11.03 1.17 1.004 1.70 12.2 1.33
## 2 E.dec.concave dec 12.2 10.99 1.25 1.251 NA NA 3.26
## 3 E.dec.concave dec 9.1 7.58 1.53 1.527 NA NA 2.25
## 4 GP.bell bell 12.2 11.15 1.18 1.012 1.72 12.3 1.52
## 5 GP.bell bell 12.1 11.15 1.11 0.949 1.77 12.3 1.41
## 6 E.dec.concave dec 9.1 7.64 1.46 1.461 NA NA 2.43
## BMR.zSD BMD.xfold BMR.xfold BMD.zSD.lower BMD.zSD.upper BMD.xfold.lower
## 1 12.19 NA 10.83 0.489 3.93 Inf
## 2 12.08 6.59 11.02 1.724 4.58 6.40
## 3 8.89 5.26 8.19 1.366 3.52 4.61
## 4 12.33 NA 10.95 0.423 4.17 Inf
## 5 12.26 NA 10.89 0.481 4.36 Inf
## 6 8.90 5.47 8.19 1.334 3.31 4.75
## BMD.xfold.upper nboot.successful
## 1 Inf 36
## 2 Inf 47
## 3 5.96 50
## 4 Inf 36
## 5 Inf 26
## 6 6.00 50
```

The plot() function applied on the output of the bmdboot() function gives an ECDF plot of the chosen BMD with the confidence interval of each BMD (see ?bmdcalc for examples). By default BMDs with an infinite confidence interval bound are not plotted.

### 2.4.4 Filtering BMDs according to estimation quality

Using the bmdfilter() function, it is possible to use one of the three filters proposed to retain only the items associated to the best estimated BMD values.

- By default are retained only the items for which the BMD and its
confidence interval are defined (using
`"CIdefined"`

) (so excluding items for which the bootstrap procedure failed). - One can be even more restrictive by retaining items only if the BMD
confidence interval is within the range of tested/observed doses (using
`"CIfinite"`

), - or less restrictive (using
`"BMDdefined"`

) requiring that the BMD point estimate only must be defined within the range of tested/observed doses.

Let us recall that in the `bmdcalc()`

output, if it is not
the case the BMD is coded `NA`

or `NaN`

.

Below is an example of application of the different filters based on BMD-xfold values, chosen just to better illustrate the way filters work, as there far more bad BMD-xfold estimations than bad BMD-zSD estimations.

```
# Plot of BMDs with no filtering
subres <- bmdfilter(b$res, BMDfilter = "none")
bmdplot(subres, BMDtype = "xfold", point.size = 2, point.alpha = 0.4,
add.CI = TRUE, line.size = 0.4) + theme_bw()
```

```
# Plot of items with defined BMD point estimate
subres <- bmdfilter(b$res, BMDtype = "xfold", BMDfilter = "definedBMD")
bmdplot(subres, BMDtype = "xfold", point.size = 2, point.alpha = 0.4,
add.CI = TRUE, line.size = 0.4) + theme_bw()
```

### 2.4.5 Plot of fitted curves with BMD values and confidence intervals

It is possible to add the output of bmdcalc() (or of bmdboot()) in
the argument BMDoutput of the plot() function of drcfit(), in order to
add BMD values (when defined) as a **vertical line on each fitted
curve**, and **bounds of the confidence intervals**
(when successfully calculated) as **two dashed lines**.
**Horizontal dotted lines corresponding to the two BMR potential
values will be also added**. See an example below.

```
# If you do not want to add the confidence intervals just replace b
# the output of bmdboot() by r the output of bmdcalc()
plot(f, BMDoutput = b)
```

All the fitted curves may also be saved in the same way in a pdf file using the plotfit2pdf() function (see ?drcfit).

### 2.4.6 Plot of all the fitted curves in one figure with points at BMD-BMR values

It is possible to use the curvesplot() function to plot all the
fitted curves in one figure and to add The use of the curvesplot()
function is more extensively described in the next parts and in the
corresponding help page. By default in this plot the curves are scaled
to focus on the shape of the dose-response and not on their amplitude
(add `scaling = FALSE`

to see the curves without
scaling).

In the following plot we also added vertical lines corresponding to tested doses on the plot and add transparency to visualize the density of curves when shapes are similar (especially the case for linear shapes).