Getting started with wrProteo

Wolfgang Raffelsberger

2020-04-29

Introduction

This package contains a collection of various tools in Proteomics. To get started, we need to load the package “wrMisc” and this package(wrProteo), both are available from CRAN.

Calculating molecular masses from composition formulas

Molecular masses based on (summed) chemical formulas

massDeFormula(c("12H12O","HO"," 2H 1 Se, 6C 2N","HSeCN"," ","e"))
#>  -> massDeFormula :  can't find ' ' .. setting to 0 mass
#>       12H12O         1H1O  2H1Se1e6C2N  1H1Se1e1C1N                        1e 
#> 2.040329e+02 1.700274e+01 1.819389e+02 1.069280e+02 0.000000e+00 5.485799e-04

Molecular masses based on amino-acid sequence

AAmass()
#>         A         R         N         D         C         E         Q         G 
#>  71.03711 156.10111 114.04293 115.02694 103.00918 129.04259 128.05858  57.02146 
#>         H         I         L         K         M         F         P         S 
#> 137.05891 113.08406 113.08406 128.09496 131.04048 147.06841  97.05276  87.03203 
#>         T         W         Y         V 
#> 101.04768 186.07931 163.06333  99.06841
## mass of peptide (or protein)
pep1 <- c(aa="AAAA",de="DEFDEF")
convAASeq2mass(pep1,seqN=FALSE)
#>       aa       de 
#> 302.1590 800.2865

Reading Fasta files (from Uniprot)

This package contains a parser for Fasta-files allowing to separate different fields of meta-data like IDs, Name and Species. Here we will read a tiny example fasta-file obtained from a collection of typical contaminants in proteomics.

path1 <- system.file('extdata',package='wrProteo')
fiNa <-  "conta1.fasta"
## basic reading of Fasta
fasta1 <- readFasta2(file.path(path1,fiNa))
str(fasta1)
#>  Named chr [1:35] "FPTDDDDKIVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTLDNDIMLIKLSSPATLNSRVATV"| __truncated__ ...
#>  - attr(*, "names")= chr [1:35] "TRYP_PIG TRYPSIN PRECURSOR" "TRY1_BOVIN Cationic trypsin (Fragment) - Bos taurus (Bovine)" "CTRA_BOVIN Chymotrypsinogen A - Bos taurus (Bovine)" "CTRB_BOVIN Chymotrypsinogen B - Bos taurus (Bovine)" ...

## now let's read and further separate annotation-fields
fasta2 <- readFasta2(file.path(path1,fiNa),tableOut=TRUE)
str(fasta2)
#>  chr [1:35, 1:9] "generic" "generic" "generic" "generic" "generic" ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:9] "database" "uniqueIdentifier" "entryName" "proteinName" ...

Importing and treating data from quantitative proteomics

This package provides support for importing quantitation results from Proteome Discoverer, MaxQuant and Proline. All quantitation import functions offer special features for separating species for optional different treatment. Normalization aims to eliminate/reduce variablity introduced to the data not linked to the original biological question. Of course, a well designed experiment with sufficent replicates is crucial for performing efficient normalization.
Technical replicates are very frequently found in proteomics, they allow to asses the influence of repeated injection of the same material. Biological replicates allow interpreting experiments in a more general way. Multiple options for normalization are available in the package wrMisc and can be called when importing data. Global median or global mean normalization have proved to provide robust results in numerous settings. However, they depend on the hypothesis that there are no global changes. This implies that the number of proteins changing abundance in a strong way should be small.

Typical contaminants in proteomics like keratins are very abudant and thus may influence global mean normalization. The import functions from this package let’s the user to choose if all proteins or a subset should be considered when determining normalization factors. For example, you may want to exclude contaminants from normalization since their dyamics may reflect other cofactors than the biological question. In the case of benchmark-tests it is very common to different species, like the human spike-in set UPS-1.
In such cases you may orient normalization on a primary species to avoid having high concentrations of spike-in standards influencing the normalization.

All import functions generate lists separating (selected) annotation data (annot) from normalized log2-quantitation data (quant) and initial quantitation (raw).

Read data from Proteome Discoverer

In Proteome Discoverer quantitation data on level of consensus-proteins should be exported to tabulated text files, which can be treated by this function. Note, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate.

path1 <- system.file("extdata",package="wrProteo")
fiNa <- "exampleProtDiscov1.txt"
dataPD <- readPDExport(file=fiNa,path=path1)

## a summary of the quantitation data
summary(dataPD$quant)
#>      S1rep1          S1rep2          S1rep3          S2rep1     
#>  Min.   :17.63   Min.   :17.49   Min.   :18.37   Min.   :17.15  
#>  1st Qu.:20.18   1st Qu.:20.23   1st Qu.:20.03   1st Qu.:20.27  
#>  Median :21.72   Median :21.72   Median :21.72   Median :21.72  
#>  Mean   :21.75   Mean   :21.83   Mean   :21.71   Mean   :21.90  
#>  3rd Qu.:23.06   3rd Qu.:23.19   3rd Qu.:23.02   3rd Qu.:23.29  
#>  Max.   :28.16   Max.   :28.29   Max.   :28.16   Max.   :28.70  
#>  NA's   :1       NA's   :3       NA's   :1       NA's   :2      
#>      S2rep2          S2rep3     
#>  Min.   :17.58   Min.   :17.14  
#>  1st Qu.:20.23   1st Qu.:20.09  
#>  Median :21.72   Median :21.72  
#>  Mean   :21.86   Mean   :21.78  
#>  3rd Qu.:23.26   3rd Qu.:23.13  
#>  Max.   :28.65   Max.   :28.58  
#>                  NA's   :2

Read data from MaxQuant

Typically MaxQuant exports by default quantitation data on level of consensus-proteins as a folder called txt with a file called proteinGroups.txt . So in a standard case one needs only to provide the path to this file. However, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate, and thus it has gotten a different name.

fiNa <- "proteinGroupsMaxQuantUps1.txt"
specPref1=c(conta="conta|CON_|LYSC_CHICK",mainSpecies="YEAST",spike="HUMAN_UPS")
dataMQ <- readMaxQuantFile(path1,file=fiNa,specPref=specPref1)
#>    by species : conta: 9  mainSpe: 245  spike: 48

## a summary of the quantitation data
summary(dataMQ$quant)
#>    25000am.1       25000am.2       25000am.3        2500am.1    
#>  Min.   :20.95   Min.   :21.22   Min.   :20.96   Min.   :19.52  
#>  1st Qu.:27.01   1st Qu.:27.01   1st Qu.:26.97   1st Qu.:26.96  
#>  Median :27.66   Median :27.66   Median :27.66   Median :27.66  
#>  Mean   :27.74   Mean   :27.74   Mean   :27.74   Mean   :27.33  
#>  3rd Qu.:28.44   3rd Qu.:28.44   3rd Qu.:28.48   3rd Qu.:28.48  
#>  Max.   :33.20   Max.   :33.19   Max.   :33.11   Max.   :33.10  
#>  NA's   :2       NA's   :2       NA's   :3       NA's   :10     
#>     2500am.2        2500am.3        250am.1         250am.2     
#>  Min.   :18.69   Min.   :19.53   Min.   :18.64   Min.   :17.98  
#>  1st Qu.:26.99   1st Qu.:26.91   1st Qu.:27.08   1st Qu.:27.08  
#>  Median :27.66   Median :27.66   Median :27.66   Median :27.66  
#>  Mean   :27.32   Mean   :27.29   Mean   :27.70   Mean   :27.66  
#>  3rd Qu.:28.49   3rd Qu.:28.48   3rd Qu.:28.53   3rd Qu.:28.54  
#>  Max.   :33.22   Max.   :33.16   Max.   :33.11   Max.   :33.13  
#>  NA's   :9       NA's   :10      NA's   :39      NA's   :38     
#>     250am.3     
#>  Min.   :18.75  
#>  1st Qu.:27.06  
#>  Median :27.66  
#>  Mean   :27.71  
#>  3rd Qu.:28.47  
#>  Max.   :33.08  
#>  NA's   :41

Read data from Proline

In Proline quantitation data on level of consensus-proteins can be exported to csv or tabulated text files, which can be treated by this function. Note, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate.

path1 <- system.file("extdata",package="wrProteo")
fiNa <- "exampleProlineABC.csv"
dataPL <- readProlineFile(file.path(path1,fiNa))
#> -> readProlineFile -> extrColsDeX :  Can't find column(s) 'protein_set.score'

## a summary of the quantitation data
summary(dataPL$quant)
#>       A_01            A_02            A_03            A_04      
#>  Min.   :14.26   Min.   :14.09   Min.   :14.10   Min.   :14.35  
#>  1st Qu.:19.26   1st Qu.:19.26   1st Qu.:19.22   1st Qu.:19.37  
#>  Median :20.80   Median :20.67   Median :20.72   Median :20.65  
#>  Mean   :20.98   Mean   :20.95   Mean   :20.93   Mean   :21.00  
#>  3rd Qu.:22.50   3rd Qu.:22.46   3rd Qu.:22.54   3rd Qu.:22.48  
#>  Max.   :28.66   Max.   :28.62   Max.   :28.61   Max.   :28.73  
#>  NA's   :29      NA's   :27      NA's   :21      NA's   :27     
#>       B_01            B_02            B_03            B_04      
#>  Min.   :15.78   Min.   :12.15   Min.   :15.32   Min.   :13.37  
#>  1st Qu.:18.83   1st Qu.:18.90   1st Qu.:18.78   1st Qu.:18.61  
#>  Median :20.30   Median :20.22   Median :20.26   Median :20.17  
#>  Mean   :20.43   Mean   :20.42   Mean   :20.42   Mean   :20.33  
#>  3rd Qu.:22.03   3rd Qu.:22.08   3rd Qu.:22.00   3rd Qu.:21.94  
#>  Max.   :27.61   Max.   :27.74   Max.   :27.75   Max.   :27.75  
#>  NA's   :77      NA's   :77      NA's   :76      NA's   :73     
#>       C_01            C_02            C_03            C_04      
#>  Min.   :14.44   Min.   :14.35   Min.   :14.69   Min.   :13.84  
#>  1st Qu.:19.31   1st Qu.:19.32   1st Qu.:19.18   1st Qu.:19.41  
#>  Median :20.92   Median :20.85   Median :20.77   Median :20.80  
#>  Mean   :21.08   Mean   :21.06   Mean   :21.03   Mean   :21.10  
#>  3rd Qu.:22.65   3rd Qu.:22.63   3rd Qu.:22.66   3rd Qu.:22.65  
#>  Max.   :28.85   Max.   :28.84   Max.   :28.83   Max.   :28.83  
#>  NA's   :22      NA's   :18      NA's   :20      NA's   :21

Normalization

As mentioned, the aim of normalization is to remove bias in data not linked to the original (biological) question. The import functions presented above do already by default global median normalization. In overal it is important to inspect results from normalization, graphical display of histograms, boxplots or violin-plots usually work well to compare distributions. Multiple options exist for normalizing data, for example the function normalizeThis from the package wrMisc can be used to run additional normalization.

Different normalization procedures intervene with different ‘aggressiveness’, ie capacity of deformining the initial data. In general we suggest to start normalizing using ‘milder’ procedures, like global median and switch to more intervening methods if initial results seem not satisfactory. Beware, heavy normalization procedures may also alter the main information you want to analyze. Ie, some biologically true positive changes may start to fade or dissapear when inappropriate normalization gets performed. Please note, that normalization should be performed before NA-inputation to avoid introducing new bias in the group of imputed values.

Imputation of NA-values

In Proteomics the quantitation of very low abundances is very challenging and typically absent or very low abundances appear in the results as 0 or NA. Typically this may be liked to the fact that no peak is detected in a MS-1 chromatogram (for a given LC elution-time) where other samples had a strong peak with succesful MS-2 identification. Data teated with MaxQuant have typically a higher degree of NA values. To simplify the treatment all 0 values are transformed to NA, anyway they would not allow log2 transformation either.

Before replacing NA-values it is important to verify that such values may be associated to absent or very low abundances. To do so, we suggest to inspect groups of replicate-measurments. In particular with multiple technical replicates of the same sample it is supposed that any variability observed is not linked to the sample itself. So for each NA that occurs in the data we suggest to look what was reported for the same protein with the other (technical) replicates. This brings us to the term of ‘NA-neighbours’ (quantifications for the same protein in replicates). When drawing histograms of NA-neighbours one can visually inspect and verify that NA-neighbours are typically low abundance values, however, but not necessarily the lowest values observed in the entire data-set.

This package proposes two related strategies for NA-imputation. First, the classical imputation of NA-values using Normal distributed random data is presented. Then, a more elaborate version based on repeated implementations to obtain more robust results is presented.

## Let's inspect NA values as graphic
matrixNAinspect(dataPD$quant,gr=gl(2,3),tit="Tiny example data from ProteomeDiscoverer") 

## Let's inspect NA values as graphic
matrixNAinspect(dataMQ$quant,gr=gl(3,3),tit="Tiny example data from MaxQuant") 

## Let's inspect NA values as graphic
matrixNAinspect(dataPL$quant,gr=as.factor(substr(colnames(dataPL$quant),1,1)),
  tit="Tiny example data from Proline") 

So only if the hypothesis of NA-neighbours as typically low abundance values gets confirmed by visual inspection of the histograms, one may proceed to replacing them by low random values. If one uses a unique (very) low value for NA-replacements, this will easily pose a problem when t-tests should be employed to look for proteins changing abundance between two or more groups of samples. Therefore it is common practice to draw random values from a Normal distribution representing this lower end of abundance values. Nevertheless, the choice of the parameters of this Normal distribution is very delicate.

The functions proposed here offer automatic selection of these parameters, which have been tested in a number of different projects. However, this choice should be checked by critically inpecting the histograms of ‘NA-neighbours’ (ie successful quantitations in other replicate samples of the same protein) and the final resulting distribution. Initially all NA-neighbours are extracted. It is also worth mentioning that in the majority of data-sets we encountered, such NA-neighbours form skewed distributions. Occasionally one may also find instances where 2 out of 3 (or more replicates) are NA. The successful quantitations of such instances with 2 NA-values per group are considered even more representative, but of course much less observed values remain. Thus a primary choice is made: If the selection of (min) 2 NA-values per group has more than 300 values, this distribution will be used as base to model the distribution for drawing random values. In this case, by default the 0.18 quantile of the 2 NA-neighbour distribution will be used as mean for the new Normal distribution used for NA-replacements. If the number of 2 NA-neighbours is >= 300, (by default) the 0.1 quantile all NA-neighbour values will used. Of course, the user has also the possibility to use custom choices for these parameters.

The final replacement is done on all NA values. This also includes proteins with are all NA in a given condition as well a instances of mixed successful qunatitation and NA values.

## ProteomeDiscoverer
dataPD$NArepl <- matrixNAneighbourImpute(dataPD$quant,gr=gl(2,3),
  tit="Tiny example from ProteomeDiscoverer") 
#>  -> matrixNAneighbourImpute :  n.woNA= 1785  n.NA = 9
#>     model 10 %-tile of (min 1 NA/grp) 10 NA-neighbour values
#>     imputation: mean= 18.1   sd= 0.56

## MaxQuant
dataMQ$NArepl <- matrixNAneighbourImpute(dataMQ$quant,gr=gl(3,3),tit="Tiny example from MaxQuant") 
#>  -> matrixNAneighbourImpute :  n.woNA= 2564  n.NA = 154
#>     model 10 %-tile of (min 1 NA/grp) 36 NA-neighbour values
#>     imputation: mean= 19   sd= 0.74

## Proline
dataPL$NArepl <- matrixNAneighbourImpute(dataPL$quant,gr=gl(3,4),tit="Tiny example from Proline") 
#>  -> matrixNAneighbourImpute :  n.woNA= 4012  n.NA = 488
#>     model 10 %-tile of (min 1 NA/grp) 285 NA-neighbour values
#>     imputation: mean= 16.4   sd= 0.62

However, imputing using normal distributed random data also brings the risk of occasional extreme values. In the extreme case it may happen that a given protein is all NA in one group, and by chance the random values turn out be rather high. Then the final group mean of imputed values may be even higher than the mean of another group with successfull quantitations. Of course in this case it would be a bad interpretation to consider the protein in question upregulated in the sample where initially all values were NA. To circumvent this problem there are 2 options : Firstly, one may use special filtering schemes to exclude such constellations from final results or secondly, one could repeat replacement of NA-values numerous times.

This type of filtering can be performed using presenceFilt (package wrMisc).

The function testRobustToNAimputation allows such repeated replacement of NA-values. This function performs NA-imputation and statistical testing (after repeated imputation) between all groups of samples the same time (as it would be inefficient to separate these two tasks). The tests underneith apply shrinkage from the empirical Bayes procedure from the bioconductor package limma.

## Impute NA-values repeatedly and run statistical testing after each round of imputations
testPL <- testRobustToNAimputation(dataPL$quant,gr=gl(3,4),lfdrInclude=FALSE) 
#>  -> matrixNAneighbourImpute :  n.woNA= 4012  n.NA = 488
#>     model 10 %-tile of (min 1 NA/grp) 285 NA-neighbour values
#>     imputation: mean= 16.4   sd= 0.62
#>  -> combineMultFilterNAimput :     at presenceFilt:   352 360 357   out of  375 
#>  -> combineMultFilterNAimput :     at abundanceFilt:  346 355 353
#>  -> combineMultFilterNAimput :    at NA> mean:   284, 338 and 288
## Note, this example is too small for reliable lfdr estimation (would give warning)

## test results: classical BH FDR
head(testPL$BH)
#>               1-2       1-3          2-3
#> [1,] 1.277109e-01 0.1607831 2.065986e-01
#> [2,]           NA 0.8787038           NA
#> [3,] 1.358713e-04 0.9537559 1.132843e-04
#> [4,] 2.051206e-05 0.6742809 7.391326e-06
#> [5,] 3.871227e-06 0.1925214 4.751582e-07
#> [6,] 3.886196e-07 0.3789325 4.504443e-07
sum(testPL$BH[,1] < 0.05,na.rm=TRUE)
#> [1] 242
## the data after repeated NA-imputation
head(testPL$datImp)
#>          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
#> [1,] 22.56945 22.19699 22.39452 22.42658 22.48411 22.66792 22.41511 22.90290
#> [2,] 17.61153 18.00776 16.35186 17.15051 16.29294 16.39084 16.34952 16.30732
#> [3,] 25.40560 25.28552 25.29061 25.33628 24.08922 24.01434 23.94017 24.86961
#> [4,] 19.35777 19.39723 19.73350 19.06219 17.77645 17.89538 18.11154 17.76764
#> [5,] 23.43934 23.55885 23.47963 23.28007 22.72747 22.61432 22.56039 22.50835
#> [6,] 25.36381 25.32356 25.17127 25.20114 24.21076 24.23250 24.10540 24.12083
#>          [,9]    [,10]    [,11]    [,12]
#> [1,] 22.97558 22.85138 22.70813 22.64852
#> [2,] 17.26139 17.44857 17.38586 17.40649
#> [3,] 25.32381 25.35219 25.33581 25.38618
#> [4,] 19.64330 19.08661 19.58276 19.89625
#> [5,] 23.64638 23.70737 23.60144 23.63291
#> [6,] 25.22358 25.07744 25.14947 25.09119

Session-Info

#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                   LC_CTYPE=French_France.1252   
#> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=French_France.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] wrProteo_1.1.1 wrMisc_1.2.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6       digest_0.6.25      magrittr_1.5       evaluate_0.14     
#>  [5] rlang_0.4.5        stringi_1.4.6      limma_3.43.11      rmarkdown_2.1     
#>  [9] RColorBrewer_1.1-2 tools_4.0.0        stringr_1.4.0      xfun_0.13         
#> [13] yaml_2.2.1         compiler_4.0.0     htmltools_0.4.0    knitr_1.28