This package contains a parser for Fasta-files allowing to separate different fields of meta-data like IDs, Name and Species. Here we will read a tiny example fasta-file obtained from a collection of typical contaminants in proteomics.
path1 <- system.file('extdata',package='wrProteo') fiNa <- "conta1.fasta" ## basic reading of Fasta fasta1 <- readFasta2(file.path(path1,fiNa)) str(fasta1) #> Named chr [1:35] "FPTDDDDKIVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTLDNDIMLIKLSSPATLNSRVATV"| __truncated__ ... #> - attr(*, "names")= chr [1:35] "TRYP_PIG TRYPSIN PRECURSOR" "TRY1_BOVIN Cationic trypsin (Fragment) - Bos taurus (Bovine)" "CTRA_BOVIN Chymotrypsinogen A - Bos taurus (Bovine)" "CTRB_BOVIN Chymotrypsinogen B - Bos taurus (Bovine)" ... ## now let's read and further separate annotation-fields fasta2 <- readFasta2(file.path(path1,fiNa),tableOut=TRUE) str(fasta2) #> chr [1:35, 1:9] "generic" "generic" "generic" "generic" "generic" ... #> - attr(*, "dimnames")=List of 2 #> ..$ : NULL #> ..$ : chr [1:9] "database" "uniqueIdentifier" "entryName" "proteinName" ...
This package provides support for importing quantitation results from Proteome Discoverer, MaxQuant and Proline. All quantitation import functions offer special features for separating species for optional different treatment. Normalization aims to eliminate/reduce variablity introduced to the data not linked to the original biological question. Of course, a well designed experiment with sufficent replicates is crucial for performing efficient normalization.
Technical replicates are very frequently found in proteomics, they allow to asses the influence of repeated injection of the same material. Biological replicates allow interpreting experiments in a more general way. Multiple options for normalization are available in the package wrMisc and can be called when importing data. Global median or global mean normalization have proved to provide robust results in numerous settings. However, they depend on the hypothesis that there are no global changes. This implies that the number of proteins changing abundance in a strong way should be small.
Typical contaminants in proteomics like keratins are very abudant and thus may influence global mean normalization. The import functions from this package let’s the user to choose if all proteins or a subset should be considered when determining normalization factors. For example, you may want to exclude contaminants from normalization since their dyamics may reflect other cofactors than the biological question. In the case of benchmark-tests it is very common to different species, like the human spike-in set UPS-1.
In such cases you may orient normalization on a primary species to avoid having high concentrations of spike-in standards influencing the normalization.
All import functions generate lists separating (selected) annotation data (annot) from normalized log2-quantitation data (quant) and initial quantitation (raw).
In Proteome Discoverer quantitation data on level of consensus-proteins should be exported to tabulated text files, which can be treated by this function. Note, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate.
## a summary of the quantitation data summary(dataPD$quant) #> S1rep1 S1rep2 S1rep3 S2rep1 #> Min. :17.63 Min. :17.49 Min. :18.37 Min. :17.15 #> 1st Qu.:20.18 1st Qu.:20.23 1st Qu.:20.03 1st Qu.:20.27 #> Median :21.72 Median :21.72 Median :21.72 Median :21.72 #> Mean :21.75 Mean :21.83 Mean :21.71 Mean :21.90 #> 3rd Qu.:23.06 3rd Qu.:23.19 3rd Qu.:23.02 3rd Qu.:23.29 #> Max. :28.16 Max. :28.29 Max. :28.16 Max. :28.70 #> NA's :1 NA's :3 NA's :1 NA's :2 #> S2rep2 S2rep3 #> Min. :17.58 Min. :17.14 #> 1st Qu.:20.23 1st Qu.:20.09 #> Median :21.72 Median :21.72 #> Mean :21.86 Mean :21.78 #> 3rd Qu.:23.26 3rd Qu.:23.13 #> Max. :28.65 Max. :28.58 #> NA's :2
Typically MaxQuant exports by default quantitation data on level of consensus-proteins as a folder called txt with a file called proteinGroups.txt . So in a standard case one needs only to provide the path to this file. However, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate, and thus it has gotten a different name.
## a summary of the quantitation data summary(dataMQ$quant) #> 25000am.1 25000am.2 25000am.3 2500am.1 #> Min. :20.95 Min. :21.22 Min. :20.96 Min. :19.52 #> 1st Qu.:27.01 1st Qu.:27.01 1st Qu.:26.97 1st Qu.:26.96 #> Median :27.66 Median :27.66 Median :27.66 Median :27.66 #> Mean :27.74 Mean :27.74 Mean :27.74 Mean :27.33 #> 3rd Qu.:28.44 3rd Qu.:28.44 3rd Qu.:28.48 3rd Qu.:28.48 #> Max. :33.20 Max. :33.19 Max. :33.11 Max. :33.10 #> NA's :2 NA's :2 NA's :3 NA's :10 #> 2500am.2 2500am.3 250am.1 250am.2 #> Min. :18.69 Min. :19.53 Min. :18.64 Min. :17.98 #> 1st Qu.:26.99 1st Qu.:26.91 1st Qu.:27.08 1st Qu.:27.08 #> Median :27.66 Median :27.66 Median :27.66 Median :27.66 #> Mean :27.32 Mean :27.29 Mean :27.70 Mean :27.66 #> 3rd Qu.:28.49 3rd Qu.:28.48 3rd Qu.:28.53 3rd Qu.:28.54 #> Max. :33.22 Max. :33.16 Max. :33.11 Max. :33.13 #> NA's :9 NA's :10 NA's :39 NA's :38 #> 250am.3 #> Min. :18.75 #> 1st Qu.:27.06 #> Median :27.66 #> Mean :27.71 #> 3rd Qu.:28.47 #> Max. :33.08 #> NA's :41
In Proline quantitation data on level of consensus-proteins can be exported to csv or tabulated text files, which can be treated by this function. Note, for this vignette we have only extracted a few hundred lines to make the data-set easier to manipulate.
## a summary of the quantitation data summary(dataPL$quant) #> A_01 A_02 A_03 A_04 #> Min. :14.26 Min. :14.09 Min. :14.10 Min. :14.35 #> 1st Qu.:19.26 1st Qu.:19.26 1st Qu.:19.22 1st Qu.:19.37 #> Median :20.80 Median :20.67 Median :20.72 Median :20.65 #> Mean :20.98 Mean :20.95 Mean :20.93 Mean :21.00 #> 3rd Qu.:22.50 3rd Qu.:22.46 3rd Qu.:22.54 3rd Qu.:22.48 #> Max. :28.66 Max. :28.62 Max. :28.61 Max. :28.73 #> NA's :29 NA's :27 NA's :21 NA's :27 #> B_01 B_02 B_03 B_04 #> Min. :15.78 Min. :12.15 Min. :15.32 Min. :13.37 #> 1st Qu.:18.83 1st Qu.:18.90 1st Qu.:18.78 1st Qu.:18.61 #> Median :20.30 Median :20.22 Median :20.26 Median :20.17 #> Mean :20.43 Mean :20.42 Mean :20.42 Mean :20.33 #> 3rd Qu.:22.03 3rd Qu.:22.08 3rd Qu.:22.00 3rd Qu.:21.94 #> Max. :27.61 Max. :27.74 Max. :27.75 Max. :27.75 #> NA's :77 NA's :77 NA's :76 NA's :73 #> C_01 C_02 C_03 C_04 #> Min. :14.44 Min. :14.35 Min. :14.69 Min. :13.84 #> 1st Qu.:19.31 1st Qu.:19.32 1st Qu.:19.18 1st Qu.:19.41 #> Median :20.92 Median :20.85 Median :20.77 Median :20.80 #> Mean :21.08 Mean :21.06 Mean :21.03 Mean :21.10 #> 3rd Qu.:22.65 3rd Qu.:22.63 3rd Qu.:22.66 3rd Qu.:22.65 #> Max. :28.85 Max. :28.84 Max. :28.83 Max. :28.83 #> NA's :22 NA's :18 NA's :20 NA's :21
As mentioned, the aim of normalization is to remove bias in data not linked to the original (biological) question. The import functions presented above do already by default global median normalization. In overal it is important to inspect results from normalization, graphical display of histograms, boxplots or violin-plots usually work well to compare distributions. Multiple options exist for normalizing data, for example the function normalizeThis from the package wrMisc can be used to run additional normalization.
Different normalization procedures intervene with different ‘aggressiveness’, ie capacity of deformining the initial data. In general we suggest to start normalizing using ‘milder’ procedures, like global median and switch to more intervening methods if initial results seem not satisfactory. Beware, heavy normalization procedures may also alter the main information you want to analyze. Ie, some biologically true positive changes may start to fade or dissapear when inappropriate normalization gets performed. Please note, that normalization should be performed before NA-inputation to avoid introducing new bias in the group of imputed values.
In Proteomics the quantitation of very low abundances is very challenging and typically absent or very low abundances appear in the results as 0 or NA. Typically this may be liked to the fact that no peak is detected in a MS-1 chromatogram (for a given LC elution-time) where other samples had a strong peak with succesful MS-2 identification. Data teated with MaxQuant have typically a higher degree of NA values. To simplify the treatment all 0 values are transformed to NA, anyway they would not allow log2 transformation either.
Before replacing NA-values it is important to verify that such values may be associated to absent or very low abundances. To do so, we suggest to inspect groups of replicate-measurments. In particular with multiple technical replicates of the same sample it is supposed that any variability observed is not linked to the sample itself. So for each NA that occurs in the data we suggest to look what was reported for the same protein with the other (technical) replicates. This brings us to the term of ‘NA-neighbours’ (quantifications for the same protein in replicates). When drawing histograms of NA-neighbours one can visually inspect and verify that NA-neighbours are typically low abundance values, however, but not necessarily the lowest values observed in the entire data-set.
This package proposes two related strategies for NA-imputation. First, the classical imputation of NA-values using Normal distributed random data is presented. Then, a more elaborate version based on repeated implementations to obtain more robust results is presented.
So only if the hypothesis of NA-neighbours as typically low abundance values gets confirmed by visual inspection of the histograms, one may proceed to replacing them by low random values. If one uses a unique (very) low value for NA-replacements, this will easily pose a problem when t-tests should be employed to look for proteins changing abundance between two or more groups of samples. Therefore it is common practice to draw random values from a Normal distribution representing this lower end of abundance values. Nevertheless, the choice of the parameters of this Normal distribution is very delicate.
The functions proposed here offer automatic selection of these parameters, which have been tested in a number of different projects. However, this choice should be checked by critically inpecting the histograms of ‘NA-neighbours’ (ie successful quantitations in other replicate samples of the same protein) and the final resulting distribution. Initially all NA-neighbours are extracted. It is also worth mentioning that in the majority of data-sets we encountered, such NA-neighbours form skewed distributions. Occasionally one may also find instances where 2 out of 3 (or more replicates) are NA. The successful quantitations of such instances with 2 NA-values per group are considered even more representative, but of course much less observed values remain. Thus a primary choice is made: If the selection of (min) 2 NA-values per group has more than 300 values, this distribution will be used as base to model the distribution for drawing random values. In this case, by default the 0.18 quantile of the 2 NA-neighbour distribution will be used as mean for the new Normal distribution used for NA-replacements. If the number of 2 NA-neighbours is >= 300, (by default) the 0.1 quantile all NA-neighbour values will used. Of course, the user has also the possibility to use custom choices for these parameters.
The final replacement is done on all NA values. This also includes proteins with are all NA in a given condition as well a instances of mixed successful qunatitation and NA values.
However, imputing using normal distributed random data also brings the risk of occasional extreme values. In the extreme case it may happen that a given protein is all NA in one group, and by chance the random values turn out be rather high. Then the final group mean of imputed values may be even higher than the mean of another group with successfull quantitations. Of course in this case it would be a bad interpretation to consider the protein in question upregulated in the sample where initially all values were NA. To circumvent this problem there are 2 options : Firstly, one may use special filtering schemes to exclude such constellations from final results or secondly, one could repeat replacement of NA-values numerous times.
This type of filtering can be performed using presenceFilt (package wrMisc).
The function testRobustToNAimputation allows such repeated replacement of NA-values. This function performs NA-imputation and statistical testing (after repeated imputation) between all groups of samples the same time (as it would be inefficient to separate these two tasks). The tests underneith apply shrinkage from the empirical Bayes procedure from the bioconductor package limma.
## Impute NA-values repeatedly and run statistical testing after each round of imputations testPL <- testRobustToNAimputation(dataPL$quant,gr=gl(3,4),lfdrInclude=FALSE) #> -> matrixNAneighbourImpute : n.woNA= 4012 n.NA = 488 #> model 10 %-tile of (min 1 NA/grp) 285 NA-neighbour values #> imputation: mean= 16.4 sd= 0.62 #> -> combineMultFilterNAimput : at presenceFilt: 352 360 357 out of 375 #> -> combineMultFilterNAimput : at abundanceFilt: 346 355 353 #> -> combineMultFilterNAimput : at NA> mean: 284, 338 and 288 ## Note, this example is too small for reliable lfdr estimation (would give warning) ## test results: classical BH FDR head(testPL$BH) #> 1-2 1-3 2-3 #> [1,] 1.277109e-01 0.1607831 2.065986e-01 #> [2,] NA 0.8787038 NA #> [3,] 1.358713e-04 0.9537559 1.132843e-04 #> [4,] 2.051206e-05 0.6742809 7.391326e-06 #> [5,] 3.871227e-06 0.1925214 4.751582e-07 #> [6,] 3.886196e-07 0.3789325 4.504443e-07 sum(testPL$BH[,1] < 0.05,na.rm=TRUE) #>  242 ## the data after repeated NA-imputation head(testPL$datImp) #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] #> [1,] 22.56945 22.19699 22.39452 22.42658 22.48411 22.66792 22.41511 22.90290 #> [2,] 17.61153 18.00776 16.35186 17.15051 16.29294 16.39084 16.34952 16.30732 #> [3,] 25.40560 25.28552 25.29061 25.33628 24.08922 24.01434 23.94017 24.86961 #> [4,] 19.35777 19.39723 19.73350 19.06219 17.77645 17.89538 18.11154 17.76764 #> [5,] 23.43934 23.55885 23.47963 23.28007 22.72747 22.61432 22.56039 22.50835 #> [6,] 25.36381 25.32356 25.17127 25.20114 24.21076 24.23250 24.10540 24.12083 #> [,9] [,10] [,11] [,12] #> [1,] 22.97558 22.85138 22.70813 22.64852 #> [2,] 17.26139 17.44857 17.38586 17.40649 #> [3,] 25.32381 25.35219 25.33581 25.38618 #> [4,] 19.64330 19.08661 19.58276 19.89625 #> [5,] 23.64638 23.70737 23.60144 23.63291 #> [6,] 25.22358 25.07744 25.14947 25.09119
#> R version 4.0.0 (2020-04-24) #> Platform: x86_64-w64-mingw32/x64 (64-bit) #> Running under: Windows 10 x64 (build 18363) #> #> Matrix products: default #> #> locale: #>  LC_COLLATE=C LC_CTYPE=French_France.1252 #>  LC_MONETARY=French_France.1252 LC_NUMERIC=C #>  LC_TIME=French_France.1252 #> #> attached base packages: #>  stats graphics grDevices utils datasets methods base #> #> other attached packages: #>  wrProteo_1.1.1 wrMisc_1.2.0 #> #> loaded via a namespace (and not attached): #>  Rcpp_22.214.171.124 digest_0.6.25 magrittr_1.5 evaluate_0.14 #>  rlang_0.4.5 stringi_1.4.6 limma_3.43.11 rmarkdown_2.1 #>  RColorBrewer_1.1-2 tools_4.0.0 stringr_1.4.0 xfun_0.13 #>  yaml_2.2.1 compiler_4.0.0 htmltools_0.4.0 knitr_1.28