‘Omics’ data now form a core part of systems biology by enabling researchers to understand the integrated functions of a living organism. The integrative analysis of these transcriptomics, proteomics, metabolomics data that are co jointly measured on the same samples represent analytical challenges for the statistician to extract meaningful information and to circumvent the high dimension, the noisiness and the multicollinearity characteristics of these multiple data sets. In order to correctly answer the biological questions, appropriate statistical methodologies have to be used to take into account the relationships between the different functional levels. The now well known multivariate projections approaches greatly facilitate the understanding of complex data structures. In particular, PLS-based methods can address a variety of problems and provide valuable graphical outputs. These approaches are therefore an indispensable and versatile tool in the statistician’s repertoire.
Variable selection on high throughput biological data becomes inevitable to select relevant information and to propose a parsimonious model. In this article, we give a general survey on PLS before focusing on the latest developments of PLS for variable selection to deal with large omics data sets. In a specific discriminant analysis framework, we compare two variants of PLS for variable selection on a biological data set: a backward PLS based on Variable Importance in Projection (VIP) which good performances have already been demonstrated, and a recently developed sparse PLS (sPLS) based on Lasso penalization of the loading vectors.
We demonstrate the good generalization performance of sPLS, its superiority in terms of computational efficiency and underline the importance of the graphical outputs resulting from sPLS to facilitate the biological interpretation of the results.
Les données ‘Omiques’ sont largement utilisées en biologie des systèmes pour comprendre les mécanismes biologiques impliqués dans le fonctionnement des organismes vivants. L’intégration de ces données transcriptomiques, protéomiques ou métabolomiques parfois mesurées sur les mêmes échantillons représente un challenge pour le statisticien. Il doit être capable d’extraire de ces données les informations pertinentes qu’elles contiennent, tout en devant composer avec des données à grandes dimensions et souffrant fréquemment de multicolinéarité. Dans ce contexte, il est primordial d’identifier les méthodes statistiques capables de répondre correctement aux questions biologiques, mélant parfois des relations entre différents niveaux de fonctionnalité. Les techniques statistiques multivariées de projections dans des espaces réduits facilitent grandement la compréhension des structures complexes des données omiques. En particulier, les approches basées sur la méthode PLS constituent un outil indispensable à la panoplie du statisticien. Leur grande polyvalence permet d’adresser une large variété de problèmes biologiques tout en fournissant des résultats graphiques pertinents pour l’interprétation biologique.
Etant donné le grand nombre de variables considérées (gènes, protéines ...), la sélection de variables est devenue une étape inévitable. L’objectif est de sélectionner uniquement l’information pertinente afin de construire le modèle le plus parcimonieux possible. Dans cet article, nous présentons la méthode PLS puis nous mettons l’accent sur les derniers développements en matière de sélection de variables pour la PLS dans le cadre de données omiques abondantes. Deux approches de sélection de variables avec PLS sont comparées dans le cas d’une analyse discriminante appliquée à un jeu de données biologiques : une approche descendante (‘backward’) basée sur le critère du VIP (‘Variable Importance in Projection’) pour laquelle de bonnes performances ont déjà été démontrées dans la littérature et la sparse PLS (sPLS), une approche récente basée sur une pénalisation Lasso des vecteurs ‘loadings’.
La sparse PLS montre de très bonnes perfomances globales ainsi qu’une très nette supériorité en temps de calcul. Elle permet aussi de démontrer l’efficacité des représentations graphiques issues de la PLS dans l’interprétation biologique des résultats.
Mot clés : régression Partial Least Squares, sélection de variables
@article{JSFS_2011__152_2_77_0, author = {L\^e Cao, Kim-Anh and Le Gall, Caroline}, title = {Integration and variable selection of {\textquoteleft}omics{\textquoteright} data sets with {PLS:} a survey}, journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique}, pages = {77--96}, publisher = {Soci\'et\'e fran\c{c}aise de statistique}, volume = {152}, number = {2}, year = {2011}, zbl = {1316.62007}, language = {en}, url = {http://www.numdam.org/item/JSFS_2011__152_2_77_0/} }
TY - JOUR AU - Lê Cao, Kim-Anh AU - Le Gall, Caroline TI - Integration and variable selection of ‘omics’ data sets with PLS: a survey JO - Journal de la société française de statistique PY - 2011 SP - 77 EP - 96 VL - 152 IS - 2 PB - Société française de statistique UR - http://www.numdam.org/item/JSFS_2011__152_2_77_0/ LA - en ID - JSFS_2011__152_2_77_0 ER -
%0 Journal Article %A Lê Cao, Kim-Anh %A Le Gall, Caroline %T Integration and variable selection of ‘omics’ data sets with PLS: a survey %J Journal de la société française de statistique %D 2011 %P 77-96 %V 152 %N 2 %I Société française de statistique %U http://www.numdam.org/item/JSFS_2011__152_2_77_0/ %G en %F JSFS_2011__152_2_77_0
Lê Cao, Kim-Anh; Le Gall, Caroline. Integration and variable selection of ‘omics’ data sets with PLS: a survey. Journal de la société française de statistique, Volume 152 (2011) no. 2, pp. 77-96. http://www.numdam.org/item/JSFS_2011__152_2_77_0/
[1] Gene Ontology: tool for the unification of biology, Nature genetics, Volume 25 (2000) no. 1, pp. 25-29
[2] Feature selection in omics prediction problems using cat scores and false non-discovery rate control, Ann. Appl. Stat (2009) | Zbl
[3] Model-consistent sparse estimation through the bootstrap (2009) (Technical report)
[4] Predictive ability of regression models: Part II. Selection of the best predictive PLS model, Journal of chemometrics, Volume 6 (1992) no. 6, pp. 347-356
[5] Generating Optimal Linear PLS Estimations (GOLPE): An Advanced Chemometric Tool for Handling 3D-QSAR Problems, Quantitative Structure-Activity Relationships, Volume 12 (1993) no. 1, pp. 9-20
[6] The peculiar shrinkage properties of partial least squares regression, Journal of the Royal Statistical Society B, Volume 62 (2000) no. 3, pp. 585-594 | MR | Zbl
[7] Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data, The Plant Journal, Volume 52 (2007) no. 6, pp. 1181-1191
[8] Partial least squares for discrimination, Journal of Chemometrics, Volume 17 (2003) no. 3, pp. 166-173
[9] Partial least squares: a versatile tool for the analysis of high-dimensional genomic data, Briefings in Bioinformatics, Volume 8 (2007) no. 1, pp. 32-44 | DOI
[10] Performance of some variable selection methods when multicollinearity is present, Chemometrics and Intelligent Laboratory Systems, Volume 78 (2005) no. 1-2, pp. 103-112
[11] Sparse Partial Least Squares Classification for High Dimensional Data, Statistical Applications in Genetics and Molecular Biology, Volume 9 (2010) no. 1, 30 pages | DOI | MR | Zbl
[12] Sparse partial least squares regression for simultaneous dimension reduction and variable selection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), Volume 72 (2010) no. 1, pp. 3-25 | MR | Zbl
[13] Elimination of uninformative variables for multivariate calibration, Anal. Chem, Volume 68 (1996) no. 21, pp. 3851-3858
[14] Co-inertia analysis: an alternative method for studying species–environment relationships, Freshwater Biology, Volume 31 (1994) no. 3, pp. 277-294
[15] SIMPLS: An alternative approach to partial least squares regression, Chemometrics and Intelligent Laboratory Systems, Volume 18 (1993), pp. 251-263
[16] Using chemometrics for navigating in the large data sets of genomics, proteomics, and metabonomics (gpm), Analytical and bioanalytical chemistry, Volume 380 (2004) no. 3, pp. 419-429
[17] Iterative predictor weighting (IPW) PLS: a technique for the elimination of useless predictors in regression problems, Journal of Chemometrics, Volume 13 (1999) no. 2, pp. 165-184
[18] The evolution of partial least squares models and related chemometric approaches in metabonomics and metabolic phenotyping, Journal of Chemometrics (2010)
[19] Comparison of selection methods of explanatory variables in PLS regression with application to manufacturing process data, Chemometrics and Intelligent Laboratory Systems, Volume 58 (2001) no. 2, pp. 349-363
[20] Highlighting relationships between heteregeneous biological data through graphical displays based on regularized Canonical Correlation Analysis, Journal of Biological Systems, Volume 17 (2009) no. 2, pp. 173-199 | MR | Zbl
[21] Insightful graphical outputs to explore relationships between two ‘omics’ data sets (2011) (Technical report)
[22] Partial least squares algorithm yields shrinkage estimators, The Annals of Statistics, Volume 24 (1996) no. 2, pp. 816-824 | MR | Zbl
[23] Some theoretical aspects of partial least squares regression, Chemometrics and Intelligent Laboratory Systems, Volume 58 (2001) no. 2, pp. 97-107
[24] Gene Expression Profiling of Rat Livers Reveals Indicators of Potential Adverse Effects, Toxicological Sciences, Volume 80 (2004) no. 1, pp. 193-202
[25] Relations between two sets of variates, Biometrika, Volume 28 (1936), pp. 321-377 | Zbl
[26] Genetic algorithms as a strategy for feature selection, Journal of Chemometrics, Volume 6 (1992) no. 5, pp. 267-281
[27] The PLS multivariate regression model: testing the significance of successive PLS components, Journal of chemometrics, Volume 15 (2001) no. 6, pp. 523-536
[28] Sparse PLS Discriminant Analysis: biologically relevant feature selection and graphical displays for multiclass problems (2011) (Technical report)
[29] Selecting both latent and explanatory variables in the PLS1 regression model, Chemometrics and Intelligent Laboratory Systems, Volume 66 (2003) no. 2, pp. 117-126
[30] integrOmics: an R package to unravel relationships between two omics data sets, Bioinformatics, Volume 25 (2009) no. 21, pp. 2855-2856
[31] Sparse canonical methods for biological data integration: application to a cross-platform study, BMC Bioinformatics, Volume 10 (2009) no. 34
[32] Sparse PLS: Variable Selection when Integrating Omics data, Statistical Application and Molecular Biology, Volume 7 (2008) no. (1):37 | MR | Zbl
[33] Interactive variable selection (IVS) for PLS. Part 1: Theory and algorithms, Journal of Chemometrics, Volume 8 (1994) no. 5, pp. 349-363
[34] Stability selection (2008) (Technical report) | MR
[35] Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, Volume 18 (2002) no. 1, pp. 39-50 | DOI
[36] The asymptotic variance of the univariate PLS estimator, Linear Algebra and its Applications, Volume 354 (2002) no. 1-3, pp. 245-253 | MR | Zbl
[37] Sparse canonical correlation analysis with application to genomic data integration, Statistical Applications in Genetics and Molecular Biology, Volume 8 (2009) no. 1 | MR | Zbl
[38] Overview and recent advances in partial least squares, Subspace, Latent Structure and Feature Selection (2006), pp. 34-51
[39] Probabilités analyse des données et statistique, Technip, 2006 | Zbl
[40] Sparse Principal Component Analysis via Regularized Low Rank Matrix Approximation, Journal of Multivariate Analysis, Volume 99 (2008), pp. 1015-1034 | MR | Zbl
[41] A general canonical index, Psychology Bulletin, Volume 70 (1968) no. 3, pp. 160-163
[42] La régression PLS: théorie et pratique, Editions Technip, 1998 | MR | Zbl
[43] Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B, Volume 58 (1996) no. 1, pp. 267-288 | MR | Zbl
[44] Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society: Series B (Statistical Methodology), Volume 67 (2005) no. 1, pp. 91-108 | MR | Zbl
[45] Multi-class tumor classification by discriminant partial least squares using microarray gene expression data and assessment of classification models, Computational Biology and Chemistry, Volume 28 (2004) no. 3, pp. 235-243 | Zbl
[46] SIMCA-P for windows, Graphical Software for Multivariate Process Modeling, Umea, Sweden (1996)
[47] A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case (2000) (Technical report)
[48] Quantifying the Association between Gene Expressions and DNA-Markers by Penalized Canonical Correlation Analysis, Statistical Applications in Genetics and Molecular Biology, Volume 7 (2008) no. 3 | MR | Zbl
[49] 3D QSAR in Drug Design; Theory, Methods, and Applications, PART III ESCOM, KLUWER/ESCOM, 1993
[50] PLS-regression: a basic tool of chemometrics, Chemometrics and intelligent laboratory systems, Volume 58 (2001) no. 2, pp. 109-130
[51] A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis, Biostatistics, Volume 10 (2009) no. 3, pp. 515-534 | DOI | Zbl
[52] Association of repeatedly measured intermediate risk factors for complex diseases with high dimensional SNP data, Algorithms for Molecular Biology, Volume 5 (2010) no. 1 | DOI
[53] Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology), Volume 67 (2005) no. 2, pp. 301-320 | MR | Zbl
[54] The adaptive lasso and its oracle properties, Journal of the American Statistical Association, Volume 101 (2006) no. 476, pp. 1418-1429 | MR | Zbl