Regularization is an important theme in statistics and machine earning and provides a principled way to address problems which would otherwise be ill-posed. It can be thought of as restriction of the set of functions in which an empirical risk minimization is performed. If the original empirical risk minimization problem is ill-posed in the sense that it admits several solutions or that the solution is very sensitive to small changes in the data, constraining the optimization to a small set of functions is known to sometimes yield better estimates of the true (population) risk minimizer. In particular, when one expects a good estimate to have a certain type of regularity, using this measure of regularity to build the constraint can decrease the variance of the estimator without adding too much bias. In a context of growing availability of biological data from high-throughput technologies like microarrays or next generation sequencing, being able to apply statistical learning methods to predict which treatment is best suited to a patient or how his disease is likely to evolve is of utmost importance. Since in practical situations few samples are available compared to the dimension of the data (typically tenth of thousand of measures), designing adequate regularity measures from biological prior information is important to make these problems amenable to statistical learning. Several such measures have been proposed in the recent years to address particular problems. In this work, we review some of these methods. We also present in more detail one of them, designed to enforce the support of a linear function to be a union of predefined overlapping groups of covariates, and discuss its performances on a breast cancer dataset.
La régularisation est un thème important en statistiques apprentissage automatique. Elle fournit un cadre général rigoureux pour résoudre des problèmes qui seraient autrement mal posés. On peut la présenter comme la restriction de l’ensemble des fonctions dans lequel on applique une minimisation du risque empirique. Lorsque le problème de minimisation du risque empirique est mal posé, dans le sens où il n’admet pas de solution unique ou que celle-ci est très sensible à de petits changements dans les données, contraindre l’optimisation dans un petit ensemble de fonctions améliore parfois l’estimation du minimum du vrai risque (en population). En particulier, si l’on s’attend à ce qu’un bon estimateur possède un certain type de régularité, utiliser cette mesure de régularité pour construire la contrainte peut permettre de diminuer la variance de l’estimateur sans pour autant trop augmenter son biais. La disponibilité grandissante des données biologiques issues de technologies dites à haut débit, telles que les puces à ADN ou le séquençage à haut débit rendent possible l’utilisation de méthodes d’apprentissage statistique pour prédire le traitement le plus adapté à un patient ou l’évolution la plus vraisemblable de sa maladie. Ces applications fondamentales sont limitées par le fait que peu d’échantillons sont généralement disponibles comparé à la dimension des données (typiquement des dizaines de milliers de mesures). La conception de mesures de régularité adaptées à ces problèmes est donc nécessaire. De nombreuses mesures, adaptées à des problèmes variés ont été récemment proposées. Nous proposons une revue de ces méthodes, et présentons plus en détail l’une d’entre elles, conçue pour contraindre le support de l’estimateur à une union de groupes de variables potentiellement chevauchants définis a priori. Nous présentons et discutons également ses performances sur un problème de prédiction impliquant des données de cancer du sein.
Mot clés : Bioinformatique, Apprentissage supervisé, Régularisation
@article{JSFS_2011__152_2_51_0, author = {Jacob, Laurent}, title = {Regularized learning in bioinformatics}, journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique}, pages = {51--76}, publisher = {Soci\'et\'e fran\c{c}aise de statistique}, volume = {152}, number = {2}, year = {2011}, zbl = {1316.62156}, language = {en}, url = {http://www.numdam.org/item/JSFS_2011__152_2_51_0/} }
TY - JOUR AU - Jacob, Laurent TI - Regularized learning in bioinformatics JO - Journal de la société française de statistique PY - 2011 SP - 51 EP - 76 VL - 152 IS - 2 PB - Société française de statistique UR - http://www.numdam.org/item/JSFS_2011__152_2_51_0/ LA - en ID - JSFS_2011__152_2_51_0 ER -
Jacob, Laurent. Regularized learning in bioinformatics. Journal de la société française de statistique, Volume 152 (2011) no. 2, pp. 51-76. http://www.numdam.org/item/JSFS_2011__152_2_51_0/
[1] Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, Volume 403 (2000) no. 6769, pp. 503-511 | DOI
[2] Machine Learning Models For Lung Cancer Classification Using Array Comparative Genomic Hybridization, Proceedings of the 2002 American Medical Informatics Association (AMIA) Annual Symposium (2002), pp. 7-11
[3] Consistency of the group lasso and multiple kernel learning, J. Mach. Learn. Res., Volume 9 (2008), pp. 1179-1225 http://jmlr.csail.mit.edu/papers/v9/bach08b.html | Zbl
[4] Exploring the new world of the genome with DNA microarrays, Nat. Genet., Volume 21 (2000), pp. 33-37 http://www.nature.com/ng/journal/v21/n1s/abs/ng0199supp_33.html
[5] Tissue Classification with Gene Expression Profiles, J. Comput. Biol., Volume 7 (2000) no. 3-4, pp. 559-583 | DOI
[6] Statistical Decision Theory and Bayesian Analysis, Springer-Verlag, 1985 | Zbl
[7] Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses, Proc. Natl. Acad. Sci. USA, Volume 98 (2001) no. 24, pp. 13790-13795 | DOI
[8] Convex Optimization, Cambridge University Press, New York, NY, USA, 2004 | Zbl
[9] Atomic decomposition by basis pursuit, SIAM J. Sci. Comput., Volume 20 (1998) no. 1, pp. 33-61 | DOI | Zbl
[10] High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer., Genome Biol., Volume 8 (2007) no. 10 | DOI
[11] Using array-comparative genomic hybridization to define molecular portraits of primary breast cancers, Oncogene, Volume 26 (2006) no. 13, pp. 1959-1970 | DOI
[12] Optimal statistical decisions / Morris H. De Groot, McGraw-Hill, New York :,, 1970, xvi+489 pages | Zbl
[13] Classification and Selection of Biomarkers in Genomic Data Using LASSO., J Biomed Biotechnol, Volume 2005 (2005) no. 2, p. 147-54 | DOI
[14] Classifying gene expression profiles from pairwise mRNA comparisons., Stat Appl Genet Mol Biol, Volume 3 (2004) | DOI | Zbl
[15] Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, Volume 286 (1999), pp. 531-537 http://www.sciencemag.org/cgi/reprint/286/5439/531.pdf
[16] The druggable genome, Nat. Rev. Drug Discov., Volume 1 (2002) no. 9, pp. 727-730 | DOI
[17] Catching Change-points with Lasso, Advances in Neural Information Processing Systems 20 (Platt, J.C.; Koller, D.; Singer, Y.; Roweis, S., eds.), MIT Press, Cambridge, MA, 2008, pp. 617-624
[18] The elements of statistical learning: data mining, inference, and prediction, Springer, 2001 | Zbl
[19] Structured Variable Selection with Sparsity-Inducing Norms (2009), 40 pages http://hal.inria.fr/inria-00377732/en/ (Research Report) | Zbl
[20] Clustered Multi-Task Learning: A Convex Formulation, Advances in Neural Information Processing Systems 21, MIT Press, 2009, pp. 745-752 http://books.nips.cc/papers/files/nips21/NIPS2008_0680.pdf
[21] Virtual screening of GPCRs: an in silico chemogenomics approach, BMC Bioinformatics, Volume 9 (2008) | DOI
[22] Gains in Power from Structured Two-Sample Tests of Means on Graphs (2010) no. arXiv:q-bio/1009.5173v1 (Technical report)
[23] Structured sparse principal component analysis, International Conference on Artificial Intelligence and Statistics (AISTATS) (2010)
[24] Group lasso with overlap and graph lasso, ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, NY, USA (2009), pp. 433-440 | DOI
[25] Efficient peptide-MHC-I binding prediction for alleles with few known binders., Bioinformatics, Volume 24 (2008) no. 3, pp. 358-366 | DOI
[26] Asymptotics for lasso-type estimators, Ann. Stat., Volume 28 (2000) no. 5, pp. 1356-1378 | DOI | Zbl
[27] An implicit function theorem: Comment, Journal of Optimization Theory and Applications, Volume 31 (1980), pp. 285-288 | Zbl
[28] Variable fusion: A new adaptive signal regression method (1997) no. 656 (Technical report)
[29] Network-constrained regularization and variable selection for analysis of genomic data., Bioinformatics, Volume 24 (2008) no. 9, pp. 1175-1182 | DOI
[30] A note on the LASSO and related procedures in model selection, Statistica Sinica, Volume 16 (2004) no. 4, pp. 1273-1284 | Zbl
[31] Hotelling’s T2 multivariate profiling for detecting differential expression in microarrays., Bioinformatics, Volume 21 (2005) no. 14, pp. 3105-3113 | DOI
[32] Next-generation DNA sequencing methods., Annu. Rev. Genomics Hum. Genet., Volume 9 (2008), pp. 387-402 | DOI
[33] The impact of informatics and computational chemistry on synthesis and screening., Drug Discov. Today, Volume 6 (2001) no. 21, pp. 1101-1110
[34] Support vector machine classification of microarray data (1998) no. 182 http://citeseer.nj.nec.com/437379.html (Technical report A.I. Memo 1677)
[35] The group lasso for logistic regression, J. R. Stat. Soc. Ser. B, Volume 70 (2008) no. 1, pp. 53-71 | DOI | Zbl
[36] Individualization of therapy using Mammaprint: from development to the MINDACT Trial., Cancer Genomics Proteomics, Volume 4 (2007) no. 3, pp. 147-155
[37] Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nat. Methods, Volume 5 (2008) no. 7, pp. 621-628 | DOI
[38] Union support recovery in high-dimensional multivariate regression (2008) no. 0808.0711v1 (Technical report)
[39] High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays, Nat. Genet., Volume 20 (1998) no. 2, pp. 207-211 | DOI
[40] Classification of arrayCGH data using fused SVM, Bioinformatics, Volume 24 (2008) no. 13, p. i375-i382 | DOI
[41] The Group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms, ICML ’08: Proceedings of the 25th international conference on Machine learning (2008), pp. 848-855
[42] Properties and refinements of the fused lasso, Annals of Statistics, Volume 37 (2009) no. 5B, pp. 2922-2952 | DOI | Zbl
[43] The Generalized LASSO: a wrapper approach to gene selection for microarray data, Proc. CADE-14, 252–255 (2002)
[44] Classification of microarray data using gene networks, BMC Bioinformatics, Volume 8 (2007) | DOI
[45] Regularized Learning with Networks of Features, Neural Information Processing Systems, MIT Press, Cambridge, MA (2009)
[46] Maximum-Margin Matrix Factorization, Adv. Neural. Inform. Process Syst. 17 (Saul, L. K.; Weiss, Y.; Bottou, L., eds.), MIT Press, Cambridge, MA (2005), pp. 1329-1336
[47] Rank, Trace-Norm and Max-Norm., COLT (2005), pp. 545-560 | Zbl
[48] Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles, Proc. Natl. Acad. Sci. USA, Volume 102 (2005) no. 43, pp. 15545-15550 | DOI
[49] Regression shrinkage and selection via the lasso, J. Royal. Statist. Soc. B., Volume 58 (1996) no. 1, pp. 267-288 | Zbl
[50] Sparsity and smoothness via the fused lasso, J. R. Stat. Soc. Ser. B Stat. Methodol., Volume 67 (2005) no. 1, pp. 91-108 http://ideas.repec.org/a/bla/jorssb/v67y2005i1p91-108.html | Zbl
[51] Spatial smoothing and hot spot detection for CGH data using the fused lasso., Biostatistics (Oxford, England), Volume 9 (2008) no. 1, pp. 18-29 | DOI | Zbl
[52] Gene expression profiling predicts clinical outcome of breast cancers, Nature, Volume 415 (2002) no. 6871, pp. 530-536 | DOI
[53] The nature of statistical learning theory, Springer-Verlag New York, Inc., New York, NY, USA, 1995 http://portal.acm.org/citation.cfm?id=211359 | Zbl
[54] Statistical Learning Theory, Wiley, New-York, 1998 | Zbl
[55] Fast detection of multiple change-points shared by many signals using group LARS, Advances in Neural Information Processing Systems 23 (Lafferty, J.; Williams, C. K. I.; Shawe-Taylor, J.; Zemel, R.S.; Culotta, A., eds.), MIT Press, 2010, pp. 2343-2351
[56] Array-CGH and breast cancer, Breast Cancer Research, Volume 8 (2006) no. 3 http://breast-cancer-research.com/content/8/3/210 | DOI
[57] Teoriya Raspoznavaniya Obrazov: Statisticheskie Problemy Obucheniya. (Russian) [Theory of Pattern Recognition: Statistical Problems of Learning], Moscow: Nauka, 1974 | Zbl
[58] A gene-expression signature as a predictor of survival in breast cancer, N. Engl. J. Med., Volume 347 (2002) no. 25, pp. 1999-2009 | DOI
[59] RNA-Seq: a revolutionary tool for transcriptomics., Nat. Rev. Genet., Volume 10 (2009) no. 1, pp. 57-63 | DOI
[60] Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B, Volume 68 (2006) no. 1, pp. 49-67 | Zbl
[61] On the non-negative garrotte estimator, Journal Of The Royal Statistical Society Series B, Volume 69 (2007) no. 2, pp. 143-161 http://ideas.repec.org/a/bla/jorssb/v69y2007i2p143-161.html | Zbl
[62] Regularization and variable selection via the Elastic Net, Journal of the Royal Statistical Society B, Volume 67 (2005), pp. 301-320 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.89.1596 | Zbl
[63] On model selection consistency of lasso, J. Mach. Learn. Res., Volume 7 (2006), pp. 2541-2563 http://jmlr.csail.mit.edu/papers/v7/zhao06a.html | Zbl