Numéro spécial : analyse des données en grande dimension
Variable clustering in high dimensional linear regression models
Journal de la société française de statistique, Volume 155 (2014) no. 2, pp. 38-56.

For the last three decades, the advent of technologies for massive data collection have brought deep changes in many scientific fields. What was first seen as a blessing, rapidly turned out to be termed as the curse of dimensionality. Reducing the dimensionality has therefore become a challenge in statistical learning. In high dimensional linear regression models, the quest for parsimony has long been driven by the idea that a few relevant variables may be sufficient to describe the modeled phenomenon. Recently, a new paradigm was introduced in a series of articles from which the present work derives. We propose here a model that simultaneously performs variable clustering and regression. Our approach no longer considers the regression coefficients as fixed parameters to be estimated, but as unobserved random variables following a Gaussian mixture model. The latent partition is then determined by maximum likelihood and predictions are obtained from the conditional distribution of the regression coefficients given the data. The number of latent components is chosen using a BIC criterion. Our model has very competitive predictive performances compared to standard approaches and brings significant improvements in interpretability.

Les trois dernières décennies ont vu l’avènement de profonds changements dans de nombreuses disciplines scientifiques. Certains de ces changements, directement liés à la collecte massive de données, ont donné naissance à de nombreux défis en apprentissage statistique. La réduction de la dimension en est un. En régression linéaire, l’idée de parcimonie a longtemps été associée à la possibilité de modéliser un phénomène grâce à un faible nombre de variables. Un nouveau paradigme a récemment été introduit dans lequel s’inscrivent pleinement les présents travaux. Nous présentons ici un modèle permettant simultanément d’estimer un modèle de régression tout en effectuant une classification des covariables. Ce modèle ne considère pas les coefficients de régression comme des paramètres à estimer mais plutôt comme des variables aléatoires non observées suivant une distribution de mélange gaussien. La partition latente des variables est estimée par maximum de vraisemblance. Le nombre de groupes de variables est choisi en minimisant le critère BIC. Notre modèle possède une très bonne qualité de prédiction et son interprétation est aiseée grâce à l’introduction de groupe de variables.

Keywords: Dimension reduction, Linear regression, Variable clustering
Mot clés : Réduction de la dimension, Régression linéaire, Classification de variables
@article{JSFS_2014__155_2_38_0,
     author = {Yengo, Lo{\"\i}c and Jacques, Julien and Biernacki, Christophe},
     title = {Variable clustering in high dimensional linear regression models},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {38--56},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {155},
     number = {2},
     year = {2014},
     zbl = {1316.62104},
     language = {en},
     url = {http://www.numdam.org/item/JSFS_2014__155_2_38_0/}
}
TY  - JOUR
AU  - Yengo, Loïc
AU  - Jacques, Julien
AU  - Biernacki, Christophe
TI  - Variable clustering in high dimensional linear regression models
JO  - Journal de la société française de statistique
PY  - 2014
SP  - 38
EP  - 56
VL  - 155
IS  - 2
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2014__155_2_38_0/
LA  - en
ID  - JSFS_2014__155_2_38_0
ER  - 
%0 Journal Article
%A Yengo, Loïc
%A Jacques, Julien
%A Biernacki, Christophe
%T Variable clustering in high dimensional linear regression models
%J Journal de la société française de statistique
%D 2014
%P 38-56
%V 155
%N 2
%I Société française de statistique
%U http://www.numdam.org/item/JSFS_2014__155_2_38_0/
%G en
%F JSFS_2014__155_2_38_0
Yengo, Loïc; Jacques, Julien; Biernacki, Christophe. Variable clustering in high dimensional linear regression models. Journal de la société française de statistique, Volume 155 (2014) no. 2, pp. 38-56. http://www.numdam.org/item/JSFS_2014__155_2_38_0/

[1] Biernacki, C. Degeneracy in the Maximum Likelihood Estimation of Univariate Gaussian Mixtures for Grouped Data and Behaviour of the EM Algorithm, Journal of Scandinavian Statistics, Volume 34 (2007), pp. 569-586 | Zbl

[2] Bondell, H. D.; Reich, B. J. Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR, Biometrics, Volume 64 (2008), pp. 115-123 | Zbl

[3] Casella, G. An Introduction to Empirical Bayes Data Analysis, The American Statistician, Volume 39 (1985) no. 2, pp. 83-87 | DOI

[4] Celeux, G.; Chauveau, D.; Diebolt, J. Some Stochastic versions of the EM Algorithm, Journal of Statistical Computation and Simulation, Volume 55 (1996), pp. 287-314 | Zbl

[5] Chun, H.; Keles, S. Expression Quantitative Trait Loci Mapping With Multivariate Sparse Partial Least Squares Regression, Genetics, Volume 182 (2009), pp. 79-90

[6] Daye, Z. J.; Jeng, X. J. Shrinkage and model selection with correlated variables via weighted fusion, Computational Statistics & Data Analysis, Volume 53 (2009) no. 4, pp. 1284-1298 http://ideas.repec.org/a/eee/csdana/v53y2009i4p1284-1298.html | Zbl

[7] Dempster, A. P.; Laird, M. N.; Rubin, D. B. Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society: Series B (Statistical Methodology), Volume 39 (1977), pp. 1-22 | Zbl

[8] Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression, Annals of Statistics, Volume 32 (2004), pp. 407-499 | Zbl

[9] Hoerl, A. E.; Kennard, W. Ridge Regression: Biased Estimation for Nonorthogonal Problems, Technometrics, Volume 12 (1970), pp. 55-67 | Zbl

[10] Hastie, T.; Tibshirani, R.; Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction: with 200 full-color illustrations, New York: Springer-Verlag, 2001, 533 pages | Zbl

[11] Ishwaran, H.; J., Sunil Rao Spike and slab variable selection: frequentist and Bayesian strategies, Annals of Statistics, Volume 33 (2005) no. 2, pp. 730-773 | Zbl

[12] Mitchell, T. J.; Beauchamp, J. J. Bayesian Variable Selection in Linear Regression, Journal of the American Statistical Association, Volume 83 (1988), pp. 1023-1032 | Zbl

[13] Park, M. Y.; Hastie, T.; Tibshirani, R. Averaged gene expressions for regression, Biostatistics (2007), pp. 212-227 | Zbl

[14] Policello, G. Conditional Maximum Likelihood Estimation in Gaussian Mixtures, Statistical Distributions in Scientific Work (NATO Advanced study Institutes Series), Volume 79, Springer Netherlands, 1981, pp. 111-125 | DOI | Zbl

[15] Petrone, S.; Rousseau, J.; Scricciolo, C. Bayes and empirical Bayes: do they merge? (2012) | arXiv | Zbl

[16] Petry, S.; Tutz, G. Shrinkage and variable selection by polytopes, Technical report No. 053, Department of Statistics, University of Munich (2009) | Zbl

[17] Sharma, D. B.; Bondell, H. D.; Zhang, H. H. Consistent Group Identification and Variable Selection in Regression with Correlated Predictors, Journal of Computational and Graphical Statistics. In Press. (2013)

[18] Schwarz, G. Estimating the Dimension of a Model, Annals of Statistics, Volume 6 (1978), pp. 461-464 | Zbl

[19] Shen, X.; Huang, H. Grouping pursuit in regression, Journal of American Statistical Association, Volume 105 (2010), pp. 727-739 | Zbl

[20] Stein, C. Estimation of the Mean of a Multivariate Normal Distribution, Annals of Statististics, Volume 9 (1981), pp. 1135-1151 | Zbl

[21] She, Y.; University, Stanford Sparse Regression with Exact Clustering, Stanford University, 2008

[22] Tibshirani, R. Regression Shrinkage and Selection Via the Lasso, Journal of the Royal Statistical Society, Series B, Volume 58 (1994), pp. 267-288 | Zbl

[23] Wei, C. G.; Tanner, M. A. A Monte Carlo Implementation of the EM Algorithm and the Poor Man’s Data Augmentation Algorithms, Journal of the American Statistical Association, Volume 85 (1990), pp. 699-704

[24] Zou, H.; Hastie, T. Regularization and variable selection via the Elastic Net, Journal of the Royal Statistical Society, Series B, Volume 67 (2005), pp. 301-320 | Zbl

[25] Zou, H.; Hastie, T.; Tibshirani, R. On the degrees of freedom of the lasso, Annals of Statistics, Volume 35 (2007) no. 5, pp. 2173-2192 | Zbl