Numéro spécial : analyse des données en grande dimension
Comparing Model Selection and Regularization Approaches to Variable Selection in Model-Based Clustering
[Comparaison des approches de régularisation et de sélection d’un modèle de mélange pour la sélection de variables en classification non supervisée]
Journal de la société française de statistique, Tome 155 (2014) no. 2, pp. 57-71.

Nous considérons deux approches importantes pour la sélection de variables en classification non supervisée : la sélection par modèle et la régularisation. Parmi les procédures existantes de sélection de variables par sélection de modèles, nous choisissons la méthode de Maugis et al. (2009b), généralisation de celle de Raftery et Dean (2006). Pour les méthodes fondées sur la régularisation, nous nous intéressons à la méthode de Witten and Tibshirani (2010). Nous comparons les performances de classification et de sélection de variables de ces deux procédures sur des données simulées. Nous montrons que la sélection de variables permet d’améliorer la classification quand les classes sont bien séparées. Les deux procédures de sélection de variables étudiées donnent des classifications analogues dans le premier exemple, mais l’approche par sélection de modèles a de meilleures performances pour la sélection de variables. Dans le second exemple, les variables sont corrélées. Nous montrons que l’approche par sélection de modèles améliore globalement la classification et la sélection de variables par rapport à la régularisation, et les deux procédures donnent de meilleurs résultats que l’algorithme des K -means (sans sélection de variables) pour la classification. Mais, il convient de noter que la sélection par modèles est inopérante pour les très grandes dimensions. Enfin, ce travail de comparaison est également mené sur des données réelles.

We compare two major approaches to variable selection in clustering: model selection and regularization. Based on previous results, we select the method of Maugis et al. (2009b), which modified the method of Raftery and Dean (2006), as a current state of the art model selection method. We select the method of Witten and Tibshirani (2010) as a current state of the art regularization method. We compared the methods by simulation in terms of their accuracy in both classification and variable selection. In the first simulation experiment all the variables were conditionally independent given cluster membership. We found that variable selection (of either kind) yielded substantial gains in classification accuracy when the clusters were well separated, but few gains when the clusters were close together. We found that the two variable selection methods had comparable classification accuracy, but that the model selection approach had substantially better accuracy in selecting variables. In our second simulation experiment, there were correlations among the variables given the cluster memberships. We found that the model selection approach was substantially more accurate in terms of both classification and variable selection than the regularization approach, and that both gave more accurate classifications than K -means without variable selection. But the model selection approach is not available in a very high dimension context.

Keywords: Model-based clustering, Model selection, Regularization approach, Variable selection
Mot clés : Classification non supervisée, Mélanges gaussiens, Régularisation, Sélection de modèles, Sélection de variables
@article{JSFS_2014__155_2_57_0,
     author = {Celeux, Gilles and Martin-Magniette, Marie-Laure and Maugis-Rabusseau, Cathy and Raftery, Adrian E.},
     title = {Comparing {Model} {Selection} and {Regularization} {Approaches} to {Variable} {Selection} in {Model-Based} {Clustering}},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {57--71},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {155},
     number = {2},
     year = {2014},
     zbl = {1316.62083},
     language = {en},
     url = {http://www.numdam.org/item/JSFS_2014__155_2_57_0/}
}
TY  - JOUR
AU  - Celeux, Gilles
AU  - Martin-Magniette, Marie-Laure
AU  - Maugis-Rabusseau, Cathy
AU  - Raftery, Adrian E.
TI  - Comparing Model Selection and Regularization Approaches to Variable Selection in Model-Based Clustering
JO  - Journal de la société française de statistique
PY  - 2014
SP  - 57
EP  - 71
VL  - 155
IS  - 2
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2014__155_2_57_0/
LA  - en
ID  - JSFS_2014__155_2_57_0
ER  - 
%0 Journal Article
%A Celeux, Gilles
%A Martin-Magniette, Marie-Laure
%A Maugis-Rabusseau, Cathy
%A Raftery, Adrian E.
%T Comparing Model Selection and Regularization Approaches to Variable Selection in Model-Based Clustering
%J Journal de la société française de statistique
%D 2014
%P 57-71
%V 155
%N 2
%I Société française de statistique
%U http://www.numdam.org/item/JSFS_2014__155_2_57_0/
%G en
%F JSFS_2014__155_2_57_0
Celeux, Gilles; Martin-Magniette, Marie-Laure; Maugis-Rabusseau, Cathy; Raftery, Adrian E. Comparing Model Selection and Regularization Approaches to Variable Selection in Model-Based Clustering. Journal de la société française de statistique, Tome 155 (2014) no. 2, pp. 57-71. http://www.numdam.org/item/JSFS_2014__155_2_57_0/

[1] Bouveyron, C.; Brunet, C. Discriminative variable selection for clustering with the sparse Fisher-EM algorithm, Computational Statistics (2013) (to appear)

[2] Bouveyron, C.; Brunet, C. Model-based clustering of high-dimensional data : A review, Computational Statistics and Data Analysis (2013) (to appear)

[3] Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone, C. J. Classification and Regression Trees, Wadsworth International, Belmont, California, 1984 | Zbl

[4] Bouveyron, C.; Girard, S.; Schmid, C. High-dimensional data clustering, Computational Statistics & Data Analysis, Volume 52 (2007), pp. 502-519 | DOI | Zbl

[5] Banfield, J. D.; Raftery, A. E. Model-based Gaussian and non-Gaussian clustering, Biometrics, Volume 49 (1993), pp. 803-821 | Zbl

[6] Celeux, G.; Govaert, G. Gaussian Parsimonious Clustering Models, Pattern Recognition, Volume 28 (1995), pp. 781-793

[7] Celeux, G.; Martin-Magniette, M.-L.; Maugis-Rabusseau, C.; Raftery, A. E. Letter to the Editor, Journal of the American Statistical Association, Volume 106 (2011), p. 383-383 | DOI

[8] Fraiman, R.; Justel, A.; Svarc, M. Selection of Variables for Cluster Analysis and Classification Rules, Journal of the American Statistical Association, Volume 103 (2008), pp. 1294-1303 | Zbl

[9] Friedman, J. H.; Meulman, J. J. Clustering objects on subsets of attributes (with discussion), Journal of the Royal Statistical Society, Series B, Volume 66 (2004), pp. 815-849 | Zbl

[10] Fraley, C.; Raftery, A. E. Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association, Volume 97 (2002), pp. 611-631 | Zbl

[11] Guo, J.; Levina, E.; Michailidis, G.; Zhu, J. Pairwise Variable Selection for High-Dimensional Model-Based Clustering, Biometrics, Volume 66 (2010), pp. 793-804 | Zbl

[12] Galimberti, G.; Montanari, A.; Viroli, C. Penalized factor mixture analysis for variable selection in clustered data, Computational Statistics and Data Analysis, Volume 53 (2009), pp. 4301-4310 | Zbl

[13] Gagnot, S.; Tamby, J.-P.; Martin-Magniette, M.-L.; Bitton, F.; Taconnat, L.; Balzergue, S.; Aubourg, S.; Renou, J.-P.; Lecharny, A.; Brunaud, V. CATdb: a public access to Arabidopsis transcriptome data from the URGV-CATMA platform., Nucleic Acids Research, Volume 36 (2008), pp. 986-990

[14] Hubert, L. J.; Arabie, P. Comparing partitions, Journal of Classification, Volume 2 (1985), pp. 193-218 | Zbl

[15] Kim, S.; Song, D. K. H.; DeSarbo, W. S. Model-Based Segmentation Featuring Simultaneous Segment-Level Variable Selection, Journal of Marketing Research, Volume 49 (2012), pp. 725-736

[16] Law, M. H.; Figueiredo, M. A. T.; Jain, A. K. Simultaneous Feature Selection and Clustering Using Mixture Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 26 (2004), pp. 1154-1166

[17] Lee, H.; Li, J. Variable Selection for Clustering by Separability Based on Ridgelines, Journal of Computational and Graphical Statistics, Volume 21 (2012), pp. 315-337

[18] McLachlan, G. J.; Basford, K. E. Mixture Models: Inference and Applications to Clustering, Marcel Dekker, New York, 1988 | Zbl

[19] McLachlan, G.J.; Baek, J.; Rathnayake, S. I. Mixtures of factor analyzers for the analysis of high-dimensional data, Mixture Estimation and Applications (Mengersen, K.L.; Robert, C.P.; Titterington, D.M., eds.), New Jersey: Wiley, 2011, pp. 171-191

[20] Maugis, C.; Celeux, G.; Martin-Magniette, M.-L. Variable Selection for Clustering with Gaussian Mixture Models, Biometrics, Volume 65 (2009), pp. 701-709 | Zbl

[21] Maugis, C.; Celeux, G.; Martin-Magniette, M.-L. Variable selection in model-based clustering: A general variable role modeling, Computational Statistics and Data Analysis, Volume 53 (2009), pp. 3872-3882 | Zbl

[22] McNicholas, P.D.; Murphy, T.B. Parsimonious Gaussian Mixture Models, Statistics and Computing, Volume 18 (2008), pp. 285-296

[23] Maugis, C.; Martin-Magniette, M.-L.; Tamby, J.-P.; Renou, J.-P.; Lecharny, A.; Aubourg, S.; Celeux, G. Sélection de variables pour la classification par mélanges gaussiens pour prédire la fonction des gènes orphelins, La Revue Modulad, Volume 40 (2009), pp. 69-80

[24] Nia, V. P.; Davison, A. C. High-Dimensional Bayesian Clustering with Variable Selection: The R Package bclust, Journal of Statistical Software, Volume 47 (2012) no. 5 | DOI

[25] Pan, W.; Shen, X. Penalized Model-Based Clustering with Application to Variable Selection, Journal of Machine Learning Reserach, Volume 8 (2007), pp. 1145-1164 | Zbl

[26] Poon, L. K. M.; Zhang, N. L.; Liu, A. H. Model-based clustering of high-dimensional data: Variable selection versus facet determination, International Journal of Approximate Reasoning, Volume 54 (2013), pp. 196-215 | Zbl

[27] Raftery, A. E.; Dean, N. Variable Selection for Model-Based Clustering, Journal of the American Statistical Association, Volume 101 (2006), pp. 168-178 | Zbl

[28] Steinley, D.; Brusco, M. J. SELECTION OF VARIABLES IN CLUSTER ANALYSIS: AN EMPIRICAL COMPARISON OF EIGHT PROCEDURES, Psychometrika, Volume 73 (2008), pp. 125-144 | Zbl

[29] Scrucca, L. Dimension reduction for model-based clustering, Statistics and Computing, Volume 20 (2010), pp. 471-484

[30] Sun, W.; Wang, J.; Fang, Y. Regularized k-means clustering of high dimensional data and its asymptotic consistency, Electronic Journal of Statistics, Volume 6 (2012), pp. 148-167 | Zbl

[31] Tadesse, M. G.; Sha, N.; Vannucci, M. Bayesian variable selection in clustering high-dimensional data, Journal of the American Statistical Association, Volume 100 (2005), pp. 602-617 | Zbl

[32] Tibshirani, R.; Walther, G.; Hastie, T. Estimating the number of clusters in a data set via the gap statistic, Journal of the Royal Statistical Society. Series B. Statistical Methodology, Volume 63 (2001), pp. 411-423 | Zbl

[33] Wolfe, J. H. Pattern Clustering by Multivariate Mixture Analysis, Multivariate Behavioral Research, Volume 5 (1970), pp. 329-350

[34] Witten, D. M.; Tibshirani, R. A framework for feature selection in clustering, Journal of American Statistical Association, Volume 105 (2010), pp. 713-726 | Zbl

[35] Wang, S.; Zhu, J. Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data, Biometrics, Volume 64 (2008), pp. 440-448 | Zbl

[36] Xie, B.; Pan, W.; Shen, X. Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables, Electronic Journal of Statistics, Volume 2 (2008), pp. 168-212 | DOI | Zbl

[37] Zhou, H.; Pan, W.; Shen, X. Penalized model-based clustering with unconstrained covariance matrices, Electronic Journal of Statistics, Volume 3 (2009), pp. 1473-1496 | DOI | Zbl