Handling missing values in exploratory multivariate data analysis methods
[Gestion des données manquantes en analyse factorielle]
Journal de la société française de statistique, Tome 153 (2012) no. 2, pp. 79-99.

Cet article fait suite à la conférence de Julie Josse sur ses travaux de thèse lors de la réception du prix Marie-Jeanne Laurent-Duhamel, dans le cadre des 44e Journées de Statistique (Bruxelles, 2012). Il reprend les principaux résultats des papiers de Julie Josse et François Husson sur la gestion des données manquantes en analyse factorielle et décrit de nouvelles avancées sur le sujet. Dans un premier temps, nous détaillons un algorithme d’ACP itérative régularisée qui permet d’estimer les axes et composantes principales en présence de données manquantes et qui pallie le problème majeur du surajustement. L’estimation ponctuelle est enrichie par la construction de zone de confiance. Une méthode d’imputation multiple non-paramétrique est alors développée pour prendre en compte l’incertitude due aux données manquantes. Enfin, nous abordons le problème récurrent du choix du nombre de dimensions et définissons des approximations de la validation croisée de type validation croisée généralisée. Tous ces travaux sont mis à disposition de l’utilisateur grâce au package missMDA du logiciel libre R.

This paper is a written version of the talk Julie Josse delivered at the 44 Journées de Statistique (Bruxelles, 2012), when being awarded the Marie-Jeanne Laurent-Duhamel prize for her Ph.D. dissertation by the French Statistical Society. It proposes an overview of some results, proposed in Julie Josse and François Husson’s papers, as well as new challenges in the field of handling missing values in exploratory multivariate data analysis methods and especially in principal component analysis (PCA). First we describe a regularized iterative PCA algorithm to provide point estimates of the principal axes and components and to overcome the major issue of overfitting. Then, we give insight in the parameters variance using a non parametric multiple imputation procedure. Finally, we discuss the problem of the choice of the number of dimensions and we detail cross-validation approximation criteria. The proposed methodology is implemented in the R package missMDA.

Keywords: Missing values, PCA, Multiple imputation, MCA, EM algorithm, Regularization, Residual bootstrap, Number of dimensions, Generalized cross-validation
Mot clés : Données manquantes, ACP, Imputation multiple, ACM, Algorithme EM, Regularization, Bootstrap des résidus, Nombre de dimensions, Validation croisée généralisée
@article{JSFS_2012__153_2_79_0,
     author = {Josse, Julie and Husson, Fran\c{c}ois},
     title = {Handling missing values in exploratory multivariate data analysis methods},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {79--99},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {153},
     number = {2},
     year = {2012},
     mrnumber = {3008600},
     zbl = {1316.62006},
     language = {en},
     url = {http://www.numdam.org/item/JSFS_2012__153_2_79_0/}
}
TY  - JOUR
AU  - Josse, Julie
AU  - Husson, François
TI  - Handling missing values in exploratory multivariate data analysis methods
JO  - Journal de la société française de statistique
PY  - 2012
SP  - 79
EP  - 99
VL  - 153
IS  - 2
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2012__153_2_79_0/
LA  - en
ID  - JSFS_2012__153_2_79_0
ER  - 
%0 Journal Article
%A Josse, Julie
%A Husson, François
%T Handling missing values in exploratory multivariate data analysis methods
%J Journal de la société française de statistique
%D 2012
%P 79-99
%V 153
%N 2
%I Société française de statistique
%U http://www.numdam.org/item/JSFS_2012__153_2_79_0/
%G en
%F JSFS_2012__153_2_79_0
Josse, Julie; Husson, François. Handling missing values in exploratory multivariate data analysis methods. Journal de la société française de statistique, Tome 153 (2012) no. 2, pp. 79-99. http://www.numdam.org/item/JSFS_2012__153_2_79_0/

[1] Bartholomew, D. J. Latent Variable Models and Factor Analysis, Griffin, 1987 | Zbl

[2] Benzécri, J.-P. L’analyse des données. Tome II: L’analyse des correspondances, Dunod, 1973 | MR | Zbl

[3] Bro, R.; Kjeldahl, K.; Smilde, A. K.; Kiers, H. A. L. Cross-validation of component model: a critical look at current methods, Anal Bioanal Chem, Volume 390 (2008), pp. 1241-1251

[4] Caussinus, H. Models and uses of principal component analysis (with discussion), Multidimensional Data Analysis (de Leeuw, J.; Heiser, W.; Meulman, J.; Critchley, F., eds.), DSWO Press, 1986, pp. 149-178

[5] Christofferson, A. The one-component model with incomplete data, Uppsala University, Institute of statistics (1969) (Ph. D. Thesis)

[6] Chateau, F.; Lebart, L. Assessing sample variability in the visualization techniques related to principal component analysis: bootstrap and alternative simulation methods, COMPSTAT, Physica-Verlag (Prats, A., ed.) (1996), pp. 205-210

[7] Craven, P.; Wahba, G. Smoothing noisy data with spline functions, Numer. Math., Volume 31 (1979) no. 4, pp. 377-403 | Zbl

[8] Denis, J.-B.; Gower, J. C. Asymptotic confidence regions for biadditive models: interpreting genotype-environment interactions, Applied Statistics, Volume 45 (1996) no. 4, pp. 479-493

[9] Dempster, A. P.; Laird, N. M.; Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, Volume 39 (1977) no. 1, pp. 1-38 | MR | Zbl

[10] Escofier, B.; Pagès, J. Analyses Factorielles simples et multiples, Dunod, 2008

[11] Escofier, B. Traitement des questionnaires avec non réponse, analyse des correspondances avec marge modifiée et analyse multicanonique avec contrainte, Pub. Inst. Stat. Univ., Volume 32 (1987) no. 3, pp. 33-69 | Zbl

[12] Greenacre, M.; Blasius, J. Multiple Correspondence Analysis and Related Methods, Chapman & Hall/CRC, 2006 | Zbl

[13] Gower, J. C.; Dijksterhuis, G. B. Procrustes Problems, New York: Oxford University Press, 2004 | MR | Zbl

[14] Gaiffas, S.; Lecué, G. Weighted algorithms for compressed sensing and matrix completion, Submitted (2011)

[15] Greenacre, M.; Pardo, R. Subset correspondence analysis: visualizing relationships among a selected set of response categories from a questionnaire survey, Sociological methods and research, Volume 35 (2006) no. 2, pp. 193-218 | MR

[16] Greenacre, M. Theory and Applications of Correspondence Analysis, Acadamic Press, 1984 | MR | Zbl

[17] Gabriel, K. R.; Zamir, S. Lower Rank Approximation of Matrices by Least Squares with Any Choice of Weights, Technometrics, Volume 21 (1979) no. 4, pp. 236-246 | Zbl

[18] Honaker, J.; King, G.; Blackwell, M. Amelia: Amelia II: A Program for Missing Data (2010) http://gking.harvard.edu/amelia (R package version 1.2-16)

[19] Hastie, T.; Tibshirani, R.; Friedman, J. The elements of statistical learning. Data Mining, Inference and Prediction. Second Edition, Springer series in statistics, 2009 | MR

[20] Healy, M. J. R.; Wesmacott, M. Missing values in experiments analyzed on automatic computers, Applied statistics, Volume 5 (1956) no. 3, pp. 203-206

[21] Ilin, A.; Raiko, T. Practical Approaches to Principal Component Analysis in the Presence of Missing Values, Journal of Machine Learning Research, Volume 11 (2010), pp. 1957-2000 | MR | Zbl

[22] Josse, J.; Chavent, M.; Liquet, B.; Husson, F. Handling missing values with Regularized Iterative Multiple Correspondence Analysis, Journal of classification, Volume 29 (2012) no. 1, pp. 91-116 | MR | Zbl

[23] Josse, J.; Husson, F. Multiple imputation in PCA, Advances in data analysis and classification, Volume 5 (2011) no. 3, pp. 231-246 | MR | Zbl

[24] Josse, J.; Husson, F. Selecting the number of components in PCA using cross-validation approximations, Computational Statististics and Data Analysis, Volume 56 (2011) no. 6, pp. 1869-1879 | MR | Zbl

[25] Jolliffe, I. T. Principal Component Analysis, Springer, 2002 | MR | Zbl

[26] Josse, J.; Pagès, J.; Husson, F. Gestion des données manquantes en Analyse en Composantes Principales, Journal de la Société Française de Statistique, Volume 150 (2009) no. 2, pp. 28-51 | Numdam | MR | Zbl

[27] Kiers, H. A. L. Weighted least squares fitting using ordinary least squares algorithms, Psychometrika, Volume 62 (1997) no. 2, pp. 251-266 | Zbl

[28] Lange, K. Optimization, Springer-Verlag, New-York, 2004 | MR

[29] Little, R. J. A.; Rubin, D. B. Statistical Analysis with Missing Data, Wiley series in probability and statistics, New-York, 1987, 2002 | MR | Zbl

[30] Meulman, J Homgeneity Analysis of Incomplete Data, D.S.W.O.-Press, Leiden, 1982

[31] Moreno-Gonzalez, J.; Crossa, J.; Cornelius, P. L. Additive main effects and multiplicative interaction model: I. theory on variance components for predicting cell means, Crop Science, Volume 43 (2003), pp. 1967-1975

[32] Mazumder, R.; Hastie, T.; Tibshirani, R. Spectral Regularization Algorithms for Learning Large Incomplete Matrices, Journal machine learning research, Volume 11 (2009), pp. 2287-2322 | MR | Zbl

[33] Netflix Netflix Challenge, 2009 http://www.netflixprize.com

[34] Nora-Chouteau, C. Une méthode de reconstitution et d’analyse de données incomplètes, Université Pierre et Marie Curie (1974) (Ph. D. Thesis)

[35] R Development Core Team R: A Language and Environment for Statistical Computing (2011) http://www.R-project.org/ (ISBN 3-900051-07-0)

[36] Robinson, G. K. BLUP is a Good Thing: The Estimation of Random Effects, Statistical Science, Volume 6 (1991) no. 1, pp. 15-51 | MR | Zbl

[37] Roweis, S. EM algorithms for PCA and Sensible PCA, Advances in Neural Information Processing Systems, Volume 10 (2008), pp. 626-632

[38] Rubin, D B Inference and missing data, Biometrika, Volume 63 (1976), pp. 581-592 | MR | Zbl

[39] Rubin, D. B. Multiple Imputation for Non-Response in Survey, Wiley, 1987 | MR | Zbl

[40] Schafer, J. Analysis of Incomplete Multivariate Data, Chapman & Hall/CRC, 1997 | MR | Zbl

[41] Schafer, J. L.; Olsen, M. K. Multiple imputation for missing-data problems: A data analyst’s perspective, Multivariate Behavioral Research, Volume 33 (1998) no. 4, pp. 545-571

[42] Tipping, M.; Bishop, C. M. Probabilistic Principal Component Analysis, Journal of the Royal Statistical Society B, Volume 61 (1999) no. 3, pp. 611-622 | MR | Zbl

[43] Timmerman, M. E. Multilevel component analysis, British Journal of Mathematical and Statistical Psychology, Volume 59 (2006) no. 2, pp. 301-320 | MR

[44] Timmerman, M. E.; Kiers, H. A. L.; Smilde, A. K. Estimating confidence intervals for principal component loadings: a comparison between the bootstrap and asymptotic results, British Journal of Mathematica and Statistical Psychology, Volume 60 (2007) no. 2, pp. 295-314

[45] Wold, H; Lyttkens, E Nonlinear iterative partial least squares (NIPALS) estimation procedures, Bulletin. Int. Stat. Institut, Volume 43 (1969), pp. 29-51 | Zbl

[46] de Leeuw, J.; Mooijaart, A.; van der Leeden, R. Fixed factor score models with linear restrictions (1985) (Technical report)

[47] van Buuren, S. Flexible Imputation of Missing Data, Chapman & Hall/CRC, Boca Raton, 2012 | Zbl