Handling missing values in exploratory multivariate data analysis methods
[Gestion des données manquantes en analyse factorielle]
Journal de la société française de statistique, Tome 153 (2012) no. 2, pp. 79-99.

Cet article fait suite à la conférence de Julie Josse sur ses travaux de thèse lors de la réception du prix Marie-Jeanne Laurent-Duhamel, dans le cadre des 44e Journées de Statistique (Bruxelles, 2012). Il reprend les principaux résultats des papiers de Julie Josse et François Husson sur la gestion des données manquantes en analyse factorielle et décrit de nouvelles avancées sur le sujet. Dans un premier temps, nous détaillons un algorithme d’ACP itérative régularisée qui permet d’estimer les axes et composantes principales en présence de données manquantes et qui pallie le problème majeur du surajustement. L’estimation ponctuelle est enrichie par la construction de zone de confiance. Une méthode d’imputation multiple non-paramétrique est alors développée pour prendre en compte l’incertitude due aux données manquantes. Enfin, nous abordons le problème récurrent du choix du nombre de dimensions et définissons des approximations de la validation croisée de type validation croisée généralisée. Tous ces travaux sont mis à disposition de l’utilisateur grâce au package missMDA du logiciel libre R.

This paper is a written version of the talk Julie Josse delivered at the 44 Journées de Statistique (Bruxelles, 2012), when being awarded the Marie-Jeanne Laurent-Duhamel prize for her Ph.D. dissertation by the French Statistical Society. It proposes an overview of some results, proposed in Julie Josse and François Husson’s papers, as well as new challenges in the field of handling missing values in exploratory multivariate data analysis methods and especially in principal component analysis (PCA). First we describe a regularized iterative PCA algorithm to provide point estimates of the principal axes and components and to overcome the major issue of overfitting. Then, we give insight in the parameters variance using a non parametric multiple imputation procedure. Finally, we discuss the problem of the choice of the number of dimensions and we detail cross-validation approximation criteria. The proposed methodology is implemented in the R package missMDA.

Mots clés : Données manquantes, ACP, Imputation multiple, ACM, Algorithme EM, Regularization, Bootstrap des résidus, Nombre de dimensions, Validation croisée généralisée
@article{JSFS_2012__153_2_79_0,
     author = {Josse, Julie and Husson, Fran\c{c}ois},
     title = {Handling missing values in exploratory multivariate data analysis methods},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {79--99},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {153},
     number = {2},
     year = {2012},
     zbl = {1316.62006},
     mrnumber = {3008600},
     language = {en},
     url = {http://www.numdam.org/item/JSFS_2012__153_2_79_0/}
}
TY  - JOUR
AU  - Josse, Julie
AU  - Husson, François
TI  - Handling missing values in exploratory multivariate data analysis methods
JO  - Journal de la société française de statistique
PY  - 2012
DA  - 2012///
SP  - 79
EP  - 99
VL  - 153
IS  - 2
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2012__153_2_79_0/
UR  - https://zbmath.org/?q=an%3A1316.62006
UR  - https://www.ams.org/mathscinet-getitem?mr=3008600
LA  - en
ID  - JSFS_2012__153_2_79_0
ER  - 
Josse, Julie; Husson, François. Handling missing values in exploratory multivariate data analysis methods. Journal de la société française de statistique, Tome 153 (2012) no. 2, pp. 79-99. http://www.numdam.org/item/JSFS_2012__153_2_79_0/

[1] Bartholomew, D. J. Latent Variable Models and Factor Analysis, Griffin, 1987 | Zbl 0664.62057

[2] Benzécri, J.-P. L’analyse des données. Tome II: L’analyse des correspondances, Dunod, 1973 | MR 593139 | Zbl 0503.62003

[3] Bro, R.; Kjeldahl, K.; Smilde, A. K.; Kiers, H. A. L. Cross-validation of component model: a critical look at current methods, Anal Bioanal Chem, Volume 390 (2008), pp. 1241-1251

[4] Caussinus, H. Models and uses of principal component analysis (with discussion), Multidimensional Data Analysis (de Leeuw, J.; Heiser, W.; Meulman, J.; Critchley, F., eds.), DSWO Press, 1986, pp. 149-178

[5] Christofferson, A. The one-component model with incomplete data (1969) (Ph. D. Thesis)

[6] Chateau, F.; Lebart, L. Assessing sample variability in the visualization techniques related to principal component analysis: bootstrap and alternative simulation methods, COMPSTAT, Physica-Verlag (1996), pp. 205-210

[7] Craven, P.; Wahba, G. Smoothing noisy data with spline functions, Numer. Math., Volume 31 (1979) no. 4, pp. 377-403 | Zbl 0377.65007

[8] Denis, J.-B.; Gower, J. C. Asymptotic confidence regions for biadditive models: interpreting genotype-environment interactions, Applied Statistics, Volume 45 (1996) no. 4, pp. 479-493

[9] Dempster, A. P.; Laird, N. M.; Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, Volume 39 (1977) no. 1, pp. 1-38 | MR 501537 | Zbl 0364.62022

[10] Escofier, B.; Pagès, J. Analyses Factorielles simples et multiples, Dunod, 2008

[11] Escofier, B. Traitement des questionnaires avec non réponse, analyse des correspondances avec marge modifiée et analyse multicanonique avec contrainte, Pub. Inst. Stat. Univ., Volume 32 (1987) no. 3, pp. 33-69 | Zbl 0651.62055

[12] Greenacre, M.; Blasius, J. Multiple Correspondence Analysis and Related Methods, Chapman & Hall/CRC, 2006 | Zbl 1198.62062

[13] Gower, J. C.; Dijksterhuis, G. B. Procrustes Problems, New York: Oxford University Press, 2004 | MR 2051013 | Zbl 1057.62044

[14] Gaiffas, S.; Lecué, G. Weighted algorithms for compressed sensing and matrix completion, Submitted (2011)

[15] Greenacre, M.; Pardo, R. Subset correspondence analysis: visualizing relationships among a selected set of response categories from a questionnaire survey, Sociological methods and research, Volume 35 (2006) no. 2, pp. 193-218 | MR 2252386

[16] Greenacre, M. Theory and Applications of Correspondence Analysis, Acadamic Press, 1984 | MR 767260 | Zbl 0555.62005

[17] Gabriel, K. R.; Zamir, S. Lower Rank Approximation of Matrices by Least Squares with Any Choice of Weights, Technometrics, Volume 21 (1979) no. 4, pp. 236-246 | Zbl 0471.62004

[18] Honaker, J.; King, G.; Blackwell, M. Amelia: Amelia II: A Program for Missing Data (2010) http://gking.harvard.edu/amelia (R package version 1.2-16)

[19] Hastie, T.; Tibshirani, R.; Friedman, J. The elements of statistical learning. Data Mining, Inference and Prediction. Second Edition, Springer series in statistics, 2009 | MR 2722294

[20] Healy, M. J. R.; Wesmacott, M. Missing values in experiments analyzed on automatic computers, Applied statistics, Volume 5 (1956) no. 3, pp. 203-206

[21] Ilin, A.; Raiko, T. Practical Approaches to Principal Component Analysis in the Presence of Missing Values, Journal of Machine Learning Research, Volume 11 (2010), pp. 1957-2000 | MR 2678019 | Zbl 1242.62047

[22] Josse, J.; Chavent, M.; Liquet, B.; Husson, F. Handling missing values with Regularized Iterative Multiple Correspondence Analysis, Journal of classification, Volume 29 (2012) no. 1, pp. 91-116 | MR 2911330 | Zbl 1360.62306

[23] Josse, J.; Husson, F. Multiple imputation in PCA, Advances in data analysis and classification, Volume 5 (2011) no. 3, pp. 231-246 | MR 2832901 | Zbl 1274.62409

[24] Josse, J.; Husson, F. Selecting the number of components in PCA using cross-validation approximations, Computational Statististics and Data Analysis, Volume 56 (2011) no. 6, pp. 1869-1879 | MR 2892383 | Zbl 1243.62082

[25] Jolliffe, I. T. Principal Component Analysis, Springer, 2002 | MR 2036084 | Zbl 1011.62064

[26] Josse, J.; Pagès, J.; Husson, F. Gestion des données manquantes en Analyse en Composantes Principales, Journal de la Société Française de Statistique, Volume 150 (2009) no. 2, pp. 28-51 | MR 2609690 | Zbl 1311.62091

[27] Kiers, H. A. L. Weighted least squares fitting using ordinary least squares algorithms, Psychometrika, Volume 62 (1997) no. 2, pp. 251-266 | Zbl 0873.62058

[28] Lange, K. Optimization, Springer-Verlag, New-York, 2004 | MR 2072899

[29] Little, R. J. A.; Rubin, D. B. Statistical Analysis with Missing Data, Wiley series in probability and statistics, New-York, 1987, 2002 | MR 1925014 | Zbl 1011.62004

[30] Meulman, J Homgeneity Analysis of Incomplete Data, D.S.W.O.-Press, Leiden, 1982

[31] Moreno-Gonzalez, J.; Crossa, J.; Cornelius, P. L. Additive main effects and multiplicative interaction model: I. theory on variance components for predicting cell means, Crop Science, Volume 43 (2003), pp. 1967-1975

[32] Mazumder, R.; Hastie, T.; Tibshirani, R. Spectral Regularization Algorithms for Learning Large Incomplete Matrices, Journal machine learning research, Volume 11 (2009), pp. 2287-2322 | MR 2719857 | Zbl 1242.68237

[33] Netflix Netflix Challenge, 2009 http://www.netflixprize.com

[34] Nora-Chouteau, C. Une méthode de reconstitution et d’analyse de données incomplètes (1974) (Ph. D. Thesis)

[35] R Development Core Team R: A Language and Environment for Statistical Computing (2011) http://www.R-project.org/ (ISBN 3-900051-07-0)

[36] Robinson, G. K. BLUP is a Good Thing: The Estimation of Random Effects, Statistical Science, Volume 6 (1991) no. 1, pp. 15-51 | MR 1108815 | Zbl 0955.62500

[37] Roweis, S. EM algorithms for PCA and Sensible PCA, Advances in Neural Information Processing Systems, Volume 10 (2008), pp. 626-632

[38] Rubin, D B Inference and missing data, Biometrika, Volume 63 (1976), pp. 581-592 | MR 455196 | Zbl 0344.62034

[39] Rubin, D. B. Multiple Imputation for Non-Response in Survey, Wiley, 1987 | MR 899519 | Zbl 1070.62007

[40] Schafer, J. Analysis of Incomplete Multivariate Data, Chapman & Hall/CRC, 1997 | MR 1692799 | Zbl 0997.62510

[41] Schafer, J. L.; Olsen, M. K. Multiple imputation for missing-data problems: A data analyst’s perspective, Multivariate Behavioral Research, Volume 33 (1998) no. 4, pp. 545-571

[42] Tipping, M.; Bishop, C. M. Probabilistic Principal Component Analysis, Journal of the Royal Statistical Society B, Volume 61 (1999) no. 3, pp. 611-622 | MR 1707864 | Zbl 0924.62068

[43] Timmerman, M. E. Multilevel component analysis, British Journal of Mathematical and Statistical Psychology, Volume 59 (2006) no. 2, pp. 301-320 | MR 2282217

[44] Timmerman, M. E.; Kiers, H. A. L.; Smilde, A. K. Estimating confidence intervals for principal component loadings: a comparison between the bootstrap and asymptotic results, British Journal of Mathematica and Statistical Psychology, Volume 60 (2007) no. 2, pp. 295-314

[45] Wold, H; Lyttkens, E Nonlinear iterative partial least squares (NIPALS) estimation procedures, Bulletin. Int. Stat. Institut, Volume 43 (1969), pp. 29-51 | Zbl 0214.46503

[46] de Leeuw, J.; Mooijaart, A.; van der Leeden, R. Fixed factor score models with linear restrictions (1985) (Technical report)

[47] van Buuren, S. Flexible Imputation of Missing Data, Chapman & Hall/CRC, Boca Raton, 2012 | Zbl 1256.62005