An approach commonly used to handle missing values in Principal Component Analysis (PCA) consists in ignoring the missing values by optimizing the loss function over all non-missing elements. This can be achieved by several methods, including the use of NIPALS, weighted regression or iterative PCA. The latter is based on iterative imputation of the missing elements during the estimation of the parameters, and can be seen as a particular EM algorithm. First, we review theses approaches with respect to the criterion minimization. This presentation gives a good understanding of their properties and the difficulties encountered. Then, we point out the problem of overfitting and we show how the probabilistic formulation of PCA (Tipping & Bishop, 1997) offers a proper and convenient regularization term to overcome this problem. Finally, the performances of the new algorithm are compared to those of the other algorithms from simulations.
Une solution classique pour réaliser une Analyse en Composante Principale (ACP) sur données incomplètes consiste à chercher les axes et les composantes qui minimisent l’erreur de reconstitution sur les données présentes. Plusieurs algorithmes ont été proposés dans la littérature comme NIPALS, une approche par moindres carrés alternés pondérés et une approche par ACP itérative. Cette dernière consiste en une imputation itérative des données au cours du processus d’estimation et s’apparente à un algorithme EM d’un modèle particulier. Ces algorithmes sont décrits dans le cadre commun de la minimisation du critère. Cette présentation unifiée permet de mieux comprendre leurs propriétés et les difficultés qu’ils rencontrent. Nous nous focalisons ensuite sur le problème principal du surajustement et montrons comment la formulation probabiliste de l’ACP (Tipping & Bishop, 1997) offre un terme de régularisation adapté pour pallier à ce problème. Les performances de l’algorithme finalement proposé sont comparées à celles des autres algorithmes à partir de simulations.
Mot clés : ACP, données manquantes, moindres carrés alternés pondérés, algorithme EM, ACP-GEM, surajustement, ACP probabiliste
Keywords: PCA, missing values, alternating weighted least squares, EM algorithm, GEM-PCA, overfitting, probabilistic PCA
@article{JSFS_2009__150_2_28_0, author = {Josse, Julie and Husson, Fran\c{c}ois and Pag\`es, J\'er\^ome}, title = {Gestion des donn\'ees manquantes en {Analyse} en {Composantes} {Principales}}, journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique}, pages = {28--51}, publisher = {Soci\'et\'e fran\c{c}aise de statistique}, volume = {150}, number = {2}, year = {2009}, mrnumber = {2609690}, zbl = {1311.62091}, language = {fr}, url = {http://www.numdam.org/item/JSFS_2009__150_2_28_0/} }
TY - JOUR AU - Josse, Julie AU - Husson, François AU - Pagès, Jérôme TI - Gestion des données manquantes en Analyse en Composantes Principales JO - Journal de la société française de statistique PY - 2009 SP - 28 EP - 51 VL - 150 IS - 2 PB - Société française de statistique UR - http://www.numdam.org/item/JSFS_2009__150_2_28_0/ LA - fr ID - JSFS_2009__150_2_28_0 ER -
%0 Journal Article %A Josse, Julie %A Husson, François %A Pagès, Jérôme %T Gestion des données manquantes en Analyse en Composantes Principales %J Journal de la société française de statistique %D 2009 %P 28-51 %V 150 %N 2 %I Société française de statistique %U http://www.numdam.org/item/JSFS_2009__150_2_28_0/ %G fr %F JSFS_2009__150_2_28_0
Josse, Julie; Husson, François; Pagès, Jérôme. Gestion des données manquantes en Analyse en Composantes Principales. Journal de la société française de statistique, Volume 150 (2009) no. 2, pp. 28-51. http://www.numdam.org/item/JSFS_2009__150_2_28_0/
[1] Latent Variable Models and factor Analysis, Griffin, 1987 | Zbl
[2] Damped Newton algorithms for matrix factorization with missing data, Computer vision and pattern recognition, Volume 2 (2005), pp. 316-322
[3] Pattern recognition and machine learning, Springer, 2006 | MR | Zbl
[4] Bayesian PCA, Proceedings of the 1998 conference on Advances in neural information processing systems II, MIT Press, Cambridge, MA, USA (1999), pp. 382-388
[5] Multi-way analysis in the food industry. Models, algorithms and applications (1998) (Ph. D. Thesis)
[6] Collaborative filtering with privacy via factor analysis, Proceding IEEE Symposium on Security and Privacy (2002), pp. 45-57
[7] Models and uses of principal component analysis (with discussion), Multidimensional Data Analysis (de Leeuw, J; Heiser, W; Meulman, J; Critchley, F, eds.), DSWO Press, 1986, pp. 149-178
[8] The one-component model with incomplete data, Uppsala University, Institute of statistics (1969) (Ph. D. Thesis)
[9] Ajustements de modèles linéaires et bilinéaires sous contraintes linéaires avec données manquantes, Revue de statistique appliquée, Volume 39 (1991), pp. 5-24
[10] Modèles pour l’analyse des données multidimensionelles, Economica, 1992
[11] An application of Factor Analysis with missing data, Technometrics, Volume 23 (1981), pp. 91-95
[12] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, Volume 39 (1977), pp. 1-38 | MR | Zbl
[13] Le traitement des variables vectorielles, Biometrics, Volume 29 (1973), pp. 751-760 | MR
[14] Netflix Challenge, 2008 http://sifter.org/~simon/journal/20061211.html
[15] Missing values in principal component analysis, Chemiometrics and intelligent laboratory systems, Volume 42 (1998), pp. 125-139
[16] Theory and applications of correspondence analysis, Acadamic Press, 1984 | MR | Zbl
[17] Matrix computations, Johns Hopkins University Press, 1996 | MR | Zbl
[18] Lower Rank Approximation of Matrices by Least Squares with Any Choice of Weights, Technometrics, Volume 21 (1979), pp. 236-246 | Zbl
[19] Convergent computation by iterative majorization : theory and applications in multidimensional data analysis, Recent Advances in Descriptive Multivariate Analysis (Krzanowski, W J, ed.), Oxford University Press, 1995, pp. 157-189 | MR
[20] FactoMineR R package version 1.11 , 2009 http://factominer.free.fr
[21] The elements of statistical learning. Data Mining, Inference and Prediction, Springer series in statistics, 2001 | Zbl
[22] Missing values in experiments analyzed on automatic computers, Applied statistics, Volume 5 (1956), pp. 203-206
[23] Testing the significance of the RV coefficient, Computational Statististics and Data Analysis, Volume 53 (2008), pp. 82-91 | MR | Zbl
[24] Weighted least squares fitting using ordinary least squares algorithms, Psychometrica, Volume 62 (1997), pp. 251-266 | Zbl
[25] Applied Multiway data analysis (chap.7), Wiley series in probability and statistics, 2008 | MR | Zbl
[26] Statistical analysis with missing data, Wiley series in probability and statistics, New-York, 1987, 2002 | MR | Zbl
[27] Missing data problems in machine learning, University of Toronto (2008) (Ph. D. Thesis)
[28] Une méthode de reconstitution et d’analyse de données incomplètes, Université Pierre et Marie Curie (1974) (Ph. D. Thesis)
[29] On lines and plane of closest fit to systems of points in space, Phil. Mag., Volume 2 (1901), pp. 559-572 | JFM
[30] Principal Component Analysis for sparse High-Dimensional Data, Neural Information Processing (2007), pp. 566-575
[31] EM algorithms for PCA and Sensible PCA, Advances in Neural Information Processing Systems, Volume 10 (2008), pp. 626-632
[32] EM algorithms for ML factor analysis, Psychometrica, Volume 47 (1982), pp. 69-76 | MR | Zbl
[33] Multiple imputation for non-response in survey, Wiley, 1987 | MR | Zbl
[34] Multiple imputation after 18+ years, Journal of the American Statistical Association, Volume 91 (1996), pp. 473-489 | Zbl
[35] Analysis of incomplete multivariate data, Chapman & Hall/CRC, 1997 | MR | Zbl
[36] Missing Data : Our View of the State of the Art, Psychological Methods, Volume 7 (2002), pp. 147-177
[37] Learning with Matrix Factorizations, Massachusetts institute of technology (2004) (Ph. D. Thesis) | MR
[38] pcaMethods-a bioconductor package providing PCA methods for incomplete data, Bioinformatics, Volume 23 (2007), pp. 1164-1167
[39] Mixture of probabilistic principal component analysers, Neural Computation, Volume 11 (1999), pp. 443-482
[40] Probabilistic Principal Component Analysis, Journal of the Royal Statistical Society B, Volume 61 (1999), pp. 611-622 | MR | Zbl
[41] La régression PLS théorie et pratique, Technip, 1998 | MR | Zbl
[42] Notes on Probabilistic PCA with missing values (2009) (Technical report)
[43] Maximum likelihood principal component analysis, J. Chemom., Volume 11 (2002), pp. 339-366
[44] Dealing with missing data Part I, Chemiometrics and Intelligent Laboratory System, Volume 58 (2001), pp. 15-27
[45] Estimation of principal components and related methods by iterative least squares, Multivariate Analysis (Krishnaiah, P R, ed.), Academic Press, 1966, pp. 391-420 | MR | Zbl
[46] Nonlinear estimation by iterative least squares procedures, Research Papers in Statistics : Festschrift for Jerzy Neyman (David, F N, ed.), Wiley, 1966, pp. 411-444 | MR | Zbl