Numéro spécial : Sondages
Analysing large datasets of functional data: a survey sampling point of view
[Analyse statistique de grandes bases de données fonctionnelles : le point de vue de sondeurs]
Journal de la société française de statistique, Tome 155 (2014) no. 4, pp. 70-94.

A l’ère des données massives, il n’est plus inhabituel d’avoir à gérer de très grandes bases de données de phénomènes temporels. Quand l’objectif est d’estimer des indicateurs simples tels que la trajectoire moyenne ou médiane ou bien encore les principaux modes de variation autour de la moyenne, capturés par l’intermédiaire d’une analyse en composantes principales, les techniques de sondage sont des approches intéressantes. Elles offrent en effet un bon compromis entre taille des données à traiter et précision de l’estimation. Ce travail présente une revue des approches de sondage qui ont été developpées ces dernières années pour analyser de grandes bases de données fonctionnelles. L’accent est mis sur les manières de prendre en compte l’information auxiliaire en vue d’améliorer l’estimation en comparaison avec le sondage aléatoire simple sans remise et sur la construction de bandes de confiance. Ces techniques sont illustrées sur un jeu de données de courbes de charge électrique mesurées chaque demi-heure pendant une semaine.

At the age of Big Data, it is now common to have to deal with very large datasets of phenomena that evolve over time. When the aim is to estimate simple quantities such as the mean or the median trajectory, as well as the main modes of variation of the data, captured through a principal components analysis, survey sampling techniques may be employed successfully. They can offer an interesting trade off between size of the data and accuracy of estimators. This paper makes a review of survey sampling approaches recently developed to deal with large datasets of functional data. We present different sampling techniques that can be employed to build confidence bands and improve, with the help of auxiliary information, the accurary of estimators compared to simple random sampling without replacement. These procedures are illustrated on a dataset of electricity load curves measured every half-hour over a period of one week.

Keywords: Big Data, Confidence bands, Horvitz-Thompson estimator, Model-assisted estimation, Unequal probability sampling designs, Variance estimation
Mot clés : Bandes de confiance, Données massives, Estimateur de Horvitz-Thompson, Estimateurs assistés par un modèle, Estimation de la variance, Plans à probabilités inégales
@article{JSFS_2014__155_4_70_0,
     author = {Lardin-Puech, Pauline and Cardot, Herv\'e and Goga, Camelia},
     title = {Analysing large datasets of functional data: a survey sampling point of view},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {70--94},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {155},
     number = {4},
     year = {2014},
     mrnumber = {3286190},
     zbl = {1316.62019},
     language = {en},
     url = {http://www.numdam.org/item/JSFS_2014__155_4_70_0/}
}
TY  - JOUR
AU  - Lardin-Puech, Pauline
AU  - Cardot, Hervé
AU  - Goga, Camelia
TI  - Analysing large datasets of functional data: a survey sampling point of view
JO  - Journal de la société française de statistique
PY  - 2014
SP  - 70
EP  - 94
VL  - 155
IS  - 4
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2014__155_4_70_0/
LA  - en
ID  - JSFS_2014__155_4_70_0
ER  - 
%0 Journal Article
%A Lardin-Puech, Pauline
%A Cardot, Hervé
%A Goga, Camelia
%T Analysing large datasets of functional data: a survey sampling point of view
%J Journal de la société française de statistique
%D 2014
%P 70-94
%V 155
%N 4
%I Société française de statistique
%U http://www.numdam.org/item/JSFS_2014__155_4_70_0/
%G en
%F JSFS_2014__155_4_70_0
Lardin-Puech, Pauline; Cardot, Hervé; Goga, Camelia. Analysing large datasets of functional data: a survey sampling point of view. Journal de la société française de statistique, Tome 155 (2014) no. 4, pp. 70-94. http://www.numdam.org/item/JSFS_2014__155_4_70_0/

[1] Beaumont, J-F.; Bocci, C. Variance estimation when donor imputation is used to fill in missing values, Canad. J. Statist., Volume 37 (2009), pp. 400-416 | MR | Zbl

[2] Berger, Y. Rate of convergence to normal distribution for the Horvitz-Thompson estimator, J. of Statistical Planning and Inference, Volume 67 (1998), p. 209-226. | MR | Zbl

[3] Brewer, K.; Hanif, M. Sampling with unequal probabilities, Springer-Verlag, New York, 1983 | MR | Zbl

[4] Boistard, H.; Lopuhaä, H-P.; Ruiz-Gazen, A. Approximation of rejective sampling inclusion probabilities and application to higher order correlation, Electronic Journal of Statistics, Volume 6 (2012), pp. 1967-1983 | MR | Zbl

[5] Breidt, F-J.; Opsomer, J. D. Local polynomial regression estimators in survey sampling, The Annals of Statistics, Volume 28 (2000) no. 4, pp. 1023-1053 | MR | Zbl

[6] Breidt, F-J.; Opsomer, J. D. Endogeous post-stratification in surveys: classifying with a sample-fitted model, The Annals of Statistics, Volume 36 (2008), pp. 403-427 | MR | Zbl

[7] Beaumont, J-F.; Rivest, L-P. Dealing with outliers in survey data, Handbook of Statistics (Pfeffermann, D.; Rao, C.R., eds.), Volume 29A, Elsevier, 2009, pp. 247-279 | MR

[8] Brown, B.M. Statistical use of the spatial median, Journal of the Royal Statistical Society, B, Volume 45 (1983), pp. 25-30 | MR | Zbl

[9] Cardot, H.; Chaouch, M.; Goga, C.; Labruère, C. Properties of design-based functional principal components analysis, J. of Statistical Planning and Inference, Volume 140 (2010), pp. 75-91 | MR | Zbl

[10] Cardot, H.; Cénac, P.; Zitt, P-A. Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm, Bernoulli, Volume 19 (2013), pp. 18-43 | MR | Zbl

[11] Cardot, H.; Dessertaine, A.; Goga, C.; Josserand, E.; Lardin, P. Comparison of different sample designs and construction of confidence bands to estimate the mean of functional data: An illustration on electricity consumption., Survey Methodology, Volume 39 (2013), pp. 283-301

[12] Chiky, R.; Dessertaine, A.; Hébrail, G. Échantillonnage sur les flux de données : état de l’art, Méthodes de sondages (Guibert, P.; Haziza, D.; Ruiz-Gazen, A.; Tillé, Y., eds.), Dunod, Paris (2008), pp. 314-318

[13] Cardot, H.; Dessertaine, A.; Josserand, E. Semiparametric models with functional responses in a model assisted survey sampling setting, Compstat 2010 (Lechevallier, Y.; Saporta, G., eds.), Physica-Verlag, Springer, 2010, pp. 411-420 | MR

[14] Cardot, H.; Degras, D.; Josserand, E. Confidence bands for Horvitz-Thompson estimators using sampled noisy functional data, Bernoulli, Volume 19 (2013), pp. 2067-2097 | MR | Zbl

[15] Chaouch, M.; Goga, C. Using complex surveys to estimate the L 1 -median of a functional variable: application to electricity load curves, International Statistical Review, Volume 80 (2012) no. 1, pp. 40-59 | MR | Zbl

[16] Cardot, H.; Goga, C.; Lardin, P. Uniform convergence and asymptotic confidence bands for model-assisted estimators of the mean of sampled functional data, Electronic Journal of Statistics, Volume 7 (2013), pp. 562-596 | MR | Zbl

[17] Cardot, H.; Goga, C.; Lardin, P. Variance estimation and asymptotic confidence bands for the mean estimator of sampled functional data with high entropy unequal probability sampling designs, Scandinavian J. of Statistics, Volume 41 (2014), pp. 516-534 | MR | Zbl

[18] Chauvet, G. Méthodes de bootstrap en population finie, Université de Rennes 2, France (2007) (Ph. D. Thesis)

[19] Chaudhuri, P. On a geometric notion of quantiles for multivariate data, J. Amer. Statist. Assoc., Volume 91 (1996), pp. 862-872 | MR | Zbl

[20] Chiky, R. Résumé de flux de données distribués, Sup Telecom, Paris (2009) (Ph. D. Thesis)

[21] Cardot, H.; Josserand, E. Horvitz-Thompson estimators for functional data: asymptotic confidence bands and optimal allocation for stratified sampling, Biometrika, Volume 98 (2011), pp. 107-118 | MR | Zbl

[22] Cochran, W-G. Sampling techniques, John Wiley & Sons, New York, 1977 | MR | Zbl

[23] Chen, J.; Shao, J. Nearest neighbor imputation for survey data, J. Official Statist., Volume 16 (2000), pp. 113-132

[24] Chauvet, G.; Tillé, Y. A fast algorithm of balanced sampling, Computational Statistics, Volume 21 (2006), pp. 53-61 | MR | Zbl

[25] Degras, D. Simultaneous confidence bands for non-parametric regression with functional data, Statistica Sinica, Volume 21 (2011) no. 4, pp. 1735-1765 | MR | Zbl

[26] Degras, D. Rotation Sampling for Functional Data, Statistica Sinica, Volume 24 (2014), pp. 1075-1095 | MR | Zbl

[27] Deville, J-C. Méthodes statistiques et numériques de l’analyse harmonique., Ann. Insee, Volume 15 (1974), pp. 3-104 | MR

[28] Deville, J-C. Variance estimation for complex statistics and estimators: linearization and residual techniques, Survey Methodology, Volume 25 (1999), pp. 193-203

[29] Dauxois, J.; Pousse, A. Les analyse factorielles en calcul des probabilités et en statistique : essai d’étude synthétique, Université Paul Sabatier, Toulouse (1976) (Ph. D. Thesis)

[30] Deville, J-C.; Tillé, Y. Variance approximation under balanced sampling, Journal of Statistical Planning and Inference, Volume 128 (2005), pp. 569-591 | MR | Zbl

[31] Faraway, J.J. Regression analysis for a functional response, Technometrics, Volume 39 (1997) no. 3, pp. 254-261 | MR | Zbl

[32] The Oxford handbook of functional data analysis (Ferraty, F.; Romain, Y., eds.), Oxford University Press, Oxford, 2011 | MR | Zbl

[33] Fuller, W-A. Sampling Statistics, John Wiley & Sons, 2009 | Zbl

[34] Ferraty, F.; Vieu, P. Nonparametric functional data analysis. Theory and practice, Springer Series in Statistics, Springer, New York, 2006 | Zbl

[35] Gervini, D. Robust functional estimation using the spatial median and spherical principal components, Biometrika, Volume 95 (2008), pp. 587-600 | Zbl

[36] Goga, C. Improving the estimation of the functional median using survey data and B-spline modeling., Unpublished Technical Report (2014)

[37] Goga, C.; Ruiz-Gazen, A. Efficient estimation of non-linear finite population parameters by using non-parametrics, Journal of the Royal Statistical Society, B, Volume 76 (2014), pp. 113-140 | Zbl

[38] Gross, S. Median estimation in sample surveys, ASA Proceedings of Survey Research (1980)

[39] Hájek, J. Asymptotic theory of rejective sampling with varying probabilities from a finite population, Annals of Mathematical Statistics, Volume 35 (1964), pp. 1491-1523 | Zbl

[40] Hájek, J. Comment on a paper by D. Basu, Foundations of statistical inference (1971), p. 236-236

[41] Haziza, D. Imputation and inference in the presence of missing data, Handbook of statistics (Pfeffermann, D.; Rao, C.R., eds.), Volume 29A, Elsevier, 2009, pp. 215-246

[42] Hájek, J. Sampling from a finite population, Statistics: Textbooks and Monographs, Marcel Dekker Inc., New York, 1981 | Zbl

[43] Isaki, C-T.; Fuller, W-A. Survey design under the regression superpopulation model, J. Amer. Statist. Assoc., Volume 77 (1982), pp. 49-61 | Zbl

[44] Ilmonen, P.; Oja, H.; Serfling, R. On Invariant Coordinate System (ICS) Functionals, International Statistical Review, Volume 80 (2012), pp. 93-110 | Zbl

[45] Jolliffe, I. T. Principal component analysis, Springer Series in Statistics, Springer-Verlag, New York, 2002 | Zbl

[46] Koenker, R.; Bassett, G. Regression quantiles, Econometrica, Volume 46 (1978), pp. 33-50 | Zbl

[47] Kemperman, J.H.B. The median of a finite measure on a Banach space, In: Dodge, Y. (Ed.), Statistical Data Analysis Based on the L 1 Norm and Related Methods, North-Holland, Amesterdam (1987), pp. 217-230

[48] Lardin, P. Estimation de synchrones de consommation électrique et prise en compte d’information auxiliaire, Université de Bourgogne (2012) (Ph. D. Thesis)

[49] Lohr, S. The Age of Big Data, The New York Times (2012)

[50] Ramsay, J-O.; Silverman, B-W. Functional Data Analysis, Springer Series in Statistics, New York, 2005 | Zbl

[51] Robinson, P. M.; Särndal, C. E. Asymptotic properties of the generalized regression estimator in probability sampling, Sankhya : The Indian Journal of Statistics, Volume 45 (1983), pp. 240-248 | Zbl

[52] Sen, A-R. On the estimate of the variance in sampling with varying probabilities, Journal of the Indian Society of Agricultural Statistics, Volume 5 (1953), pp. 119-127

[53] Staniswalis, J-G.; Lee, J-J. Nonparametric regression analysis of longitudinal data, J. Amer. Statist. Assoc., Volume 93 (1998), pp. 1403-1418 | Zbl

[54] Small, C.G. A survey of multidimensional medians, International Statistical Review, Volume 58 (1990), pp. 263-277

[55] Särndal, C-E.; Swensson, B.; Wretman, J. Model assisted survey sampling, Springer Series in Statistics, Springer-Verlag, New York, 1992 | Zbl

[56] Shao, J.; Wang, H. Confidence intervals based on survey data with nearest neighbor imputation, Statist. Sinica, Volume 18 (2008), pp. 281-297 | Zbl

[57] Tillé, Y. Sampling algorithms, Springer Series in Statistics, Springer, New York, 2006 | Zbl

[58] von Mises, R. On the asymptotic distribution of differentiable statistical functions, Annals of Mathematical Statistics, Volume 18 (1947), pp. 309-348 | Zbl

[59] Vardi, Yehuda; Zhang, Cun-Hui The multivariate L 1 -median and associated data depth, Proc. Natl. Acad. Sci. USA, Volume 97 (2000) no. 4, pp. 1423-1426 | DOI | MR | Zbl

[60] Weber, A. Uber Den Standard Der Industrien, Tubingen. English translation by C.J. Freidrich (1929). Alfred Weber’s theory of location of industries., Chicago: Chicago University Press, 1909

[61] Weiszfeld, E. Sur le point pour lequel la somme des distances de n points donnés est minimum, Tôhoku Mathematical Journal, Volume 43 (1937), pp. 355-386 | Zbl

[62] Yates, F.; Grundy, P-M. Selection without replacement from within strata with probability proportional to size, J. Royal Statist. Soc., B, Volume 15 (1953), pp. 235-261 | Zbl