A general approach to account for dependence in large-scale multiple testing
[Un cadre global pour la prise en compte de la dépendance dans les procédures de tests multiples en grande dimension]
Journal de la société française de statistique, Tome 153 (2012) no. 2, pp. 100-122.

Les données générées par les biotechnologies haut-débit sont caractérisées par leur grande dimension et leur hétérogénéité. L’analyse statistique de ces données remet en cause y compris les approches les plus éprouvées, comme les méthodes usuelles d’inférence statistique. Cet article a pour objectif de présenter une étude de l’impact de la dépendance sur les propriétés des procédures de tests multiples en grande dimension : après une description introductive des principales problématiques liées à la présence de dépendance, les mesures de risques d’erreurs et les algorithmes permettant de contrôler ces risques lors de la mise en œuvre de procédures de tests multiples sont plus particulièrement étudiés. Cette étude analytique aboutit à la définition d’un cadre général de la prise en compte de l’hétérogénéité des données, grâce à la modélisation de la structure de dépendance par Analyse en Facteurs. L’instabilité des procédures induite par la présence de dépendance est alors réduite, procurant à la fois une augmentation de la puissance des tests et une diminution de la variabilité des taux d’erreurs. La mise en œuvre de cette méthode est également évoquée, et les résultats méthodologiques sont illustrés à partir de données génomiques, analysées à l’aide du package FAMT du logiciel libre R qui implémente les méthodes présentées précédemment.

Cet article accompagne la conférence que j’ai eu l’honneur de donner lors de la réception du prix Marie-Jeanne Laurent-Duhamel, dans le cadre des 44èmes Journées de Statistique organisées par la Société Française de Statistique à Bruxelles, en mai 2012.

The data generated by high-throughput biotechnologies are characterized by their high-dimension and heterogeneity. Usual, tried and tested inference approaches are questioned in the statistical analysis of such data. Motivated by issues raised by the analysis of gene expressions data, I focus on the impact of dependence on the properties of multiple testing procedures in high-dimension. This article aims at presenting the main results: after introducing the issues brought by dependence among variables, the impact of dependence on the error rates and on the procedures developed to control them is more particularly studied. It results in the description of an innovative methodology based on a factor structure to model the data heterogeneity, which provides a general framework to deal with dependence in multiple testing. The proposed framework leads to less variability for error rates and consequently shows large improvements of power and stability of simultaneous inference with respect to existing multiple testing procedures. Besides, the model parameters estimation in a high-dimensional setting and the determination of the number of factors to be considered in the model are evoked. These results are then illustrated by real data from microarray experiments analyzed using the R package called FAMT.

This paper is an extended written version of my oral presentation on the same topic at the 44th Journées de Statistique organized by the French Statistical Society (SFdS) in Bruxelles, Belgium, 2012, when being awarded the Marie-Jeanne Laurent-Duhamel prize.

Mots clés : Tests multiples, Dépendance, Grande dimension, Taux d’erreurs, Analyse en facteurs, Proportion d’hypothèses nulles
@article{JSFS_2012__153_2_100_0,
     author = {Friguet, Chlo\'e},
     title = {A general approach to account for dependence in large-scale multiple testing},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {100--122},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {153},
     number = {2},
     year = {2012},
     zbl = {1316.62111},
     mrnumber = {3008601},
     language = {en},
     url = {http://www.numdam.org/item/JSFS_2012__153_2_100_0/}
}
TY  - JOUR
AU  - Friguet, Chloé
TI  - A general approach to account for dependence in large-scale multiple testing
JO  - Journal de la société française de statistique
PY  - 2012
DA  - 2012///
SP  - 100
EP  - 122
VL  - 153
IS  - 2
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2012__153_2_100_0/
UR  - https://zbmath.org/?q=an%3A1316.62111
UR  - https://www.ams.org/mathscinet-getitem?mr=3008601
LA  - en
ID  - JSFS_2012__153_2_100_0
ER  - 
Friguet, Chloé. A general approach to account for dependence in large-scale multiple testing. Journal de la société française de statistique, Tome 153 (2012) no. 2, pp. 100-122. http://www.numdam.org/item/JSFS_2012__153_2_100_0/

[1] Allison, D.B. A mixture model approach for the analysis of microarray gene expression data, Computational Statistics and Data Analysis, Volume 39 (2002), pp. 1-20 | MR 1895555 | Zbl 1119.62371

[2] Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society. Series B (Methodological), Volume 57 (1995), pp. 289-300 | MR 1325392 | Zbl 0809.62014

[3] Benjamini, Y.; Krieger, A.; Yekutieli, D. Adaptive linear step-up procedures that control the false discovery rate, Biometrika, Volume 93 (2006), pp. 491-507 | MR 2261438 | Zbl 1108.62069

[4] Black, M. A. A note on the adaptative control of false discovery rates, Journal of the Royal Statistical Society. Series B, Volume 66 (2004), pp. 297-304 | MR 2062377 | Zbl 1062.62130

[5] Blum, Y.; LeMignon, G.; Lagarrigue, S.; Causeur, D. A factor model to analyze heterogeneity in gene expression, BMC bioinformatics, Volume 11:368 (2010)

[6] Bonferroni, C. E. Teoria statistica delle classi e calcolo delle probabilità, Pubblicazioni del R Istituto Superiore si Scienze Economiche e Comerciali di Firenze (1936), pp. 3-62 | Zbl 0016.41103

[7] Blanchard, G.; Roquain, E. Two simple sufficient conditions for FDR control, Electronic journal of Statistics, Volume 2 (2008), pp. 963-992 | MR 2448601 | Zbl 1320.62179

[8] Benjamini, Y.; Yekutieli, D. The control of the false discovery rate in multiple testing under dependency, Annals of Statistics, Volume 29 (2001), pp. 1165-1188 | MR 1869245 | Zbl 1041.62061

[9] Cattell, R. B. The scree test for the number of factors, Multivariate Behavioural Research, Volume 1 (1966), pp. 245-276

[10] Causeur, D.; Friguet, C.; Houée, M.; Kloareg, M. Factor Analysis for Multiple Testing (FAMT): an R package for large-scale significance testing under dependence, Journal of Statistical Software, Volume 40(14) (2011), pp. 1-19

[11] Causeur, D.; Kloareg, M.; Friguet, C Control of the FWER in Multiple Testing Under Dependence, Communications in Statistics - Theory and Methods, Volume 38 (2009), pp. 2733-2747 | MR 2568183 | Zbl 1175.62057

[12] Dudoit, S.; Fridlyand, J.; Speed, T.P. Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, Volume 97 (2002), pp. 77-87 | MR 1963389 | Zbl 1073.62576

[13] Dudoit, S.; Shaffer, J.; Boldrick, J. C. Multiple hypothesis testing in microarray experiments, Statistical Science, Volume 18 (2003), pp. 71-103 | MR 1997066 | Zbl 1048.62099

[14] Efron, B. Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis, Journal of the American Statistical Association, Volume 99 (2004), pp. 96-104 | MR 2054289 | Zbl 1089.62502

[15] Efron, B. Correlation and large-scale simultaneous testing, Journal of the American Statistical Association, Volume 102 (2007), pp. 93-103 | MR 2293302 | Zbl 1284.62340

[16] Efron, B.; Tibshirani, R.; Storey, J.D.; Tusher, V. Empirical Bayes Analysis of a Microarray Experiment, Journal of the American Statistical Association, Volume 96 (2001), pp. 1151-1160 | MR 1946571 | Zbl 1073.62511

[17] Friguet, C.; Causeur, D. Estimation of the proportion of true null hypotheses in high-dimensional data under dependence, Computational Statistics and Data Analysis, Volume 55 (2011), pp. 2665-2676 | MR 2802344 | Zbl 06917723

[18] Friguet, C.; Kloareg, M.; Causeur, D. A factor model approach to multiple testing under dependence, Journal of the American Statistical Association, Volume 104:488 (2009), pp. 1406-1415 | MR 2750571 | Zbl 1205.62071

[19] Genovese, C.; Wasserman, L. Operarting characteristics and extensions of the false discovery rate procedure, Journal of the Royal Statistical Society. Series B, Volume 64 (2002), pp. 499-517 | MR 1924303 | Zbl 1090.62072

[20] Hedenfalk, I.; Duggan, D.; Chen, Y. D.; Radmacher, M.; Bittner, M.; Simon, R.; Meltzer, P.; Gusterson, B.; Esteller, M.; Kallioniemi, O. P.; Wilfond, B.; Borg, A.; Trent, J. Gene expression profiles in hereditary breast cancer, New England Journal of Medicine, Volume 344 (2001), pp. 539-548

[21] Kustra, R.; Shioda, R.; Zhu, M. A factor analysis model for functional genomics, BMC Bioinformatics, Volume 7 (2006)

[22] Korn, E.L.; Troendle, J.F.; McShane, L.M.; Simon, R. Controlling the number of false discoveries: application to high-dimensional genomic data, Journal of Statistical Planning and Inference, Volume 124 (2004), pp. 379-398 | MR 2080371 | Zbl 1074.62070

[23] Kim, K. I.; Van de Wiel, M. Effects of dependence in high-dimensional multiple testing problems, BMC Bioinformatics, Volume 9 (2008)

[24] LeMignon, G.; Désert, C.; Pite, F.; Leroux, S.; Demeure, O.; Guernec, G.; Abasht, B.; Douaire, M.; LeRoy, P.; Lagarrigue, S. Using transcriptome profiling to characterize QTL regions on chicken chromosome 5, BMC Genomics (2009), pp. 10-575

[25] Langaas, M.; Lindqvist, B. H.; Ferkingstad, E. Estimating the proportion of true null hypotheses, with application to DNA microarray data, Journal of the Royal Statistical Society. Series B, Volume 67 (2005), pp. 555-572 | MR 2168204 | Zbl 1095.62037

[26] Leek, J. T.; Storey, J. A general framework for multiple testing dependence, Proceedings of the National Academy of Sciences, Volume 105 (2008), pp. 18718-18723 | Zbl 1359.62202

[27] Montanelli, R. G.; Humphrey, L. G. Latent roots of ranrom data correlatoin matrices with squared multiple correlations on the diagonal: a Monte-Carlo study, Psychometrica, Volume 41 (1976), pp. 341-348 | Zbl 0336.62040

[28] Mardia, K. V.; Kent, J. T.; Bibby, J. M. Multivariate Analysis, 1979 | MR 560319 | Zbl 0432.62029

[29] Owen, A.B. Variance of the number of false discoveries, Journal of the Royal Statistical Society. Series B, Volume 67 (2005), pp. 411-426 | MR 2155346 | Zbl 1069.62102

[30] Pollard, K.; Ge, Y.; Taylor, S.; Dudoit, S. multtest: Resampling-based multiple hypothesis testing (R package version 1.23.3)

[31] Pournara, I.; Wernisch, L. Factor analysis for gene regulatory networks and transcription factor activity profiles, BMC Bioinformatics, Volume 8 (2007) | Article

[32] Robertson, D.; Symons, J. Maximum likelyhood factor analysis with rank-deficient sample covariance matrix, Journal of Multivariate Analysis, Volume 98 (2007), pp. 813-828 | MR 2322130 | Zbl 1123.62042

[33] Rubin, D. B.; Thayer, D. T. EM Algorithms for ML Factor Analysis, Psychometrika, Volume 47 (1982), pp. 69-76 | MR 668505 | Zbl 0483.62046

[34] Storey, J. D.; Dai, J.Y.; Leek, J. T. The optimal discovery procedure for large-scale significance testing, with application to comparative microarray experiments, Biostatistics, Volume 8 (2007), pp. 414-432 | Zbl 1213.62175

[35] Shaffer, J. Multiple hypotheses testing: a review, Annual review of psychology, Volume 46 (1995), pp. 561-584

[36] Spearman, C. General intelligence, objectively determined and measured, American Journal of Psychology, Volume 15 (1904), pp. 201-293

[37] Storey, J.; Tibshirani, R Statistical significance for genomewide studies, Proceedings of the National Academy of Sciences, Volume 100 (2003), pp. 9440-9445 | MR 1994856 | Zbl 1130.62385

[38] Storey, J. D. A direct approach to false discovery rates, Journal of the Royal Statistical Society. Series B, Volume 64 (2002), pp. 479-498 | MR 1924302 | Zbl 1090.62073

[39] Storey, J. D. The positive false discovery rate: a Bayesian interpretation and the q -value, Annals of Statistics, Volume 31 (2003), pp. 2013-2035 | MR 2036398 | Zbl 1042.62026

[40] Storey, J.D. The optimal discovery procedure: A new approach to simultaneous significance testing, Journal of the Royal Statistical Society. Series B, Volume 69 (2007), pp. 347-368 | MR 2323757

[41] Storey, J. D.; Taylor, J. E.; Siegmund, D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach, Journal of the Royal Statistical Society. Series B, Volume 66 (2004), pp. 187-205 | MR 2035766 | Zbl 1061.62110