Segmentation of the Poisson and negative binomial rate models: a penalized estimator
ESAIM: Probability and Statistics, Tome 18 (2014) , pp. 750-769.

We consider the segmentation problem of Poisson and negative binomial (i.e. overdispersed Poisson) rate distributions. In segmentation, an important issue remains the choice of the number of segments. To this end, we propose a penalized -likelihood estimator where the penalty function is constructed in a non-asymptotic context following the works of L. Birgé and P. Massart. The resulting estimator is proved to satisfy an oracle inequality. The performances of our criterion is assessed using simulated and real datasets in the RNA-seq data analysis context.

DOI : https://doi.org/10.1051/ps/2014005
Classification : 62G05,  62G07,  62P10
Mots clés : distribution estimation, change-point detection, count data (RNA-seq), poisson and negative binomial distributions, model selection
@article{PS_2014__18__750_0,
     author = {Cleynen, Alice and Lebarbier, Emilie},
     title = {Segmentation of the Poisson and negative binomial rate models: a penalized estimator},
     journal = {ESAIM: Probability and Statistics},
     pages = {750--769},
     publisher = {EDP-Sciences},
     volume = {18},
     year = {2014},
     doi = {10.1051/ps/2014005},
     language = {en},
     url = {http://www.numdam.org/articles/10.1051/ps/2014005/}
}
Cleynen, Alice; Lebarbier, Emilie. Segmentation of the Poisson and negative binomial rate models: a penalized estimator. ESAIM: Probability and Statistics, Tome 18 (2014) , pp. 750-769. doi : 10.1051/ps/2014005. http://www.numdam.org/articles/10.1051/ps/2014005/

[1] H. Akaike, Information Theory and Extension of the Maximum Likelihood Principle. Second int. Symp. Inf. Theory (1973) 267-281. | MR 483125 | Zbl 0283.62006

[2] N. Akakpo, Estimating a discrete distribution via histogram selection. ESAIM: PS 15 (2011) 1-29. | EuDML 197753 | MR 2793047

[3] S. Arlot and P. Massart, Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res. 10 (2009) 245-279.

[4] Y. Baraud and L. Birgé, Estimating the intensity of a random measure by histogram type estimators. Probab. Theory Relat. Fields (2009) 143 239-284. | MR 2449129 | Zbl 1149.62019

[5] A. Barron, L. Birgé and P. Massart, Risk bounds for model selection via penalization. Probab. Theory Relat. Fields 113 (1999) 301-413. | MR 1679028 | Zbl 0946.62036

[6] C. Biernacki, G. Celeux, G. Govaert, Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 719-725.

[7] L. Birgé, Model selection for Poisson processes. In Asymptotics: particles, processes and inverse problems, Vol. 55 of IMS Lect. Notes Monogr. Ser.. Beachwood, OH: Inst. Math. Statist. (2007) 32-64. | MR 2459930 | Zbl 1176.62082

[8] L. Birgé and P. Massart, From model selection to adaptive estimation, in Festschrift for Lucien Le Cam. New York, Springer (1997) 55-87. | MR 1462939 | Zbl 0920.62042

[9] L. Birgé and P. Massart, Gaussian model selection. J. Eur. Math. Soc. 3 (2001) 203-268. | EuDML 277724 | MR 1848946 | Zbl 1037.62001

[10] L. Birgé and P. Massart, Minimal penalties for Gaussian model selection. Probab. Theory Relat. Fields (2007) 138 33-73. | MR 2288064 | Zbl 1112.62082

[11] J.V. Braun, R. Braun and H.G. Müller, Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika 87 (2000) 301-314. | MR 1782480 | Zbl 0963.62067

[12] J.V. Braun, H.G. Muller, Statistical methods for DNA sequence segmentation. Stat. Sci. (1998) 142-162. | Zbl 0960.62121

[13] Breiman, Friedman, Olshen, Stone: Classification and Regression Trees. Wadsworth and Brooks (1984). | Zbl 0541.62042

[14] G. Castellan, Modified Akaikes criterion for histogram density estimation. Technical Report#9961 (1999).

[15] A. Cleynen, M. Koskas, E. Lebarbier, G. Rigaill and S. Robin, Segmentor3IsBack, an R package for the fast and exact segmentation of Seq-data. Algorithms for Molecular Biology (2014)

[16] N. Johnson, A. Kemp and S. Kotz, Univariate Discrete Distributions. John Wiley & Sons, Inc. (2005). | MR 2163227 | Zbl 1092.62010

[17] R. Killick and I.A. Eckley, Changepoint: an R package for changepoint analysis. Lancaster University (2011).

[18] E. Lebarbier, Detecting multiple change-points in the mean of Gaussian process by model selection. Signal Process. 85 (2005) 717-736. | Zbl 1148.94403

[19] T.M. Luong, Y. Rozenholc and G. Nuel, Fast estimation of posterior probabilities in change-point analysis through a constrained hidden Markov model. Comput. Stat. Data Anal. (2013). | MR 3103767

[20] P. Massart, Concentration inequalities and model selection. In Lect. Notes Math. Springer Berlin/Heidelberg (2007). | MR 2319879 | Zbl 1170.60006

[21] P. Reynaud-Bouret, Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities. Probab. Theory Relat. Fields 126 (2003) 103-153. | MR 1981635 | Zbl 1019.62079

[22] G. Rigaill, Pruned dynamic programming for optimal multiple change-point detection. ArXiv:1004.0887 2010, [http://arxiv.org/abs/1004.0887].

[23] G. Rigaill, E. Lebarbier and S. Robin, Exact posterior distributions and model selection criteria for multiple change-point detection problems. Stat. Comput. 22 (2012) 917-929. | MR 2913792 | Zbl 1252.62027

[24] D. Risso, K. Schwartz, G. Sherlock and S. Dudoit, GC-Content Normalization for RNA-Seq Data. BMC Bioinform. 12 (2011) 480.

[25] Y.C. Yao, Estimating the number of change-points via Schwarz' criterion. Stat. Probab. Lett. 6 (1988) 181-189. | MR 919373 | Zbl 0642.62016

[26] N.R. Zhang and D.O. Siegmund, A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics 63 (2007) 22-32. | MR 2345571 | Zbl 1206.62174