Quantization/Clustering: when and why does k -means work?
[Quantification et/ou Classification non-supervisée : quand et pourquoi la méthode des centres mobiles marche-t-elle ?]
Journal de la société française de statistique, Tome 159 (2018) no. 1, pp. 1-26.

Bien qu’utilisé comme algorithme de classification, les k -moyennes sont à la base conçus pour fournir un quantifieur, c’est-à-dire un moyen de compresser une distribution de probabilités avec k points. En nous appuyant sur les travaux de Levrard (2015) et Tang and Monteleoni (2016a) , nous essayerons d’expliquer en quoi et sous quelles conditions ces deux objectifs a priori distincts sont compatibles. Plus précisément, nous montrerons que dans le cas où la distribution d’où sont tirés les points satisfait une condition de type marge (baptisée ainsi par analogie avec les conditions de marge établies en classification supervisée dans Mammen and Tsybakov, 1999 ), non seulement le minimiseur théorique du risque empirique associé mais aussi le résultat fourni par l’algorithme de Lloyd fournissent d’une part une classification sinon optimale (au sens de Azizyan et al., 2013 ) du moins pertinente et d’autre part une compression rapide (en la taille de l’échantillon) et optimale.

Though mostly used as a clustering algorithm, k -means is originally designed as a quantization algorithm. Namely, it aims at providing a compression of a probability distribution with k points. Building upon Levrard (2015) ; Tang and Monteleoni (2016a) , we try to investigate how and when these two approaches are compatible. Namely, we show that provided the sample distribution satisfies a margin like condition (in the sense of Mammen and Tsybakov, 1999 for supervised learning), both the associated empirical risk minimizer and the output of Lloyd’s algorithm provide almost optimal classification in certain cases (in the sense of Azizyan et al., 2013 ). Besides, we also show that they achieved fast and optimal convergence rates in terms of sample size and compression risk.

Keywords: $k$-means, clustering, quantization, separation rate, distortion
Mot clés : $k$-means, classification non-supervisée, quantification, vitesse de séparation, distorsion
@article{JSFS_2018__159_1_1_0,
     author = {Levrard, Cl\'ement},
     title = {Quantization/Clustering: when and why does $k$-means work?},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {1--26},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {159},
     number = {1},
     year = {2018},
     zbl = {1397.62222},
     language = {en},
     url = {http://www.numdam.org/item/JSFS_2018__159_1_1_0/}
}
TY  - JOUR
AU  - Levrard, Clément
TI  - Quantization/Clustering: when and why does $k$-means work?
JO  - Journal de la société française de statistique
PY  - 2018
SP  - 1
EP  - 26
VL  - 159
IS  - 1
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2018__159_1_1_0/
LA  - en
ID  - JSFS_2018__159_1_1_0
ER  - 
%0 Journal Article
%A Levrard, Clément
%T Quantization/Clustering: when and why does $k$-means work?
%J Journal de la société française de statistique
%D 2018
%P 1-26
%V 159
%N 1
%I Société française de statistique
%U http://www.numdam.org/item/JSFS_2018__159_1_1_0/
%G en
%F JSFS_2018__159_1_1_0
Levrard, Clément. Quantization/Clustering: when and why does $k$-means work?. Journal de la société française de statistique, Tome 159 (2018) no. 1, pp. 1-26. http://www.numdam.org/item/JSFS_2018__159_1_1_0/

[1] Antoniadis, Anestis; Brosat, Xavier; Cugliari, Jairo; Poggi, Jean-Michel Clustering functional data using wavelets (2011) no. RR-7515, 30 pages https://hal.inria.fr/inria-00559115 (Research Report) | Zbl

[2] Auder, Benjamin; Fischer, Aurélie Projection-based curve clustering, Journal of Statistical Computation and Simulation, Volume 82 (2012) no. 8, pp. 1145-1168 | DOI | Zbl

[3] Antos, András; Györfi, László; György, András Individual convergence rates in empirical vector quantizer design, IEEE Trans. Inform. Theory, Volume 51 (2005) no. 11, pp. 4013-4022 | DOI | MR | Zbl

[4] Awasthi, Pranjal; Sheffet, Or Improved spectral-norm bounds for clustering, Approximation, randomization, and combinatorial optimization (Lecture Notes in Comput. Sci.), Volume 7408, Springer, Heidelberg, 2012, pp. 37-49 | DOI | MR | Zbl

[5] Azizyan, Martin; Singh, Aarti; Wasserman, Larry Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation, Advances in Neural Information Processing Systems 26 (Burges, C. J. C.; Bottou, L.; Welling, M.; Ghahramani, Z.; Weinberger, K. Q., eds.), Curran Associates, Inc., 2013, pp. 2139-2147 http://papers.nips.cc/paper/4983-minimax-theory-for-high-dimensional-gaussian-mixtures-with-sparse-mean-separation.pdf

[6] Arthur, David; Vassilvitskii, Sergei k-means++: the advantages of careful seeding, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ACM, New York (2007), pp. 1027-1035 | MR | Zbl

[7] Biau, Gérard; Devroye, Luc; Lugosi, Gábor On the performance of clustering in Hilbert spaces, IEEE Trans. Inform. Theory, Volume 54 (2008) no. 2, pp. 781-790 | DOI | MR | Zbl

[8] Bunea, F.; Giraud, C.; Royer, M.; Verzelen, N. PECOK: a convex optimization approach to variable clustering, ArXiv e-prints (2016) | arXiv

[9] Brezis, Haim Functional analysis, Sobolev spaces and partial differential equations, Universitext, Springer, New York, 2011, xiv+599 pages | MR | Zbl

[10] Celeux, Gilles; Govaert, Gérard A classification EM algorithm for clustering and two stochastic versions, Comput. Statist. Data Anal., Volume 14 (1992) no. 3, pp. 315-332 | DOI | MR | Zbl

[11] Chou, P. A. The distortion of vector quantizers trained on n vectors decreases to the optimum as 𝒪 p ( 1 / n ) , Proc. IEEE Int. Symp. Inf. Theory, Trondheim, Norway (1994), 457 pages

[12] Dempster, A. P.; Laird, N. M.; Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, series B, Volume 39 (1977) no. 1, pp. 1-38 | MR | Zbl

[13] Dasgupta, Sanjoy; Schulman, Leonard A probabilistic analysis of EM for mixtures of separated, spherical Gaussians, J. Mach. Learn. Res., Volume 8 (2007), pp. 203-226 | MR | Zbl

[14] Fischer, Aurélie Quantization and clustering with Bregman divergences, J. Multivariate Anal., Volume 101 (2010) no. 9, pp. 2207-2221 | DOI | MR | Zbl

[15] Gersho, Allen; Gray, Robert M. Vector Quantization and Signal Compression, Kluwer Academic Publishers, Norwell, MA, USA, 1991 | Zbl

[16] Graf, Siegfried; Luschgy, Harald Foundations of quantization for probability distributions, Lecture Notes in Mathematics, 1730, Springer-Verlag, Berlin, 2000, x+230 pages | DOI | MR | Zbl

[17] Kumar, Amit; Kannan, Ravindran Clustering with spectral norm and the k -means algorithm, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science—FOCS 2010, IEEE Computer Soc., Los Alamitos, CA, 2010, pp. 299-308 | MR

[18] Kumar, Amit; Sabharwal, Yogish; Sen, Sandeep Linear time algorithms for clustering problems in any dimensions, Automata, languages and programming (Lecture Notes in Comput. Sci.), Volume 3580, Springer, Berlin, 2005, pp. 1374-1385 | DOI | MR | Zbl

[19] Kim, Kyungpil; Zhang, Shibo; Jiang, Keni; Cai, Li; Lee, In-Beum; Feldman, Lewis J.; Huang, Haiyan Measuring similarities between gene expression profiles through new data transformations, BMC Bioinformatics, Volume 8 (2007) no. 1 | DOI

[20] Levrard, Clément Fast rates for empirical vector quantization, Electron. J. Statist., Volume 7 (2013), pp. 1716-1746 | DOI | MR | Zbl

[21] Levrard, Clément Nonasymptotic bounds for vector quantization in Hilbert spaces, Ann. Statist., Volume 43 (2015) no. 2, pp. 592-619 | DOI | MR | Zbl

[22] Levrard, Clément Sparse oracle inequalities for variable selection via regularized quantization, Bernoulli, Volume 24 (2018) no. 1, pp. 271-296 | DOI | MR | Zbl

[23] Linder, T. Learning-Theoretic Methods in Vector Quantization, Springer Vienna, Vienna (2002), pp. 163-210 | DOI

[24] Lloyd, Stuart P. Least squares quantization in PCM, IEEE Trans. Inform. Theory, Volume 28 (1982) no. 2, pp. 129-137 | DOI | MR | Zbl

[25] Lu, Y.; Zhou, H. H. Statistical and Computational Guarantees of Lloyd’s Algorithm and its Variants, ArXiv e-prints (2016) | arXiv

[26] MacQueen, J. Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, University of California Press, Berkeley, Calif. (1967), pp. 281-297 https://projecteuclid.org/euclid.bsmsp/1200512992 | MR | Zbl

[27] Mahajan, Meena; Nimbhorkar, Prajakta; Varadarajan, Kasturi The planar k -means problem is NP-hard, Theoret. Comput. Sci., Volume 442 (2012), pp. 13-21 | DOI | MR | Zbl

[28] Mammen, Enno; Tsybakov, Alexandre B. Smooth discrimination analysis, Ann. Statist., Volume 27 (1999) no. 6, pp. 1808-1829 | DOI | MR | Zbl

[29] Ng, Andrew Y.; Jordan, Michael I.; Weiss, Yair On Spectral Clustering: Analysis and an algorithm, Advances in Neural Information Processing Systems, MIT Press (2001), pp. 849-856

[30] Orhan, Umut; Hekim, Mahmut; Ozer, Mahmut EEG signals classification using the K-means clustering and a multilayer perceptron neural network model., Expert Syst. Appl., Volume 38 (2011) no. 10, pp. 13475-13481 http://dblp.uni-trier.de/db/journals/eswa/eswa38.html#OrhanHO11

[31] Ostrovsky, Rafail; Rabani, Yuval; Schulman, Leonard J.; Swamy, Chaitanya The effectiveness of Lloyd-type methods for the k -means problem, J. ACM, Volume 59 (2012) no. 6, 22 pages | DOI | MR | Zbl

[32] Pollard, David A central limit theorem for k -means clustering, Ann. Probab., Volume 10 (1982) no. 4, pp. 919-926 http://links.jstor.org/sici?sici=0091-1798(198211)10:4<919:ACLTFC>2.0.CO;2-Q&origin=MSN | MR | Zbl

[33] Tavazoie, Saeed; Hugues, Jason D.; Campbell, Michael J.; Cho, Raymond J.; Church, George M. Systematic determination of genetic network architecture, Nature genetics, Volume 22 (1999), p. 281-5

[34] Tang, Cheng; Monteleoni, Claire On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (Gretton, Arthur; Robert, Christian C., eds.) (Proceedings of Machine Learning Research), Volume 51, PMLR, Cadiz, Spain (2016), pp. 1280-1289 http://proceedings.mlr.press/v51/tang16b.html

[35] Tang, Cheng; Monteleoni, Claire On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice. Supplementary material (2016)

[36] Tsybakov, Alexandre B. Introduction to Nonparametric Estimation, Springer Publishing Company, Incorporated, 2008 | MR