Quantization/Clustering: when and why does k -means work?
[Quantification et/ou Classification non-supervisée : quand et pourquoi la méthode des centres mobiles marche-t-elle ?]
Journal de la société française de statistique, Tome 159 (2018) no. 1, pp. 1-26.

Bien qu’utilisé comme algorithme de classification, les k -moyennes sont à la base conçus pour fournir un quantifieur, c’est-à-dire un moyen de compresser une distribution de probabilités avec k points. En nous appuyant sur les travaux de Levrard (2015) et Tang and Monteleoni (2016a) , nous essayerons d’expliquer en quoi et sous quelles conditions ces deux objectifs a priori distincts sont compatibles. Plus précisément, nous montrerons que dans le cas où la distribution d’où sont tirés les points satisfait une condition de type marge (baptisée ainsi par analogie avec les conditions de marge établies en classification supervisée dans Mammen and Tsybakov, 1999 ), non seulement le minimiseur théorique du risque empirique associé mais aussi le résultat fourni par l’algorithme de Lloyd fournissent d’une part une classification sinon optimale (au sens de Azizyan et al., 2013 ) du moins pertinente et d’autre part une compression rapide (en la taille de l’échantillon) et optimale.

Though mostly used as a clustering algorithm, k -means is originally designed as a quantization algorithm. Namely, it aims at providing a compression of a probability distribution with k points. Building upon Levrard (2015) ; Tang and Monteleoni (2016a) , we try to investigate how and when these two approaches are compatible. Namely, we show that provided the sample distribution satisfies a margin like condition (in the sense of Mammen and Tsybakov, 1999 for supervised learning), both the associated empirical risk minimizer and the output of Lloyd’s algorithm provide almost optimal classification in certain cases (in the sense of Azizyan et al., 2013 ). Besides, we also show that they achieved fast and optimal convergence rates in terms of sample size and compression risk.

Mots clés : k -means, classification non-supervisée, quantification, vitesse de séparation, distorsion
@article{JSFS_2018__159_1_1_0,
     author = {Levrard, Cl\'ement},
     title = {Quantization/Clustering: when and why does $k$-means work?},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {1--26},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {159},
     number = {1},
     year = {2018},
     zbl = {1397.62222},
     language = {en},
     url = {http://www.numdam.org/item/JSFS_2018__159_1_1_0/}
}
TY  - JOUR
AU  - Levrard, Clément
TI  - Quantization/Clustering: when and why does $k$-means work?
JO  - Journal de la société française de statistique
PY  - 2018
DA  - 2018///
SP  - 1
EP  - 26
VL  - 159
IS  - 1
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2018__159_1_1_0/
UR  - https://zbmath.org/?q=an%3A1397.62222
LA  - en
ID  - JSFS_2018__159_1_1_0
ER  - 
Levrard, Clément. Quantization/Clustering: when and why does $k$-means work?. Journal de la société française de statistique, Tome 159 (2018) no. 1, pp. 1-26. http://www.numdam.org/item/JSFS_2018__159_1_1_0/

[1] Antoniadis, Anestis; Brosat, Xavier; Cugliari, Jairo; Poggi, Jean-Michel Clustering functional data using wavelets (2011) no. RR-7515, 30 pages https://hal.inria.fr/inria-00559115 (Research Report) | Zbl 1271.62131

[2] Auder, Benjamin; Fischer, Aurélie Projection-based curve clustering, Journal of Statistical Computation and Simulation, Volume 82 (2012) no. 8, pp. 1145-1168 | Article | Zbl 1431.62679

[3] Antos, András; Györfi, László; György, András Individual convergence rates in empirical vector quantizer design, IEEE Trans. Inform. Theory, Volume 51 (2005) no. 11, pp. 4013-4022 | Article | MR 2239017 | Zbl 1284.94039

[4] Awasthi, Pranjal; Sheffet, Or Improved spectral-norm bounds for clustering, Approximation, randomization, and combinatorial optimization (Lecture Notes in Comput. Sci.), Volume 7408, Springer, Heidelberg, 2012, pp. 37-49 | Article | MR 3003539 | Zbl 1358.68220

[5] Azizyan, Martin; Singh, Aarti; Wasserman, Larry Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation, Advances in Neural Information Processing Systems 26 (Burges, C. J. C.; Bottou, L.; Welling, M.; Ghahramani, Z.; Weinberger, K. Q., eds.), Curran Associates, Inc., 2013, pp. 2139-2147 http://papers.nips.cc/paper/4983-minimax-theory-for-high-dimensional-gaussian-mixtures-with-sparse-mean-separation.pdf

[6] Arthur, David; Vassilvitskii, Sergei k-means++: the advantages of careful seeding, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2007), pp. 1027-1035 | MR 2485254 | Zbl 1302.68273

[7] Biau, Gérard; Devroye, Luc; Lugosi, Gábor On the performance of clustering in Hilbert spaces, IEEE Trans. Inform. Theory, Volume 54 (2008) no. 2, pp. 781-790 | Article | MR 2444554 | Zbl 1304.62088

[8] Bunea, F.; Giraud, C.; Royer, M.; Verzelen, N. PECOK: a convex optimization approach to variable clustering, ArXiv e-prints (2016) | arXiv:1606.05100

[9] Brezis, Haim Functional analysis, Sobolev spaces and partial differential equations, Universitext, Springer, New York, 2011, xiv+599 pages | MR 2759829 | Zbl 1220.46002

[10] Celeux, Gilles; Govaert, Gérard A classification EM algorithm for clustering and two stochastic versions, Comput. Statist. Data Anal., Volume 14 (1992) no. 3, pp. 315-332 | Article | MR 1192205 | Zbl 0937.62605

[11] Chou, P. A. The distortion of vector quantizers trained on n vectors decreases to the optimum as 𝒪 p ( 1 / n ) , Proc. IEEE Int. Symp. Inf. Theory (1994), 457 pages

[12] Dempster, A. P.; Laird, N. M.; Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, series B, Volume 39 (1977) no. 1, pp. 1-38 | MR 501537 | Zbl 0364.62022

[13] Dasgupta, Sanjoy; Schulman, Leonard A probabilistic analysis of EM for mixtures of separated, spherical Gaussians, J. Mach. Learn. Res., Volume 8 (2007), pp. 203-226 | MR 2320668 | Zbl 1222.62142

[14] Fischer, Aurélie Quantization and clustering with Bregman divergences, J. Multivariate Anal., Volume 101 (2010) no. 9, pp. 2207-2221 | Article | MR 2671211 | Zbl 1201.62080

[15] Gersho, Allen; Gray, Robert M. Vector Quantization and Signal Compression, Kluwer Academic Publishers, Norwell, MA, USA, 1991 | Zbl 0782.94001

[16] Graf, Siegfried; Luschgy, Harald Foundations of quantization for probability distributions, Lecture Notes in Mathematics, 1730, Springer-Verlag, Berlin, 2000, x+230 pages | Article | MR 1764176 | Zbl 0951.60003

[17] Kumar, Amit; Kannan, Ravindran Clustering with spectral norm and the k -means algorithm, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science—FOCS 2010, IEEE Computer Soc., Los Alamitos, CA, 2010, pp. 299-308 | MR 3025203

[18] Kumar, Amit; Sabharwal, Yogish; Sen, Sandeep Linear time algorithms for clustering problems in any dimensions, Automata, languages and programming (Lecture Notes in Comput. Sci.), Volume 3580, Springer, Berlin, 2005, pp. 1374-1385 | Article | MR 2184726 | Zbl 1081.68746

[19] Kim, Kyungpil; Zhang, Shibo; Jiang, Keni; Cai, Li; Lee, In-Beum; Feldman, Lewis J.; Huang, Haiyan Measuring similarities between gene expression profiles through new data transformations, BMC Bioinformatics, Volume 8 (2007) no. 1 | Article

[20] Levrard, Clément Fast rates for empirical vector quantization, Electron. J. Statist., Volume 7 (2013), pp. 1716-1746 | Article | MR 3080408 | Zbl 1349.62038

[21] Levrard, Clément Nonasymptotic bounds for vector quantization in Hilbert spaces, Ann. Statist., Volume 43 (2015) no. 2, pp. 592-619 | Article | MR 3316191 | Zbl 1314.62143

[22] Levrard, Clément Sparse oracle inequalities for variable selection via regularized quantization, Bernoulli, Volume 24 (2018) no. 1, pp. 271-296 | Article | MR 3706757 | Zbl 1426.62211

[23] Linder, T. Learning-Theoretic Methods in Vector Quantization (2002), pp. 163-210 | Article

[24] Lloyd, Stuart P. Least squares quantization in PCM, IEEE Trans. Inform. Theory, Volume 28 (1982) no. 2, pp. 129-137 | Article | MR 651807 | Zbl 0504.94015

[25] Lu, Y.; Zhou, H. H. Statistical and Computational Guarantees of Lloyd’s Algorithm and its Variants, ArXiv e-prints (2016) | arXiv:1612.02099

[26] MacQueen, J. Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics (1967), pp. 281-297 https://projecteuclid.org/euclid.bsmsp/1200512992 | MR 214227 | Zbl 0214.46201

[27] Mahajan, Meena; Nimbhorkar, Prajakta; Varadarajan, Kasturi The planar k -means problem is NP-hard, Theoret. Comput. Sci., Volume 442 (2012), pp. 13-21 | Article | MR 2927097 | Zbl 1260.68158

[28] Mammen, Enno; Tsybakov, Alexandre B. Smooth discrimination analysis, Ann. Statist., Volume 27 (1999) no. 6, pp. 1808-1829 | Article | MR 1765618 | Zbl 0961.62058

[29] Ng, Andrew Y.; Jordan, Michael I.; Weiss, Yair On Spectral Clustering: Analysis and an algorithm, Advances in Neural Information Processing Systems (2001), pp. 849-856

[30] Orhan, Umut; Hekim, Mahmut; Ozer, Mahmut EEG signals classification using the K-means clustering and a multilayer perceptron neural network model., Expert Syst. Appl., Volume 38 (2011) no. 10, pp. 13475-13481 http://dblp.uni-trier.de/db/journals/eswa/eswa38.html#OrhanHO11

[31] Ostrovsky, Rafail; Rabani, Yuval; Schulman, Leonard J.; Swamy, Chaitanya The effectiveness of Lloyd-type methods for the k -means problem, J. ACM, Volume 59 (2012) no. 6, 22 pages | Article | MR 3008400 | Zbl 1281.68229

[32] Pollard, David A central limit theorem for k -means clustering, Ann. Probab., Volume 10 (1982) no. 4, pp. 919-926 http://links.jstor.org/sici?sici=0091-1798(198211)10:4<919:ACLTFC>2.0.CO;2-Q&origin=MSN | MR 672292 | Zbl 0502.62055

[33] Tavazoie, Saeed; Hugues, Jason D.; Campbell, Michael J.; Cho, Raymond J.; Church, George M. Systematic determination of genetic network architecture, Nature genetics, Volume 22 (1999), p. 281-5

[34] Tang, Cheng; Monteleoni, Claire On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research), Volume 51 (2016), pp. 1280-1289 http://proceedings.mlr.press/v51/tang16b.html

[35] Tang, Cheng; Monteleoni, Claire On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice. Supplementary material (2016)

[36] Tsybakov, Alexandre B. Introduction to Nonparametric Estimation, Springer Publishing Company, Incorporated, 2008 | MR 2724359