Exact Cross-Validation for k NN : application to passive and active learning in classification
[Validation-croisée exacte pour les k NN : application à l’apprentissage passif et actif en classification]
Journal de la société française de statistique, Tome 152 (2011) no. 3, pp. 83-97.

Pour l’algorithme de classification des k plus proches voisins ( k NN), une expression explicite de l’estimateur du taux d’erreur de classification par validation croisée Leave p Out (L p O) est proposée. Cette expression explicite est d’abord utilisée dans le cadre de l’apprentissage passif pour étudier l’impact du choix du paramètre p du L p O sur le choix de k dans l’algorithme k NN. On s’intéresse ensuite au problème de l’apprentissage actif (active learning). Une procédure de sélection des exemples basée sur la recommandation du comité des classificateurs L p O est considérée. L’influence du paramètre p sur le choix des nouveaux exemples et sur le choix du paramètre k à chaque étape de l’apprentissage actif est étudiée. En particulier, il est montré que l’évolution de la valeur du paramètre k choisie par L p O en apprentissage actif est différente de celle observée en apprentissage passif.

In the binary classification framework, a closed form expression of the cross-validation Leave- p -Out (L p O) risk estimator for the k Nearest Neighbor algorithm ( k NN) is derived. It is first used to study the L p O risk minimization strategy for choosing k in the passive learning setting. The impact of p on the choice of k and the L p O estimation of the risk are inferred. In the active learning setting, a procedure is proposed that selects new examples using a L p O committee of k NN classifiers. The influence of p on the choice of new examples and the tuning of k at each step is investigated. The behavior of k chosen by L p O is shown to be different from what is observed in passive learning.

Keywords: Classification, Cross-validation, $k$NN algorithm Active learning
Mot clés : Classification, Valildation-croisée, $k$NN, Apprentissage actif
@article{JSFS_2011__152_3_83_0,
     author = {Celisse, Alain and Mary-Huard, Tristan},
     title = {Exact {Cross-Validation} for $k${NN} : application to passive and active learning in classification},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {83--97},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {152},
     number = {3},
     year = {2011},
     mrnumber = {2871178},
     zbl = {1316.62084},
     language = {en},
     url = {http://www.numdam.org/item/JSFS_2011__152_3_83_0/}
}
TY  - JOUR
AU  - Celisse, Alain
AU  - Mary-Huard, Tristan
TI  - Exact Cross-Validation for $k$NN : application to passive and active learning in classification
JO  - Journal de la société française de statistique
PY  - 2011
SP  - 83
EP  - 97
VL  - 152
IS  - 3
PB  - Société française de statistique
UR  - http://www.numdam.org/item/JSFS_2011__152_3_83_0/
LA  - en
ID  - JSFS_2011__152_3_83_0
ER  - 
%0 Journal Article
%A Celisse, Alain
%A Mary-Huard, Tristan
%T Exact Cross-Validation for $k$NN : application to passive and active learning in classification
%J Journal de la société française de statistique
%D 2011
%P 83-97
%V 152
%N 3
%I Société française de statistique
%U http://www.numdam.org/item/JSFS_2011__152_3_83_0/
%G en
%F JSFS_2011__152_3_83_0
Celisse, Alain; Mary-Huard, Tristan. Exact Cross-Validation for $k$NN : application to passive and active learning in classification. Journal de la société française de statistique, Tome 152 (2011) no. 3, pp. 83-97. http://www.numdam.org/item/JSFS_2011__152_3_83_0/

[1] Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection, Statist. Surv., Volume 4 (2010), pp. 40-79 | MR | Zbl

[2] Celisse, A.; Robin, S. Nonparametric density estimation by exact leave- p -out cross-validation, Comput. Statist. Data Anal., Volume 52 (2008) no. 5, pp. 2350-2368 | DOI | MR | Zbl

[3] Devroye, L.; Gyorfi, L.; Lugosi, G. A probabilistic theory of pattern recognition, Springer, 1996 | MR | Zbl

[4] Fix, E.; Hodges, J. Discriminatory analysis- nonparametric discrimination: Consistency principles, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, IEEE Computer Society Press, Los Alamitos, CA (1991) (Reprint of original work from 1952)

[5] Fix, E.; Hodges, J. Nonparametric Discrimination: small sample performance, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, IEEE Computer Society Press, Los Alamitos, CA (1991) (Reprint of original work from 1952)

[6] Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; Bloomfield, C.D.; Lander, E.S. Class prediction and discovery using gene expression data, Science, Volume 286 (1999), pp. 531-537

[7] Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction., Springer, New York, 2001 | MR | Zbl

[8] McCallum, A.; Nigam, K. Employing EM and Pool-Based Active Learning for Text Classification, Mach. Learn.: Proc. of the Fifteenth Intern. Conf. (ICML ’98) (1998), pp. 359-367

[9] Settles, B.; Craven, M.; Friedland, L. Active Learning with Real Annotation Costs, Proceedings of the NIPS Workshop on Cost-Sensitive Learning (2008), pp. 1-10

[10] Settles, B. Active Learning Literature Survey (2009) no. 1648 (Comp. Sci. Tech. Report)

[11] Simard, P.Y.; LeCun, Y.; Denker, J.S.; Victorri, B. Transformation Invariance in Pattern Recognition – Tangent Distance and Tangent Propagation, International Journal of Imaging Systems and Technology, Volume 11 (2001) no. 3

[12] Seung, H. S.; Opper, M; Sompolinsky, H. Query by committee, Annual Workshop on Computational Learning Theory (1992), pp. 287-294

[13] Stone, M. Cross-validatory choice and assessment of statistical predictions, J. Roy. Statist. Soc. Ser. B, Volume 36 (1974), pp. 111-147 | MR | Zbl