Partial differential equations/Dynamical systems
On the convergence of formally diverging neural net-based classifiers
[Convergence de classifieurs par réseaux de neurones formellement divergents]
Comptes Rendus. Mathématique, Tome 356 (2018) no. 4, pp. 395-405.

Nous étudions dans cette note le comportement asymptotique d'algorithmes du gradient appliqués à des problèmes de classification basés sur des modèles élémentaires de réseaux neuronaux à apprentissage supervisé. Nous prouvons que ces algorithmes divergent au sens mathématique strict, puisque la suite de paramètres définissant le classifieur est non bornée. Toutefois, en développant des méthodes d'entropie–production d'entropie, notre approche conduit à des taux explicites qui montrent, au moins lorsque les classes peuvent être bien séparées, que les paramètres divergent seulement logarithmiquement alors que la fonction coût converge vers 0 polynomialement. En conséquence, d'un point de vue pratique, l'algorithme permet effectivement d'obtenir un classifieur avec de bonnes performances, et peut même sembler converger.

We present an analytical study of gradient descent algorithms applied to a classification problem in machine learning based on artificial neural networks. Our approach is based on entropy–entropy dissipation estimates that yield explicit rates. Specifically, as long as the neural nets remain within a set of “good classifiers”, we establish a striking feature of the algorithm: it mathematically diverges as the number of gradient descent iterations (“time”) goes to infinity but this divergence is only logarithmic, while the loss function vanishes polynomially. As a consequence, this algorithm still yields a classifier that exhibits good numerical performance and may even appear to converge.

Reçu le :
Accepté le :
Publié le :
DOI : 10.1016/j.crma.2018.03.003
Berlyand, Leonid 1 ; Jabin, Pierre-Emmanuel 2

1 Department of Mathematics, The Pennsylvania State University, University Park, PA 16802, USA
2 CSCAMM and Department of Mathematics, University of Maryland, College Park, MD 20742, USA
@article{CRMATH_2018__356_4_395_0,
     author = {Berlyand, Leonid and Jabin, Pierre-Emmanuel},
     title = {On the convergence of formally diverging neural net-based classifiers},
     journal = {Comptes Rendus. Math\'ematique},
     pages = {395--405},
     publisher = {Elsevier},
     volume = {356},
     number = {4},
     year = {2018},
     doi = {10.1016/j.crma.2018.03.003},
     language = {en},
     url = {http://www.numdam.org/articles/10.1016/j.crma.2018.03.003/}
}
TY  - JOUR
AU  - Berlyand, Leonid
AU  - Jabin, Pierre-Emmanuel
TI  - On the convergence of formally diverging neural net-based classifiers
JO  - Comptes Rendus. Mathématique
PY  - 2018
SP  - 395
EP  - 405
VL  - 356
IS  - 4
PB  - Elsevier
UR  - http://www.numdam.org/articles/10.1016/j.crma.2018.03.003/
DO  - 10.1016/j.crma.2018.03.003
LA  - en
ID  - CRMATH_2018__356_4_395_0
ER  - 
%0 Journal Article
%A Berlyand, Leonid
%A Jabin, Pierre-Emmanuel
%T On the convergence of formally diverging neural net-based classifiers
%J Comptes Rendus. Mathématique
%D 2018
%P 395-405
%V 356
%N 4
%I Elsevier
%U http://www.numdam.org/articles/10.1016/j.crma.2018.03.003/
%R 10.1016/j.crma.2018.03.003
%G en
%F CRMATH_2018__356_4_395_0
Berlyand, Leonid; Jabin, Pierre-Emmanuel. On the convergence of formally diverging neural net-based classifiers. Comptes Rendus. Mathématique, Tome 356 (2018) no. 4, pp. 395-405. doi : 10.1016/j.crma.2018.03.003. http://www.numdam.org/articles/10.1016/j.crma.2018.03.003/

[1] Balan, R.; Singh, M.; Zou, D. Lipschitz properties for deep convolutional networks, Contemp. Math. (2018) (in press)

[2] Butkovsky, O. Subgeometric rates of convergence of Markov processes in the Wasserstein metric, Ann. Appl. Probab., Volume 24 (2014) no. 2, pp. 526-552

[3] Butkovsky, O.A.; Veretennikov, A.Yu. On asymptotics for Vaserstein coupling of Markov chains, Stoch. Process. Appl., Volume 123 (2013) no. 9, pp. 3518-3541

[4] Carrillo, J.A.; Di Francesco, M.; Figalli, A.; Laurent, T.; Slepcev, D. Global-in-time weak measure solutions and finite-time aggregation for nonlocal interaction equations, Duke Math. J., Volume 156 (2011), pp. 229-271

[5] Cheng, X.; Chen, X.; Mallat, S. Deep Haar scattering networks, Inf. Inference, Volume 5 (2016) no. 2, pp. 105-133

[6] Czaja, W.; Li, W. Analysis of time-frequency scattering transforms, Appl. Comput. Harmon. Anal. (2018) (in press) | DOI

[7] Douc, R.; Fort, G.; Moulines, E.; Soulier, P. Practical drift conditions for subgeometric rates of convergence, Ann. Appl. Probab., Volume 14 (2004), pp. 1353-1377

[8] Durmus, A.; Fort, G.; Moulines, E. Subgeometric rates of convergence in Wasserstein distance for Markov chains, Ann. Inst. Henri Poincaré Probab. Stat., Volume 52 (2016) no. 4, pp. 1799-1822

[9] Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning, MIT Press, 2016 http://www.deeplearningbook.org

[10] Hairer, M.; Mattingly, J.C.; Scheutzow, M. Asymptotic coupling and a general form of Harris' theorem with applications to stochastic delay equations, Probab. Theory Relat. Fields, Volume 149 (2011) no. 1–2, pp. 223-259

[11] Hinton, G. et al. Deep neural networks for acoustic modeling in speech recognition, IEEE Signal Process. Mag., Volume 29 (2012), pp. 82-97

[12] Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks, Lake Tahoe, NV, USA, 2012 (2014), pp. 109-1098

[13] Le Cun, Y.; Bengio, Y.; Hinton, G. Deep learning, Nature, Volume 521 (2015), pp. 436-444

[14] Le Cun, Y.; Boser, B.; Denker, J.; Henderson, D.; Howard, R.; Hubbard, W.; Jackelt, L. Handwritten digit recognition with a back-propagation network (Touretzky, D.S., ed.), Advances in Neural Information Processing Systems 2, Morgan Kaufmann, San Francisco, CA, 1990, pp. 396-404

[15] Le Cun, Y.; Bottou, L.; Orr, G.; Muller, K. Efficient BackProp (Orr, G.; Muller, K., eds.), Neural Networks: Tricks of the Trade, Springer, 1998

[16] Leung, M.K.; Xiong, H.Y.; Lee, L.J.; Frey, B.J. Deep learning of the tissue regulated splicing code, Bioinformatics, Volume 30 (2014) (i121–i129)

[17] Mallat, S. Group invariant scattering, Commun. Pure Appl. Math., Volume 65 (2012) no. 10, pp. 1331-1398

[18] Mallat, S. Understanding deep convolutional networks, Phil. Trans. R. Soc. A, Volume 374 (2016) no. 2065

[19] Meyn, S.; Tweedie, R.L. Markov Chains and Stochastic Stability, Cambridge University Press, Cambridge, UK, 2009

[20] Michel, P.; Mischler, S.; Perthame, B. General relative entropy inequality: an illustration on growth models, J. Math. Pures Appl., Volume 84 (2005) no. 9, pp. 1235-1260

[21] Shamir, O. Convergence of stochastic gradient descent for PCA (preprint) | arXiv

[22] I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in: Proc. 28th Annual Conf. on Neural Information Processing Systems, Montreal, Canada, 8–13 December 2014.

Cité par Sources :

LB was partially supported by NSF DMREF grant DMS-1628411; PEJ was partially supported by NSF Grant DMS-161453, NSF Grant RNMS (Ki-Net) DMS-1107444 and by LTS grants DO 0048-0049-0050 and 0052.