Sparsity in penalized empirical risk minimization
Annales de l'I.H.P. Probabilités et statistiques, Tome 45 (2009) no. 1, pp. 7-57.

Soit $\left(X,Y\right)$ un couple aléatoire à valeurs dans $S×T$ et de loi $P$ inconnue. Soient $\left({X}_{1},{Y}_{1}\right),...,\left({X}_{n},{Y}_{n}\right)$ des répliques i.i.d. de $\left(X,Y\right)$, de loi empirique associée ${P}_{n}$. Soit ${h}_{1},...,{h}_{N}:S↦\left[-1,1\right]$ un dictionnaire composé de $N$ fonctions. Pour tout $\lambda \in {ℝ}^{N}$, on note ${f}_{\lambda }:={\sum }_{j=1}^{N}{\lambda }_{j}{h}_{j}$. Soit $\ell :T×ℝ↦ℝ$ fonction de perte donnée que l’on suppose convexe en la seconde variable. On note $\left(\ell •f\right)\left(x,y\right):=\ell \left(y;f\left(x\right)\right)$. On étudie le problème de minimisation du risque empirique pénalisé suivant

 ${\stackrel{^}{\lambda }}^{\epsilon }:=\underset{\lambda \in {ℝ}^{N}}{argmin}\left[{P}_{n}\left(\ell •{f}_{\lambda }\right)+\epsilon {\parallel \lambda \parallel }_{{\ell }_{p}}^{p}\right],$
qui correspond à la version empirique du problème
 ${\lambda }^{\epsilon }:=\underset{\lambda \in {ℝ}^{N}}{argmin}\left[P\left(\ell •{f}_{\lambda }\right)+\epsilon {\parallel \lambda \parallel }_{{\ell }_{p}}^{p}\right]$
(ici $\epsilon \ge 0$ est un paramètre de régularisation; ${\lambda }^{0}$ correspond au cas $\epsilon =0$). Ce cadre général englobe un certain nombre de problèmes de régression et de classification. On s’intéresse au cas où $p\ge 1$, mais reste proche de 1 (de sorte que $p-1$ soit de l’ordre $\frac{1}{logN}$, ou inférieur). On montre que la «sparsité» de ${\lambda }^{\epsilon }$ implique la «sparsité» de ${\stackrel{^}{\lambda }}^{\epsilon }$. En outre, on étudie les conséquences de la «sparsité» en termes de bornes supérieures sur l’excès de risque $P\left(\ell •{f}_{{\stackrel{^}{\lambda }}^{\epsilon }}\right)-P\left(\ell •{f}_{{\lambda }^{0}}\right)$ des solutions obtenues pour les différents problèmes de minimisation du risque empirique.

Let $\left(X,Y\right)$ be a random couple in $S×T$ with unknown distribution $P$. Let $\left({X}_{1},{Y}_{1}\right),...,\left({X}_{n},{Y}_{n}\right)$ be i.i.d. copies of $\left(X,Y\right)$, ${P}_{n}$ being their empirical distribution. Let ${h}_{1},...,{h}_{N}:S↦\left[-1,1\right]$ be a dictionary consisting of $N$ functions. For $\lambda \in {ℝ}^{N}$, denote ${f}_{\lambda }:={\sum }_{j=1}^{N}{\lambda }_{j}{h}_{j}$. Let $\ell :T×ℝ↦ℝ$ be a given loss function, which is convex with respect to the second variable. Denote $\left(\ell •f\right)\left(x,y\right):=\ell \left(y;f\left(x\right)\right)$. We study the following penalized empirical risk minimization problem

 ${\stackrel{^}{\lambda }}^{\epsilon }:=\underset{\lambda \in {ℝ}^{N}}{argmin}\left[{P}_{n}\left(\ell •{f}_{\lambda }\right)+\epsilon {\parallel \lambda \parallel }_{{\ell }_{p}}^{p}\right],$
which is an empirical version of the problem
 ${\lambda }^{\epsilon }:=\underset{\lambda \in {ℝ}^{N}}{argmin}\left[P\left(\ell •{f}_{\lambda }\right)+\epsilon {\parallel \lambda \parallel }_{{\ell }_{p}}^{p}\right]$
(here $\epsilon \ge 0$ is a regularization parameter; ${\lambda }^{0}$ corresponds to $\epsilon =0$). A number of regression and classification problems fit this general framework. We are interested in the case when $p\ge 1$, but it is close enough to 1 (so that $p-1$ is of the order $\frac{1}{logN}$, or smaller). We show that the “sparsity” of ${\lambda }^{\epsilon }$ implies the “sparsity” of ${\stackrel{^}{\lambda }}^{\epsilon }$ and study the impact of “sparsity” on bounding the excess risk $P\left(\ell •{f}_{{\stackrel{^}{\lambda }}^{\epsilon }}\right)-P\left(\ell •{f}_{{\lambda }^{0}}\right)$ of solutions of empirical risk minimization problems.

DOI : https://doi.org/10.1214/07-AIHP146
Classification : 62G99,  62J99,  62H30
Mots clés : empirical risk, penalized empirical risk, ℓ_p-penalty, sparsity, oracle inequalities
@article{AIHPB_2009__45_1_7_0,
title = {Sparsity in penalized empirical risk minimization},
journal = {Annales de l'I.H.P. Probabilit\'es et statistiques},
pages = {7--57},
publisher = {Gauthier-Villars},
volume = {45},
number = {1},
year = {2009},
doi = {10.1214/07-AIHP146},
zbl = {1168.62044},
mrnumber = {2500227},
language = {en},
url = {http://www.numdam.org/articles/10.1214/07-AIHP146/}
}
Koltchinskii, Vladimir. Sparsity in penalized empirical risk minimization. Annales de l'I.H.P. Probabilités et statistiques, Tome 45 (2009) no. 1, pp. 7-57. doi : 10.1214/07-AIHP146. http://www.numdam.org/articles/10.1214/07-AIHP146/

 A. Barron, L. Birgé and P. Massart. Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 (1999) 301-413. | MR 1679028 | Zbl 0946.62036

 P. Bartlett, O. Bousquet and S. Mendelson. Local Rademacher complexities. Ann. Statist. 33 (2005) 1497-1537. | MR 2166554 | Zbl 1083.62034

 A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization. Analysis, Algorithms and Engineering Applications. MPS/SIAM, Series on Optimization, Philadelphia, 2001. | MR 1857264 | Zbl 0986.90032

 F. Bunea, A. Tsybakov and M. Wegkamp. Aggregation for Gaussian regression. Ann. Statist. 35 (2007) 1674-1697. | MR 2351101

 F. Bunea, A. Tsybakov and M. Wegkamp. Sparsity oracle inequalities for the LASSO. Electron. J. Statist. 1 (2007) 169-194. | MR 2312149 | Zbl 1146.62028

 E. Candes and T. Tao. The Dantzig selector statistical estimation when p is much larger than n. Ann. Statist. 35 (2007) 2313-2351. | MR 2382644 | Zbl 1139.62019

 E. Candes, M. Rudelson, T. Tao and R. Vershynin. Error correction via linear programming. In Proc. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS05) 295-308. IEEE, Pittsburgh, PA, 2005.

 E. Candes, J. Romberg and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math. 59 (2006) 1207-1223. | MR 2230846 | Zbl 1098.94009

 O. Catoni. Statistical Learning Theory and Stochastic Optimization. Springer, New York, 2004. | MR 2163920 | Zbl 1076.93002

 D. L. Donoho. For most large underdetermined systems of equations the minimal ℓ1-norm near-solution approximates the sparsest near-solution. Preprint, 2004. | Zbl 1105.90068

 D. L. Donoho. For most large underdetermined systems of linear equations the minimal ℓ1-norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59 (2006) 797-829. | MR 2217606 | Zbl 1113.15004

 D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory 52 (2006) 1289-1306. | MR 2241189

 D. L. Donoho, M. Elad and V. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inform. Theory 52 (2006) 6-18. | MR 2237332

 Van De S. Geer. High-dimensional generalized linear models and the Lasso. Ann. Statist. 36 (2008) 614-645. | MR 2396809 | Zbl 1138.62323

 V. Koltchinskii. Model selection and aggregation in sparse classification problems. Oberwolfach Reports Meeting on Statistical and Probabilistic Methods of Model Selection, October, 2005.

 V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk mnimization. Ann. Statist. 34 (2006) 2593-2656. | MR 2329442 | Zbl 1118.62065

 V. Koltchinskii and D. Panchenko. Complexities of convex combinations and bounding the generalization error in classification. Ann. Statist. 33 (2005) 1455-1496. | MR 2166553 | Zbl 1080.62045

 M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, New York, 1991. | MR 1102015 | Zbl 0748.60004

 P. Massart. Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Tolouse (IX) 9 (2000) 245-303. | Numdam | MR 1813803 | Zbl 0986.62002

 P. Massart. Concentration Inequalities and Model Selection. Springer, Berlin, 2007. | MR 2319879 | Zbl 1170.60006

 S. Mendelson, A. Pajor and N. Tomczak-Jaegermann. Reconstruction and subgaussian operators in Asymptotic Geometric Analysis. Geom. Funct. Anal. 17 (2007) 1248-1282. | MR 2373017 | Zbl 1163.46008

 N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the LASSO. Ann. Statist. 34 (2006) 1436-1462. | MR 2278363 | Zbl 1113.62082

 A. Nemirovski. Topics in non-parametric statistics. In Ecole d'Et'e de Probabilités de Saint-Flour XXVIII, 1998 85-277. P. Bernard (Ed). Springer, New York, 2000. | MR 1775640 | Zbl 0998.62033

 M. Rudelson and R. Vershynin. Geometric approach to error correcting codes and reconstruction of signals. Int. Math. Res. Not. 64 (2005) 4019-4041. | MR 2206919 | Zbl 1103.94014

 R. Tibshirani. Regression shrinkage and selection via Lasso. J. Royal Statist. Soc. Ser. B 58 (1996) 267-288. | MR 1379242 | Zbl 0850.62538

 A. Tsybakov. Optimal rates of aggregation. In Proc. 16th Annual Conference on Learning Theory (COLT) and 7th Annual Workshop on Kernel Machines, 303-313. Lecture Notes in Artificial Intelligence 2777. Springer, New York, 2003.

 Van Der A. Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, New York, 1996. | MR 1385671 | Zbl 0862.60002

 Y. Yang. Mixing strategies for density estimation. Ann. Statist. 28 (2000) 75-87. | MR 1762904 | Zbl 1106.62322

 Y. Yang. Aggregating regression procedures for a better performance. Bernoulli 10 (2004) 25-47. | MR 2044592 | Zbl 1040.62030

 P. Zhao and B. Yu. On model selection consistency of LASSO. J. Mach. Learn. Res. 7 (2006) 2541-2563. | MR 2274449