Sparsity in penalized empirical risk minimization
Annales de l'I.H.P. Probabilités et statistiques, Volume 45 (2009) no. 1, pp. 7-57.

Let $\left(X,Y\right)$ be a random couple in $S×T$ with unknown distribution $P$. Let $\left({X}_{1},{Y}_{1}\right),...,\left({X}_{n},{Y}_{n}\right)$ be i.i.d. copies of $\left(X,Y\right)$, ${P}_{n}$ being their empirical distribution. Let ${h}_{1},...,{h}_{N}:S↦\left[-1,1\right]$ be a dictionary consisting of $N$ functions. For $\lambda \in {ℝ}^{N}$, denote ${f}_{\lambda }:={\sum }_{j=1}^{N}{\lambda }_{j}{h}_{j}$. Let $\ell :T×ℝ↦ℝ$ be a given loss function, which is convex with respect to the second variable. Denote $\left(\ell •f\right)\left(x,y\right):=\ell \left(y;f\left(x\right)\right)$. We study the following penalized empirical risk minimization problem

 ${\stackrel{^}{\lambda }}^{\epsilon }:=\underset{\lambda \in {ℝ}^{N}}{argmin}\left[{P}_{n}\left(\ell •{f}_{\lambda }\right)+\epsilon {\parallel \lambda \parallel }_{{\ell }_{p}}^{p}\right],$
which is an empirical version of the problem
 ${\lambda }^{\epsilon }:=\underset{\lambda \in {ℝ}^{N}}{argmin}\left[P\left(\ell •{f}_{\lambda }\right)+\epsilon {\parallel \lambda \parallel }_{{\ell }_{p}}^{p}\right]$
(here $\epsilon \ge 0$ is a regularization parameter; ${\lambda }^{0}$ corresponds to $\epsilon =0$). A number of regression and classification problems fit this general framework. We are interested in the case when $p\ge 1$, but it is close enough to 1 (so that $p-1$ is of the order $\frac{1}{logN}$, or smaller). We show that the “sparsity” of ${\lambda }^{\epsilon }$ implies the “sparsity” of ${\stackrel{^}{\lambda }}^{\epsilon }$ and study the impact of “sparsity” on bounding the excess risk $P\left(\ell •{f}_{{\stackrel{^}{\lambda }}^{\epsilon }}\right)-P\left(\ell •{f}_{{\lambda }^{0}}\right)$ of solutions of empirical risk minimization problems.

Soit $\left(X,Y\right)$ un couple aléatoire à valeurs dans $S×T$ et de loi $P$ inconnue. Soient $\left({X}_{1},{Y}_{1}\right),...,\left({X}_{n},{Y}_{n}\right)$ des répliques i.i.d. de $\left(X,Y\right)$, de loi empirique associée ${P}_{n}$. Soit ${h}_{1},...,{h}_{N}:S↦\left[-1,1\right]$ un dictionnaire composé de $N$ fonctions. Pour tout $\lambda \in {ℝ}^{N}$, on note ${f}_{\lambda }:={\sum }_{j=1}^{N}{\lambda }_{j}{h}_{j}$. Soit $\ell :T×ℝ↦ℝ$ fonction de perte donnée que l’on suppose convexe en la seconde variable. On note $\left(\ell •f\right)\left(x,y\right):=\ell \left(y;f\left(x\right)\right)$. On étudie le problème de minimisation du risque empirique pénalisé suivant

 ${\stackrel{^}{\lambda }}^{\epsilon }:=\underset{\lambda \in {ℝ}^{N}}{argmin}\left[{P}_{n}\left(\ell •{f}_{\lambda }\right)+\epsilon {\parallel \lambda \parallel }_{{\ell }_{p}}^{p}\right],$
qui correspond à la version empirique du problème
 ${\lambda }^{\epsilon }:=\underset{\lambda \in {ℝ}^{N}}{argmin}\left[P\left(\ell •{f}_{\lambda }\right)+\epsilon {\parallel \lambda \parallel }_{{\ell }_{p}}^{p}\right]$
(ici $\epsilon \ge 0$ est un paramètre de régularisation; ${\lambda }^{0}$ correspond au cas $\epsilon =0$). Ce cadre général englobe un certain nombre de problèmes de régression et de classification. On s’intéresse au cas où $p\ge 1$, mais reste proche de 1 (de sorte que $p-1$ soit de l’ordre $\frac{1}{logN}$, ou inférieur). On montre que la «sparsité» de ${\lambda }^{\epsilon }$ implique la «sparsité» de ${\stackrel{^}{\lambda }}^{\epsilon }$. En outre, on étudie les conséquences de la «sparsité» en termes de bornes supérieures sur l’excès de risque $P\left(\ell •{f}_{{\stackrel{^}{\lambda }}^{\epsilon }}\right)-P\left(\ell •{f}_{{\lambda }^{0}}\right)$ des solutions obtenues pour les différents problèmes de minimisation du risque empirique.

DOI: 10.1214/07-AIHP146
Classification: 62G99,  62J99,  62H30
Keywords: empirical risk, penalized empirical risk, ℓ_p-penalty, sparsity, oracle inequalities
@article{AIHPB_2009__45_1_7_0,
title = {Sparsity in penalized empirical risk minimization},
journal = {Annales de l'I.H.P. Probabilit\'es et statistiques},
pages = {7--57},
publisher = {Gauthier-Villars},
volume = {45},
number = {1},
year = {2009},
doi = {10.1214/07-AIHP146},
zbl = {1168.62044},
mrnumber = {2500227},
language = {en},
url = {http://www.numdam.org/articles/10.1214/07-AIHP146/}
}
TY  - JOUR
TI  - Sparsity in penalized empirical risk minimization
JO  - Annales de l'I.H.P. Probabilités et statistiques
PY  - 2009
DA  - 2009///
SP  - 7
EP  - 57
VL  - 45
IS  - 1
PB  - Gauthier-Villars
UR  - http://www.numdam.org/articles/10.1214/07-AIHP146/
UR  - https://zbmath.org/?q=an%3A1168.62044
UR  - https://www.ams.org/mathscinet-getitem?mr=2500227
UR  - https://doi.org/10.1214/07-AIHP146
DO  - 10.1214/07-AIHP146
LA  - en
ID  - AIHPB_2009__45_1_7_0
ER  - 
%0 Journal Article
%T Sparsity in penalized empirical risk minimization
%J Annales de l'I.H.P. Probabilités et statistiques
%D 2009
%P 7-57
%V 45
%N 1
%I Gauthier-Villars
%U https://doi.org/10.1214/07-AIHP146
%R 10.1214/07-AIHP146
%G en
%F AIHPB_2009__45_1_7_0
Koltchinskii, Vladimir. Sparsity in penalized empirical risk minimization. Annales de l'I.H.P. Probabilités et statistiques, Volume 45 (2009) no. 1, pp. 7-57. doi : 10.1214/07-AIHP146. http://www.numdam.org/articles/10.1214/07-AIHP146/

[1] A. Barron, L. Birgé and P. Massart. Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 (1999) 301-413. | MR | Zbl

[2] P. Bartlett, O. Bousquet and S. Mendelson. Local Rademacher complexities. Ann. Statist. 33 (2005) 1497-1537. | MR | Zbl

[3] A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization. Analysis, Algorithms and Engineering Applications. MPS/SIAM, Series on Optimization, Philadelphia, 2001. | MR | Zbl

[4] F. Bunea, A. Tsybakov and M. Wegkamp. Aggregation for Gaussian regression. Ann. Statist. 35 (2007) 1674-1697. | MR

[5] F. Bunea, A. Tsybakov and M. Wegkamp. Sparsity oracle inequalities for the LASSO. Electron. J. Statist. 1 (2007) 169-194. | MR | Zbl

[6] E. Candes and T. Tao. The Dantzig selector statistical estimation when p is much larger than n. Ann. Statist. 35 (2007) 2313-2351. | MR | Zbl

[7] E. Candes, M. Rudelson, T. Tao and R. Vershynin. Error correction via linear programming. In Proc. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS05) 295-308. IEEE, Pittsburgh, PA, 2005.

[8] E. Candes, J. Romberg and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math. 59 (2006) 1207-1223. | MR | Zbl

[9] O. Catoni. Statistical Learning Theory and Stochastic Optimization. Springer, New York, 2004. | MR | Zbl

[10] D. L. Donoho. For most large underdetermined systems of equations the minimal ℓ1-norm near-solution approximates the sparsest near-solution. Preprint, 2004. | Zbl

[11] D. L. Donoho. For most large underdetermined systems of linear equations the minimal ℓ1-norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59 (2006) 797-829. | MR | Zbl

[12] D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory 52 (2006) 1289-1306. | MR

[13] D. L. Donoho, M. Elad and V. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inform. Theory 52 (2006) 6-18. | MR

[14] Van De S. Geer. High-dimensional generalized linear models and the Lasso. Ann. Statist. 36 (2008) 614-645. | MR | Zbl

[15] V. Koltchinskii. Model selection and aggregation in sparse classification problems. Oberwolfach Reports Meeting on Statistical and Probabilistic Methods of Model Selection, October, 2005.

[16] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk mnimization. Ann. Statist. 34 (2006) 2593-2656. | MR | Zbl

[17] V. Koltchinskii and D. Panchenko. Complexities of convex combinations and bounding the generalization error in classification. Ann. Statist. 33 (2005) 1455-1496. | MR | Zbl

[18] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, New York, 1991. | MR | Zbl

[19] P. Massart. Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Tolouse (IX) 9 (2000) 245-303. | Numdam | MR | Zbl

[20] P. Massart. Concentration Inequalities and Model Selection. Springer, Berlin, 2007. | MR | Zbl

[21] S. Mendelson, A. Pajor and N. Tomczak-Jaegermann. Reconstruction and subgaussian operators in Asymptotic Geometric Analysis. Geom. Funct. Anal. 17 (2007) 1248-1282. | MR | Zbl

[22] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the LASSO. Ann. Statist. 34 (2006) 1436-1462. | MR | Zbl

[23] A. Nemirovski. Topics in non-parametric statistics. In Ecole d'Et'e de Probabilités de Saint-Flour XXVIII, 1998 85-277. P. Bernard (Ed). Springer, New York, 2000. | MR | Zbl

[24] M. Rudelson and R. Vershynin. Geometric approach to error correcting codes and reconstruction of signals. Int. Math. Res. Not. 64 (2005) 4019-4041. | MR | Zbl

[25] R. Tibshirani. Regression shrinkage and selection via Lasso. J. Royal Statist. Soc. Ser. B 58 (1996) 267-288. | MR | Zbl

[26] A. Tsybakov. Optimal rates of aggregation. In Proc. 16th Annual Conference on Learning Theory (COLT) and 7th Annual Workshop on Kernel Machines, 303-313. Lecture Notes in Artificial Intelligence 2777. Springer, New York, 2003.

[27] Van Der A. Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, New York, 1996. | MR | Zbl

[28] Y. Yang. Mixing strategies for density estimation. Ann. Statist. 28 (2000) 75-87. | MR | Zbl

[29] Y. Yang. Aggregating regression procedures for a better performance. Bernoulli 10 (2004) 25-47. | MR | Zbl

[30] P. Zhao and B. Yu. On model selection consistency of LASSO. J. Mach. Learn. Res. 7 (2006) 2541-2563. | MR

Cited by Sources: