Variable selection through CART
ESAIM: Probability and Statistics, Tome 18 (2014) , pp. 770-798.

This paper deals with variable selection in regression and binary classification frameworks. It proposes an automatic and exhaustive procedure which relies on the use of the CART algorithm and on model selection via penalization. This work, of theoretical nature, aims at determining adequate penalties, i.e. penalties which allow achievement of oracle type inequalities justifying the performance of the proposed procedure. Since the exhaustive procedure cannot be realized when the number of variables is too large, a more practical procedure is also proposed and still theoretically validated. A simulation study completes the theoretical results.

DOI : https://doi.org/10.1051/ps/2014006
Classification : 62G05,  62G07,  62G20
Mots clés : binary classification, CART, model selection, penalization, regression, variable selection
@article{PS_2014__18__770_0,
     author = {Sauve, Marie and Tuleau-Malot, Christine},
     title = {Variable selection through CART},
     journal = {ESAIM: Probability and Statistics},
     pages = {770--798},
     publisher = {EDP-Sciences},
     volume = {18},
     year = {2014},
     doi = {10.1051/ps/2014006},
     language = {en},
     url = {http://www.numdam.org/articles/10.1051/ps/2014006/}
}
Sauve, Marie; Tuleau-Malot, Christine. Variable selection through CART. ESAIM: Probability and Statistics, Tome 18 (2014) , pp. 770-798. doi : 10.1051/ps/2014006. http://www.numdam.org/articles/10.1051/ps/2014006/

[1] S. Arlot and P. Bartlett, Margin adaptive model selection in statistical learning. Bernoulli 17 (2011) 687-713. | MR 2787611 | Zbl 1345.62087

[2] L. Birgé and P. Massart, Minimal penalties for gaussian model selection. Probab. Theory Relat. Fields 138 (2007) 33-73. | MR 2288064 | Zbl 1112.62082

[3] L. Breiman, Random forests. Mach. Learn. 45 (2001) 5-32. | Zbl 1007.68152

[4] L. Breiman and A. Cutler, Random forests. http://www.stat.berkeley.edu/users/breiman/RandomForests/ (2005).

[5] L. Breiman, J. Friedman, R. Olshen and C. Stone, Classification and Regression Trees. Chapman et Hall (1984). | MR 726392 | Zbl 0541.62042

[6] R. Díaz-Uriarte and S. Alvarez De Andrés, Gene selection and classification of microarray data using random forest. BMC Bioinform. 7 (2006) 1-13.

[7] B. Efron, T. Hastie, I. Johnstone and R. Tibshirani, Least angle regression. Ann. Stat. 32 (2004) 407-499. | MR 2060166 | Zbl 1091.62054

[8] J. Fan and J. Lv, A selective overview of variable selection in high dimensional feature space. Stat. Sin. 20 (2010) 101-148. | MR 2640659 | Zbl 1180.62080

[9] G.M. Furnival and R.W. Wilson, Regression by leaps and bounds. Technometrics 16 (1974) 499-511. | Zbl 0294.62079

[10] R. Genuer, J.M. Poggi and C. Tuleau-Malot, Variable selection using random forests. Pattern Recognit. Lett. 31 (2010) 2225-2236.

[11] S. Gey, Margin adaptive risk bounds for classification trees, hal-00362281.

[12] S. Gey and E. Nédélec, Model Selection for CART Regression Trees. IEEE Trans. Inf. Theory 51 (2005) 658-670. | MR 2236074 | Zbl 1301.62064

[13] B. Ghattas and A. Ben Ishak, Sélection de variables pour la classification binaire en grande dimension: comparaisons et application aux données de biopuces. Journal de la société française de statistique 149 (2008) 43-66. | EuDML 93483 | MR 2501989

[14] U. Grömping, Estimators of relative importance in linear regression based on variance decomposition. The American Statistician 61 (2007) 139-147. | MR 2368103

[15] I. Guyon and A. Elisseff, An introduction to variable and feature selection. J. Mach. Learn. Res. 3 (2003) 1157-1182. | Zbl 1102.68556

[16] I. Guyon, J. Weston, S. Barnhill and V.N. Vapnik, Gene selection for cancer classification using support vector machines. Mach. Learn. 46 (2002) 389-422. | Zbl 0998.68111

[17] T. Hastié, R. Tibshirani and J. Friedman, The Elements of Statistical Learning. Springer (2001). | MR 1851606 | Zbl 0973.62007

[18] T. Hesterberg, N.H. Choi, L. Meier and C. Fraley, Least angle regresion and l1 penalized regression: A review. Stat. Surv. 2 (2008) 61-93. | MR 2520981 | Zbl 1189.62070

[19] R. Kohavi and G.H. John, Wrappers for feature subset selection. Artificial Intelligence 97 (1997) 273-324. | Zbl 0904.68143

[20] V. Koltchinskii, Local rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34 (2004) 2593-2656. | MR 2329442 | Zbl 1118.62065

[21] E. Mammen and A. Tsybakov, Smooth discrimination analysis. Ann. Stat. 27 (1999) 1808-1829. | MR 1765618 | Zbl 0961.62058

[22] P. Massart, Some applications of concentration inequalities to statistics. Annales de la faculté des sciences de Toulouse 2 (2000) 245-303. | EuDML 73516 | Numdam | MR 1813803 | Zbl 0986.62002

[23] P. Massart, Concentration Inequlaities and Model Selection. Lect. Notes Math. Springer (2003). | Zbl 1170.60006

[24] P. Massart and E. Nédélec, Risk bounds for statistical learning. Ann. Stat. 34 (2006). | MR 2291502 | Zbl 1108.62007

[25] J.M. Poggi and C. Tuleau, Classification supervisée en grande dimension. Application à l'agrément de conduite automobile. Revue de Statistique Appliquée LIV (2006) 41-60.

[26] E. Rio, Une inégalité de bennett pour les maxima de processus empiriques. Ann. Inst. Henri Poincaré, Probab. Stat. 38 (2002) 1053-1057. | EuDML 77737 | Numdam | MR 1955352 | Zbl 1014.60011

[27] A. Saltelli, K. Chan and M. Scott, Sensitivity Analysis. Wiley (2000). | MR 1886391 | Zbl 1152.62071

[28] M. Sauvé, Histogram selection in non gaussian regression. ESAIM PS 13 (2009) 70-86. | Numdam | MR 2502024 | Zbl 1180.62061

[29] M. Sauvé and C. Tuleau-Malot, Variable selection through CART, hal-00551375.

[30] I.M. Sobol, Sensitivity estimates for nonlinear mathematical models. Math. Mod. Comput. Experiment 1 (1993) 271-280. | MR 1335161 | Zbl 1039.65505

[31] R. Tibshirani, Regression shrinkage and selection via Lasso. J. R. Stat. Soc. Ser. B 58 (1996) 267-288. | MR 1379242 | Zbl 0850.62538

[32] A.B. Tsybakov, Optimal aggregation of classifiers in statistical learning. Ann. Stat. 32 (2004) 135-166. | MR 2051002 | Zbl 1105.62353