Impact of subsampling and tree depth on random forests

Duroux, Roxane; Scornet, Erwan

doi:10.1051/ps/2018008

Duroux, Roxane ¹ ; Scornet, Erwan ¹

ESAIM: Probability and Statistics, Tome 22 (2018), pp. 96-128.

Résumé

Random forests are ensemble learning methods introduced by Breiman [Mach. Learn. 45 (2001) 5–32] that operate by averaging several decision trees built on a randomly selected subspace of the data set. Despite their widespread use in practice, the respective roles of the different mechanisms at work in Breiman’s forests are not yet fully understood, neither is the tuning of the corresponding parameters. In this paper, we study the influence of two parameters, namely the subsampling rate and the tree depth, on Breiman’s forests performance. More precisely, we prove that quantile forests (a specific type of random forests) based on subsampling and quantile forests whose tree construction is terminated early have similar performances, as long as their respective parameters (subsampling rate and tree depth) are well chosen. Moreover, experiments show that a proper tuning of these parameters leads in most cases to an improvement of Breiman’s original forests in terms of mean squared error.

MR Zbl

DOI : 10.1051/ps/2018008

Classification : 62G05, 62G20
Mots clés : Random forests, randomization, parameter tuning, subsampling, tree depth

Affiliations des auteurs :

Duroux, Roxane ¹ ; Scornet, Erwan ¹

@article{PS_2018__22__96_0,
     author = {Duroux, Roxane and Scornet, Erwan},
     title = {Impact of subsampling and tree depth on random forests},
     journal = {ESAIM: Probability and Statistics},
     pages = {96--128},
     publisher = {EDP-Sciences},
     volume = {22},
     year = {2018},
     doi = {10.1051/ps/2018008},
     mrnumber = {3891755},
     zbl = {1409.62072},
     language = {en},
     url = {http://www.numdam.org/articles/10.1051/ps/2018008/}
}

TY  - JOUR
AU  - Duroux, Roxane
AU  - Scornet, Erwan
TI  - Impact of subsampling and tree depth on random forests
JO  - ESAIM: Probability and Statistics
PY  - 2018
SP  - 96
EP  - 128
VL  - 22
PB  - EDP-Sciences
UR  - http://www.numdam.org/articles/10.1051/ps/2018008/
DO  - 10.1051/ps/2018008
LA  - en
ID  - PS_2018__22__96_0
ER  -

%0 Journal Article
%A Duroux, Roxane
%A Scornet, Erwan
%T Impact of subsampling and tree depth on random forests
%J ESAIM: Probability and Statistics
%D 2018
%P 96-128
%V 22
%I EDP-Sciences
%U http://www.numdam.org/articles/10.1051/ps/2018008/
%R 10.1051/ps/2018008
%G en
%F PS_2018__22__96_0

Duroux, Roxane; Scornet, Erwan. Impact of subsampling and tree depth on random forests. ESAIM: Probability and Statistics, Tome 22 (2018), pp. 96-128. doi : 10.1051/ps/2018008. http://www.numdam.org/articles/10.1051/ps/2018008/

Bibliographie
Cité par

S. Arlot and R. Genuer, Analysis of Purely Random Forests Bias. Preprint (2014). | arXiv

G. Biau, Analysis of a random forests model. J. Mach. Learn. Res. 13 (2012) 1063–1095. | MR | Zbl

G. Biau and L. Devroye, Cellular tree classifiers, in Algorithmic Learning Theory. Springer, Cham (2014) 8–17. | MR | Zbl

G. Biau, L. Devroye and G. Lugosi, Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res. 9 (2008) 2015–2033. | MR | Zbl

L. Breiman, Random forests. Mach. Learn. 45 (2001) 5–32. | DOI | Zbl

L. Breiman, J.H. Friedman, R.A. Olshen and C.J. Stone, Classification and Regression Trees. Chapman & Hall, CRC, Boca Raton (1984). | Zbl

P. Bühlmann, Bagging, boosting and ensemble methods, in Handbook of Computational Statistics. Springer, Berlin, Heidelberg (2012) 985–1022. | DOI | MR

M. Denil, D. Matheson and N. De Freitas, Consistency of Online Random Forests. Vol. 28 of Proc. of ICML’13 Proceedings of the 30th International Conference on International Conference on Machine Learning, Atlanta, GA, USA June 6–21 (2013) 1256–1264.

M. Denil, D. Matheson and N. De Freitas, Narrowing the gap: random forests in theory and in practice, in International Conference on Machine Learning (ICML) (2014).

L. Devroye, L. Györfi and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer, New York (1996). | DOI | MR | Zbl

R. Díaz-Uriarte and S. Alvarez De Andrés, Gene selection and classification of microarray data using random forest. BMC Bioinform. 7 (2006) 1–13. | DOI

M. Fernández-Delgado, E. Cernadas, S. Barro and D. Amorim, Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res. 15 (2014) 3133–3181. | MR | Zbl

R. Genuer, Variance reduction in purely random forests. J. Nonparametric Stat. 24 (2012) 543–562. | DOI | MR | Zbl

R. Genuer, J. Poggi and C. Tuleau-Malot, Variable selection using random forests. Pattern Recognit. Lett. 31 (2010) 2225–2236. | DOI

H. Ishwaran and U.B. Kogalur, Consistency of random survival forests. Stat. Probab. Lett. 80 (2010) 1056–1064. | DOI | MR | Zbl

L. Meier, S. Van De Geer and P. Bühlmann, High-dimensional additive modeling. Ann. Stat. 37 (2009) 3779–3821. | DOI | MR | Zbl

L. Mentch and G. Hooker, Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J. Mach. Learn. Res. 17 (2015) 841–881. | MR

Y. Qi, Random forest for bioinformatics, in Ensemble Machine Learning. Springer, Boston, MA (2012) 307–323.

G. Rogez, J. Rihan, S. Ramalingam, C. Orrite and P. H. Torr, Randomized trees for human pose detection, in IEEE Conference on Computer Vision and Pattern Recognition (2008) 1–8.

M. Sabzevari, G. Martínez-Muñoz and A. Suárez, Improving the Robustness of Bagging with Reduced Sampling Size. Universitécatholique de Louvain (2014).

E. Scornet, On the asymptotics of random forests. J. Multivar. Anal. 146 (2016) 72–83. | DOI | MR | Zbl

E. Scornet, G. Biau and J.-P. Vert, Consistency of random forests. Ann. Stat. 43 (2015) 1716–1741. | DOI | MR | Zbl

C.J. Stone, Optimal rates of convergence for nonparametric estimators. Ann. Stat. 8 (1980) 1348–1360. | DOI | MR | Zbl

C.J. Stone, Optimal global rates of convergence for nonparametric regression. Ann. Stat. 10 (1982) 1040–1053. | DOI | MR | Zbl

M. Van Der Laan, E.C. Polley and A.E. Hubbard, Super learner. Stat. Appl. Genet. Mol. Biol. 6 (2007). | DOI | MR | Zbl

S. Wager, Asymptotic Theory for Random Forests. Preprint (2014). | arXiv

S. Wager and S. Athey, Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. (2018) 1–15. | MR | Zbl

S. Wager and G. Walther., Adaptive Concentration of Regression Trees, With Application to Random Forests (2015).

F. Zaman and H. Hirose, Effect of subsampling rate on subbagging and related ensembles of stable classifiers, in International Conference on Pattern Recognition and Machine Intelligence. Springer (2009) 44–49. | DOI

Cité par Sources :