--- class: middle, center, title-slide # `WildWood`: a new Random Forest algorithm
**Stéphane Gaïffas**
[https://stephanegaiffas.github.io](https://stephanegaiffas.github.io)
[Joint work with **Ibrahim Merad** and **Yiyang Yu** (LPSM, Univ. de Paris)]
.grid[ .kol-1-4[.width-100[![](figures/lpsm.svg)]] .kol-1-4[.width-55[![](figures/up.svg)]] .kol-1-4[.width-45[![](figures/ens.svg)]] .kol-1-4[.width-100[![](figures/prairie.svg)]] ] --- class: middle, center, end-slide # 1. Introduction --- class: middle # Why Random Forest ? - Still widely used for **tabular data** - Ease of use, robustness to hyperparameters, embarrassingly parallel - Very popular in some domains (finance, bioinformatics, predictive medicine, astronomy, among many others...) Some **`stackoverflow`** search results for *"deep learning"*: .center.width-70[![](figures/stack-deep.jpg)] and for *"random forest"*: .center.width-70[![](figures/stack-rf.jpg)] --- class: middle Let's have a look at **Google Trends**: .center.width-100[![](figures/google-trends1.jpg)] .center.width-100[![](figures/google-trends2.jpg)] ## In this talk 1. We'll describe the *standard Random Forest* algorithm 1. And describe the construction of a *new version* of it called **`WildWood`** --- class: middle, center, end-slide # 2. Motivating examples --- # Toy example .center.width-90[![](figures/fig_aggregation_samples.svg)] .center[Binary classification dataset with 2 continuous features] --- # Standard Random Forest .center.width-90[![](figures/fig_aggregation_1.svg)] .center[Decision function of a **standard Random Forest** with 10 trees] --- # Standard Random Forest .center.width-90[![](figures/fig_aggregation_1_tree_0.svg)] .center[Decision function of tree #1 in the **standard Random Forest**] --- # Standard Random Forest .center.width-90[![](figures/fig_aggregation_1_tree_1.svg)] .center[Decision function of tree #2 in the **standard Random Forest**] --- # Standard Random Forest .center.width-90[![](figures/fig_aggregation_1_tree_2.svg)] .center[Decision function of tree #3 in the **standard Random Forest**] --- # Standard Random Forest .center.width-90[![](figures/fig_aggregation_1_tree_3.svg)] .center[Decision function of tree #4 in the **standard Random Forest**] --- # Standard Random Forest .center.width-90[![](figures/fig_aggregation_1_tree_4.svg)] .center[Decision function of tree #5 in the **standard Random Forest**] --- # `WildWood` .center.width-90[![](figures/fig_aggregation_0.svg)] .center[Decision function of **`WildWood`** with 10 trees] --- # `WildWood` .center.width-90[![](figures/fig_aggregation_0_tree_0.svg)] .center[Decision function of tree #1 in **`WildWood`**] --- # `WildWood` .center.width-90[![](figures/fig_aggregation_0_tree_1.svg)] .center[Decision function of tree #2 in **`WildWood`**] --- # `WildWood` .center.width-90[![](figures/fig_aggregation_0_tree_2.svg)] .center[Decision function of tree #3 in **`WildWood`**] --- # `WildWood` .center.width-90[![](figures/fig_aggregation_0_tree_3.svg)] .center[Decision function of tree #4 in **`WildWood`**] --- # `WildWood` .center.width-90[![](figures/fig_aggregation_0_tree_4.svg)] .center[Decision function of tree #5 in **`WildWood`**] --- # Standard RF vs `WildWood` On the previous toy example: .grid[ .kol-1-2[ .width-105[![](figures/fig_aggregation_1.svg)]
.center[Standard Random Forest] ] .kol-1-2[ .width-100[![](figures/fig_aggregation_0.svg)]
.center[`WildWood`] ] ] --- # Standard RF vs `WildWood` .center.width-100[![](figures/fig_n_trees.png)] - Mean test AUC (y-axis) on 4 datasets against number of trees used (x-axis) - Averaged scores over 10 train/test splits for **`WildWood`** and **`scikit-learn`**’s `RandomForestClassifier` and `ExtraTreeClassifier` classes [1, 2] Thanks to **subtrees aggregation** (and other tricks), `WildWood` improves these baselines, even with *few trees* - Even a *single tree* performs well! .footnote[ [1]: F. Fabian Pedregosa et al. *Scikit-learn: Machine learning in Python*. Journal of Machine Learning Research, 2011.
[2]: G. Louppe, *Understanding random forests: From theory to practice.* PhD thesis, University of Liege, 2014.
] --- # `WildWood` vs EGB libraries EGB = Extreme Gradient Boosting libraries .center.width-100[![](figures/fig_auc_timings_new.jpg)] - Test AUC (top) and training time (bottom) of `WildWood` compared with very popular EGB libraries (**after hyperoptimization** of all algorithms) - `WildWood` is *slightly below such very strong baselines*, but is **faster** (training times are on a *logarithmic scale*) on the considered datasets .footnote[ **XGBoost**: T. Chen and C. Guestrin, *Xgboost: A scalable tree boosting system*, SIGKDD 2016
**LightGBM**: I. Guyon et al., *Lightgbm: A highly efficient gradient boosting decision tree*, NeurIPS 2017
**CatBoost**: L. Prokhorenkova et al., *Catboost: unbiased boosting with categorical features*, arXiv preprint arXiv:1706.09516, 2017. ] --- class: middle, center, end-slide # 3. Standard decision trees and Random Forest --- class: middle # Training data Supervised training data $(x\_i, y\_i)$ for $i \in I = \\{ 1, \ldots, n \\}$ where - $x_i$ is the vector of *features* of sample $i$ - $y_i$ is the *label* of sample $i$ About **features**: we have $x_i \in \mathcal X \subset \mathbb R^d$ where $\mathcal X$ is the *features* space - $x_i$ can contain *numerical* or *categorical* features - Random Forest with *categorical* features: standard encoding approach is *one-hot encoding* About **labels**: we have $y_i \in \mathcal Y$ where $\mathcal Y$ is the labels set - For regression we have $\mathcal Y = \mathbb R$ - For $K$-class classification we have $\mathcal Y = \\{ 0, \ldots, K - 1\\}$, with predictions in $\widehat{\mathcal Y} = \mathcal P(\mathcal Y) =$ set of probabilities on $\mathcal Y$ --- class: middle # Decision trees Standard example of a classification tree on `iris` .grid[ .kol-1-2.width-100[![](figures/iris_plot.svg)] .kol-1-2.width-100[![](figures/cart_iris.svg)] ] --- ## Decision trees: main principles - Construct of a *partition* of the feature space: a *guillotine partition*, represented equivalently with a *binary tree* - Each *branch* of the tree corresponds to a pair (feature, threshold) called *split* - Each *node* of the tree is a *rectangular region* of the feature space. Final nodes are called *leaves* and are elements of the considered partition - Predictions are *constant within leaves*, and use *training samples in it* (for regression: average of labels in the leaves) .grid[ .kol-1-2.width-70[![](figures/tree1.png)] .kol-1-2.width-60[![](figures/tree2.png)] ] --- class: middle ## Binary trees A *finite ordered binary tree* $\mathcal T$ is a finite subset of $\\{ 0, 1 \\}^{\*} = \bigcup\_{n \geq 0} \\{ 0, 1 \\}^n$ of all finite words on $\\{ 0, 1\\}$. .grid[ .kol-2-3[ - $\\{ 0, 1 \\}^{\*}$ endowed with a *tree structure* (complete binary tree), empty word is $\mathtt{root}$ - For any $\mathbf v \in \\{ 0, 1\\}^{\*}$, *left child* is ${\mathbf v}0$, *right child* is ${\mathbf v}1$ - $\text{inodes} (\mathcal T) := \\{ {\mathbf v}\in \mathcal T : {\mathbf v}0, {\mathbf v}1 \in \mathcal T \\}$ - $\text{leaves} (\mathcal T) := \\{ {\mathbf v}\in \mathcal T : {\mathbf v}0, {\mathbf v}1 \not\in \mathcal T \\}$ ] .kol-1-3[ .width-100[![](figures/binary-tree.svg)] ] ] Note that $\text{inodes} (\mathcal T)$ and $\text{leaves} (\mathcal T)$ are *disjoint* and the set of all nodes is $$ \text{nodes}(\mathcal T) := \text{inodes} (\mathcal T) \cup \text{leaves} (\mathcal T) $$ --- class: middle ## Splits and cells The *split* $\sigma\_{\mathbf v} = (j\_{\mathbf v}, t\_{\mathbf v})$ of each ${\mathbf v}\in \text{inodes} (\mathcal T)$ is characterized by: - its *dimension* $j\_{\mathbf v} \in \\{ 1, \ldots, d \\}$ - a *threshold* $t\_{\mathbf v} \in \mathbb R$ To each ${\mathbf v} \in \text{nodes}(\mathcal T)$ corresponds a *cell* $C\_{\mathbf v} \subseteq \mathcal X$ defined recursively: $C\_{\mathtt{root}} = \mathcal X$ and, for each ${\mathbf v} \in \text{inodes}(\mathcal T)$, we put $$ C\_{{\mathbf v}0} := \\{ x \in C\_{\mathbf v} : x\_{j\_{\mathbf v}} \leq t\_{{\mathbf v}} \\} \quad \text{and} \quad C\_{{\mathbf v}1} := C\_{\mathbf v} \setminus C\_{{\mathbf v}0}. $$ By construction, $(C\_{\mathbf v})\_{\mathbf v \in \text{leaves} (\mathcal T)}$ is a *partition* of $\mathcal X$. --- class: middle ## Nodes predictions Let $\mathbf v \in \text{leaves} (\mathcal T)$. For *regression* $(\mathcal Y = \mathbb R)$, decision trees typically use $$ {\widehat y}\_{\mathbf v} := \frac{1}{n\_{\mathbf v}} \sum\_{i \in I : x\_i \in C\_{\mathbf v}} y\_i \quad \text{where} \quad n\_{\mathbf v} := | \\{ i \in I : x\_i \in C\_{\mathbf v} \\} | $$ while for $K$-class *classification* $(\mathcal Y = \\{ 0, \ldots, K-1 \\} )$, use $$ {\widehat y}\_{\mathbf v}(k) := \frac{n\_{\mathbf v}(k)}{n\_{\mathbf v}} \quad \text{where} \quad n\_{\mathbf v}(k) := | \\{ i \in I : x\_i \in C\_{\mathbf v}, y_i = k \\} | $$ for $k \in \\{ 0, \ldots, K-1 \\}$. ## Tree prediction A decision tree's $\mathcal T$ prediction for $x \in \mathcal X$ is defined as the *prediction of the leaf* ${\mathbf v}\_{\mathcal T} (x)$ that contains $x$: $$ {\widehat y}\_{\mathcal T}(x) = {\widehat y}\_{{\mathbf v}\_{\mathcal T} (x)} $$ --- ## Tree growing Quality of the prediction depends on the tree (hence the partition). Issue: finding the *optimal* tree is *very hard*! Decision trees use a recursive **greedy** approach: - Start from the tree with a single node `root` corresponding to $\mathcal X$ - Recursively split the *current* leaves $\mathbf v \in \text{leaves} (\mathcal T)$: look for the *best* $\sigma\_{\mathbf v} = (j\_{\mathbf v}, t\_{\mathbf v})$ for each $\mathbf v$ - The *best* $\sigma\_{\mathbf v}$ is the one that leads to childs $\mathbf v0$ and $\mathbf v1$ with improved *homogeneousness* of the label they contain - *Homogeneousness* quantified by *impurity* measures such as *variance* for regression, *Gini* or *entropy* for classification ## Remarks - Here *greedy* means: no regret strategy on the choice of the splits - Called *CART* (Classification an Regression Trees). Usually leads to *poor predictions* and *overfitting*. --- class: middle ## Split finding Look for $\sigma\_{\mathbf v} = (j\_{\mathbf v}, t\_{\mathbf v})$ that leads to childs $\mathbf v0$ and $\mathbf v1$ that maximize $$ \mathrm{IG}(j\_{\mathbf v}, t\_{\mathbf v}) = \frac{n\_{\mathbf v}}{n} H(\mathbf v) - \frac{n\_{\mathbf v0}}{n\_{\mathbf v}} H(\mathbf v0) - \frac{n\_{\mathbf v1}}{n\_{\mathbf v}} H(\mathbf v1) $$ with $$ H(\mathbf v) = \frac{1}{n\_{\mathbf v}} \sum\_{i \in I : x\_i \in C\_{\mathbf v}} (y\_i - \widehat y\_{\mathbf v})^2 $$ for the *variance* criterion and $$ H(\mathbf v) = \sum\_{k=0}^{K-1} \frac{n\_{\mathbf v}(k)}{n\_{\mathbf v}} \Big(1 - \frac{n\_{\mathbf v}(k)}{n\_{\mathbf v}} \Big) = 1 - \sum\_{k=0}^{K-1} \Big(\frac{n\_{\mathbf v}(k)}{n\_{\mathbf v}} \Big)^2 $$ for *Gini* criterion ($K$-class classification). **Remarks.** Split finding requires to try out all $j=1, \ldots, d$ and all possible thresholds $t$ among the *sorted features* $(x\_{i,j})\_i$ for $\\{i \in I : x\_i \in C\_{\mathbf v}\\}$ --- # Standard Random Forest Combine the following ingredients [Breiman (2001), Biau et al. (2016), Louppe (2014)]: **1. Ensemble of regression or classification trees.** Train in parallel several decision trees. They all work *independently of each other* (embarrassingly parallel) **2. Bagging.** These trees are *randomized* by training them on *bootstrap samples* (sampling with replacement) of the dataset - Use an *average* of the predictions given by the trees: reduces variance of a single tree - Even better if the trees are "not correlated" **3. Columns subsampling.** *Randomize further* the trees by sampling a *random subset of features* each time we look for a split - Reduces even further correlation between trees --- class: middle ## Boostrap - $I = \\{1, \ldots, n\\} = $ training samples indices - From $I$, sample $n$ elements *uniformly* at random, *with replacement* - Define $I_{\mathtt{itb}} \subset I$ as the set of *"in-the-bag"* indices containing the unique values sampled - Define *"out-of-bag"* indices as $I\_{\mathtt{otb}} := I \setminus I\_{\mathtt{itb}}$. *Only used* to compute **out-of-bag scores** in standard Random Forest. Standard 0.632 rule (Efron 1997): we have $$ \mathbb P[i \in I\_{\mathtt{itb}}] = 1 - (1 - 1/n)^n \rightarrow 1 - e^{-1} \approx 0.632 \quad \text{as} \quad n \rightarrow +\infty $$ ## Feature subsampling - For *each* split search, *sample uniformly at random* a subset $J \subset \\{ 1, \ldots, d\\}$ of size $d\_{\max} \leq d$, and try out only splits $(j, t)$ with $j \in J$ - Default choice is often $d\_{\max} = \sqrt{d}$ --- class: middle ## Bagging Grow (in parallel) $M$ *randomized* decision trees $$ {\widehat f} (\bullet \; ; \; \Pi\_1), \ldots, {\widehat f} (\bullet \; ; \; \Pi\_M) $$ where each $\Pi_1, \ldots, \Pi\_M$ corresponds to i.i.d realizations of the *bootstrap* and *feature subsampling* random process. Given $x \in \mathcal X$, the Random Forest prediction ${\widehat g}(x)$ uses the *average* of the trees' predictions $$ {\widehat g}(x \; ; \; \mathbf \Pi) = \frac 1M \sum\_{m=1}^M {\widehat f} (x \; ; \; \Pi\_m) $$ - Default choice in `scikit-learn` is $M=100$ - We use $M=10$ by default in `WildWood` --- class: center, middle, end-slide # 4. A new Random Forest algorithm: `WildWood` --- class: middle # `WildWood` Main ingredients in `WildWood` are: 1. *Features binning* and *histogram strategy* for **faster split finding** 1. Careful handling of *categorical* features: **no one-hot encoding** required 1. New prediction function based on *subtrees aggregation* - **Improved** predictions - **Less trees** required - Involves a very *different use of out-of-bag samples* --- # Features binning `WildWood` performs split finding on **binned features**. Common practice in *extreme gradient boosting*, introduced in `LightGBM`. - Features matrix $\mathbf X$ is transformed into a *same-size* matrix of *binned features* denoted $\mathbf X^{\mathrm{bin}}$. - To each $j=1, \ldots, d$ (columns of $\mathbf X$) is associated a set $B\_j = \\{ 1, \ldots, b\_j \\}$ of bins with $b\_j \leq b\_{\max}$ (default is $b\_{\max} =255=$ a single byte per value). - When a feature is *continuous*, it is binned into $b\_{\max}$ bins using **inter-quantile intervals**. - If it is *categorical*, each modality is **mapped to a bin** (and sparsest modalities end up binned together) - If a feature contains *missing values*, we just use a **specific bin** to encode them - After binning, each column satisfies $\mathbf X\_{\bullet, j}^\mathrm{bin} \in B\_j^n$. And denote the *binned features space* as $C = \prod_{j=1}^d B_j$. --- # Split finding on histograms Let us consider $K$-class classification ## Histograms When looking for the best split of a node ${\mathbf v}$, we compute the *"histogram"* of the in-the-bag samples it contains: $$ \mathrm{hist}\_{\mathbf v}[j, b, k] = \sum\_{i \in I\_{\mathtt{itb}} : x\_i \in C\_{\mathbf v}} \mathbf 1\_{x\_{i, j} = b, y\_i = k} $$ for each feature $j$, bin $b$ and label class $k$. One has $$ \mathrm{hist}\_{\mathbf v} = \mathrm{hist}\_{\mathbf v0} + \mathrm{hist}\_{\mathbf v 1} $$ $\Rightarrow$ no need to compute histograms for both ${\mathbf v}0$ and ${\mathbf v}1$, but only a single one. --- ## Split finding We **loop** over: - the set $J\_{\mathbf v}$ of *non-constant features* in the node $$ J\_{\mathbf v} := \Big\\{ j : | H\_{\mathbf v}[j] | \geq 2 \Big\\} \quad \text{ where } \quad H\_{\mathbf v}[j] := \Big\\{ b : \sum\_{k} \mathrm{hist}\_{\mathbf v}[j, b, k] \geq 1 \Big\\} $$ - and over the set of *non-empty* bins $H\_{\mathbf v}[j]$ **Along the loop** we: - compute the *Gini* or *entropy* criterion using the **histogram**'s statistics - keep the best split found **Remark** - Split finding in standard RF requires to *sort* $n\_{\mathbf v}$ numbers $d\_{\max}$ times to find the best threshold for each feature [Louppe (2014)] - Split finding on histograms requires to *compute the histogram* **once** and to *sort* $b\_j \ll n\_{\mathbf v}$ numbers $d\_{\max}$ times. **Much faster** on *large* datasets --- class: middle ## Splits and cells The split $\sigma\_{\mathbf v} = (j\_{\mathbf v}, t\_{\mathbf v})$ of each ${\mathbf v}\in \text{inodes} (\mathcal T)$ is characterized by: - its dimension $j\_{\mathbf v} \in \\{ 1, \ldots, d \\}$ - a *subset of bins* $t\_{\mathbf v} \subsetneq \\{ 1, \ldots, b\_{j\_{\mathbf v}} \\}$ and cells are defined recursively as before: $C\_{\mathtt{root}} = C$ and $$ C\_{{\mathbf v}0} := \\{ x \in C\_{\mathbf v} : x\_{j\_{\mathbf v}} \in t\_{{\mathbf v}} \\} \quad \text{and} \quad C\_{{\mathbf v}1} := C\_{\mathbf v} \setminus C\_{{\mathbf v}0}. $$ - if $j\_{\mathbf v}$ is a *continuous* feature, bins have a natural order and $t\_{\mathbf v}= \\{ 1, 2, \ldots, s\_{\mathbf v} \\}$ for some bin threshold $s\_{\mathbf v} \in B\_{j\_{\mathbf v}}$ - if $j\_{\mathbf v}$ is a *categorical* feature, the whole set $t\_{\mathbf v}$ is required --- class: middle ## Bin order for categorical features **Order of the bins** used in the loop depends on the **type** of the feature. If **continuous**, use the *natural order* of bins. If **categorical**: - For binary classification ($y\_i \in \\{0, 1\\}$), use the bin order that sorts the proportion of labels $1$ in each bin: $$ \frac{\mathrm{hist}\_{\mathbf v}[j, b, 1]}{\sum\_{k \in \\{ 0, 1\\}} \mathrm{hist}\_{\mathbf v}[j, b, k]} $$ - This finds the *optimal split* with complexity $O(b_j \log b_j)$ (Theorem 9.6 in Breiman et al. (1984)) while there are *$2^{b_j - 1} -1$ possible splits*! - Cute trick introduced in `LightGBM`, using an order of gradient / hessian statistics of the boosting loss --- class: middle For $K$-class classification with $K > 2$, we implement two *heuristic strategies*: 1. **one-versus-rest**: train $M K$ trees instead of $M$, each tree trained with a binary one-versus-rest label 1. **heuristic**: train $M$ trees, split finding uses $K$ loops over the bin orders that sort $$ \frac{\mathrm{hist}\_{\mathbf v}[j, b, k]}{\sum\_{k' \in \\{ 0, \ldots, K-1\\}} \mathrm{hist}\_{\mathbf v}[j, b, k']} $$ for $k=0, \ldots, K-1$ If a feature contains **missing values**: - loop *left to right* (along bin order(s)), and *right to left* as well - this compares splits that put missing values *on the left* or *on the right* --- # Splits and tree growth stopping For $\mathbf v \in \text{nodes}(\mathcal T)$, let $$ n\_{\mathtt{itb}, \mathbf v} := | \\{ i \in I\_{\mathtt{itb}} : x\_i \in C\_{\mathbf v} \\} | \quad \text{and} \quad n\_{\mathtt{otb}, \mathbf v} := | \\{ i \in I\_{\mathtt{otb}} : x\_i \in C\_{\mathbf v} \\} | $$ Given $n\_{\text{min-split}} \geq 2$ and $n\_{\text{min-leaf}} \geq 1$ we: 1. discard all potential splits of $\mathbf v$ leading to children ${\mathbf v}0, {\mathbf v}1$ if $$ \min\big( n\_{\mathtt{itb}, \mathbf v0}, \; n\_{\mathtt{otb}, \mathbf v0}, \; n\_{\mathtt{itb}, \mathbf v1}, \; n\_{\mathtt{otb}, \mathbf v1} \big) < n\_{\text{min-split}} $$ 1. do not split $\mathbf v$ and make it a leaf if $\min\big(n\_{\mathtt{itb}, \mathbf v}, \; n\_{\mathtt{otb}, \mathbf v} \big) < n\_{\text{min-leaf}}$ 1. When only *leaves* or *non-splittable nodes* remain: **stop growth** of the tree 1. Trees grow in a *depth-first fashion* (a requirement in Algorithm 1 below) Points 1-4 ensure that all nodes contain both ${\mathtt{itb}}$ and ${\mathtt{otb}}$ samples - $n\_{\text{min-split}}$ and $n\_{\text{min-leaf}}$ *weakly impact* `WildWood`'s performances --- # `WildWood`'s prediction function Given a fully-grown tree $\mathcal T$ its prediction function is an *aggregation* of the predictions given by **all subtrees** rooted at ${\mathtt{root}}$ $\\{T : T \subset \mathcal T \\}$ - $\mathcal T$ is grown using ${\mathtt{itb}}$ samples - We use ${\mathtt{otb}}$ samples to perform *aggregation with exponential weights*, where weights give more importance to subtrees with a *good predictive performance* on ${\mathtt{otb}}$ samples ## Node and subtree prediction Fix a subtree $T \subset \mathcal T$ and $x \in C$ and let ${\mathbf v}\_{T} (x)$ be the leaf of $T$ containing $x$. The prediction of a node ${\mathbf v} \in \mathrm{nodes}(\mathcal T)$ and of a subtree $T \subset \mathcal T$ is given by $$ {\widehat y}\_{\mathbf v} = h( (y\_i)\_{i \in I\_{\mathtt{itb}}} : x\_i \in C\_{\mathbf v}) \quad \text{and} \quad \widehat y\_{T} (x) = {\widehat y}\_{{\mathbf v}\_{T} (x)} $$ where $h : \cup\_{n \geq 0} \mathcal Y^n \to \widehat \mathcal Y$ is a *generic forecaster* used in each cell $C\_{\mathbf v}$ --- class: middle name: node-predictors For *regression* we use $$ {\widehat y}\_{\mathbf v} = \frac{1}{n\_{\mathtt{itb}, \mathbf v}} \sum\_{i \in I\_{\mathtt{itb}} : x\_i \in C\_{\mathbf v}} y\_i $$ For *classification* with ${\mathcal Y} = \\{ 0, \ldots, K-1 \\}$, we use $$ {\widehat y}\_{\mathbf v} (k) = \frac{n\_{\mathtt{itb}, \mathbf v} (k) + \alpha}{n\_{\mathtt{itb}, \mathbf v} + K \alpha} $$ where $n\_{\mathtt{itb}, \mathbf v}(k) := | \\{ i \in I\_{\mathtt{itb}} : x\_i \in C\_{\mathbf v}, y\_i = k \\} |$ for $k \in \mathcal Y$ - It's the Bayes predictive posterior with $\mathrm{Dir}(\alpha, \ldots, \alpha)$ prior - `WildWood` uses $\alpha = 1/2$ by default (Krichevsky-Trofimov forecaster [Tjalkens (1993)]) - Using $\alpha > 0$ ensures that ${\widehat y}\_{\mathbf v}(k) > 0$ for all $k \in \mathcal Y$: useful since `WildWood` uses the log-los as default to assess ${\mathtt{otb}}$ performance --- class: middle ## Subtrees loss We use ${\mathtt{otb}}$ samples to compute the *cumulative losses* of the predictions of all subtrees $T \subset \mathcal T$ $$ L\_T := \sum\_{i \in I\_{\mathtt{otb}}} \ell ({\widehat y}\_{T} (x\_i), y\_i) $$ where $\ell : \widehat {\mathcal Y} \times {\mathcal Y} \to \mathbb R^+$ is a loss function. - Default is $\ell ({\widehat y}, y) = ({\widehat y} - y)^2$ for regression - Default is $\ell ({\widehat y}, y) = - \log {\widehat y}(y)$ for multiclass classification, where ${\widehat y}(y) \in (0, 1]$ --- class: middle name: prediction-function ## Prediction function Let $x \in C$. Prediction function $\widehat f$ of a tree $\mathcal T$ in `WildWood` is $$ \widehat f(x) = \frac{\sum\_{T \subset \mathcal T} \pi (T) e^{-\eta L\_T} {\widehat y}\_{T} (x)}{\sum\_{T \subset \mathcal T} \pi (T) e^{-\eta L\_T}} \quad \text{with} \quad \pi(T) = 2^{- \|\| T \|\|}, $$ where: - the sum is over **all subtrees** $T$ of $\mathcal T$ rooted at ${\mathtt{root}}$ (it's a **huge** sum!) - $\eta > 0$ is called "step", default is $\eta=1$ for the log-loss - $\|\|T\|\| = | \mathrm{nodes}(T)| - | (\mathrm{leaves}(T) \cap \mathrm{leaves}(\mathcal T)) |$ - $\pi =$ distribution of the branching process with branching probability $1 / 2$ at each node of $\mathcal T$, with two children when it branches $\widehat f(x)$ *aggregates* the predictions $\widehat y\_T(x)$ of *all subtrees* $T \subset \mathcal T$, *weighted* by their *performance* on ${\mathtt{otb}}$ samples. Can be understood as a **non-greedy way to prune trees** --- class: middle ## Remarks on the prediction function - Computing $\widehat f$ is **computationally and memory-wise infeasible** for a large $\mathcal T$, since it involves a sum over all $T \subset \mathcal T$ and requires one weight for each $T$ - The number of subtrees of a minimal tree that separates $n$ points is *exponential* in the number of nodes, hence *exponential* in $n$. - But, we can compute **exactly** and **very efficiently** $\widehat f$ thanks to the *prior choice* $\pi$ using an adaptation of **context tree weighting** (Williems et al. 1995, 1998, Helmbold et al. 1997, Catoni 2004) --- class: middle **Theorem 1.** We can write $\widehat f(x) = \widehat f\_{{\mathtt{root}}}(x)$, where $\widehat f\_{{\mathtt{root}}}(x)$ satisfies the recursion $$ \widehat f\_{\mathbf v}(x) = \frac 12 \frac{w\_{\mathbf v}}{w\_{\mathbf v}^{\mathrm{den}}} {\widehat y}\_{\mathbf v} + \Big(1 - \frac 12 \frac{w\_{\mathbf v}}{w\_{\mathbf v}^{\mathrm{den}}} \Big) \widehat f\_{{\mathbf v} a}(x) $$ for $\mathbf{v}, \mathbf{v}a \in \mathtt{path}(x)$, the path in $\mathcal T$ going from $\mathtt{root}$ to ${\mathbf v}\_{\mathcal T}(x)$, where $$ w\_{\mathbf v} := \exp(-\eta L\_{\mathbf v}) \quad \text{with} \quad L\_{\mathbf v} := \sum\_{i \in I\_{\mathtt{otb}} : x\_i \in C\_{\mathbf v}} \ell ({\widehat y}\_{\mathbf v}, y\_i) $$ and where ${w\_{\mathbf v}^{\mathrm{den}}}$ are weights given by ${w\_{\mathbf v}^{\mathrm{den}}} = w\_{\mathbf v}$ if $\mathbf v \in \mathrm{leaves}(\mathcal T)$ and $$ {w\_{\mathbf v}^{\mathrm{den}}} = \frac 12 w\_{\mathbf v} + \frac 12 {w\_{\mathbf v 0}^{\mathrm{den}}} {w\_{\mathbf v 1}^{\mathrm{den}}} \quad \text{if} \quad \mathbf v \notin \mathrm{leaves}(\mathcal T) $$ **Consequence:** *very efficient computation of $\widehat f(x)$* described in the Algorithms 1 and 2 below. --- class: middle **Algorithm 1** (Computation of $\log({w\_{\mathbf v}^{\mathrm{den}}})$ for all ${\mathbf v} \in {\mathrm{nodes}}(\mathcal T)$)
**Inputs:**
$\quad$ $\mathcal T$, $\eta > 0$ and losses $L\_{\mathbf v}$ for all ${\mathbf v} \in \mathrm{nodes}(\mathcal T)$.
$\quad$ Nodes from $\mathrm{nodes}(\mathcal T)$ are stored in a data structure $\mathtt{nodes}$ that respects
$\quad$ parenthood order: for any ${\mathbf v} = \mathtt{nodes}[i\_{\mathbf v}] \in \text{inodes}(\mathcal T)$ and children
$\quad$ ${\mathbf v}a = \mathtt{nodes}[i\_{{\mathbf v}a}]$ for $a \in \\{0, 1\\}$, we have $i\_{\mathbf v a} > i\_{\mathbf v}$
**For** ${\mathbf v}\in \mathrm{reversed}(\mathtt{nodes}):$
$\quad$ **If** $\mathbf v \in \mathrm{leaves}(\mathcal T)$ :
$\quad$ $\quad$ Put $\log({w\_{\mathbf v}^{\mathrm{den}}}) \gets - \eta L\_{\mathbf v}$
$\quad$ **Else** :
$\quad$ $\quad$ Put $ \log({w\_{\mathbf v}^{\mathrm{den}}}) \gets \log\big( \frac 12 e^{-\eta L\_{\mathbf v}} + \frac 12 e^{\log({w\_{\mathbf v0}^{\mathrm{den}}}) + \log({w\_{\mathbf v1}^{\mathrm{den}}})} \big) $ **Return**:
$\quad$ The set of log-weights $\\{ \log({w\_{\mathbf v}^{\mathrm{den}}}) : \mathbf v \in \mathrm{nodes}(\mathcal T) \\}$ Algorithm 1 is applied *once $\mathcal T$ is fully grown*, so that `WildWood` is ready to produce predictions using Algorithm 2 below --- class: middle ## About Algorithm 1 - Computes ${w\_{\mathbf v}^{\mathrm{den}}}$ using the fact that trees in `WildWood` are grown in a *depth-first fashion* - We can loop *once* over nodes, leading to a $O(|\mathrm{nodes}(\mathcal T)|)$ complexity in time and memory - Works recursively over the logarithms $\log({w\_{\mathbf v}^{\mathrm{den}}})$ to avoid over- or under-flows - If required, hyper-optimization of $\eta$ or $\alpha$ is fast: does not need to grow $\mathcal T$ again, but only to update ${w\_{\mathbf v}^{\mathrm{den}}}$ for all ${\mathbf v}\in \mathrm{nodes}(\mathcal T)$ using Algorithm 1 --- class: middle **Algorithm 2** (Computation of $\widehat f(x)$ for any $x \in C$) **Inputs:**
$\quad$ Tree $\mathcal T$, losses $L\_{\mathbf v}$ and log-weights $\log({w\_{\mathbf v}^{\mathrm{den}}})$ computed by Algorithm 1 Find ${\mathbf v}\_{\mathcal T}(x) \in \mathrm{leaves}(\mathcal T)$ (the leaf containing $x$) and put $\mathbf v \gets {\mathbf v}\_{\mathcal T}(x)$
Put $\widehat f(x) \gets {\widehat y}\_{\mathbf v}$
**While** ${\mathbf v} \neq {\mathtt{root}}$ :
$\quad$ $\mathbf v \gets \mathrm{parent}(\mathbf v)$
$\quad$ Put $\alpha \gets \frac 12 \exp(-\eta L\_{\mathbf v} - \log({w\_{\mathbf v}^{\mathrm{den}}}))$
$\quad$ Put $\widehat f(x) \gets \alpha {\widehat y}\_{\mathbf v}+ (1 - \alpha) \widehat f(x)$
**Return** :
$\quad$ The prediction $\widehat f(x)$ of the tree for $x \in C$ --- class: middle **About Algorithm 2 ** Let $\mathrm{path}\_{\mathcal T}(x)$ be the path in $\mathcal T$ from $\mathtt{root}$ to ${\mathbf v}\_{\mathcal T}(x)$ - Algorithm 2 just needs to go to ${\mathbf v}\_{\mathcal T}(x)$ and back to ${\mathtt{root}}$ along $\mathrm{path\_{\mathcal T}}(x)$ - So, Algorithm 2 only **increases by a factor 2** the prediction complexity of `WildWood` compared to a *standard* Random Forest - **Improved predictions** with a *comparable* computational cost compared to standard Random Forest (for predictions, training is faster) --- class: end-slide, center # 5. Theoretical guarantees for `WildWood` --- # Theoretical guarantees Theoretical guarantees only for the **subtrees aggregation** used in `wildWood` **No theoretical guarantees** about the *random forest* itself (an independent and open **difficult** problem) - $\ell$ is $\eta$-*exp-concave* for some $\eta > 0$ whenever $z \mapsto \exp(-\eta \ell(z, y))$ is concave for any $y \in \mathcal Y$. - Recall that: $\mathcal T$ is grown using $\mathtt{itb}$ samples and is independent of $\mathtt{otb}$ samples used to compute $L\_T$. **Theorem 2** (Oracle inequality). Assume that $\ell$ is $\eta$-exp-concave. Then, $\widehat f$ satisfies the oracle inequality (inf is over all subtrees $T \subset \mathcal T$) $$ \frac{1}{n\_{\mathtt{otb}}} \sum\_{i \in I\_{\mathtt{otb}}} \ell(\widehat f(x\_i), y\_i) \leq \inf\_{T \subset \mathcal T} \bigg\\{ \frac{1}{n\_{\mathtt{otb}}} \sum\_{i \in I\_{\mathtt{otb}}} \ell (\widehat y\_{T} (x\_i), y\_i) + \frac{\log 2}{\eta} \frac{\|\|T\|\|}{n\_{\mathtt{otb} + 1}} \bigg\\} $$ where $\|\|T\|\| = | \mathrm{nodes}(T)| - | (\mathrm{leaves}(T) \cap \mathrm{leaves}(\mathcal T)) |$ --- class: middle ## About Theorem 2 - `WildWood`'s prediction function performs *nearly as well* as the best *oracle* subtree $\mathrm{argmin}\_{T \subset \mathcal T} \sum\_{i \in I\_{\mathtt{otb}}} \ell (\widehat y\_{T} (x\_i), y\_i)$ on $\mathtt{otb}$ samples - Finding such an oracle is *computationally infeasible*: requires to try out *all possible subtrees*, while `WildWood`'s prediction function comes at a cost comparable to that of a standard Random Forest - $\|\|T\|\| = O(\log N\_{\mathcal T})$ with a number of "experts" $N\_{\mathcal T} = |\\{ T : T \subset \mathcal T \\}|$ for a well-balanced $\mathcal T$: the rate $O(\|\| T \|\| / n\_{\mathtt{otb}})$ is *optimal* for model-selection oracle inequalities [Tsybakov (2003)] - Theorem 2 relies on PAC-Baysesian theory [McAllester (1998,1999), Catoni (2007)] --- class: middle **Consequences** of Theorem 2 are Corollary 1 for the *log-loss* (classification) and Corollary 2 for the *least-squares* loss (regression). **Corollary 1** (Classification). Consider classification with $\mathcal Y = \\{ 0, \ldots, K-1 \\}$ and - the prediction function $\widehat f$ with step $\eta = 1$ and log-loss (defined [here](#prediction-function)) - node predictions given by the Krichevsky-Trofimov forecaster (see [here](#node-predictors)) Then, we have $$ \frac{1}{n\_{\mathtt{otb}}} \sum\_{i \in I\_{\mathtt{otb}}} \ell(\widehat f (x\_i), y\_i) \leq \inf\_{T \subset \mathcal T} \Big\\{ \frac{1}{n\_{\mathtt{otb}}} \sum\_{i \in I\_{\mathtt{otb}}} \ell(g\_T(x\_i), y\_i) + c\_K \frac{\|\| T \|\| + 1}{n\_{\mathtt{otb}}} \Big\\} $$ with $c\_K = (K + 4 \log 2 - 1) / 4$ and where $g\_T$ is *any constant function on the leaves* of $T$. --- class: middle **Corollary 2** (Regression). Consider regression with $\mathcal Y = [-B, B]$ for some $B > 0$ and - prediction function $\widehat f$ with step $\eta = 1/(8B^2)$ and square loss (defined [here](#prediction-function)) - node predictions given by labels averages (see [here](#node-predictors)) Then, we have $$ \frac{1}{n\_{\mathtt{otb}}} \sum\_{i \in I\_{\mathtt{otb}}} \ell(\widehat f (x\_i), y\_i) \leq \inf\_{T \subset \mathcal T} \Big\\{ \frac{1}{n\_{\mathtt{otb}}} \sum\_{i \in I\_{\mathtt{otb}}} \ell(g\_T(x\_i), y\_i) + c\_B \frac{\|\| T \|\|}{n\_{\mathtt{otb}}} \Big\\} $$ with $c\_B = 8 (\log 2) B^2 $ and where $g\_T$ is *any constant function on the leaves* of $T$. --- class: middle, center, end-slide # 6. Implementation and experiments --- class: middle # `WildWood`'s implementation - Available at: https://github.com/pyensemble/wildwood.git under `BSD3-Clause` licence - You can install it using : **`pip install wildwood`** - Documentation is here: https://wildwood.readthedocs.io/ - Implemented in **`Python`** but strongly accelerated with **`numba`**. This means JIT-compilation (first `fit` and `predict_proba` will compile the code $\approx$ 10 seconds). Trees growing in parallel using `joblib`. - Heavily inspired by `scikit-learn`'s implementation of Random Forest [Louppe (2014)] and follows `scikit-learn`'s API --- class: middle # Experiments We compare `WildWood` ($\mathtt{WW}n$ for $n$ trees) with several **strong baselines**: - $\mathtt{RF}n$: `scikit-learn`'s Random Forest using $n$ trees - $\mathtt{HGB}$: an histogram-based implementation of extreme gradient boosting (inspired by LightGBM) from `scikit-learn` - $\mathtt{XGB}$: `XGBoost` - $\mathtt{LGBM}$: `LightGBM` - $\mathtt{CB}$: CatBoost Experiments can be reproduced using `Python` scripts on the repository --- class: middle ## Datasets used - Publicly available datasets from the UCI repository - Including small and large datasets - Multi-class classification problems .width-100[![](figures/table5.jpg)] --- class: middle ## Experimental protocol - Each dataset is randomly split into training (70%) and testing sets (30%) - Categorical-encode and specify categorical features to algorithms support them ($\mathtt{HGB}$, $\mathtt{LGBM}$, $\mathtt{CB}$ and $\mathtt{WW}n$) - One-hot encode categorical features for other algorithms ($\mathtt{RF}n$, $\mathtt{XGB}$) - Hyperoptimization: from the training set, use $4/5$ for training and $1/5$ for validation and perform 50 steps of optimization using the Tree Parzen estimator from `hyperopt` [Bergstra et al. (2015)] - Refit on the whole training set with best hyperparameters and report scores on the test set - Performed 5 times to report standard deviations (see `WildWood`'s paper) - Use area under the ROC curve (AUC) and compute the average AUC of each class versus the rest with more than 2 classes --- class: middle .width-95[![](figures/table1.jpg)] - Test AUCs **after hyperoptimization** - $\mathtt{WW}$ is better (or identical) than $\mathtt{RF}$ and is close to EGB libraries - $\mathbf{bold}=$ best extreme gradient boosting - $\underline{\mathrm{underline}}=$ best random forest --- class: middle .width-100[![](figures/table2.jpg)] - Test AUCs and training times with **default hyperparameters** on the 5 largest datasets - `WW` is almost always the *fastest algorithm*, for performances comparable to all baselines - $\mathbf{bold}=$ best extreme gradient boosting - $\underline{\mathrm{underline}}=$ best random forest --- class: middle, center, end-slide # Conclusion --- class: middle ## Take-home messages - `WildWood` is a **new Random Forest algorithm** for batch supervised learning - `WildWood` uses differently out-of-bag samples: it performs **aggregation with exponential weights** of the predictions of **all subtrees** - **Improved predictions** in each individual tree, at a **small computational cost** - Histogram strategy makes `WildWood` competitive with strong baselines, both in terms of performances and training times. ## Future works - `WildWood`'s implementation: room for many improvements - The forest prediction is a simple arithmetic mean of each tree predictions, we could perform also exponentially weighted aggregation between trees - A `WildWood`-based implementation of isolation-forest [Liu et al. (2008)], same aggregation mechanism with the log-loss for density estimation --- class: end-slide, center count: false # Thank you!