WildWood

---

# `WildWood`: a new Random Forest algorithm

**Stéphane Gaïffas** 
[https://stephanegaiffas.github.io](https://stephanegaiffas.github.io)

[Joint work with **Ibrahim Merad** and **Yiyang Yu** (LPSM, Univ. de Paris)]

.grid[
.kol-1-4[.width-100[![](figures/lpsm.svg)]]
.kol-1-4[.width-55[![](figures/up.svg)]]
.kol-1-4[.width-45[![](figures/ens.svg)]]
.kol-1-4[.width-100[![](figures/prairie.svg)]]
]

---

# 1. Introduction

---

# Why Random Forest ?

- Still widely used for **tabular data**
- Ease of use, robustness to hyperparameters, embarrassingly parallel
- Very popular in some domains (finance, bioinformatics, predictive medicine, astronomy, among many others...)

Some **`stackoverflow`** search results for *"deep learning"*:

.center.width-70[![](figures/stack-deep.jpg)]

and for *"random forest"*:

.center.width-70[![](figures/stack-rf.jpg)]

---

Let's have a look at **Google Trends**:

.center.width-100[![](figures/google-trends1.jpg)]

.center.width-100[![](figures/google-trends2.jpg)]

## In this talk

1. We'll describe the *standard Random Forest* algorithm
1. And describe the construction of a *new version* of it called **`WildWood`**

---

# 2. Motivating examples

---

# Toy example

.center.width-90[![](figures/fig_aggregation_samples.svg)]
.center[Binary classification dataset with 2 continuous features]

---

# Standard Random Forest

.center.width-90[![](figures/fig_aggregation_1.svg)]
.center[Decision function of a **standard Random Forest** with 10 trees]

---

# Standard Random Forest

.center.width-90[![](figures/fig_aggregation_1_tree_0.svg)]
.center[Decision function of tree #1 in the **standard Random Forest**]

---

# Standard Random Forest

.center.width-90[![](figures/fig_aggregation_1_tree_1.svg)]
.center[Decision function of tree #2 in the **standard Random Forest**]

---

# Standard Random Forest

.center.width-90[![](figures/fig_aggregation_1_tree_2.svg)]
.center[Decision function of tree #3 in the **standard Random Forest**]

---

# Standard Random Forest

.center.width-90[![](figures/fig_aggregation_1_tree_3.svg)]
.center[Decision function of tree #4 in the **standard Random Forest**]

---

# Standard Random Forest

.center.width-90[![](figures/fig_aggregation_1_tree_4.svg)]
.center[Decision function of tree #5 in the **standard Random Forest**]

---

# `WildWood`

.center.width-90[![](figures/fig_aggregation_0.svg)]
.center[Decision function of **`WildWood`** with 10 trees]

---

# `WildWood`

.center.width-90[![](figures/fig_aggregation_0_tree_0.svg)]
.center[Decision function of tree #1 in **`WildWood`**]

---

# `WildWood`

.center.width-90[![](figures/fig_aggregation_0_tree_1.svg)]
.center[Decision function of tree #2 in **`WildWood`**]

---

# `WildWood`

.center.width-90[![](figures/fig_aggregation_0_tree_2.svg)]
.center[Decision function of tree #3 in **`WildWood`**]

---

# `WildWood`

.center.width-90[![](figures/fig_aggregation_0_tree_3.svg)]
.center[Decision function of tree #4 in **`WildWood`**]

---

# `WildWood`

.center.width-90[![](figures/fig_aggregation_0_tree_4.svg)]
.center[Decision function of tree #5 in **`WildWood`**]

---

# Standard RF vs `WildWood`

On the previous toy example:

.grid[
.kol-1-2[
.width-105[![](figures/fig_aggregation_1.svg)] .center[Standard Random Forest]
]
.kol-1-2[
.width-100[![](figures/fig_aggregation_0.svg)] .center[`WildWood`]
]
]

---

# Standard RF vs `WildWood`

.center.width-100[![](figures/fig_n_trees.png)]

- Mean test AUC (y-axis) on 4 datasets against number of trees used (x-axis)
- Averaged scores over 10 train/test splits for **`WildWood`** and **`scikit-learn`**’s `RandomForestClassifier` and `ExtraTreeClassifier` classes [1, 2]

Thanks to **subtrees aggregation** (and other tricks), `WildWood` improves these baselines, even with *few trees*
- Even a *single tree* performs well!

.footnote[
[1]: F. Fabian Pedregosa et al. *Scikit-learn: Machine learning in Python*. Journal of Machine Learning Research, 2011. 
[2]: G. Louppe, *Understanding random forests: From theory to practice.* PhD thesis, University of Liege, 2014. 
]

---

# `WildWood` vs EGB libraries

EGB = Extreme Gradient Boosting libraries

.center.width-100[![](figures/fig_auc_timings_new.jpg)]

- Test AUC (top) and training time (bottom) of `WildWood` compared with very popular EGB libraries (**after hyperoptimization** of all algorithms)
- `WildWood` is *slightly below such very strong baselines*, but is **faster** (training times are on a *logarithmic scale*) on the considered datasets

.footnote[
**XGBoost**: T. Chen and C. Guestrin, *Xgboost: A scalable tree boosting system*, SIGKDD 2016 
**LightGBM**: I. Guyon et al., *Lightgbm: A highly efficient gradient boosting decision tree*, NeurIPS 2017 
**CatBoost**: L. Prokhorenkova et al., *Catboost: unbiased boosting with categorical features*, arXiv preprint arXiv:1706.09516, 2017.
]

---

# 3. Standard decision trees and Random Forest

---

# Training data

Supervised training data $(x\_i, y\_i)$ for $i \in I = \\{ 1, \ldots, n \\}$ where 
- $x_i$ is the vector of *features* of sample $i$
- $y_i$ is the *label* of sample $i$

About **features**: we have $x_i \in \mathcal X \subset \mathbb R^d$ where $\mathcal X$ is the *features* space

- $x_i$ can contain *numerical* or *categorical* features
- Random Forest with *categorical* features: standard encoding approach is *one-hot encoding*

About **labels**: we have $y_i \in \mathcal Y$ where $\mathcal Y$ is the labels set

- For regression we have $\mathcal Y = \mathbb R$ 
- For $K$-class classification we have $\mathcal Y = \\{ 0, \ldots, K - 1\\}$, with predictions in $\widehat{\mathcal  Y} = \mathcal P(\mathcal Y) =$ set of probabilities on $\mathcal Y$

---

# Decision trees

Standard example of a classification tree on `iris`

.grid[
.kol-1-2.width-100[![](figures/iris_plot.svg)]
.kol-1-2.width-100[![](figures/cart_iris.svg)]
]

---

## Decision trees: main principles

- Construct of a *partition* of the feature space: a *guillotine partition*, represented equivalently with a *binary tree*
- Each *branch* of the tree corresponds to a pair (feature, threshold) called *split* 
- Each *node* of the tree is a *rectangular region* of the feature space. Final nodes are called *leaves* and are elements of the considered partition
- Predictions are *constant within leaves*, and use *training samples in it* (for regression: average of labels in the leaves)

---

## Binary trees

A *finite ordered binary tree* $\mathcal T$ is a finite subset of $\\{ 0, 1 \\}^{\*} = \bigcup\_{n \geq 0} \\{ 0, 1 \\}^n$ of all finite words on $\\{ 0, 1\\}$.

.grid[
.kol-2-3[
- $\\{ 0, 1 \\}^{\*}$ endowed with a *tree structure* (complete binary tree), empty word is $\mathtt{root}$
- For any $\mathbf v \in \\{ 0, 1\\}^{\*}$, *left child* is ${\mathbf v}0$, *right child* is ${\mathbf v}1$
- $\text{inodes} (\mathcal T) := \\{ {\mathbf v}\in \mathcal T : {\mathbf v}0, {\mathbf v}1 \in \mathcal T \\}$ 
- $\text{leaves} (\mathcal T) := \\{ {\mathbf v}\in \mathcal T : {\mathbf v}0, {\mathbf v}1 \not\in \mathcal T \\}$ 
]
.kol-1-3[
.width-100[![](figures/binary-tree.svg)]
]
]

Note that $\text{inodes} (\mathcal T)$ and $\text{leaves} (\mathcal T)$ are *disjoint* and the set of all nodes is
$$
\text{nodes}(\mathcal T) := \text{inodes} (\mathcal T) \cup \text{leaves}  (\mathcal T)
$$

---

## Splits and cells

The *split* $\sigma\_{\mathbf v} = (j\_{\mathbf v}, t\_{\mathbf v})$ of each ${\mathbf v}\in \text{inodes} (\mathcal T)$ is characterized by:

- its *dimension* $j\_{\mathbf v} \in \\{ 1, \ldots, d \\}$ 
- a *threshold* $t\_{\mathbf v} \in \mathbb R$

To each ${\mathbf v} \in \text{nodes}(\mathcal T)$ corresponds a *cell* $C\_{\mathbf v} \subseteq \mathcal X$ defined recursively: $C\_{\mathtt{root}} = \mathcal X$ and, for each ${\mathbf v} \in \text{inodes}(\mathcal T)$, we put
$$
C\_{{\mathbf v}0} := \\{ x \in C\_{\mathbf v} : x\_{j\_{\mathbf v}} \leq t\_{{\mathbf v}}  \\} \quad \text{and} \quad C\_{{\mathbf v}1} := C\_{\mathbf v} \setminus C\_{{\mathbf v}0}.
$$

By construction, $(C\_{\mathbf v})\_{\mathbf v \in \text{leaves} (\mathcal T)}$ is a *partition* of $\mathcal X$.

---

## Nodes predictions

Let $\mathbf v \in \text{leaves} (\mathcal T)$. 
For *regression* $(\mathcal Y = \mathbb R)$, decision trees typically use
$$
{\widehat y}\_{\mathbf v} := \frac{1}{n\_{\mathbf v}} \sum\_{i \in I : x\_i \in C\_{\mathbf v}} y\_i \quad \text{where} \quad n\_{\mathbf v} := | \\{ i \in I : x\_i \in C\_{\mathbf v} \\} |
$$
while for $K$-class *classification* $(\mathcal Y = \\{ 0, \ldots, K-1 \\} )$, use
$$
{\widehat y}\_{\mathbf v}(k) := \frac{n\_{\mathbf v}(k)}{n\_{\mathbf v}} \quad \text{where} \quad n\_{\mathbf v}(k) := | \\{ i \in I : x\_i \in C\_{\mathbf v}, y_i = k \\} |
$$
for $k \in \\{ 0, \ldots, K-1 \\}$.

## Tree prediction

A decision tree's $\mathcal T$ prediction for $x \in \mathcal X$ is defined as the *prediction of the leaf* ${\mathbf v}\_{\mathcal T} (x)$ that contains $x$:

$$
{\widehat y}\_{\mathcal T}(x) = {\widehat y}\_{{\mathbf v}\_{\mathcal T} (x)}
$$

---

## Tree growing

Quality of the prediction depends on the tree (hence the partition).

Issue: finding the *optimal* tree is *very hard*!

Decision trees use a recursive **greedy** approach:

- Start from the tree with a single node `root` corresponding to $\mathcal X$
- Recursively split the *current* leaves $\mathbf v \in \text{leaves} (\mathcal T)$: look for the *best* $\sigma\_{\mathbf v} = (j\_{\mathbf v}, t\_{\mathbf v})$ for each $\mathbf v$
- The *best* $\sigma\_{\mathbf v}$ is the one that leads to childs $\mathbf v0$ and $\mathbf v1$ with improved *homogeneousness* of the label they contain
- *Homogeneousness* quantified by *impurity* measures such as *variance* for regression, *Gini* or *entropy* for classification

## Remarks

- Here *greedy* means: no regret strategy on the choice of the splits
- Called *CART* (Classification an Regression Trees). Usually leads to *poor predictions* and *overfitting*.

---

## Split finding

Look for $\sigma\_{\mathbf v} = (j\_{\mathbf v}, t\_{\mathbf v})$ that leads to childs $\mathbf v0$ and $\mathbf v1$ that maximize

$$
\mathrm{IG}(j\_{\mathbf v}, t\_{\mathbf v}) = \frac{n\_{\mathbf v}}{n} H(\mathbf v) - \frac{n\_{\mathbf v0}}{n\_{\mathbf v}} H(\mathbf v0) - \frac{n\_{\mathbf v1}}{n\_{\mathbf v}} H(\mathbf v1)
$$
with
$$
H(\mathbf v) = \frac{1}{n\_{\mathbf v}} \sum\_{i \in I : x\_i \in C\_{\mathbf v}} (y\_i - \widehat y\_{\mathbf v})^2
$$
for the *variance* criterion and
$$
H(\mathbf v) = \sum\_{k=0}^{K-1} \frac{n\_{\mathbf v}(k)}{n\_{\mathbf v}} \Big(1 - \frac{n\_{\mathbf v}(k)}{n\_{\mathbf v}} \Big) = 1 - \sum\_{k=0}^{K-1} \Big(\frac{n\_{\mathbf v}(k)}{n\_{\mathbf v}} \Big)^2
$$
for *Gini* criterion ($K$-class classification).

**Remarks.** Split finding requires to try out all $j=1, \ldots, d$ and all possible thresholds $t$ among the *sorted features* $(x\_{i,j})\_i$ for $\\{i \in I : x\_i \in C\_{\mathbf v}\\}$

---

# Standard Random Forest

Combine the following ingredients [Breiman (2001), Biau et al. (2016), Louppe (2014)]:

**1. Ensemble of regression or classification trees.** Train in parallel several decision trees.
They all work *independently of each other* (embarrassingly parallel)

**2. Bagging.** These trees are *randomized* by training them on *bootstrap samples* (sampling with replacement) of the dataset

- Use an *average* of the predictions given by the trees: reduces variance of a single tree
- Even better if the trees are "not correlated"

**3. Columns subsampling.** *Randomize further* the trees by sampling a *random subset of features* each time we look for a split

- Reduces even further correlation between trees

---

## Boostrap

- $I = \\{1, \ldots, n\\} = $ training samples indices
- From $I$, sample $n$ elements *uniformly* at random, *with replacement*
- Define $I_{\mathtt{itb}} \subset I$ as the set of *"in-the-bag"* indices containing the unique values sampled
- Define *"out-of-bag"* indices as $I\_{\mathtt{otb}} := I \setminus I\_{\mathtt{itb}}$. *Only used* to compute **out-of-bag scores** in standard Random Forest.

Standard 0.632 rule (Efron 1997): we have 
$$
\mathbb P[i \in I\_{\mathtt{itb}}] = 1 - (1 - 1/n)^n \rightarrow 1 - e^{-1} \approx 0.632 \quad \text{as} \quad n \rightarrow +\infty
$$

## Feature subsampling

- For *each* split search, *sample uniformly at random* a subset $J \subset \\{ 1, \ldots, d\\}$ of size $d\_{\max} \leq d$, and try out only splits $(j, t)$ with $j \in J$
- Default choice is often $d\_{\max} = \sqrt{d}$

---

## Bagging

Grow (in parallel) $M$ *randomized* decision trees 
$$
{\widehat f} (\bullet \; ; \; \Pi\_1), \ldots, {\widehat f} (\bullet \; ; \; \Pi\_M)
$$
where each $\Pi_1, \ldots, \Pi\_M$ corresponds to i.i.d realizations of the *bootstrap* and *feature subsampling* random process.

Given $x \in \mathcal X$, the Random Forest prediction ${\widehat g}(x)$ uses the *average* of the trees' predictions
$$
{\widehat g}(x \; ; \; \mathbf \Pi) = \frac 1M \sum\_{m=1}^M {\widehat f} (x \; ; \; \Pi\_m)
$$

- Default choice in `scikit-learn` is $M=100$
- We use $M=10$ by default in `WildWood`

---

# 4. A new Random Forest algorithm: `WildWood`

---

# `WildWood`

Main ingredients in `WildWood` are:

1. *Features binning* and *histogram strategy* for **faster split finding**

1. Careful handling of *categorical* features: **no one-hot encoding** required

1. New prediction function based on *subtrees aggregation*
  - **Improved** predictions
  - **Less trees** required
  - Involves a very *different use of out-of-bag samples*

---

# Features binning

`WildWood` performs split finding on **binned features**. Common practice in *extreme gradient boosting*, introduced in `LightGBM`.
- Features matrix $\mathbf X$ is transformed into a *same-size* matrix of *binned features* denoted $\mathbf X^{\mathrm{bin}}$.
- To each $j=1, \ldots, d$ (columns of $\mathbf X$) is associated a set 
$B\_j = \\{ 1, \ldots, b\_j \\}$ of bins with $b\_j \leq b\_{\max}$ (default is $b\_{\max} =255=$ a single byte per value).
- When a feature is *continuous*, it is binned into $b\_{\max}$ bins using **inter-quantile intervals**.
- If it is *categorical*, each modality is **mapped to a bin** (and sparsest modalities end up binned together)
- If a feature contains *missing values*, we just use a **specific bin** to encode them
- After binning, each column satisfies $\mathbf X\_{\bullet, j}^\mathrm{bin} \in B\_j^n$.

And denote the *binned features space* as $C = \prod_{j=1}^d B_j$.

---

# Split finding on histograms

Let us consider $K$-class classification

## Histograms

When looking for the best split of a node ${\mathbf v}$, we compute the *"histogram"* of the in-the-bag samples it contains:
$$
\mathrm{hist}\_{\mathbf v}[j, b, k] = \sum\_{i \in I\_{\mathtt{itb}} : x\_i \in C\_{\mathbf v}} \mathbf 1\_{x\_{i, j} = b, y\_i = k}
$$
for each feature $j$, bin $b$ and label class $k$. One has 
$$
\mathrm{hist}\_{\mathbf v} = \mathrm{hist}\_{\mathbf v0} + \mathrm{hist}\_{\mathbf v 1}
$$
$\Rightarrow$ no need to compute histograms for both ${\mathbf v}0$ and ${\mathbf v}1$, but only a single one.

---

## Split finding

We **loop** over:

- the set $J\_{\mathbf v}$ of *non-constant features* in the node
$$
J\_{\mathbf v} := \Big\\{ j : | H\_{\mathbf v}[j] | \geq 2 \Big\\} \quad \text{ where } \quad H\_{\mathbf v}[j] := \Big\\{ b : \sum\_{k} \mathrm{hist}\_{\mathbf v}[j, b, k] \geq 1 \Big\\}
$$ 
- and over the set of *non-empty* bins $H\_{\mathbf v}[j]$

**Along the loop** we:

- compute the *Gini* or *entropy* criterion using the **histogram**'s statistics
- keep the best split found

**Remark**

- Split finding in standard RF requires to *sort* $n\_{\mathbf v}$ numbers $d\_{\max}$ times to find the best threshold for each feature [Louppe (2014)]
- Split finding on histograms requires to *compute the histogram* **once** and to *sort* $b\_j \ll n\_{\mathbf v}$ numbers $d\_{\max}$ times. **Much faster** on *large* datasets

---

## Splits and cells

The split $\sigma\_{\mathbf v} = (j\_{\mathbf v}, t\_{\mathbf v})$ of each ${\mathbf v}\in \text{inodes} (\mathcal T)$ is characterized by:

- its dimension $j\_{\mathbf v} \in \\{ 1, \ldots, d \\}$ 
- a *subset of bins* $t\_{\mathbf v} \subsetneq \\{ 1, \ldots, b\_{j\_{\mathbf v}} \\}$

and cells are defined recursively as before: $C\_{\mathtt{root}} = C$ and
$$
C\_{{\mathbf v}0} := \\{ x \in C\_{\mathbf v} : x\_{j\_{\mathbf v}} \in t\_{{\mathbf v}}  \\} \quad \text{and} \quad C\_{{\mathbf v}1} := C\_{\mathbf v} \setminus C\_{{\mathbf v}0}.
$$

- if $j\_{\mathbf v}$ is a *continuous* feature, bins have a natural order and $t\_{\mathbf v}= \\{ 1, 2, \ldots, s\_{\mathbf v} \\}$ for some bin threshold $s\_{\mathbf v} \in B\_{j\_{\mathbf v}}$
- if $j\_{\mathbf v}$ is a *categorical* feature, the whole set $t\_{\mathbf v}$ is required

---

## Bin order for categorical features

**Order of the bins** used in the loop depends on the **type** of the feature.
If **continuous**, use the *natural order* of bins. If **categorical**:

-  For binary classification ($y\_i \in \\{0, 1\\}$), use the bin order that sorts the proportion of labels $1$ in each bin:
$$
\frac{\mathrm{hist}\_{\mathbf v}[j, b, 1]}{\sum\_{k \in \\{ 0, 1\\}} \mathrm{hist}\_{\mathbf v}[j, b, k]}
$$
- This finds the *optimal split* with complexity $O(b_j \log b_j)$ (Theorem 9.6 in Breiman et al. (1984)) while there are *$2^{b_j - 1} -1$ possible splits*!
- Cute trick introduced in `LightGBM`, using an order of gradient / hessian statistics of the boosting loss

---

For $K$-class classification with $K > 2$, we implement two *heuristic strategies*:

1. **one-versus-rest**: train $M K$ trees instead of $M$, each tree trained with a 
   binary one-versus-rest label
1. **heuristic**: train $M$ trees, split finding uses $K$ loops over the bin orders that sort
   $$
   \frac{\mathrm{hist}\_{\mathbf v}[j, b, k]}{\sum\_{k' \in \\{ 0, \ldots, K-1\\}} \mathrm{hist}\_{\mathbf v}[j, b, k']}
   $$ 
   for $k=0, \ldots, K-1$

If a feature contains **missing values**: 
- loop *left to right* (along bin order(s)), and *right to left* as well
- this compares splits that put missing values *on the left* or *on the right*

---

# Splits and tree growth stopping

For $\mathbf v \in \text{nodes}(\mathcal T)$, let
$$
n\_{\mathtt{itb}, \mathbf v} := | \\{ i \in I\_{\mathtt{itb}} : x\_i \in C\_{\mathbf v} \\} | \quad \text{and} \quad
n\_{\mathtt{otb}, \mathbf v} := | \\{ i \in I\_{\mathtt{otb}} : x\_i \in C\_{\mathbf v} \\} |
$$
Given $n\_{\text{min-split}} \geq 2$ and $n\_{\text{min-leaf}} \geq 1$ we: 
1. discard all potential splits of $\mathbf v$ leading to children ${\mathbf v}0, {\mathbf v}1$ if
$$
\min\big( n\_{\mathtt{itb}, \mathbf v0}, \; n\_{\mathtt{otb}, \mathbf v0}, \; n\_{\mathtt{itb}, \mathbf v1}, \; n\_{\mathtt{otb}, \mathbf v1} \big) < n\_{\text{min-split}}
$$
1. do not split $\mathbf v$ and make it a leaf if $\min\big(n\_{\mathtt{itb}, \mathbf v}, \; n\_{\mathtt{otb}, \mathbf v} \big) < n\_{\text{min-leaf}}$
1. When only *leaves* or *non-splittable nodes* remain: **stop growth** of the tree
1. Trees grow in a *depth-first fashion* (a requirement in Algorithm 1 below)

Points 1-4 ensure that all nodes contain both ${\mathtt{itb}}$ and ${\mathtt{otb}}$ samples

- $n\_{\text{min-split}}$ and $n\_{\text{min-leaf}}$ *weakly impact* `WildWood`'s performances

---

# `WildWood`'s prediction function

Given a fully-grown tree $\mathcal T$ its prediction function is an *aggregation* of the predictions given by **all subtrees** rooted at 
${\mathtt{root}}$  $\\{T : T \subset \mathcal T \\}$

- $\mathcal T$ is grown using ${\mathtt{itb}}$ samples
- We use ${\mathtt{otb}}$ samples to perform *aggregation with exponential weights*, where weights give more importance to subtrees with a *good predictive performance* on ${\mathtt{otb}}$ samples

## Node and subtree prediction

Fix a subtree $T \subset \mathcal T$ and $x \in C$ and let ${\mathbf v}\_{T} (x)$ be the leaf of $T$ containing $x$.
The prediction of a node ${\mathbf v} \in \mathrm{nodes}(\mathcal T)$ and of a subtree $T \subset \mathcal T$ is 
given by
$$
{\widehat y}\_{\mathbf v} = h( (y\_i)\_{i \in I\_{\mathtt{itb}}} : x\_i \in C\_{\mathbf v}) \quad \text{and} \quad
\widehat y\_{T} (x) = {\widehat y}\_{{\mathbf v}\_{T} (x)}
$$
where $h : \cup\_{n \geq 0} \mathcal Y^n \to \widehat \mathcal Y$ is a *generic forecaster* used in each cell $C\_{\mathbf v}$

---

For *regression* we use
$$
{\widehat y}\_{\mathbf v} = \frac{1}{n\_{\mathtt{itb}, \mathbf v}} \sum\_{i \in I\_{\mathtt{itb}} : x\_i \in C\_{\mathbf v}}  y\_i
$$

For *classification* with ${\mathcal Y} = \\{ 0, \ldots, K-1 \\}$, we use  
$$
{\widehat y}\_{\mathbf v} (k) = \frac{n\_{\mathtt{itb}, \mathbf v} (k) + \alpha}{n\_{\mathtt{itb}, \mathbf v} + K \alpha}
$$
where $n\_{\mathtt{itb}, \mathbf v}(k) := | \\{ i \in I\_{\mathtt{itb}} : x\_i \in C\_{\mathbf v}, y\_i = k \\} |$ for $k \in \mathcal Y$
- It's the Bayes predictive posterior with $\mathrm{Dir}(\alpha, \ldots, \alpha)$ prior
- `WildWood` uses $\alpha = 1/2$ by default (Krichevsky-Trofimov forecaster [Tjalkens (1993)])
- Using $\alpha > 0$ ensures that ${\widehat y}\_{\mathbf v}(k) > 0$ for all $k \in \mathcal Y$: useful since `WildWood` uses the log-los as default to assess ${\mathtt{otb}}$ performance

---

## Subtrees loss

We use ${\mathtt{otb}}$ samples to compute the *cumulative losses* of the predictions of all subtrees $T \subset \mathcal T$
$$
L\_T := \sum\_{i \in I\_{\mathtt{otb}}} \ell ({\widehat y}\_{T} (x\_i), y\_i)
$$
where $\ell : \widehat {\mathcal Y} \times {\mathcal Y} \to \mathbb R^+$ is a loss function.
- Default is $\ell ({\widehat y}, y) = ({\widehat y} - y)^2$ for regression 
- Default is $\ell ({\widehat y}, y) = - \log {\widehat y}(y)$ for multiclass classification, where ${\widehat y}(y) \in (0, 1]$

---

## Prediction function

Let $x \in C$.
Prediction function $\widehat f$ of a tree $\mathcal T$ in `WildWood` is
$$
\widehat f(x) = \frac{\sum\_{T \subset \mathcal T} \pi (T) e^{-\eta L\_T} {\widehat y}\_{T} (x)}{\sum\_{T \subset \mathcal T} \pi (T) e^{-\eta L\_T}} \quad \text{with} \quad \pi(T) = 2^{- \|\| T \|\|},
$$
where:
- the sum is over **all subtrees** $T$ of $\mathcal T$ rooted at ${\mathtt{root}}$ (it's a **huge** sum!)
- $\eta > 0$ is called "step", default is $\eta=1$ for the log-loss
- $\|\|T\|\| = | \mathrm{nodes}(T)| - | (\mathrm{leaves}(T) \cap \mathrm{leaves}(\mathcal T)) |$
- $\pi =$ distribution of the branching process with branching probability $1 / 2$ at each node of $\mathcal T$, with two children when it branches

$\widehat f(x)$ *aggregates* the predictions $\widehat y\_T(x)$ of *all subtrees* $T \subset \mathcal T$, *weighted* by their *performance* on ${\mathtt{otb}}$ samples.

Can be understood as a **non-greedy way to prune trees**

---

## Remarks on the prediction function

- Computing $\widehat f$ is **computationally and memory-wise infeasible** for a large $\mathcal T$, since it involves a sum over all $T \subset \mathcal T$ and requires one weight for each $T$

- The number of subtrees of a minimal tree that separates $n$ points is 
*exponential* in the number of nodes, hence *exponential* in $n$.

- But, we can compute **exactly** and **very efficiently** $\widehat f$ thanks 
to the *prior choice* $\pi$ using an adaptation of **context tree weighting** (Williems et al. 1995, 1998, Helmbold et al. 1997, Catoni 2004)

---

**Theorem 1.**
We can write
$\widehat f(x) = \widehat f\_{{\mathtt{root}}}(x)$, where $\widehat f\_{{\mathtt{root}}}(x)$ satisfies the recursion
$$
\widehat f\_{\mathbf v}(x) = \frac 12 \frac{w\_{\mathbf v}}{w\_{\mathbf v}^{\mathrm{den}}} {\widehat y}\_{\mathbf v} + \Big(1 - \frac 12 \frac{w\_{\mathbf v}}{w\_{\mathbf v}^{\mathrm{den}}}  \Big) \widehat f\_{{\mathbf v} a}(x)
$$
for $\mathbf{v}, \mathbf{v}a \in \mathtt{path}(x)$, the path in $\mathcal T$ going from $\mathtt{root}$ to ${\mathbf v}\_{\mathcal T}(x)$, where
$$
w\_{\mathbf v} := \exp(-\eta L\_{\mathbf v}) \quad \text{with} \quad L\_{\mathbf v} := \sum\_{i \in I\_{\mathtt{otb}} : x\_i \in C\_{\mathbf v}} \ell ({\widehat y}\_{\mathbf v}, y\_i)
$$
and where ${w\_{\mathbf v}^{\mathrm{den}}}$ are weights given by ${w\_{\mathbf v}^{\mathrm{den}}} =  w\_{\mathbf v}$ if $\mathbf v \in \mathrm{leaves}(\mathcal T)$ and
$$
{w\_{\mathbf v}^{\mathrm{den}}} = \frac 12 w\_{\mathbf v} +  \frac 12 {w\_{\mathbf v 0}^{\mathrm{den}}} {w\_{\mathbf v 1}^{\mathrm{den}}} \quad \text{if} \quad \mathbf v \notin \mathrm{leaves}(\mathcal T)
$$

**Consequence:** *very efficient computation of $\widehat f(x)$* described in 
the Algorithms 1 and 2 below.

---

**Algorithm 1** (Computation of $\log({w\_{\mathbf v}^{\mathrm{den}}})$ for all ${\mathbf v} \in {\mathrm{nodes}}(\mathcal T)$)

**Inputs:** 
$\quad$ $\mathcal T$, $\eta > 0$ and losses $L\_{\mathbf v}$ for all ${\mathbf v} \in \mathrm{nodes}(\mathcal T)$. 
$\quad$ Nodes from $\mathrm{nodes}(\mathcal T)$ are stored in a data structure $\mathtt{nodes}$ that respects 
$\quad$ parenthood order: for any ${\mathbf v} = \mathtt{nodes}[i\_{\mathbf v}] \in \text{inodes}(\mathcal T)$ and children 
$\quad$ ${\mathbf v}a = \mathtt{nodes}[i\_{{\mathbf v}a}]$ for $a \in \\{0, 1\\}$, we have $i\_{\mathbf v a} > i\_{\mathbf v}$

**For** ${\mathbf v}\in \mathrm{reversed}(\mathtt{nodes}):$ 
$\quad$ **If** $\mathbf v \in \mathrm{leaves}(\mathcal T)$ : 
$\quad$ $\quad$ Put $\log({w\_{\mathbf v}^{\mathrm{den}}}) \gets - \eta L\_{\mathbf v}$ 
$\quad$ **Else** : 
$\quad$ $\quad$ Put 
$
\log({w\_{\mathbf v}^{\mathrm{den}}}) \gets \log\big( \frac 12 e^{-\eta L\_{\mathbf v}} + \frac 12 e^{\log({w\_{\mathbf v0}^{\mathrm{den}}}) + \log({w\_{\mathbf v1}^{\mathrm{den}}})} \big)
$

**Return**: 
$\quad$ The set of log-weights $\\{ \log({w\_{\mathbf v}^{\mathrm{den}}}) : \mathbf v \in \mathrm{nodes}(\mathcal T) \\}$

Algorithm 1 is applied *once $\mathcal T$ is fully grown*, so that `WildWood` is ready to produce 
predictions using Algorithm 2 below

---

## About Algorithm 1

- Computes ${w\_{\mathbf v}^{\mathrm{den}}}$ using the fact that trees in 
`WildWood` are grown in a *depth-first fashion*

- We can loop *once* over nodes, leading to a $O(|\mathrm{nodes}(\mathcal T)|)$ complexity in time and memory

- Works recursively over the logarithms $\log({w\_{\mathbf v}^{\mathrm{den}}})$ to avoid over- or under-flows

- If required, hyper-optimization of $\eta$ or $\alpha$ is fast: does not need to grow $\mathcal T$ again, but only to update ${w\_{\mathbf v}^{\mathrm{den}}}$ for all ${\mathbf v}\in \mathrm{nodes}(\mathcal T)$ using Algorithm 1

---

**Algorithm 2** (Computation of $\widehat f(x)$ for any $x \in C$)

**Inputs:** 
$\quad$ Tree $\mathcal T$, losses $L\_{\mathbf v}$ and log-weights $\log({w\_{\mathbf v}^{\mathrm{den}}})$ computed by Algorithm 1

Find ${\mathbf v}\_{\mathcal T}(x) \in \mathrm{leaves}(\mathcal T)$ (the leaf containing $x$) and put $\mathbf v \gets {\mathbf v}\_{\mathcal T}(x)$

Put $\widehat f(x) \gets {\widehat y}\_{\mathbf v}$

**While** ${\mathbf v} \neq {\mathtt{root}}$ : 
$\quad$ $\mathbf v \gets \mathrm{parent}(\mathbf v)$ 
$\quad$ Put $\alpha \gets \frac 12 \exp(-\eta L\_{\mathbf v} - \log({w\_{\mathbf v}^{\mathrm{den}}}))$ 
$\quad$ Put $\widehat f(x) \gets \alpha {\widehat y}\_{\mathbf v}+ (1 - \alpha) \widehat f(x)$

**Return** : 
$\quad$ The prediction $\widehat f(x)$ of the tree for $x \in C$

---

**About Algorithm 2 **

Let $\mathrm{path}\_{\mathcal T}(x)$ be the path in $\mathcal T$ from $\mathtt{root}$ to ${\mathbf v}\_{\mathcal T}(x)$

- Algorithm 2 just needs to go to ${\mathbf v}\_{\mathcal T}(x)$ and back to ${\mathtt{root}}$ along $\mathrm{path\_{\mathcal T}}(x)$
- So, Algorithm 2 only **increases by a factor 2** the prediction complexity of `WildWood` compared to a *standard* Random Forest
- **Improved predictions** with a *comparable* computational cost compared to standard Random Forest (for predictions, training is  faster)

---

# 5. Theoretical guarantees for `WildWood`

---

# Theoretical guarantees

Theoretical guarantees only for the **subtrees aggregation** used in `wildWood`

**No theoretical guarantees** about the *random forest* itself (an independent and open **difficult** problem)

- $\ell$ is $\eta$-*exp-concave* for some $\eta > 0$ whenever $z \mapsto \exp(-\eta \ell(z, y))$ is concave for any $y \in \mathcal Y$.

- Recall that: $\mathcal T$ is grown using $\mathtt{itb}$ samples and is independent of $\mathtt{otb}$ samples used to compute $L\_T$.

**Theorem 2** (Oracle inequality).
Assume that $\ell$ is $\eta$-exp-concave.
Then, $\widehat f$ satisfies the oracle inequality (inf is over all subtrees $T \subset \mathcal T$)
$$
\frac{1}{n\_{\mathtt{otb}}} \sum\_{i \in I\_{\mathtt{otb}}} \ell(\widehat f(x\_i), y\_i) \leq \inf\_{T \subset \mathcal T} \bigg\\{ \frac{1}{n\_{\mathtt{otb}}} \sum\_{i \in I\_{\mathtt{otb}}} \ell (\widehat y\_{T} (x\_i), y\_i) + \frac{\log 2}{\eta} \frac{\|\|T\|\|}{n\_{\mathtt{otb} + 1}} \bigg\\}
$$
where $\|\|T\|\| = | \mathrm{nodes}(T)| - | (\mathrm{leaves}(T) \cap \mathrm{leaves}(\mathcal T)) |$

---

## About Theorem 2

- `WildWood`'s prediction function performs *nearly as well* as the best *oracle* subtree $\mathrm{argmin}\_{T \subset \mathcal T} \sum\_{i \in I\_{\mathtt{otb}}} \ell (\widehat y\_{T} (x\_i), y\_i)$ on $\mathtt{otb}$ samples
- Finding such an oracle is *computationally infeasible*: requires to try out *all possible subtrees*, while `WildWood`'s prediction function comes at a cost comparable to that of a standard Random Forest
- $\|\|T\|\| = O(\log N\_{\mathcal T})$ with a number of "experts" $N\_{\mathcal T} = |\\{ T : T \subset \mathcal T \\}|$ for a well-balanced $\mathcal T$: the rate $O(\|\| T \|\| / n\_{\mathtt{otb}})$ is *optimal* for model-selection oracle inequalities [Tsybakov (2003)]
- Theorem 2 relies on PAC-Baysesian theory [McAllester (1998,1999), Catoni (2007)]

---

**Consequences** of Theorem 2 are Corollary 1 for the *log-loss* (classification) and Corollary 2 for the *least-squares* loss (regression).

**Corollary 1** (Classification).

Consider classification with $\mathcal Y = \\{ 0, \ldots, K-1 \\}$ and

- the prediction function $\widehat f$ with step $\eta = 1$ and log-loss (defined [here](#prediction-function))
- node predictions given by the Krichevsky-Trofimov forecaster (see [here](#node-predictors))

Then, we have
$$
\frac{1}{n\_{\mathtt{otb}}} \sum\_{i \in I\_{\mathtt{otb}}} \ell(\widehat f (x\_i), y\_i) \leq \inf\_{T \subset \mathcal T} \Big\\{ \frac{1}{n\_{\mathtt{otb}}} \sum\_{i \in I\_{\mathtt{otb}}} 
\ell(g\_T(x\_i), y\_i) +  c\_K \frac{\|\| T \|\| + 1}{n\_{\mathtt{otb}}} \Big\\}
$$
with $c\_K = (K + 4 \log 2 - 1) / 4$ and where $g\_T$ is *any constant function on the leaves* of $T$.

---

**Corollary 2** (Regression).

Consider regression with $\mathcal Y = [-B, B]$ for some $B > 0$ and

- prediction function $\widehat f$ with step $\eta = 1/(8B^2)$ and square loss (defined [here](#prediction-function))
- node predictions given by labels averages (see [here](#node-predictors))

Then, we have
$$
\frac{1}{n\_{\mathtt{otb}}} \sum\_{i \in I\_{\mathtt{otb}}} \ell(\widehat f (x\_i), y\_i) \leq \inf\_{T \subset \mathcal T} \Big\\{ \frac{1}{n\_{\mathtt{otb}}} \sum\_{i \in I\_{\mathtt{otb}}} 
\ell(g\_T(x\_i), y\_i) +  c\_B \frac{\|\| T \|\|}{n\_{\mathtt{otb}}} \Big\\}
$$
with $c\_B = 8 (\log 2) B^2 $ and where $g\_T$ is *any constant function on the leaves* of $T$.

---

# 6. Implementation and experiments

---

# `WildWood`'s implementation

- Available at: https://github.com/pyensemble/wildwood.git under `BSD3-Clause` licence

- You can install it using : **`pip install wildwood`**

- Documentation is here: https://wildwood.readthedocs.io/

- Implemented in **`Python`** but strongly accelerated with **`numba`**. This means JIT-compilation (first `fit` and `predict_proba` will compile the code $\approx$ 10 seconds). Trees growing in parallel using `joblib`.

- Heavily inspired by `scikit-learn`'s implementation of Random Forest [Louppe (2014)] and follows `scikit-learn`'s API

---

# Experiments

We compare `WildWood` ($\mathtt{WW}n$ for $n$ trees) with several **strong baselines**:

- $\mathtt{RF}n$: `scikit-learn`'s Random Forest using $n$ trees
- $\mathtt{HGB}$: an histogram-based implementation of extreme gradient boosting (inspired by LightGBM) from `scikit-learn`
- $\mathtt{XGB}$:  `XGBoost`
- $\mathtt{LGBM}$: `LightGBM` 
- $\mathtt{CB}$: CatBoost

Experiments can be reproduced using `Python` scripts on the repository

---

## Datasets used

- Publicly available datasets from the UCI repository
- Including small and large datasets
- Multi-class classification problems

---

## Experimental protocol

- Each dataset is randomly split into training (70%) and testing sets (30%)
- Categorical-encode and specify categorical features to algorithms support them ($\mathtt{HGB}$, $\mathtt{LGBM}$, $\mathtt{CB}$ and $\mathtt{WW}n$)
- One-hot encode categorical features for other algorithms ($\mathtt{RF}n$, $\mathtt{XGB}$)
- Hyperoptimization: from the training set, use $4/5$ for training and $1/5$ for validation and perform 50 steps of optimization using the Tree Parzen estimator from `hyperopt` [Bergstra et al. (2015)] 
- Refit on the whole training set with best hyperparameters and report scores on the test set
- Performed 5 times to report standard deviations (see `WildWood`'s paper)
- Use area under the ROC curve (AUC) and compute the average AUC of each class versus the rest with more than 2 classes

---

- Test AUCs **after hyperoptimization**
- $\mathtt{WW}$ is better (or identical) than $\mathtt{RF}$ and is close to EGB libraries 
- $\mathbf{bold}=$ best extreme gradient boosting
- $\underline{\mathrm{underline}}=$ best random forest

---

- Test AUCs and training times with **default hyperparameters** on the 5 largest datasets
- `WW` is almost always the *fastest algorithm*, for performances comparable to all baselines
- $\mathbf{bold}=$ best extreme gradient boosting
- $\underline{\mathrm{underline}}=$ best random forest

---

# Conclusion

---

## Take-home messages

- `WildWood` is a **new Random Forest algorithm** for batch supervised learning
- `WildWood` uses differently out-of-bag samples: it performs **aggregation with exponential weights** of the predictions of **all subtrees**
- **Improved predictions** in each individual tree, at a **small computational cost**
- Histogram strategy makes `WildWood` competitive with strong baselines, both in terms of performances and training times.

## Future works

- `WildWood`'s implementation: room for many improvements
- The forest prediction is a simple arithmetic mean of each tree predictions, we could perform also exponentially weighted aggregation between trees
- A `WildWood`-based implementation of isolation-forest [Liu et al. (2008)], same aggregation mechanism with the log-loss for density estimation

---

# Thank you!