Remark

# Binarsity

## A penalization for one-hot encoded features in supervised learning

### Stéphane Gaïffas

### [joint work with M. Alaya, S. Bussy and A. Guilloux]

---

---

# Who I am ?

- Stéphane Gaïffas

- Professor

- Research and teaching in machine learning, big data and AI

- Laboratoire de Probabilités, Statistique et Modélisation

- Université Paris Diderot

- [http://www.cmap.polytechnique.fr/~gaiffas/](http://www.cmap.polytechnique.fr/~gaiffas/)

- [stephane.gaiffas@lpsm.paris](mailto:stephane.gaiffas@lpsm.paris)

.center[
    <img src="figs/lpsm.png" style="height: 160px;" />
    <img src="" style="width: 30px;" />
    <img src="figs/paris-diderot.png" style="height: 90px;" />
    <img src="" style="width: 30px;" />
    <img src="figs/uspc.png" style="height: 120px;" />
]

---

# Who are my co-authors

.pull-left[
  <img src="figs/mokhtar.png" style="height: 150px;" />
- Dr. Mokhtar Z. Alaya 
- Post Doc at LITIS Lab
- University of Rouen
]

.pull-left[
  <img src="figs/simon.png" style="height: 150px;" />
  - Dr. Simon Bussy
  - Researcher at AP-HP
  - Co-funder of Califrais
]

.pull-left[
  <img src="figs/agathe.png" style="height: 150px;" />
  - Pr. Agathe Guilloux
  - Professor at LaMME
  - Univ. d'Evry Val d'Essonne
]

---

# Binarsity

### [Why one-hot encoded features ?]

`$$
\newcommand{\R}{\mathbb R}
\newcommand{\cY}{\mathcal Y}
\newcommand{\cJ}{\mathcal J}
\newcommand{\bX}{\boldsymbol X}
\newcommand{\XB}{\boldsymbol X^B}
\newcommand{\TV}{\text{TV}}
\newcommand{\norm}[1]{\| #1 \|}
\newcommand{\inr}[1]{\langle #1 \rangle}
\newcommand{\ind}[1]{\mathbf 1_{#1}}
\DeclareMathOperator{\argmin}{argmin}
\DeclareMathOperator{\pen}{pen}
\DeclareMathOperator{\prox}{prox}
\DeclareMathOperator{\bina}{bina}
$$`

---
# Motivation

- You want an  .stress[intrepretable] and .stress[quick and easy] model to train

- You want to stick with a .stress[linear] model

- Several columns of your dataset contain .stress[continuous features]

# Setting

- You have a training dataset $(x_i, y_i)$ for samples $i=1, \ldots, n$

- With features `$x_i = [x_{i,1} \cdots x_{i,p}]^\top \in \R^p$` and labels $y_i \in \cY \subset \R$

- Let `$\bX = [x_{i,j}]_{1 \leq i \leq n; 1 \leq j \leq p}$` be the $n \times p$ features matrix

- Let $\bX_{\bullet, j}$ be the $j$-th feature column of $\bX$

## A useful trick: .stress[one-hot encoding of continuous features]

It is a useful and **well-known trick**

---

# One-hot encoding of continuous features

- Replaces `$\bX_{\bullet, j}$` by `$d_j$` .stress[binary] columns `$\XB_{\bullet, j, 1}, \ldots, \XB_{\bullet, j, d_j}$`

- The $i$-th row of $\XB$ writes
`$$
x_i^B = [x^B_{i,1,1} \cdots x^B_{i,1,d_1} x^B_{i,2,1} \cdots x^B_{i,2,d_2} \cdots x^B_{i,p,1} \cdots x^B_{i,p, d_p}]^\top \in \R^d
$$`
where `$d = \sum_{j=1}^p d_j$`

Partition the range of `$\bX_{\bullet, j}$` into intervals `$I_{j,1}, \ldots, I_{j, d_j}$` and put

`$$x^B_{i, j, k} =
  \begin{cases}
    1 &\text{ if } x_{i,j} \in I_{j, k}, \\
    0 & \text{ otherwise}
  \end{cases}$$`

How to choose the `$I_{j, k}$` ? .stress[Inter-quantile] intervals
`$$
I_{j, 1} = \Big[ q_j(0), q_j(\frac {1}{d_j})\Big] \quad \text{ and } \quad I_{j, k} = \Big(q_j(\frac {k-1}{d_j}) , q_j(\frac {k}{d_j}) \Big]
$$` 
for `$k=2, \ldots, d_j$`, where `$q_j(\alpha) = $` quantile of order $\alpha \in [0, 1]$ of $\bX_{\bullet, j}$

---

# One-hot encoding of continuous features

---

# One-hot encoding of continuous features

---

# One-hot encoding of continuous features

---

# One-hot encoding of continuous features

---

# One-hot encoding of continuous features

---
# One-hot encoding of continuous features

## Linear regression on .stress[raw continuous features]

Look for weights $w \in \R^p$ such that

`$$
y_{i} \approx w^\top x_{i} + b = \sum_{j=1}^p x_{i,j} w_j + b
$$`

Impact of feature $j$ is .stress[linear] $z \mapsto w_j \times z$ and encoded by a **single weight** $w_j$

## Linear regression on .stress[one-hot-encoded features]

Look for weights $\theta \in \R^{d}$ where `$d = \sum_{j=1}^p d_j$` such that

`$$
y_{i} \approx \theta^\top x_{i}^B  + b = \sum_{j=1}^p \sum_{k=1}^{d_j} x_{i,j, k} \theta_{j, k} + b = \sum_{j=1}^p \sum_{k=1}^{d_j} \theta_{j, k} \mathbf{1}_{x_{i, j} \in I_{j, k}} + b
$$`

Impact of feature $j$ .stress[piecewise constant] `$z \mapsto \sum_{k=1}^{d_j} \theta_{j, k} \mathbf{1}_{z \in I_{j, k}} $` and encoded with a weight `$\theta_{j, k}$` for each range of values `$I_{j, k}$`

---

# Goodness-of-fit

Linear method with .stress[goodness-of-fit]
`$$
R_n(\theta, b) = \frac 1n \sum_{i=1}^n \ell(y_i, x_i^\top \theta + b)
$$`
coming from a GLM (Generalized Linear Model) where
`$$
\mathbb P_\theta(y | x) = \exp \Big( \frac{y m_\theta(x) - b(m_\theta(x))}{\phi} + c(y, \phi) \Big)
$$`
where
`$$
m_\theta(x_i) = x_i ^\top \theta
$$`

and `$\ell(y, y') = - y y' + b(y')$`.

## Example: logistic regression

Most important example where $\ell$ is the .stress[logistic loss]:
`$$
\ell(y, z) = \log(1 + e^{- yz})
$$`

---
# Problems with one-hot encoding

1. **Colinear binary features.** by construction we have `$\sum_{k=1}^{d_j} x_{i, j, k}^B = 1$` for any $j=1, \ldots, p$, hence `$X^B$` is .stress[not full rank]

2. **Choice of `$d_j$`**. Too large leads to .stress[overfitting]

3. **Features selection.** Sparsity of blocks of weights `$\theta_{j, 1} = \cdots = \theta_{j, d_j} = 0$` in order to .stress[kill feature $j$]

# Solutions

1. Impose `$n_j^\top \theta_{j, \bullet} = 0$` for all $j=1, \ldots, p$ for some $n_j \in \R^{d_j}$

2. .stress[Total variation penalization] in each block
`$$
\sum_{j=1}^p \sum_{k=2}^{d_j} \lambda_{j, k} |\theta_{j, k} - \theta_{j, k-1}|
$$`

3. Actually solved by 1. and 2. !

---

# Binarsity penalization

Consider the following .stress[penalized] goodness-of-fit
`$$
\begin{align*}
R_n(\theta, b) &= \frac 1n \sum_{i=1}^n \ell(y_i, \theta^\top x_i^B + b) + \bina(\theta), \quad \text{ where} \\
\bina(\theta) &= \sum_{j=1}^p \Big( \sum_{k=2}^{d_j} \lambda_{j, k} |\theta_{j, k} - \theta_{j, k-1}| \Big) + \delta_j(\theta_{j, \bullet})
\end{align*}
$$`
where 
`$
\delta_j(\theta_{j, \bullet}) =
\begin{cases}
0 & \text{ if } n_j^\top \theta_{j, \bullet} = 0 \\
+\infty & \text{ otherwise}
\end{cases}
$`

where we put
`$n_j = [n_{j, 1} \cdots n_{j, d_j}] \in \mathbb N^{d_j}$` where `$n_{j, k} = | \{ i : x_{i, j} \in I_{j, k} \} |$` and

`$$
\lambda_{j,k} \approx \sqrt{\frac{\log d}{n} \pi_{j,k}} \quad
\text{ with } 
\quad \pi_{j,k} = \frac{\big|\big\{ i=1, \ldots, n: x_{i,j} \in \cup_{k'=k}^{d_j} I_{j, k'}
 \big\}\big|}{n}
 $$`

---
# Binarsity penalization

Where is it all coming from ?

## Some explanations

- `$d_j$` maybe be too large but **force consecutive `$\theta_{j, k}$`  and `$\theta_{j, k-1}$` to be close**

- `$\ell_1$` penalization leads to sparsity. Binarsity leads to .stress[piecewise constant `$\theta_{j, \bullet}$`]

- If `$\theta_{j, \bullet}$` is constant, `$n_j^\top \theta_{j, \bullet} = 0$` entails that `$\theta_{j, \bullet}$` is 0 everywhere: .stress[block sparsity]

## In practice...

- We use interquantile intervals: coordinates of $n_j$ are almost equal.
.stress[Simply use $n_j = [1 \cdots 1]$]

- Forget about $\lambda_{j, k}$ just .stress[cross-validate over a constant single $\lambda$]

---
class: center, middle, inverse

# Illustrations

### [Might help to understand]

---

# Weights of a logistic regression

Churn dataset (UCI) $n=3333, p=14$

## Raw continuous features

## Binarized features .stress[no penalization]

---

# Weights of a logistic regression

Churn dataset (UCI) $n=3333, p=14$

## Raw continuous features

## Binarized features .stress[low binarsity penalization]

---

# Weights of a logistic regression

Churn dataset (UCI) $n=3333, p=14$

## Raw continuous features

## Binarized features .stress[strong binarsity penalization]

---

# Decision function of a logistic regression

On 3 toy datasets with $n=1000$, $p=2$ and $d_1 = d_2 = 100$

---
class: center, middle, inverse

# Some theoretical guarantees

### [We can totally skip this]

---
# Assumptions on the GLM

`$$
\mathbb P(y | x) = \exp \Big( \frac{y m^0(x) - b(m^0(x)))}{\phi} + c(y, \phi) \Big)
$$`

1. $b$ is three times continuously differentiable

1. `$|b'''(z)| \leq C_b |b''(z)|$` for some `$C_b > 0$`

1. `$C_n = \max_{i=1, \ldots, n}|m^0(x_i)| < \infty$`

1. `$L_n \leq \max_{i=1, \ldots, n} b''\big(m^0(x_i)\big) \leq  U_n$`

Satisfied for the following standard GLM with

.pure-table.pure-table-bordered.pure-table-striped.smaller-font[
| Model | $\phi$ | $b(z)$ | $b'(z)$ | $b''(z)$ | $b'''(z)$ | $C_b$ | $L_n $| $U_n$ |
|-------|--------|--------|---------|----------|-----------|-------|-------|-----|
| Normal | $\sigma^2$ | $\frac{z^2}{2}$ | $z$ | $1$ | $0$ | $0$ | $1$ | $1$ |
| Logistic | $1$ | $\log(1 + e^z)$|$\frac{e^z}{1+e^z}$ | $\frac{e^z}{(1+e^z)^2}$ | $\frac{1 - e^z}{1+e^z} b''(z)$ | 2 | $\frac{e^{C_n}}{(1 + e^{C_n})^2}$ | $\frac{1}{4}$ |
| Poisson | $1$ |  $e^z$ | $e^z$ | $e^z$ | $b''(z)$ | 1 | $e^{-C_n}$  | $e^{C_n}$ |
]

---
# What is an oracle inequality

Expected risk for GLM (Van de Geer 2008)
`$$ 
R(m_{\theta}) = \dfrac 1n \sum_{i=1}^n \big\{- b'\big(m^0(X_i)\big) m_{\theta}(X_i) + b\big(m_{\theta}(X_i)\big)\big\}
$$`

Non asymptotic .stress[oracle inequality]

- How close is $m_{\hat \theta}$ to the minimal expected risk ?
- Measured by the excess risk `$R(m_{\hat\theta}) - R(m^0)$`

For $n$ fixed prove something like
`$$
R(m_{\hat\theta}) - R(m^0) \leq \inf_{\theta} \Big\{ R(m_{\theta}) - R(m^0) + \frac{\text{complexity}(\theta)}{n} \Big\}
$$`

For `$\ell_1$` penalization (`$\norm{u}_1 = \sum_{j=1}^p |u_j|$`) one has
`$$
\text{complexity}(\theta) =  s(\theta) \log p
$$`
where `$s(\theta) = \text{sparsity}(\theta) = |\{ j=1, \ldots, p : \theta_j \neq 0 \}|$`
(Bickel Ritov Tsybakov 2009, among many others)

---
# A new measure of sparsity: .stress[binarsity]

We consider .stress[a different sparsity] measure

- For $\theta \in \R^d,$ let `$J(\theta)=  [J_1 (\theta),\ldots, J_p (\theta)]$` where
`$$
J_j(\theta)= \{k = 2, \ldots, d_j \; : \; \theta_{j,k} \neq \theta_{j,k-1} \}.
$$`
- Let `$J^\complement(\theta)= \big[J_1^\complement(\theta),\ldots,J_p^\complement(\theta)\big]$` be complementary of $J(\theta).$ 
- If `$J =[J_1, \ldots, J_p]$` put `$|J| = \sum_{j=1}^p |J_j|$` where $|E| = $ cardinality of $E$
- The measure of sparsity we use is 
`$$
  \text{binarsity}(\theta) = |J(\theta)| = \sum_{j=1}^p |J_j(\theta)| = 
  \sum_{j=1}^p | \{k = 1, \ldots, d_j : \theta_{j,k} \neq \theta_{j,k-1} \}|
$$`

Counts the .stress[number of non-equal consecutive values of] $\theta$

- If $\theta$ is block-sparse $|\cJ(\theta)| \ll p$ where `$\cJ(\theta) = \{ j = 1, \ldots, p : \theta_{j, \bullet} \neq 0_{d_j} \}$` then 
`$$
|J(\theta)| \leq |\cJ(\theta)| \max_{j \in \cJ(\theta)} |J_j(\theta)|
$$` which means that $|J(\theta)|$ is controlled by the block sparsity $|\cJ(\theta)|$.

---
# Restricted eigenvalues assumption

- Define
`$$
\kappa (K) = \inf_{u \in \mathscr{C}(K) \backslash \{ 0 \} } \bigg\{ \frac{\norm{\XB u}_2}{\sqrt n \norm{u_K}_2} \bigg\}
$$`
with
`$$
\mathscr{C}(K) \stackrel{}{=} \bigg\{u \in \R^d: \sum_{j=1}^p \| (u_{j, \bullet})_{{K_j}^\complement} \|_{\text{TV}}
 \leq  2\sum_{j=1}^p \|(u_{j, \bullet})_{K_j} \|_{\TV} \bigg\}
$$`
where `$(u_K)_k = u_k$` if $k \in K$ and `$(u_K)_k = 0$` if $k \not \in K$ and
`$$
 \|(u_{j, \bullet}) \|_{\TV}  = \sum_{k=2}^{d_j} \lambda_{j, k} |u_{j, k} - u_{j, k-1}|
$$`

- The set $\mathscr{C}(K)$ is a cone composed by all vectors with a support "close" to $K$

**Assumption.** We have $\kappa (K) > 0$ for any $K$ such that `$|K| \leq J^\star $`

---
# Theorem

Define

`$$
\lambda_{j,k} =  \sqrt{\frac{2U_n\phi(A +\log d)}{n}\, \pi_{j,k}}
$$`

some some $A >0$ and consider

`$$
\hat \theta \in \argmin_{\theta \in B_d(\rho)} \big\{R_n(\theta) + \bina(\theta) \big\},
$$`
where 
`$$
B_d(\rho) = \Big \{\theta \in \R^d: \sum_{j=1}^p\norm{\theta_{j , \bullet}}_\infty \leq \rho \Big\}
$$`

A standard constraint in literature for the **proof of oracle inequalities for sparse GLMs** (Van de Geer 2008)

---
# Theorem

Then
`$$
\begin{align*}
R(m_{\hat \theta}) - R(m^0) \leq & \inf_{\theta} \Big\{ 3 (R(m_{\theta}) - R(m^0)) \\
  & \quad + \frac{c (C_b(C_n + \rho) + 2)}{L_n \kappa^2(J(\theta))} \; |J(\theta)| \; 
  \max_{j=1, \ldots, p}  \norm{(\lambda_{j,\bullet})_{J_j(\theta)}}_\infty^2 \Big\},
\end{align*}
$$`
with probability at least $1 -2e^{-A}$, where inf is over  `$\theta \in B_d(\rho)$` such that `$n_j^\top \theta_{j, \bullet} = 0$` for all `$j=1, \ldots, p$` and such that `$|J(\theta)| \leq J^*$`

The **complexity term** satisfies
`$$
  \label{complex-term-thm1}
  |J(\theta)| \max_{j=1, \ldots, p}\norm{(\lambda_{j,\bullet})_{J_j(\theta)}}_\infty^2 
  \leq 2 U_n\phi \frac{|J(\theta)|(A + \log d)}{n}.
$$`

- Means that $\hat \theta$ .stress[achieves an optimal tradeoff between bias] $R(m_{\theta}) - R(m^0)$ .stress[and sparsity] $|J(\theta)|$

- .stress[Fast rate] of order `$(\text{binarsity}(\theta) \times \log d) /n$`

---
# Remarks

- $\theta \in B_d(\rho)$ **standard technical assumption**

- Related to the use of .stress[self-concordance] (Bach 2010) to connect empirical squared $\ell_2$-norm and the empirical Kullback divergence

- Common for the **proof of oracle inequalities for sparse GLMs**, see Van de Geer (2008) and Ivanoff et al. (2016) for Poisson regression

- Note that
`$$
  \label{lemma:control_inner_ball}
  \max_{i=1,\ldots,n} | \inr{x_i^B, \theta} | \leq \sum_{j=1}^p 
  \norm{\theta_{j , \bullet}}_\infty \leq |\cJ(\theta)|
  \times \norm{\theta}_\infty,
$$`
where `$\norm{\theta}_\infty = \max_{j=1, \ldots, p} \norm{\theta_{j, \bullet}}_\infty$`.

- So that $\theta \in B_d(\rho)$ is entailed by a box constraint on $\theta$, which depends on `$|\cJ(\theta)|$` (and not $p$)

---

# Sparse additive model

- $x_i \in [0, 1]^d$ for all $i=1, \ldots, n$.

- .stress[Gaussian setting] with the least-squares loss $\ell(y, y') = \frac 12 (y - y')^2$, $b(y) = \frac 12 y^2$ and $\phi = \sigma^2$ (noise variance) with $L_n = U_n = 1$, $C_b = 0$

- $m^0$ has a .stress[sparse additive structure]
  `$$
    m^0(x) = \sum_{j \in \cJ_*} m_j^0(x_j)
  $$`
  for `$x = [x_1 \cdots x_p] \in \R^p$`, where `$m_j^0$` satisfies
  `$$
  |m_j^0(z) - m_j^0(z')| \leq L |z - z'|
  $$` 
  for any `$z, z' \in \R$` and where `$\cJ_* \subset \{ 1, \ldots, p \}$` is a **set of active features** (sparsity means that `$|\cJ_*| \ll p$`). 
  Also, assume
  `$$
    \sum_{i=1}^n m_j^0(x_{i, j}) = 0 \quad \text{ for all } \quad j=1, \ldots, p
  $$`

---
# Sparse additive model

- .stress[Identifiability] and .stress[smoothness] requirements

- Standard when studying additive models (Meier et al. 2009)

- Restrict $m_j^0$ to be **Lipschitz** and not smoother: binarsity produces a .stress[piecewise constant decision function] with respect to each $j$

## Theorem

Consider `$d_j = D = n^{1/3}$` and `$I_{j, 1} = [0, \frac{1}{D}]$, $I_{j, k} = I_{k} = (\frac{k-1}{D}, \frac{k}{D}]$`.
Introduce 
`$$
\theta_{j, k}^* = \frac{\sum_{i=1}^n m_j^0(x_{i, j}) \ind{I_k}(x_{i, j})}{\sum_{i=1}^n \ind{I_k}(x_{i, j})}
$$` 
for `$j \in \cJ_*$ and $\theta_{j, \bullet}^* = \boldsymbol 0_D$` for `$j \notin \cJ_*$`.
Then,
`$$
\norm{m_{\hat \theta} - m^0}_n^2 \leq \Big(3 L^2 |\cJ_*| + \frac{c M_n \sigma^2 
(A + \log(p n^{1/3} M_n))}{\kappa^2(J(\theta^*))} \Big) \frac{|\cJ_*|}{n^{2/3}},
$$`
where `$M_n = \max_{j=1, \ldots, p} \max_{i=1, \ldots, n} |m_j^0(x_{i, j})|.$`

---
# Remarks

- Convergence rate of order `$|\cJ_*|^2 n^{-2/3}$`

- Matches rate $n^{-2 / 3} = n^{-2r / (2r + 1)}$ under Lipschitz assumption ($r = 1$)

- Matches rate `$|\cJ_*|^2 n^{-4 / 5} = |\cJ_*|^2 n^{-2r / (2r + 1)} $` from Van de Geer and Buhlman (2011) under $C^2$ smoothness, namely $r = 2$

- .stress[Dimension reduction] thanks to additive modelling (non-parametric is `$n^{-2r / (2r + d)}$`)

- .stress[Binarsity matches literature] in the particular case of a **sparse additive model**

---
class: center, middle, inverse

# Implementation

### [And useful tools]

---
# Optimization problem

Need to minimize

`$$
R_n(\theta) + \bina(\theta)
$$`

where `$R_n(\cdot)$` **smooth** (gradient Lipschitz) and $\bina(\cdot)$ is **non-differentiable**

First order optimization (Bach et al. 2012), simplest is **proximal gradient descent**
`$$
\theta^{k+1} \gets \prox_{\eta \bina(\cdot)} ( \theta^k - \eta \nabla R_n(\theta^k) )
$$`
where $\eta > 0$ learning rate and where `$\prox_g$` is the .stress[proximal operator]

`$$
  \prox_g(v) \in \argmin_{u \in \R^d} \Big\{\frac 12 
  \norm{v- u}_2^2 + g(u) \Big\}
$$`

In our implementation: **more advanced optimization techniques** (SGD with variance reduction, waaaaaaay faster)

---
# Proximal operator of binarsity

The following algorithm .stress[computes the proximal operator] of $\bina(\theta)$

--------
1. **Input:** `$\theta \in \R^d$`, `$\lambda_{j, k}$` and `$n_{j, k}$` for `$j=1, \ldots, p$ and $k=1, \ldots, d_j$`

1. **Output:** `$\eta = \prox_{\bina}(\theta)$`

1. **for** $j=1, \ldots, p$ **do**
  - `$\beta_{j,\bullet}  \gets \prox_{\norm{\theta_{j, \bullet}}_{\TV}} (\theta_{j, \bullet})$` $\qquad\quad$ (weighted TV penalization in block `$\theta_{j, \bullet}$`)
  - `$\eta_{j,\bullet} \gets  \beta_{j,\bullet} - \frac{n_j^\top \beta_{j,\bullet}}{\norm{n_j}_2^2} n_j$` $\qquad\qquad$ (projet onto `$\text{span}(n_j)^\perp$`)

1. **Return:** $\eta$
--------

- Fast

- Main difficulty is `$\prox_{\norm{\cdot}_{\TV}}$`

---
# Proximal operator of binarsity

The following is `$\beta = \prox_{\norm{\cdot}_\TV}(\theta)$` where `$\norm{u}_{\TV} = \sum_{j=2}^m \lambda_{j} |u_j - u_{j-1}|$`

.pull-left[
  <img src="figs/alg1.png" style="height: 400px;" />
]
.pull-right[
  <img src="figs/alg2.png" style="height: 240px;" />
]

---
# Proximal operator of binarsity

If you like .stress[proximal operators], have a look at

## Example

```python
import numpy as np
from tick.prox import ProxTV, ProxBinarsity

x = np.random.randn(50)
strength = 0.4

prox_tv = ProxTV(strength=strength)
x_tv = prox_tv.call(x)

prox_bina = ProxBinarsity(
  strength=strength, 
  blocks_start=np.arange(0, 50, 10),
  blocks_length=10 * np.ones((5,))
)
x_bina = prox_bina.call(x)
```

---
class: center, middle, inverse

# Numerical experiments

### [On a bunch of datasets]
---

# Datasets

We consider the following datasets for .stress[binary classification]

.pure-table.pure-table-bordered.pure-table-striped[
|dataset | \#samples | \#features |
|--------|-----|-----|
|Ionosphere | 351 | 34 |
|Churn | 3333 | 21 |
|Default of credit card | 30000 | 24 |
|Adult | 32561 | 14 |
|Bank marketing | 45211 | 17 |
|Covertype | 550088 | 10 |
|SUSY | 5000000 | 18 |
|HEPMASS | 10500000 | 28 |
|HIGGS | 11000000 | 24 |
]

---

# Baselines

We compare our procedure to the following .stress[baselines]

.pure-table.pure-table-bordered.pure-table-striped[
| Procedure | Description |
|-----------|-------------|
| Lasso | Logistic regression (LR) with $\ell_1$ penalization |
| Group Lasso | LR with group $\ell_1$ penalization |
| Group TV | LR with group total-variation penalization |
| SVM | Support vector machine with Gaussian kernel |
| GAM | Generalized Additive Model |
| RF | Random Forest |
| GB | Gradient Boosting |
]

---
# Comparison of training times

- Binarsity usually .stress[orders of magnitude faster] than RF, SVM, GB and GAM

- For .stress[similar performances]