class: center, middle, inverse # Binarsity ## A penalization for one-hot encoded features in supervised learning ### Stéphane Gaïffas ### [joint work with M. Alaya, S. Bussy and A. Guilloux] --- layout: true class: top --- # Who I am ? - Stéphane Gaïffas - Professor - Research and teaching in machine learning, big data and AI - Laboratoire de Probabilités, Statistique et Modélisation - Université Paris Diderot - [http://www.cmap.polytechnique.fr/~gaiffas/](http://www.cmap.polytechnique.fr/~gaiffas/) - [stephane.gaiffas@lpsm.paris](mailto:stephane.gaiffas@lpsm.paris) .center[
] --- # Who are my co-authors .pull-left[
- Dr. Mokhtar Z. Alaya - Post Doc at LITIS Lab - University of Rouen ] .pull-left[
- Dr. Simon Bussy - Researcher at AP-HP - Co-funder of Califrais ] .pull-left[
- Pr. Agathe Guilloux - Professor at LaMME - Univ. d'Evry Val d'Essonne ] --- class: center, middle, inverse # Binarsity ### [Why one-hot encoded features ?] `$$ \newcommand{\R}{\mathbb R} \newcommand{\cY}{\mathcal Y} \newcommand{\cJ}{\mathcal J} \newcommand{\bX}{\boldsymbol X} \newcommand{\XB}{\boldsymbol X^B} \newcommand{\TV}{\text{TV}} \newcommand{\norm}[1]{\| #1 \|} \newcommand{\inr}[1]{\langle #1 \rangle} \newcommand{\ind}[1]{\mathbf 1_{#1}} \DeclareMathOperator{\argmin}{argmin} \DeclareMathOperator{\pen}{pen} \DeclareMathOperator{\prox}{prox} \DeclareMathOperator{\bina}{bina} $$` --- # Motivation - You want an .stress[intrepretable] and .stress[quick and easy] model to train - You want to stick with a .stress[linear] model - Several columns of your dataset contain .stress[continuous features] # Setting - You have a training dataset $(x_i, y_i)$ for samples $i=1, \ldots, n$ - With features `$x_i = [x_{i,1} \cdots x_{i,p}]^\top \in \R^p$` and labels $y_i \in \cY \subset \R$ - Let `$\bX = [x_{i,j}]_{1 \leq i \leq n; 1 \leq j \leq p}$` be the $n \times p$ features matrix - Let $\bX_{\bullet, j}$ be the $j$-th feature column of $\bX$ ## A useful trick: .stress[one-hot encoding of continuous features] It is a useful and **well-known trick** --- # One-hot encoding of continuous features - Replaces `$\bX_{\bullet, j}$` by `$d_j$` .stress[binary] columns `$\XB_{\bullet, j, 1}, \ldots, \XB_{\bullet, j, d_j}$` - The $i$-th row of $\XB$ writes `$$ x_i^B = [x^B_{i,1,1} \cdots x^B_{i,1,d_1} x^B_{i,2,1} \cdots x^B_{i,2,d_2} \cdots x^B_{i,p,1} \cdots x^B_{i,p, d_p}]^\top \in \R^d $$` where `$d = \sum_{j=1}^p d_j$` Partition the range of `$\bX_{\bullet, j}$` into intervals `$I_{j,1}, \ldots, I_{j, d_j}$` and put `$$x^B_{i, j, k} = \begin{cases} 1 &\text{ if } x_{i,j} \in I_{j, k}, \\ 0 & \text{ otherwise} \end{cases}$$` How to choose the `$I_{j, k}$` ? .stress[Inter-quantile] intervals `$$ I_{j, 1} = \Big[ q_j(0), q_j(\frac {1}{d_j})\Big] \quad \text{ and } \quad I_{j, k} = \Big(q_j(\frac {k-1}{d_j}) , q_j(\frac {k}{d_j}) \Big] $$` for `$k=2, \ldots, d_j$`, where `$q_j(\alpha) = $` quantile of order $\alpha \in [0, 1]$ of $\bX_{\bullet, j}$ --- # One-hot encoding of continuous features
--- # One-hot encoding of continuous features
--- # One-hot encoding of continuous features
--- # One-hot encoding of continuous features
--- # One-hot encoding of continuous features
--- # One-hot encoding of continuous features ## Linear regression on .stress[raw continuous features] Look for weights $w \in \R^p$ such that `$$ y_{i} \approx w^\top x_{i} + b = \sum_{j=1}^p x_{i,j} w_j + b $$` Impact of feature $j$ is .stress[linear] $z \mapsto w_j \times z$ and encoded by a **single weight** $w_j$ ## Linear regression on .stress[one-hot-encoded features] Look for weights $\theta \in \R^{d}$ where `$d = \sum_{j=1}^p d_j$` such that `$$ y_{i} \approx \theta^\top x_{i}^B + b = \sum_{j=1}^p \sum_{k=1}^{d_j} x_{i,j, k} \theta_{j, k} + b = \sum_{j=1}^p \sum_{k=1}^{d_j} \theta_{j, k} \mathbf{1}_{x_{i, j} \in I_{j, k}} + b $$` Impact of feature $j$ .stress[piecewise constant] `$z \mapsto \sum_{k=1}^{d_j} \theta_{j, k} \mathbf{1}_{z \in I_{j, k}} $` and encoded with a weight `$\theta_{j, k}$` for each range of values `$I_{j, k}$` --- # Goodness-of-fit Linear method with .stress[goodness-of-fit] `$$ R_n(\theta, b) = \frac 1n \sum_{i=1}^n \ell(y_i, x_i^\top \theta + b) $$` coming from a GLM (Generalized Linear Model) where `$$ \mathbb P_\theta(y | x) = \exp \Big( \frac{y m_\theta(x) - b(m_\theta(x))}{\phi} + c(y, \phi) \Big) $$` where `$$ m_\theta(x_i) = x_i ^\top \theta $$` and `$\ell(y, y') = - y y' + b(y')$`. ## Example: logistic regression Most important example where $\ell$ is the .stress[logistic loss]: `$$ \ell(y, z) = \log(1 + e^{- yz}) $$` --- # Problems with one-hot encoding 1. **Colinear binary features.** by construction we have `$\sum_{k=1}^{d_j} x_{i, j, k}^B = 1$` for any $j=1, \ldots, p$, hence `$X^B$` is .stress[not full rank] 2. **Choice of `$d_j$`**. Too large leads to .stress[overfitting] 3. **Features selection.** Sparsity of blocks of weights `$\theta_{j, 1} = \cdots = \theta_{j, d_j} = 0$` in order to .stress[kill feature $j$] # Solutions 1. Impose `$n_j^\top \theta_{j, \bullet} = 0$` for all $j=1, \ldots, p$ for some $n_j \in \R^{d_j}$ 2. .stress[Total variation penalization] in each block `$$ \sum_{j=1}^p \sum_{k=2}^{d_j} \lambda_{j, k} |\theta_{j, k} - \theta_{j, k-1}| $$` 3. Actually solved by 1. and 2. ! --- # Binarsity penalization Consider the following .stress[penalized] goodness-of-fit `$$ \begin{align*} R_n(\theta, b) &= \frac 1n \sum_{i=1}^n \ell(y_i, \theta^\top x_i^B + b) + \bina(\theta), \quad \text{ where} \\ \bina(\theta) &= \sum_{j=1}^p \Big( \sum_{k=2}^{d_j} \lambda_{j, k} |\theta_{j, k} - \theta_{j, k-1}| \Big) + \delta_j(\theta_{j, \bullet}) \end{align*} $$` where `$ \delta_j(\theta_{j, \bullet}) = \begin{cases} 0 & \text{ if } n_j^\top \theta_{j, \bullet} = 0 \\ +\infty & \text{ otherwise} \end{cases} $` where we put `$n_j = [n_{j, 1} \cdots n_{j, d_j}] \in \mathbb N^{d_j}$` where `$n_{j, k} = | \{ i : x_{i, j} \in I_{j, k} \} |$` and `$$ \lambda_{j,k} \approx \sqrt{\frac{\log d}{n} \pi_{j,k}} \quad \text{ with } \quad \pi_{j,k} = \frac{\big|\big\{ i=1, \ldots, n: x_{i,j} \in \cup_{k'=k}^{d_j} I_{j, k'} \big\}\big|}{n} $$` --- # Binarsity penalization Where is it all coming from ? ## Some explanations - `$d_j$` maybe be too large but **force consecutive `$\theta_{j, k}$` and `$\theta_{j, k-1}$` to be close** - `$\ell_1$` penalization leads to sparsity. Binarsity leads to .stress[piecewise constant `$\theta_{j, \bullet}$`] - If `$\theta_{j, \bullet}$` is constant, `$n_j^\top \theta_{j, \bullet} = 0$` entails that `$\theta_{j, \bullet}$` is 0 everywhere: .stress[block sparsity] ## In practice... - We use interquantile intervals: coordinates of $n_j$ are almost equal. .stress[Simply use $n_j = [1 \cdots 1]$] - Forget about $\lambda_{j, k}$ just .stress[cross-validate over a constant single $\lambda$] --- class: center, middle, inverse # Illustrations ### [Might help to understand] --- # Weights of a logistic regression Churn dataset (UCI) $n=3333, p=14$ ## Raw continuous features .center[
] ## Binarized features .stress[no penalization] .center[
] --- # Weights of a logistic regression Churn dataset (UCI) $n=3333, p=14$ ## Raw continuous features .center[
] ## Binarized features .stress[low binarsity penalization] .center[
] --- # Weights of a logistic regression Churn dataset (UCI) $n=3333, p=14$ ## Raw continuous features .center[
] ## Binarized features .stress[strong binarsity penalization] .center[
] --- # Decision function of a logistic regression On 3 toy datasets with $n=1000$, $p=2$ and $d_1 = d_2 = 100$ .center[
] --- class: center, middle, inverse # Some theoretical guarantees ### [We can totally skip this] --- # Assumptions on the GLM `$$ \mathbb P(y | x) = \exp \Big( \frac{y m^0(x) - b(m^0(x)))}{\phi} + c(y, \phi) \Big) $$` 1. $b$ is three times continuously differentiable 1. `$|b'''(z)| \leq C_b |b''(z)|$` for some `$C_b > 0$` 1. `$C_n = \max_{i=1, \ldots, n}|m^0(x_i)| < \infty$` 1. `$L_n \leq \max_{i=1, \ldots, n} b''\big(m^0(x_i)\big) \leq U_n$` Satisfied for the following standard GLM with .pure-table.pure-table-bordered.pure-table-striped.smaller-font[ | Model | $\phi$ | $b(z)$ | $b'(z)$ | $b''(z)$ | $b'''(z)$ | $C_b$ | $L_n $| $U_n$ | |-------|--------|--------|---------|----------|-----------|-------|-------|-----| | Normal | $\sigma^2$ | $\frac{z^2}{2}$ | $z$ | $1$ | $0$ | $0$ | $1$ | $1$ | | Logistic | $1$ | $\log(1 + e^z)$|$\frac{e^z}{1+e^z}$ | $\frac{e^z}{(1+e^z)^2}$ | $\frac{1 - e^z}{1+e^z} b''(z)$ | 2 | $\frac{e^{C_n}}{(1 + e^{C_n})^2}$ | $\frac{1}{4}$ | | Poisson | $1$ | $e^z$ | $e^z$ | $e^z$ | $b''(z)$ | 1 | $e^{-C_n}$ | $e^{C_n}$ | ] --- # What is an oracle inequality Expected risk for GLM (Van de Geer 2008) `$$ R(m_{\theta}) = \dfrac 1n \sum_{i=1}^n \big\{- b'\big(m^0(X_i)\big) m_{\theta}(X_i) + b\big(m_{\theta}(X_i)\big)\big\} $$` Non asymptotic .stress[oracle inequality] - How close is $m_{\hat \theta}$ to the minimal expected risk ? - Measured by the excess risk `$R(m_{\hat\theta}) - R(m^0)$` For $n$ fixed prove something like `$$ R(m_{\hat\theta}) - R(m^0) \leq \inf_{\theta} \Big\{ R(m_{\theta}) - R(m^0) + \frac{\text{complexity}(\theta)}{n} \Big\} $$` For `$\ell_1$` penalization (`$\norm{u}_1 = \sum_{j=1}^p |u_j|$`) one has `$$ \text{complexity}(\theta) = s(\theta) \log p $$` where `$s(\theta) = \text{sparsity}(\theta) = |\{ j=1, \ldots, p : \theta_j \neq 0 \}|$` (Bickel Ritov Tsybakov 2009, among many others) --- # A new measure of sparsity: .stress[binarsity] We consider .stress[a different sparsity] measure - For $\theta \in \R^d,$ let `$J(\theta)= [J_1 (\theta),\ldots, J_p (\theta)]$` where `$$ J_j(\theta)= \{k = 2, \ldots, d_j \; : \; \theta_{j,k} \neq \theta_{j,k-1} \}. $$` - Let `$J^\complement(\theta)= \big[J_1^\complement(\theta),\ldots,J_p^\complement(\theta)\big]$` be complementary of $J(\theta).$ - If `$J =[J_1, \ldots, J_p]$` put `$|J| = \sum_{j=1}^p |J_j|$` where $|E| = $ cardinality of $E$ - The measure of sparsity we use is `$$ \text{binarsity}(\theta) = |J(\theta)| = \sum_{j=1}^p |J_j(\theta)| = \sum_{j=1}^p | \{k = 1, \ldots, d_j : \theta_{j,k} \neq \theta_{j,k-1} \}| $$` Counts the .stress[number of non-equal consecutive values of] $\theta$ - If $\theta$ is block-sparse $|\cJ(\theta)| \ll p$ where `$\cJ(\theta) = \{ j = 1, \ldots, p : \theta_{j, \bullet} \neq 0_{d_j} \}$` then `$$ |J(\theta)| \leq |\cJ(\theta)| \max_{j \in \cJ(\theta)} |J_j(\theta)| $$` which means that $|J(\theta)|$ is controlled by the block sparsity $|\cJ(\theta)|$. --- # Restricted eigenvalues assumption - Define `$$ \kappa (K) = \inf_{u \in \mathscr{C}(K) \backslash \{ 0 \} } \bigg\{ \frac{\norm{\XB u}_2}{\sqrt n \norm{u_K}_2} \bigg\} $$` with `$$ \mathscr{C}(K) \stackrel{}{=} \bigg\{u \in \R^d: \sum_{j=1}^p \| (u_{j, \bullet})_{{K_j}^\complement} \|_{\text{TV}} \leq 2\sum_{j=1}^p \|(u_{j, \bullet})_{K_j} \|_{\TV} \bigg\} $$` where `$(u_K)_k = u_k$` if $k \in K$ and `$(u_K)_k = 0$` if $k \not \in K$ and `$$ \|(u_{j, \bullet}) \|_{\TV} = \sum_{k=2}^{d_j} \lambda_{j, k} |u_{j, k} - u_{j, k-1}| $$` - The set $\mathscr{C}(K)$ is a cone composed by all vectors with a support "close" to $K$ **Assumption.** We have $\kappa (K) > 0$ for any $K$ such that `$|K| \leq J^\star $` --- # Theorem Define `$$ \lambda_{j,k} = \sqrt{\frac{2U_n\phi(A +\log d)}{n}\, \pi_{j,k}} $$` some some $A >0$ and consider `$$ \hat \theta \in \argmin_{\theta \in B_d(\rho)} \big\{R_n(\theta) + \bina(\theta) \big\}, $$` where `$$ B_d(\rho) = \Big \{\theta \in \R^d: \sum_{j=1}^p\norm{\theta_{j , \bullet}}_\infty \leq \rho \Big\} $$` A standard constraint in literature for the **proof of oracle inequalities for sparse GLMs** (Van de Geer 2008) --- # Theorem Then `$$ \begin{align*} R(m_{\hat \theta}) - R(m^0) \leq & \inf_{\theta} \Big\{ 3 (R(m_{\theta}) - R(m^0)) \\ & \quad + \frac{c (C_b(C_n + \rho) + 2)}{L_n \kappa^2(J(\theta))} \; |J(\theta)| \; \max_{j=1, \ldots, p} \norm{(\lambda_{j,\bullet})_{J_j(\theta)}}_\infty^2 \Big\}, \end{align*} $$` with probability at least $1 -2e^{-A}$, where inf is over `$\theta \in B_d(\rho)$` such that `$n_j^\top \theta_{j, \bullet} = 0$` for all `$j=1, \ldots, p$` and such that `$|J(\theta)| \leq J^*$` The **complexity term** satisfies `$$ \label{complex-term-thm1} |J(\theta)| \max_{j=1, \ldots, p}\norm{(\lambda_{j,\bullet})_{J_j(\theta)}}_\infty^2 \leq 2 U_n\phi \frac{|J(\theta)|(A + \log d)}{n}. $$` - Means that $\hat \theta$ .stress[achieves an optimal tradeoff between bias] $R(m_{\theta}) - R(m^0)$ .stress[and sparsity] $|J(\theta)|$ - .stress[Fast rate] of order `$(\text{binarsity}(\theta) \times \log d) /n$` --- # Remarks - $\theta \in B_d(\rho)$ **standard technical assumption** - Related to the use of .stress[self-concordance] (Bach 2010) to connect empirical squared $\ell_2$-norm and the empirical Kullback divergence - Common for the **proof of oracle inequalities for sparse GLMs**, see Van de Geer (2008) and Ivanoff et al. (2016) for Poisson regression - Note that `$$ \label{lemma:control_inner_ball} \max_{i=1,\ldots,n} | \inr{x_i^B, \theta} | \leq \sum_{j=1}^p \norm{\theta_{j , \bullet}}_\infty \leq |\cJ(\theta)| \times \norm{\theta}_\infty, $$` where `$\norm{\theta}_\infty = \max_{j=1, \ldots, p} \norm{\theta_{j, \bullet}}_\infty$`. - So that $\theta \in B_d(\rho)$ is entailed by a box constraint on $\theta$, which depends on `$|\cJ(\theta)|$` (and not $p$) --- # Sparse additive model - $x_i \in [0, 1]^d$ for all $i=1, \ldots, n$. - .stress[Gaussian setting] with the least-squares loss $\ell(y, y') = \frac 12 (y - y')^2$, $b(y) = \frac 12 y^2$ and $\phi = \sigma^2$ (noise variance) with $L_n = U_n = 1$, $C_b = 0$ - $m^0$ has a .stress[sparse additive structure] `$$ m^0(x) = \sum_{j \in \cJ_*} m_j^0(x_j) $$` for `$x = [x_1 \cdots x_p] \in \R^p$`, where `$m_j^0$` satisfies `$$ |m_j^0(z) - m_j^0(z')| \leq L |z - z'| $$` for any `$z, z' \in \R$` and where `$\cJ_* \subset \{ 1, \ldots, p \}$` is a **set of active features** (sparsity means that `$|\cJ_*| \ll p$`). Also, assume `$$ \sum_{i=1}^n m_j^0(x_{i, j}) = 0 \quad \text{ for all } \quad j=1, \ldots, p $$` --- # Sparse additive model - .stress[Identifiability] and .stress[smoothness] requirements - Standard when studying additive models (Meier et al. 2009) - Restrict $m_j^0$ to be **Lipschitz** and not smoother: binarsity produces a .stress[piecewise constant decision function] with respect to each $j$ ## Theorem Consider `$d_j = D = n^{1/3}$` and `$I_{j, 1} = [0, \frac{1}{D}]$, $I_{j, k} = I_{k} = (\frac{k-1}{D}, \frac{k}{D}]$`. Introduce `$$ \theta_{j, k}^* = \frac{\sum_{i=1}^n m_j^0(x_{i, j}) \ind{I_k}(x_{i, j})}{\sum_{i=1}^n \ind{I_k}(x_{i, j})} $$` for `$j \in \cJ_*$ and $\theta_{j, \bullet}^* = \boldsymbol 0_D$` for `$j \notin \cJ_*$`. Then, `$$ \norm{m_{\hat \theta} - m^0}_n^2 \leq \Big(3 L^2 |\cJ_*| + \frac{c M_n \sigma^2 (A + \log(p n^{1/3} M_n))}{\kappa^2(J(\theta^*))} \Big) \frac{|\cJ_*|}{n^{2/3}}, $$` where `$M_n = \max_{j=1, \ldots, p} \max_{i=1, \ldots, n} |m_j^0(x_{i, j})|.$` --- # Remarks - Convergence rate of order `$|\cJ_*|^2 n^{-2/3}$` - Matches rate $n^{-2 / 3} = n^{-2r / (2r + 1)}$ under Lipschitz assumption ($r = 1$) - Matches rate `$|\cJ_*|^2 n^{-4 / 5} = |\cJ_*|^2 n^{-2r / (2r + 1)} $` from Van de Geer and Buhlman (2011) under $C^2$ smoothness, namely $r = 2$ - .stress[Dimension reduction] thanks to additive modelling (non-parametric is `$n^{-2r / (2r + d)}$`) - .stress[Binarsity matches literature] in the particular case of a **sparse additive model** --- class: center, middle, inverse # Implementation ### [And useful tools] --- # Optimization problem Need to minimize `$$ R_n(\theta) + \bina(\theta) $$` where `$R_n(\cdot)$` **smooth** (gradient Lipschitz) and $\bina(\cdot)$ is **non-differentiable** First order optimization (Bach et al. 2012), simplest is **proximal gradient descent** `$$ \theta^{k+1} \gets \prox_{\eta \bina(\cdot)} ( \theta^k - \eta \nabla R_n(\theta^k) ) $$` where $\eta > 0$ learning rate and where `$\prox_g$` is the .stress[proximal operator] `$$ \prox_g(v) \in \argmin_{u \in \R^d} \Big\{\frac 12 \norm{v- u}_2^2 + g(u) \Big\} $$` In our implementation: **more advanced optimization techniques** (SGD with variance reduction, waaaaaaay faster) --- # Proximal operator of binarsity The following algorithm .stress[computes the proximal operator] of $\bina(\theta)$ -------- 1. **Input:** `$\theta \in \R^d$`, `$\lambda_{j, k}$` and `$n_{j, k}$` for `$j=1, \ldots, p$ and $k=1, \ldots, d_j$` 1. **Output:** `$\eta = \prox_{\bina}(\theta)$` 1. **for** $j=1, \ldots, p$ **do** - `$\beta_{j,\bullet} \gets \prox_{\norm{\theta_{j, \bullet}}_{\TV}} (\theta_{j, \bullet})$` $\qquad\quad$ (weighted TV penalization in block `$\theta_{j, \bullet}$`) - `$\eta_{j,\bullet} \gets \beta_{j,\bullet} - \frac{n_j^\top \beta_{j,\bullet}}{\norm{n_j}_2^2} n_j$` $\qquad\qquad$ (projet onto `$\text{span}(n_j)^\perp$`) 1. **Return:** $\eta$ -------- - Fast - Main difficulty is `$\prox_{\norm{\cdot}_{\TV}}$` --- # Proximal operator of binarsity The following is `$\beta = \prox_{\norm{\cdot}_\TV}(\theta)$` where `$\norm{u}_{\TV} = \sum_{j=2}^m \lambda_{j} |u_j - u_{j-1}|$` .pull-left[
] .pull-right[
] .stress[Modification of an algorithm by Condat (2013)] to include weights $\lambda_j$ --- # Proximal operator of binarsity If you like .stress[proximal operators], have a look at .center[https://x-datainitiative.github.io/tick/modules/prox.html] ## Example ```python import numpy as np from tick.prox import ProxTV, ProxBinarsity x = np.random.randn(50) strength = 0.4 prox_tv = ProxTV(strength=strength) x_tv = prox_tv.call(x) prox_bina = ProxBinarsity( strength=strength, blocks_start=np.arange(0, 50, 10), blocks_length=10 * np.ones((5,)) ) x_bina = prox_bina.call(x) ``` --- class: center, middle, inverse # Numerical experiments ### [On a bunch of datasets] --- # Datasets We consider the following datasets for .stress[binary classification] .pure-table.pure-table-bordered.pure-table-striped[ |dataset | \#samples | \#features | |--------|-----|-----| |Ionosphere | 351 | 34 | |Churn | 3333 | 21 | |Default of credit card | 30000 | 24 | |Adult | 32561 | 14 | |Bank marketing | 45211 | 17 | |Covertype | 550088 | 10 | |SUSY | 5000000 | 18 | |HEPMASS | 10500000 | 28 | |HIGGS | 11000000 | 24 | ] --- # Baselines We compare our procedure to the following .stress[baselines] .pure-table.pure-table-bordered.pure-table-striped[ | Procedure | Description | |-----------|-------------| | Lasso | Logistic regression (LR) with $\ell_1$ penalization | | Group Lasso | LR with group $\ell_1$ penalization | | Group TV | LR with group total-variation penalization | | SVM | Support vector machine with Gaussian kernel | | GAM | Generalized Additive Model | | RF | Random Forest | | GB | Gradient Boosting | ] --- # Comparison of training times .center[
] - Binarsity usually .stress[orders of magnitude faster] than RF, SVM, GB and GAM - For .stress[similar performances] --- # Comparison of ROC curves .center[
] --- # Comparison of ROC curves .center[
] --- # Comparison of ROC curves .center[
] --- # Comparison of ROC curves .center[
] --- # Comparison of ROC curves .center[
] --- # Comparison of ROC curves .center[
] --- # Comparison of ROC curves .center[
] --- # Comparison of ROC curves .center[
] --- # Comparison of ROC curves .center[
] --- # Computation time ## On a .stress[dense] dataset .center[
] ## On a .stress[sparse] dataset - $\ell_1$ penalization (or any separable penalization) can benefit from .stress[lazy updates] - Total-variation is not separable: no lazy updates in SGD steps --- # About the number of bins - About the .stress[number of bins] used in one-hot encoding - .stress[Do we need to care about it] when using binarsity penalization ? .center[
] -- - .large[**No !**] -- - Well, **not really**... use 50, 100 or 200 bins (depending on $n$...) --- # Try this yourself: just install .stress[`tick`] Just type (ideally) in a `Python3.6` environment ```bash pip install tick ``` .pull-left[ .center[
] .stress[Documentation] is there: .small[https://x-datainitiative.github.io/tick] .stress[Code] is there: .small[https://github.com/X-DataInitiative/tick] ] .pull-right[
] --- # Try this yourself ```python # Input: features X, labels y from tick.inference import LogisticRegression from tick.preprocessing import FeaturesBinarizer from sklearn.model_selection import train_test_split # Binarize data binarizer = FeaturesBinarizer(n_cuts=50) X = binarizer.fit_transform(X) # Shuffle and split training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y) # Fit the model learner = LogisticRegression( penalty='binarsity', C=1, blocks_start=binarizer.blocks_start, blocks_length=binarizer.blocks_length ) learner.fit(X_train, y_test) # Predict on the test set y_pred = learner.predict_proba(X_test)[:, 1] ``` --- class: center, middle, inverse # Conclusion --- # Take home message - .stress[New penalization] for one-hot encoded continuous features - Good .stress[compromise between performances and training time] - And .stress[interpretability] (change points) - Nice .stress[theoretical properties] - Go use it (and **make money**) ! ## Paper
M.Z. Alaya, S. Bussy, S. Gaïffas, A. Guilloux, *Binarsity: a penalization for one-hot encoded features*, in revision in *JMLR* ## Code
Code available at `https://github.com/SimonBussy/binarsity` `tick` library available at `https://github.com/X-DataInitiative/tick` --- class: center, middle, inverse # Thank you ! ### [Questions ?]