Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Page Not Found

Page not found. Your pixels are in another canvas. Read more

Page not in menu

This is a page not in th emain menu Read more

Jupyter notebook markdown generator

Posts

Future Blog Post

less than 1 minute read

Published: January 01, 2199

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false. Read more

Blog Post number 4

less than 1 minute read

Published: August 14, 2015

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool. Read more

Blog Post number 3

less than 1 minute read

Published: August 14, 2014

Blog Post number 2

less than 1 minute read

Published: August 14, 2013

Blog Post number 1

less than 1 minute read

Published: August 14, 2012

portfolio

Portfolio item number 1

Short description of portfolio item number 1
Read more

Portfolio item number 2

Short description of portfolio item number 2
Read more

publications

Convergence rates for pointwise curve estimation with a degenerate design

S. Gaïffas

Mathematical Methods of Statistics, 2005

The nonparametric regression with a random design model is considered. We want to recover the regression function at a point $x_0$ where the design density is vanishing or exploding. Read more

PDF Download here

Sharp estimation in sup norm with random design

S. Gaïffas

Statistics and Probability Letters, 2007

In this paper, we study the estimation of a function based on noisy inhomogeneous data (the amount of data can vary on the estimation domain). We consider the model of regression with random design, where the design density is unknown. Read more

PDF Download here

Optimal rates and adaptation in the single-index model using aggregation

S. Gaïffas and G. Lecué

Electronic Journal of Statistics, 2007

We want to recover the regression function in the single-index model. Using an aggregation algorithm with local polynomial estimators, we answer in particular to the second part of Question 2 from Stone (1982) on the optimal convergence rate. The procedure constructed here has strong adaptation properties: it adapts both to the smoothness of the link function and to the unknown index. Read more

PDF Download here

On pointwise adaptive curve estimation based on inhomogeneous data

S. Gaïffas

ESAIM: Probability and Statistics, 2007

We want to recover a signal based on noisy inhomogeneous data (the amount of data can vary strongly on the estimation domain). We model the data using nonparametric regression with random design, and we focus on the estimation of the regression at a fixed point $x_0$ with little, or much data. Read more

PDF Download here

Aggregation of penalized empirical risk minimizers in regression

S. Gaïffas and G. Lecué

arXiv preprint, 2008

We give a general result concerning the rates of convergence of penalized empirical risk minimizers (PERM) in the regression model. Then, we consider the problem of agnostic learning of the regression, and give in this context an oracle inequality and a lower bound for PERM over a finite class. Read more

PDF Download here

Uniform estimation of a signal based on inhomogeneous data

S. Gaïffas

Statistica Sinica, 2009

he aim of this paper is to recover a signal based on inhomogeneous noisy data (the amount of data can vary strongly from one point to another.) In particular, we focus on the understanding of the consequences of the inhomogeneity of the data on the accuracy of estimation. For that purpose, we consider the model of regression with a random design, and we consider the minimax framework. Read more

PDF Download here

Learning and adaptive estimation for marker-dependent counting processes

S. Gaïffas and A. Guilloux

arXiv preprint, 2009

We consider the problem of statistical learning for the intensity of a counting process with covariates. In this context, we introduce an empirical risk, and prove risk bounds for the corresponding empirical risk minimizers. Then, we give an oracle inequality for the popular algorithm of aggregation with exponential weights. Read more

PDF Download here

Adaptive estimation of the conditional intensity of marker-dependent counting processes

F. Comte, S. Gaïffas and A. Guilloux

Annales de l’Institut Henri Poincaré - Probabilités et Statistiques, 2010

We propose in this work an original estimator of the conditional intensity of a marker-dependent counting process, that is, a counting process with covariates. We use model selection methods and provide a nonasymptotic bound for the risk of our estimator on a compact set. Read more

PDF Download here

Weighted algorithms for compressed sensing and matrix completion

S. Gaïffas and G. Lecué

arXiv preprint, 2011

This paper is about iteratively reweighted basis-pursuit algorithms for compressed sensing and matrix completion problems. In a first part, we give a theoretical explanation of the fact that reweighted basis pursuit can improve a lot upon basis pursuit for exact recovery in compressed sensing. We exhibit a condition that links the accuracy of the weights to the RIP and incoherency constants, which ensures exact recovery. Read more

PDF Download here

Nonparametric regression with martingale increment errors

S. Delattre and S. Gaïffas

Stochastic Processes and their Applications, 2011

We consider the problem of adaptive estimation of the regression function in a framework where we replace ergodicity assumptions (such as independence or mixing) by another structural assumption on the model. Namely, we propose adaptive upper bounds for kernel estimators with data-driven bandwidth (Lepski’s selection rule) in a regression model where the noise is an increment of martingale. Read more

PDF Download here

Sharp Oracle Inequalities for High-Dimensional Matrix Prediction

S. Gaïffas and G. Lecué

IEEE Transaction on Information Theory, 2011

We observe $(X_i, Y_i)$ where the $Y_i$ are real valued outputs and the $X_i$ are random matrices. We observe a new entry $X$ and we want to predict the output $Y$ associated with it. Read more

PDF Download here

Hyper-Sparse Optimal Aggregation

S. Gaïffas and G. Lecué

Journal of Machine Learning Research, 2011

Given a finite set $F$ of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in $F$. Up to now, optimal aggregation procedures are convex combinations of every elements of $F$. In this paper, we prove that optimal aggregation procedures combining only two functions in $F$ exist. Read more

PDF Download here

High-dimensional additive hazards models and the Lasso

S. Gaïffas and A. Guilloux

Electronic Journal of Statistics, 2012

We consider a general high-dimensional additive hazards model in a non-asymptotic setting, including regression for censored-data. In this context, we consider a Lasso estimator with a fully data-driven $\ell_1$-penalization, which is tuned for the estimation problem at hand. We prove sharp oracle inequalities for this estimator. Read more

PDF Download here

Link Prediction in Graphs with Autoregressive Features

E. Richard, S. Gaïffas and N. Vayatis

NIPS, 2012

In the paper, we consider the problem of link prediction in time-evolving graphs. We assume that certain graph features, such as the node degree, follow a vector autoregressive (VAR) model and we propose to use this information to improve the accuracy of prediction. Our strategy involves a joint optimization procedure over the space of adjacency matrices and VAR matrices which takes into account both sparsity and low rank properties of the matrices. Read more

PDF Download here

Sparse Bayesian Unsupervised Learning

S. Gaïffas and B. Michel

arXiv preprint, 2014

This paper is about variable selection, clustering and estimation in an unsupervised high-dimensional setting. Our approach is based on fitting constrained Gaussian mixture models, where we learn the number of clusters $K$ and the set of relevant variables $S$ using a generalized Bayesian posterior with a sparsity inducing prior. Read more

PDF Download here

Link Prediction in Graphs with Autoregressive Features

E. Richard, S. Gaïffas and N. Vayatis

Journal of Machine Learning Research, 2014

PDF Download here

Learning the Intensity of Time Events With Change-Points

M. Z. Alaya, S. Gaïffas and A. Guilloux

IEEE Transactions on Information Theory, 2015

We consider the problem of learning the inhomogeneous intensity of a counting process, under a sparse segmentation assumption. We introduce a weighted total-variation penalization, using data-driven weights that correctly scale the penalization along the observation interval. We prove that this leads to a sharp tuning of the convex relaxation of the segmentation prior, by stating oracle inequalities with fast rates of convergence, and consistency for change-points detection. Read more

PDF Download here

Mean-field inference of Hawkes point processes

E. Bacry, S. Gaïffas, I. Mastromatteo and J.-F. Muzy

Journal of Physics A: Mathematical and Theoretical, 2016

We propose a fast and efficient estimation method that is able to accurately recover the parameters of a $d$-dimensional Hawkes point-process from a set of observations. We exploit a mean-field approximation that is valid when the fluctuations of the stochastic intensity are small. We show that this is notably the case in situations when interactions are sufficiently weak, when the dimension of the system is high or when the fluctuations are self-averaging due to the large number of past events they involve. Read more

PDF Download here

SGD with Variance Reduction beyond Empirical Risk Minimization

M. Achab, A. Guilloux, S. Gaïffas and E. Bacry

International Conference of Monte Carlo Methods and Applications, 2017

We introduce a doubly stochastic proximal gradient algorithm for optimizing a finite average of smooth convex functions, whose gradients depend on numerically expensive expectations. Indeed, the effectiveness of SGD-like algorithms relies on the assumption that the computation of a subfunction’s gradient is cheap compared to the computation of the total function’s gradient. This is true in the Empirical Risk Minimization (ERM) setting, but can be false when each subfunction depends on a sequence of examples. Our main motivation is the acceleration of the optimization of the regularized Cox partial-likelihood (the core model in survival analysis), but other settings can be considered as well. Read more

PDF Download here

High dimensional matrix estimation with unknown variance of the noise

S. Gaïffas and O. Klopp

Statistica Sinica, 2017

Assume that we observe a small set of entries or linear combinations of entries of an unknown matrix $A_0$ corrupted by noise. We propose a new method for estimating $A_0$ that does not rely on the knowledge or on an estimation of the standard deviation of the noise $\sigma$. Our estimator achieves, up to a logarithmic factor, optimal rates of convergence under Frobenius risk and, thus, has the same prediction performance as previously proposed estimators that rely on the knowledge of $\sigma$. Some numerical experiments show the benefits of this approach. Read more

PDF Download here

Concentration inequalities for matrix martingales in continuous time

E. Bacry, S. Gaïffas and J-F Muzy

Probability Theory and Related Fields, 2017

This paper gives new concentration inequalities for the spectral norm of a wide class of matrix martingales in continuous time. These results extend previously established Freedman and Bernstein inequalities for series of random matrices to the class of continuous time processes. Our analysis relies on a new supermartingale property of the trace exponential proved within the framework of stochastic calculus. We provide also several examples that illustrate the fact that our results allow us to recover easily several formerly obtained sharp bounds for discrete time matrix martingales. Read more

PDF Download here

Uncovering Causality from Multivariate Hawkes Integrated Cumulants

M. Achab, E. Bacry, S. Gaïffas, I. Mastromatteo and J.-F. Muzy

ICML, 2017

We design a new nonparametric method that allows one to estimate the matrix of integrated kernels of a multivariate Hawkes process. This matrix not only encodes the mutual influences of each node of the process, but also disentangles the causality relationships between them. Our approach is the first that leads to an estimation of this matrix without any parametric modeling and estimation of the kernels themselves. Read more

PDF Download here

Universal consistency and minimax rates for online Mondrian Forests

J. Mourtada, S. Gaïffas and E. Scornet

NIPS, 2017

We establish the consistency of an algorithm of Mondrian Forests, a randomized classification algorithm that can be implemented online. First, we amend the original Mondrian Forest algorithm, that considers a fixed lifetime parameter. Indeed, the fact that this parameter is fixed hinders the statistical consistency of the original procedure. Read more

PDF Download here

High-dimensional robust regression and outliers detection with SLOPE

A. Virouleau, A. Guilloux, S. Gaïffas and M. Bogdan

arXiv preprint, 2017

The problems of outliers detection and robust regression in a high-dimensional setting are fundamental in statistics, and have numerous applications. Following a recent set of works providing methods for simultaneous robust regression and outliers detection, we consider in this paper a model of linear regression with individual intercepts, in a high-dimensional setting. We introduce a new procedure for simultaneous estimation of the linear regression coefficients and intercepts, using two dedicated sorted-$\ell_1$ penalizations, also called SLOPE. Read more

PDF Download here

Design d’un algorithme d’IA en grande dimension pour prédire la réadmission à l’hôpital

S. Bussy, R. Veil, V. Looten, A Burgun, S. Gaïffas, A. Guilloux, B. Ranque, A.-S. Jannot

conférence 'L'Intelligence Artificielle en Santé', 2018

La production d’un algorithme d’intelligence artificielle (IA) à partir de données de vie réelle réside dans l’intégration et la transformation d’un très grand nombre de covariables (grande dimension), puis dans la construction d’un modèle d’apprentissage. Dans cet article, nous allons détailler les différentes étapes de ce processus à travers une illustration sur des données de soin réutilisées pour prédire la réadmis- sion à l’hôpital après une crise vaso-occlusive chez les patients atteints de drépanocytose. Read more

PDF Download here

Machine learning and big data in health: the partnership between Caisse Nationale d’Assurance Maladie and Ecole polytechnique

Emmanuel Bacry, Stéphane Gaïffas

Livre IA et Santé : présent et futur, Académie Nationale de Médecine, 2018

The Caisse Nationale d’Assurance Maladie (Cnam) and the Ecole Polytechnique signed at the end of 2014 a research partnership agreement for a period of 3 years which was renewed for 3 years at the beginning of 2018. The objective of this partnership is to develop big data technologies and machine learning based on data from the Système National Inter-Régimes de l’Assurance Maladie (Sniiram). Read more

PDF Download here

C-mix: A high-dimensional mixture model for censored durations, with applications to genetic data

S. Bussy, A. Guilloux, S. Gaïffas and A.-S. Jannot

Statistical Methods in Medical Research, 2018

We introduce a supervised learning mixture model for censored durations (C-mix) to simultaneously detect subgroups of patients with different prognosis and order them based on their risk. Our method is applicable in a high-dimensional setting, i.e. with a large number of biomedical covariates. Indeed, we penalize the negative log-likelihood by the Elastic-Net, which leads to a sparse parameterization of the model and automatically pinpoints the relevant covariates for the survival prediction. Read more

PDF Download here

Uncovering Causality from Multivariate Hawkes Integrated Cumulants

M. Achab, E. Bacry, S. Gaïffas, I. Mastromatteo and J.-F. Muzy

Journal of Machine Learning Research, 2018

We design a new nonparametric method that allows one to estimate the matrix of integrated kernels of a multivariate Hawkes process. This matrix not only encodes the mutual influences of each node of the process, but also disentangles the causality relationships between them. Our approach is the first that leads to an estimation of this matrix without any parametric modeling and estimation of the kernels themselves. Read more

PDF Download here

tick: a Python library for statistical learning, with a particular emphasis on time-dependent modeling

E. Bacry, M. Bompaire, P. Deegan, S. Gaïffas and S. V. Poulsen

Journal of Machine Learning Research, 2018

tick is a statistical learning library for Python 3, with a particular emphasis on time-dependent models, such as point processes, and tools for generalized linear models and survival analysis. The core of the library is an optimization module providing model computational classes, solvers and proximal operators for regularization. Read more

PDF Download here

Sparse inference of the drift of a high-dimensional Ornstein–Uhlenbeck process

S. Gaïffas and G. Matulewicz

Journal of Multivariate Analysis, 2018

Given the observation of a high-dimensional Ornstein–Uhlenbeck (OU) process in continuous time, we are interested in inference on the drift parameter under a row-sparsity assumption. Towards that aim, we consider the negative log-likelihood of the process, penalized by an $\ell_1$-penalization (Lasso and Adaptive Lasso). We provide both finite and large-sample results for this procedure, by means of a sharp oracle inequality, and a limit theorem in the long-time asymptotics, including asymptotic consistency for variable selection. Read more

PDF Download here

Description and assessment of trustability of motives for self-exclusion reported by online poker gamblers in a cohort using account-based gambling data

A. Luquiens, D. Vendryes, H.-J. Aubin, A. Benyamina, S. Gaiffas, E. Bacry

BMJ Open, 2018

Self-exclusion is one of the main responsible gambling tools. The aim of this study was to assess the reliability of self-exclusion motives in self-reports to the gambling service provider. Read more

PDF Download here

Dual optimization for convex constrained objectives without the gradient-Lipschitz assumption

Martin Bompaire, Stéphane Gaïffas and Emmanuel Bacry

arXiv preprint, 2018

The minimization of convex objectives coming from linear supervised learning problems, such as penalized generalized linear models, can be formulated as finite sums of convex functions. For such problems, a large set of stochastic first-order solvers based on the idea of variance reduction are available and combine both computational efficiency and sound theoretical guarantees (linear convergence rates). Read more

PDF Download here

Binarsity: a penalization for one-hot encoded features in linear supervised learning

M. Z. Alaya, S. Bussy, S. Gaïffas and A. Guilloux

Journal of Machine Learning Research, 2019

This paper deals with the problem of large-scale linear supervised learning in settings where a large number of continuous features are available. We propose to combine the well-known trick of one-hot encoding of continuous features with a new penalization called binarsity. In each group of binary features coming from the one-hot encoding of a single raw continuous feature, this penalization uses total-variation regularization together with an extra linear constraint. Read more

PDF Download here

Comparison of methods for early-readmission prediction in a high-dimensional heterogeneous covariates and time-to-event outcome framework

Simon Bussy, Raphaël Veil, Vincent Looten, Anita Burgun, Stéphane Gaïffas, Agathe Guilloux, Brigitte Ranque and Anne-Sophie Jannot

BMC Medical Research Methodology, 2019

Choosing the most performing method in terms of outcome prediction or variables selection is a recurring problem in prognosis studies, leading to many publications on methods comparison. But some aspects have received little attention. Read more

PDF Download here

ConvSCCS: convolutional self-controlled case series model for lagged adverse event detection

M. Morel, E. Bacry, S. Gaïffas, A. Guilloux and F. Leroy

Biostatistics, 2019

With the increased availability of large electronic health records databases comes the chance of enhancing health risks screening. Most post-marketing detection of adverse drug reaction (ADR) relies on physicians’ spontaneous reports, leading to under-reporting. To take up this challenge, we develop a scalable model to estimate the effect of multiple longitudinal features (drug exposures) on a rare longitudinal outcome. Read more

PDF Download here

On the optimality of the Hedge algorithm in the stochastic regime

Jaouad Mourtada, Stéphane Gaïffas

Journal of Machine Learning Research, 2019

In this paper, we study the behavior of the Hedge algorithm in the online stochastic setting. We prove that anytime Hedge with decreasing learning rate, which is one of the simplest algorithm for the problem of prediction with expert advice, is remarkably both worst-case optimal and adaptive to the easier stochastic and adversarial with a gap problems. Read more

PDF Download here

Screening anxiolytics, hypnotics, antidepressants and neuroleptics for bone fracture risk among elderly: a nation-wide dynamic multivariate self-control study using the SNDS claims database.

Maryan Morel, Benjamin Bouyer, Agathe Guilloux, Moussa Laanani, Fanny Leroy, Dinh Phong Nguyen, Youcef Sebiat, Emmanuel Bacry and Stephane Gaiffas

submitted to Drugs Safety, 2020

Existing screening works provide point estimates for drug-outcome pairs risk assessment. We propose a flexible approach based on dynamic risk estimates to support alert generation while providing additional information on risk qual- ification (delay, shape) and LOD-specific biases. We illustrate this approach by studying the longitudinal effect of anxiolytic, hypnotic, antidepressant, and neuroleptic molecules on fractures using SNDS, a French large healthcare claims database Read more

PDF Download here

Sparse and low-rank multivariate Hawkes processes

E. Bacry, M. Bompaire, S. Gaïffas and J-F. Muzy

Journal of Machine Learning Research, 2020

We consider the problem of unveiling the implicit network structure of node interactions (such as user interactions in a social network), based only on high-frequency timestamps. Our inference is based on the minimization of the least-squares loss associated with a multivariate Hawkes model, penalized by $\ell_1$ and trace norm of the interaction tensor. We provide a first theoretical analysis for this problem, that includes sparsity and low-rank inducing penalizations. Read more

PDF Download here

Minimax optimal rates for Mondrian trees and forests

Jaouad Mourtada, Stéphane Gaïffas and Erwan Scornet

Annals of Statistics, 2020

Introduced by Breiman, Random Forests are widely used classification and regression algorithms. While being initially designed as batch algorithms, several variants have been proposed to handle online learning. One particular instance of such forests is the Mondrian Forest, whose trees are built using the so-called Mondrian process, therefore allowing to easily update their construction in a streaming fashion. In this paper, we provide a thorough theoretical study of Mondrian Forests in a batch learning setting, based on new results about Mondrian partitions. Read more

PDF Download here

ZiMM: a deep learning model for long term adverse events with non-clinical claims data

E. Bacry, S. Gaïffas, A. Kabeshova, Y. Yu

Machine Learning for Health (ML4H) at NeurIPS 2019 and Journal of Biomedical Informatics, 2020

This paper considers the problem of modeling long-term adverse events following prostatic surgery performed on patients with urination problems, using the French national health insurance database (SNIIRAM), which is a non-clinical claims database built around healthcare reimbursements of more than 65 million people. This makes the problem particularly challenging compared to what could be done using clinical hospital data, albeit a much smaller sample, while we exploit here the claims of almost all French citizens diagnosed with prostatic problems (with between 1.5 and 5 years of history). Read more

PDF Download here

SCALPEL3: a scalable open-source library for healthcare claims databases

E. Bacry,, S. Gaïffas, F. Leroy, M. Morel, D. P. Nguyen, Y. Sebiat and D. Sun

International Journal of Medical Informatics, 2020

This article introduces SCALPEL3, a scalable open-source framework for studies involving Large Observational Databases (LODs). Its design eases medical observational studies thanks to abstractions allowing concept extraction, high-level cohort manipulation, and production of data formats compatible with machine learning libraries. SCALPEL3 has successfully been used on the SNDS database, a huge healthcare claims database that handles the reimbursement of almost all French citizens. Read more

PDF Download here

About contrastive unsupervised representation learning for classification and its convergence

Ibrahim Merad, Yiyang Yu, Emmanuel Bacry and Stéphane Gaïffas

submitted, 2020

Contrastive representation learning has been recently proved to be very efficient for self-supervised training. These methods have been successfully used to train encoders which perform comparably to supervised training on downstream classification tasks. Read more

PDF Download here

AMF: Aggregated Mondrian Forests for Online Learning

J. Mourtada, S. Gaïffas and E. Scornet

Journal of the Royal Society Series B, 2021

Random Forests (RF) is one of the algorithms of choice in many supervised learning applications, be it classification or regression. The appeal of such methods comes from a combination of several characteristics: a remarkable accuracy in a variety of tasks, a small number of parameters to tune, robustness with respect to features scaling, a reasonable computational cost for training and prediction, and their suitability in high-dimensional settings. Read more

PDF Download here

A review on contrastive learning methods and applications to roof-type classification on aerial images

Ahmed Ben Saad, Sebastien Drouyer, Bastien Hell, Sylvain Gavoille, Stéphane Gaïffas, Gabriele Facciolo

IGARSS, 2021

Unsupervised learning based on Contrastive Learning (CL) has attracted a lot of interest recently. This is due to excellent results on a variety of subsequent tasks (especially classification) on benchmark datasets (ImageNet, CIFAR-10, etc.) without the need of large quantities of labeled samples. This work explores the application of some of the most relevant CL techniques on a large unlabeled dataset of aerial images of building rooftops. The task that we want to solve is roof type classification using a much smaller labeled dataset. The main problem with this task is the strong dataset bias and class imbalance. This is caused by the abundance of certain types of roofs and the rarity of other types. Quantitative results show that this issue heavily affects the quality of learned representations, depending on the chosen CL technique. Read more

PDF Download here

WildWood: a new Random Forest algorithm

Stéphane Gaïffas, Ibrahim Merad, Yiyang Yu

arXiv, 2021

We introduce WildWood (WW), a new ensemble algorithm for supervised learning of Random Forest (RF) type. While standard RF algorithms use bootstrap out-of-bag samples to compute out-of-bag scores, WW uses these samples to produce improved predictions given by an aggregation of the predictions of all possible subtrees of each fully grown tree in the forest. This is achieved by aggregation with exponential weights computed over out-of-bag samples, that are computed exactly and very efficiently thanks to an algorithm called context tree weighting. This improvement, combined with a histogram strategy to accelerate split finding, makes WW fast and competitive compared with other well-established ensemble methods, such as standard RF and extreme gradient boosting algorithms. Read more

PDF Download here

An improper estimator with optimal excess risk in misspecified density estimation and logistic regression

J. Mourtada, S. Gaïffas

Journal of Machine Learning Research, 2021

We introduce a procedure for predictive conditional density estimation under logarithmic loss, which we call SMP (Sample Minmax Predictor). This predictor minimizes a new general excess risk bound, which critically remains valid under model misspecification. On standard examples, this bound scales as $d / n$ where $d$ is the dimension of the model and $n$ the sample size, regardless of the true distribution. The SMP, which is an improper (out-of-model) procedure, improves over proper (within-model) estimators (such as the maximum likelihood estimator), whose excess risk can degrade arbitrarily in the misspecified case. Read more

PDF Download here

Robust supervised learning with coordinate gradient descent

S. Gaïffas, I. Merad

arXiv preprint, 2022

This paper considers the problem of supervised learning with linear methods when both features and labels can be corrupted, either in the form of heavy tailed data and/or corrupted rows. We introduce a combination of coordinate gradient descent as a learning algorithm together with robust estimators of the partial derivatives. This leads to robust statistical learning methods that have a numerical complexity nearly identical to non-robust ones based on empirical risk minimization. Read more

PDF Download here

talks

Introduction to machine learning, application to Hawkes processes

Published: March 21, 2016

Read more

Statistical Learning with Hawkes Processes and new Matrix Concentration Inequalities

Published: May 31, 2016

Read more

Methods for covariance and inverse covariance estimation in high dimension

Published: June 06, 2016

Read more

Statistical learning with Hawkes processes and new matrix concentration inequalities

Published: November 11, 2016

Read more

Statistical learning with Hawkes processes

Published: March 28, 2017

Read more

Optimization for Data Science

Published: June 20, 2017

Read more

Big Data en santé avec le Caisse Nationale d’Assurance Maladie

Published: June 23, 2017

Read more

Utilisation de techniques big data sur le SNIIRAM : le partenariat X / CNAMTS

Published: June 27, 2017

Read more

Time-oriented statistical machine learning, with applications to electronic health-care, social networks and finance

Published: August 17, 2017

Read more

High-dimensional robust regression and outliers detection with SLOPE

Published: September 22, 2017

Read more

Statistical Learning with Hawkes Processes

Published: November 29, 2017

Read more

Utilisation de techniques big data sur le SNIIRAM : le partenariat X / CNAM

Published: January 30, 2018

Read more

Machine learning et big data en santé: le partenariat X / CNAM

Published: April 05, 2018

Read more

Anytime online forests: optimal rates and algorithmic aspects

Published: June 14, 2018

Read more

A partnership about big data in health: the partnership between CNAM and Ecole polytechnique

Published: November 20, 2018

Read more

Anytime online forests: optimal rates and algorithmic aspects

Published: December 16, 2018

Read more

Binarsity: a penalization for one-hot encoded features in supervised learning

Published: June 11, 2019

Read more

SMP: Sample Minmax Predictor

Published: November 07, 2019

Read more

Deep Learning for Electronic Health Records

Published: December 17, 2020

Read more

WildWood: a new Random Forest algorithm

Published: October 21, 2021

Read more

Deep Learning for Electronic Health Records

Published: February 03, 2022

Read more

teaching

Technologies Big Data

Université de Paris, 2020

The course will focus mainly on Spark for big data processing and starts with a description and usage of the ‘Python stack’ for data science. The course will use Python as main programming language. Read more

Introduction au Machine Mearning et au Deep Learning

CNRS Formation, 2020

L’apprentissage machine (machine learning) est une discipline scientifique qui s’intéresse à la conception et au développement d’algorithmes permettant aux ordinateurs d’apprendre à prendre des décisions à partir de données. L’ensemble des données possibles qui alimentent une tâche d’apprentissage peut être très vaste et varié, ce qui rend la modélisation et les hypothèses préalables critiques pour la conception d’algorithmes pertinents. Ce stage se concentre sur la méthodologie sous-jacente à l’apprentissage supervisé avec un accent particulier sur la formulation mathématique des algorithmes et la façon dont ils peuvent être mis en oeuvre et utilisés dans la pratique. Read more

Groupe de travail ‘Outils mathématiques pour l’apprentissage statistique’

Ecole Normale Supérieure, 2020

Le but de ce groupe de travail est de travailler sur certains outils et résultats mathématiques en lien avec l’étude théorique des algorithmes d’apprentissage statistique. Read more

Statistiques

Ecole Normale Supérieure, 2020

Le but de ce cours est d’étudier des méthodes statistiques et leurs propriétés d’un point de vue théorique. En fonction des goûts et du niveau d’allergie au clavier de chacun, on pourra s’attarder plus ou moins en profondeur sur des exemples d’applications. Nous essaierons de proposer dans ce cours 70% de contenus classiques et inévitables dans un cours de statistique et 30% de résultats récents et problèmes ouverts. Read more

Deep learning

Université de Paris, 2021

A course about deep learning. Read more

Introduction to Machine Learning

Univ. de Paris, Masters MIDS et M2MO, 2021

This course focuses on the methodology underlying supervised and unsupervised learning, with a particular emphasis on the mathematical formulation of algorithms, and the way they can be implemented and used in practice. The course will describe for instance some necessary tools from optimization theory, and explain how to use them for machine learning. Numerical illustrations and applications to datasets will be given for the methods studied in the course. Read more

Stéphane Gaïffas

Sitemap

Pages

Posts

portfolio

publications

talks

teaching