Introduction to Machine Learning

Univ. Paris Diderot, Master M2MO, 2019

Syllabus

Machine learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to learn from data. A major focus of machine learning is to automatically learn complex patterns and to make intelligent decisions based on them. The set of possible data inputs that feed a learning task can be very large and diverse, which makes modeling and prior assumptions critical problems for the design of relevant algorithms.

This course focuses on the methodology underlying supervised and unsupervised learning, with a particular emphasis on the mathematical formulation of algorithms, and the way they can be implemented and used in practice. The course will describe for instance some necessary tools from optimization theory, and explain how to use them for machine learning. Numerical illustrations and applications to datasets will be given for the methods studied in the course. Practical sessions will start with a presentation of the Python language and of the main librairies for data science and scientific computing.

Format

  • We use moodle to centralize all teaching material (slides, notebooks, datasets, exercises)
  • Courses on slides and blackboard, all material in English
  • Practical sessions use python, jupyter, scikit-learn, tensorflow, keras (deep learning) with the standard stack (numpy, scipy, matplotlib)
  • Practical sessions will start with a quick introduction to python and the jupyter notebook, and the necessary libraries for data science

When and where

  • Tuesdays 10-17-24 Sept., 1-8-15-22-29 Oct.
  • Courses 13:30 – 16:00, room 0011 Sophie Germain
  • Practical sessions 16h30-19h00, room 265E Halle aux Farines

Evaluation

  • Practical sessions work 40% (jupyter notebooks for Tuesday of week $w$ must be sent before Monday 23:59 of week $w+1$, using the moodle platform)
  • Final exam 60%, Nov. 12 Amphi 3B Halle aux Farines

Agenda of the course

1. Introduction to supervised learning (courses 1, 2 and 3)

The first three courses will be about:

  • Binary classification, standard metrics and recipes (overfitting, cross-validation) and regression
  • LDA / QDA for Gaussian models
  • Logistic regression, Generalized Linear Models
  • Regularization (Ridge, Lasso, etc.)
  • Support Vector Machine, the Hinge loss
  • Kernel methods

Practical sessions will introduce the Python language and all basic librairies for scientific computing and data science with Python.

Slides of the first 3 courses

Some exercices

Notebooks and corresponding slides introducing to python, numpy and scipy

2. Optimization for Machine Learning (courses 3, 4)

Then, we will talk about basic tools from convex optimization that are of importance in machine learning, namely:

  • Proximal gradient descent
  • Coordinate descent / coordinate gradient descent
  • Stochastic gradient descent and beyond

The practical sessions will continue to describe tools for data science with Python (pandas) and we will start to use the scikit-learn library for simple machine learning tasks.

The implementation of optimization algorithms for simple linear methods is proposed as a first assignment for the course.

Slides of courses 3 and 4

Notebooks and slides about pandas, seaborn and scikit-learn

The notebook for the first assignment is here:

Instructions for the first assignement:

  • jupyter notebook only
  • Must be done by pairs of students
  • All students must send their work using moodle
  • Work must be sent before Monday October, 14th at 23:59
  • Student in the same pair send the same file (with same file name)
  • Name of the file must be in the format bonnie_parker_and_clyde_barrow.ipynb if the first student is Bonnie Parker and the second is Clyde Barrow

3. Trees and boosting methods (course 5)

This course will be all about machine learning methods based on trees:

  • Decision trees, CART, Random Forests and Boosting methods

We will use again the scikit-learn library, through some quick illustrations of some machine learning algorithm, and will illustrate some more advanced uses of it. If time permits, we will use also more powerful and industrial-level implementations of extreme boosting methods, such as XGBoost or LightGBM.

Slides

Some exercices

Notebooks

4. A glimpse of unsupervised learning (course 6)

This course will provide a quick glimpse of the most popular methods for unsupervised learning, including clustering and dimension reduction algorithms:

  • K-Means, Gaussian mixtures and EM
  • Dimension reduction : matrix factorization, NMF, t-SNE

Slides

5. Deep learning (courses 7, 8)

Finally, the last two courses will be about deep learning:

  • Introduction to neural networks
  • The perceptron, examples of “shallow” neural nets
  • Multilayer neural networks, deep learning
  • Adaptive-rate stochastic gradient descent, back-propagation
  • Convolutional neural networks
  • Recurrent neural nets

We will use TensorFlow with Keras (or TensorFlow 2 release candidate) for practical sessions, with some applications to image classification and text sentiment analysis, among other things if time permits.

A second and final assignment will be proposed, about some application of machine learning to a practical problem.

Slides

Notebooks

Course material

All the material for the course (slides and notebooks) is available on the moodle platform of the university:

References

  • Machine Learning, K.M. Murphy, MIT Press
  • Foundations of Machine Learning. M. Mohri, A. Rostamizadeh and A. Talwalkar, MIT Press
  • Deep Learning, I. Goodfellow and Y. Bengio and A. Courville, MIT Press
  • Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, W. McKinney, O’Reilly
  • Statistics for High-Dimensional Data: Methods, Theory and Applications, P. Bühlmann, S. van de Geer, Springer-Verlag