Remark

# Big Data Technologies

## Master Mathematics and Informatics

.center[
    <img src="figs/lpsm.png" style="height: 160px;" />
    <img src="" style="width: 30px;" />
    <img src="figs/paris-diderot.png" style="height: 90px;" />
    <img src="" style="width: 30px;" />
    <img src="figs/uparis.png" style="height: 120px;" />
]

---

---

# The "Python Stack" for Data Science

---

# What is `Python` ?

- multi-purpose

- focused on readability

- easy to learn

- object-oriented

- strongly and dynamically typed

- cross-platform

---

# Features of `Python`

- High-level data types (`tuples`, `dict`, `list`, `set`, etc.)

- Standard libraries with batteries included

- String services, regular expressions

- libraries for scientific computing

- Easy and efficient I/O, many file formats

- OS, threading, multiprocessing

- networking, email, html, webserver

- Can be extended with `C/C++` and easily accelerated (`cython`, `numba`, `pypy`)

- Tons of external libraries

---

# Features of `Python`

---

# The `stackoverflow` 2018 survey

---

# `Python` popularity growth

---

# `Python` popularity growth

---

# Why `Python` for  data science ?

Besides these features, `Python` has:

- large communities for data science, analytics, etc.

- many and well-established libraries

- lots of examples and documentation

- **huge** demand from the industry

---

# The `Python` Data Science Stack

### Maths / Science

.center[
<img src="figs/numpy.jpg" width=28%/>
<img src="" width=10%/>
<img src="figs/scipy.png" width=28%/>
]

---

# The `Python` Data Science Stack

### Maths / Science

- `numpy` is all about **multi-dimensional arrays** and **matrices**.

- high-level mathematical computation such as **linear algebra** in `numpy.linalg` and **random number generation** in `numpy.random`

- **Fast** but not optimized for multi-threaded architectures

- And not for **distributed** multi-machine settings

---

# The `Python` Data Science Stack

### Maths / Science

- `scipy` extends `numpy` with extra modules

- Mainly optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing

- And very useful sparse matrix formats in `scipy.sparse`

---

# The `Python` Data Science Stack

### Data processing

.center[
<img src="figs/pandas.png" width=40%/>
<img src="" width=5%/>
<img src="figs/dask.png" width=10%/>
<img src="" width=5%/>
<img src="figs/pyspark.jpg" width=20%/>
]

---

# The `Python` Data Science Stack

### Data processing

.center[
<img src="figs/pandas.png" width=40%/>
<img src="" width=5%/>
<img src="" width=10%/>
<img src="" width=5%/>
<img src="" width=20%/>
]

- `pandas` builds upon `numpy` to provide a high-performance, easy-to-use `DataFrame` object, with high-level data processing

- Easy I/O with most data format : `csv`, `json`, `hdf5`, `feather`, `parquet`, etc.

- `SQL` semantics: `groupby`, `agg`, `select`, `where`, etc.

- Some data visualization tools

- Very large **general-purpose library for data processing**, not distributed, **medium scale** data only

---

# The `Python` Data Science Stack

### Data processing

.center[
<img src="" width=40%/>
<img src="" width=5%/>
<img src="figs/dask.png" width=10%/>
<img src="" width=5%/>
<img src="" width=20%/>
]

- `dask` is roughly a **distributed** and **parallel** `pandas`

- Same API has `pandas` !

- Task scheduling, lazy evaluation, distributed dataframes

- Still young and **far behind** `spark`, but can be useful

- Easier than `spark`, full `Python` (no `JVM`)

---

# The `Python` Data Science Stack

### Data processing

.center[
<img src="" width=40%/>
<img src="" width=5%/>
<img src="" width=10%/>
<img src="" width=5%/>
<img src="figs/pyspark.jpg" width=20%/>
]

- `pyspark` is the `python` API to `spark`, a big data processing framework

- We will use it **a lot** in this course

- Native API to `spark` is `scala`: `pyspark` can be **slower** (much slower if you are not careful)

---

# The `Python` Data Science Stack

### Data Visualization

.center[
<img src="figs/matplotlib.png" width=25%/>
<img src="" width=10%/>
<img src="figs/seaborn.png" width=20%/>
<img src="" width=10%/>
<img src="figs/bokeh.png" width=20%/>
]

---

# The `Python` Data Science Stack

### Data Visualization

.center[
<img src="figs/matplotlib.png" width=25%/>
<img src="" width=10%/>
<img src="" width=20%/>
<img src="" width=10%/>
<img src="" width=20%/>
]

- `matplotlib` provides **2D plotting capabilities**

- **Very large** and **highly customizable** library

- The historical one, somewhat **low-level** when plotting things related to data

---

# The `Python` Data Science Stack

### Data Visualization

.center[
<img src="" width=25%/>
<img src="" width=10%/>
<img src="figs/seaborn.png" width=20%/>
<img src="" width=10%/>
<img src="" width=20%/>
]

- A **higher-level** plotting library built on top of `matplotlib`

- To be use **with a `pandas` dataframes** as data source

- Higher-level plotting possibilities

- Usually better-looking plots with good default parameters

---

# The `Python` Data Science Stack

### Data Visualization

.center[
<img src="" width=25%/>
<img src="" width=10%/>
<img src="" width=20%/>
<img src="" width=10%/>
<img src="figs/bokeh.png" width=20%/>
]

- An **interactive visualization library** for web browsers based on `javascript` and `d3.js`

- With a clean and simple `python` interface, can be used in a `jupyter` notebook

- Interactions enabled by default (zoom, etc.) and fast rendering

- Very good looking plots with good default parameters

[there is also `plotly`...]

---

# The `Python` Data Science Stack

### Interfaces

---

# The `Python` Data Science Stack

### Interfaces

Ways to use all these tools

- Write a script `script.py` and use `python` directly in a CLI : `python script.py`

- Use the `ipython` interactive shell

---

# The `Python` Data Science Stack

### Interfaces

- Use `jupyter`: a web application that allows to create and run documents, called **notebooks** (with `.ipynb` extension)

- Notebooks can contain code, equations, visualizations, text, etc. We will **use these a lot** in the course.

- Each `notebook` as a `kernel` running a `python` thread

- A **problem**: a `ipynb` file is a `json` document. Leads to bad code diff, a problem with `git` versioning

---

# But also...

Many libraries for statistics, machine learning and deep learning

### Statistics

- `statlearn`

### Machine learning

- `scikit-learn`, `xgboost`, `lightgbm`

### Deep learning

- `tensorflow`, `pytorch`

### Getting faster

- `numba`, `cython`

---

# But also...

- `Python` APIs for most databases and clouds

- Processing and plotting tools for Geospatial data

- Image processing

- Web development, web scrapping

among many many many other things...

---

# Thank you !