class: center, middle # Big Data Technologies ## Master Mathematics and Informatics .medium[Stéphane Gaïffas - Stéphane Boucheron] .center[
] --- layout: true class: top --- class: center, middle, inverse # The "Python Stack" for Data Science --- # What is `Python` ? .center[
] - multi-purpose - focused on readability - easy to learn - object-oriented - strongly and dynamically typed - cross-platform --- # Features of `Python` - High-level data types (`tuples`, `dict`, `list`, `set`, etc.) - Standard libraries with batteries included - String services, regular expressions - libraries for scientific computing - Easy and efficient I/O, many file formats - OS, threading, multiprocessing - networking, email, html, webserver - Can be extended with `C/C++` and easily accelerated (`cython`, `numba`, `pypy`) - Tons of external libraries --- # Features of `Python` .center[
] --- # The `stackoverflow` 2018 survey .center[
] --- # `Python` popularity growth .center[
] --- # `Python` popularity growth .center[
] --- # Why `Python` for data science ? Besides these features, `Python` has: - large communities for data science, analytics, etc. - many and well-established libraries - lots of examples and documentation - **huge** demand from the industry --- # The `Python` Data Science Stack ### Maths / Science .center[
] --- # The `Python` Data Science Stack ### Maths / Science .center[
] - `numpy` is all about **multi-dimensional arrays** and **matrices**. - high-level mathematical computation such as **linear algebra** in `numpy.linalg` and **random number generation** in `numpy.random` - **Fast** but not optimized for multi-threaded architectures - And not for **distributed** multi-machine settings --- # The `Python` Data Science Stack ### Maths / Science .center[
] - `scipy` extends `numpy` with extra modules - Mainly optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing - And very useful sparse matrix formats in `scipy.sparse` --- # The `Python` Data Science Stack ### Data processing .center[
] --- # The `Python` Data Science Stack ### Data processing .center[
] - `pandas` builds upon `numpy` to provide a high-performance, easy-to-use `DataFrame` object, with high-level data processing - Easy I/O with most data format : `csv`, `json`, `hdf5`, `feather`, `parquet`, etc. - `SQL` semantics: `groupby`, `agg`, `select`, `where`, etc. - Some data visualization tools - Very large **general-purpose library for data processing**, not distributed, **medium scale** data only --- # The `Python` Data Science Stack ### Data processing .center[
] - `dask` is roughly a **distributed** and **parallel** `pandas` - Same API has `pandas` ! - Task scheduling, lazy evaluation, distributed dataframes - Still young and **far behind** `spark`, but can be useful - Easier than `spark`, full `Python` (no `JVM`) --- # The `Python` Data Science Stack ### Data processing .center[
] - `pyspark` is the `python` API to `spark`, a big data processing framework - We will use it **a lot** in this course - Native API to `spark` is `scala`: `pyspark` can be **slower** (much slower if you are not careful) --- # The `Python` Data Science Stack ### Data Visualization .center[
] --- # The `Python` Data Science Stack ### Data Visualization .center[
] - `matplotlib` provides **2D plotting capabilities** - **Very large** and **highly customizable** library - The historical one, somewhat **low-level** when plotting things related to data --- # The `Python` Data Science Stack ### Data Visualization .center[
] - A **higher-level** plotting library built on top of `matplotlib` - To be use **with a `pandas` dataframes** as data source - Higher-level plotting possibilities - Usually better-looking plots with good default parameters --- # The `Python` Data Science Stack ### Data Visualization .center[
] - An **interactive visualization library** for web browsers based on `javascript` and `d3.js` - With a clean and simple `python` interface, can be used in a `jupyter` notebook - Interactions enabled by default (zoom, etc.) and fast rendering - Very good looking plots with good default parameters [there is also `plotly`...] --- # The `Python` Data Science Stack ### Interfaces
--- # The `Python` Data Science Stack ### Interfaces
Ways to use all these tools - Write a script `script.py` and use `python` directly in a CLI : `python script.py` - Use the `ipython` interactive shell --- # The `Python` Data Science Stack ### Interfaces
- Use `jupyter`: a web application that allows to create and run documents, called **notebooks** (with `.ipynb` extension) - Notebooks can contain code, equations, visualizations, text, etc. We will **use these a lot** in the course. - Each `notebook` as a `kernel` running a `python` thread - A **problem**: a `ipynb` file is a `json` document. Leads to bad code diff, a problem with `git` versioning --- # But also... Many libraries for statistics, machine learning and deep learning ### Statistics - `statlearn` ### Machine learning - `scikit-learn`, `xgboost`, `lightgbm` ### Deep learning - `tensorflow`, `pytorch` ### Getting faster - `numba`, `cython` --- # But also... - `Python` APIs for most databases and clouds - Processing and plotting tools for Geospatial data - Image processing - Web development, web scrapping among many many many other things... --- class: center, middle, inverse # Thank you !