Remark

# Big Data Technologies

## Master Mathematics and Informatics

.center[
 <img src="figs/lpsm.png" style="height: 160px;" />
 <img src="" style="width: 30px;" />
 <img src="figs/paris-diderot.png" style="height: 90px;" />
 <img src="" style="width: 30px;" />
 <img src="figs/uparis.png" style="height: 120px;" />
]

---

---

# Course logistics

---

# Who are we ?

.pull-left-20[
 <img src="figs/stephaneb.jpg" style="height:140px;" />
]
.pull-right-80[
- Stéphane Boucheron
- Professor 
- LPSM, Univ. Paris Diderot
- Concentration inequalities, Complex networks, Extreme values, Information theory, Machine learning
- [https://stephane-v-boucheron.fr](https://stephane-v-boucheron.fr)
]

.pull-left-20[
 <img src="figs/stephaneg.jpg" style="height: 140px;" />
]
.pull-right-80[
- Stéphane Gaïffas
- Professor 
- LPSM, Univ. Paris Diderot and DMA, ENS
- Data Science, Machine Learning and Statistics
- [https://stephanegaiffas.github.io](https://stephanegaiffas.github.io)
]

---

# Course logistics

- 24 hours = 2 hours $\times$ .stress[12 weeks] : classes + hands-on

- Tuesdays, 10:30 - 12:30

## About the hands-on

- Hands-on and homeworks using .stress[`Jupyter` notebooks]

- Using a `Docker` image specially built for the course

- Hands-on must be done using your .stress[own laptop]. Bring it at **all the courses**

---

# Course logistics

- The .stress[webpage] of the course is:

.center[[https://stephanegaiffas.github.io/big_data_course/](https://stephanegaiffas.github.io/big_data_course/)]

- .stress[Bookmark it] !

- Follow .stress[carefully] the steps described in the `tools` page:

.center[[https://stephanegaiffas.github.io/big_data_course/tools](https://stephanegaiffas.github.io/big_data_course/tools)]

- Who knows about `docker` ?

---

# Course evaluation

- .stress[Evaluation] using **homeworks** and a **final project**

- Find a .stress[friend] : all work done by **pairs of students**

- A single .stress[private] `GitHub` account for **each pair of students**. 
You put all your work there and **grant us access**

- **All your work** goes in your private repository and nowhere else: .stress[no emails] !

- All your homework will be using .stress[`jupyter` notebooks]

- .stress[Follow the steps] described here:

.center[.small[[https://stephanegaiffas.github.io/big_data_course/homeworks/](https://stephanegaiffas.github.io/big_data_course/homeworks/)]]

---

# `Docker`

---

# Why `docker` ? What is it ?

- Don't mess with your `python` env. and configuration files
- Everything in embedded in a .stress[container] (better than a VM)
- A .stress[container] is an **instance** of an .stress[image]
- Same image = same environment for everybody 
- Same image = no {version, dependencies, install} problems
- It is an .stress[industrial standard] used everywhere now!
.pull-left[
<img src="figs/containers.png" style="width: 70%;" />
]
.pull-right[
<img src="figs/python_environment.png" style="width: 75%;" />
]

---

# `docker`

Let's :

-  have a look at

.center[[https://stephanegaiffas.github.io/big_data_course/](https://stephanegaiffas.github.io/big_data_course/)]

- have a look at the `Dockerfile` to explain a little bit how the image is built

- do a quick demo on how to use the `docker` image

### And that's it for the logistics !

---

# Big data

---

# Big data

- .stress[Moore's Law]: computing power **doubled** every two years from 1975 to 2012

- Nowadays, **less** than two years and a half

- .stress[Rapid growth of datasets]: **internet activity**, social networks, genomics, physics, censor networks, etc.

- .stress[Data size trends]: **doubles every year** according to [IDC executive summary](https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm)

- .stress[Now, data grows faster than Moore's law]

### Question

- How do we **scale** to **process it** and to **learn from it** ?

---

# Let's recall some units

### bit

A bit is a value of either a 1 or 0 (on or off)

### byte (B)

A byte is 8 bits

- 1 character, e.g., "a", is one byte

### Kilobyte (KB)

A kilobyte is **1 024** B

- **2** or **3** paragraphs of text

---

# Let's recall some units

### Megabyte (MB)

A megabyte is **1 048 576** B or **1 024** KB

- **873** pages of plain text
- **4** books (200 pages or 240 000 characters)

### Gigabyte (GB)

A gigabyte is **1 073 741 824** B, **1 024** MB or **1 048 576** KB

- **894 784** pages of plain text (1 200 characters)
- **4 473** books (200 pages or 240 000 characters)
- **640** web pages (with 1.6 MB average file size)
- **341** digital pictures (with 3 MB average file size)
- **256** MP3 audio files (with 4 MB average file size)
- **1,5** 650 MB CD

---

# Let's recall some units

### Terabyte (TB)

A terabyte is **1 099 511 627 776** B, **1 024** GB  or **1 048 576** MB.

- **916 259 689** pages of plain text (1 200 characters)
- **4 581 298** books (200 pages or 240 000 characters)
- **655 360** web pages (with 1.6 MB average file size)
- **349 525** digital pictures (with 3 MB average file size)
- **262 144** MP3 audio files (with 4 MB average file size)
- **1 613** 650 MB CD's
- **233** 4.38 GB DVDs
- **40** 25 GB Blu-ray discs

---

# Let's recall some units

### Petabyte (PB)

A petabyte is **1 024** TB, **1 048 576** GB or **1 073 741 824** MB

- **938 249 922 368** pages of plain text (1 200 characters)
- **4 691 249 611** books (200 pages or 240 000 characters)
- **671 088 640** web pages (with 1.6 MB average file size)
- **357 913 941** digital pictures (with 3 MB average file size)
- **268 435 456** MP3 audio files (with 4 MB average file size)
- **1 651 910** 650 MB CD's
- **239 400** 4.38 GB DVDs
- **41 943** 25 GB Blu-ray discs

### Exabyte, etc.

- 1 EB = 1 exabyte = 1 024 PB
- 1 ZB = 1 zettabyte = 1 024 EB

---

# Some figures

You have every .stress[single second]$^1$:

- At least **8,000 tweets** sent

- **900+ photos** posted on **Instagram**

- **Thousands of Skype calls** made

- Over **70,000 Google searches** performed

- Around **80,000 YouTube videos** viewed

- Over **2 million emails** sent

---

# Some figures

There are$^1$:

- .stress[5 billion web pages] as of mid-2019 (indexed web)

and we expect$^2$ :

- .stress[4.8 ZB] of annual IP traffic in 2022

Note that

- **1** ZB $\approx$ **36 000** years of HD video
- Netflix's **entire catalog** is $\approx$ **3.5 years** of HD video

.footnote[
$^1$[https://www.worldwidewebsize.com](https://www.worldwidewebsize.com) 
$^2$Cisco's Visual Networking Index
]

---

# Some figures

More figures :

- **facebook** daily logs: **60TB**

- **1000 genomes** project: **200TB**

- Google web index: **10+ PB**

- Cost of **1TB** of storage: **~$35**

- Time to read **1TB** from disk: **3 hours** if **100MB/s**

### Let's give some .stress[latencies] now

---

# Latency numbers

.small[.pure-table.pure-table-striped[
| Memory type                        | Latency(ns)      | Latency(us) | (ms)   |                             |
| :--------------------------------- | ---------------: | ----------: | -----: | :-------------------------- |
| L1 cache reference                 |           0.5 ns |             |        |                             |  
| L2 cache reference                 |           7   ns |             |        | 14x L1 cache                |
| Main memory reference              |         100   ns |             |        | 20x L2, 200x L1             |
| Compress 1K bytes with Zippy       |       3,000   ns |        3 us |        |                             |
| Send 1K bytes over 1 Gbps network  |      10,000   ns |       10 us |        |                             |
| Read 4K randomly from SSD*         |     150,000   ns |      150 us |        | ~1GB/sec SSD                |
| Read 1 MB sequentially from memory |     250,000   ns |      250 us |        |                             |
| Round trip within same datacenter  |     500,000   ns |      500 us |        |                             |
| Read 1 MB sequentially from SSD*   |   1,000,000   ns |    1,000 us |   1 ms | ~1GB/sec SSD, 4X memory     |
| Disk seek                          |  10,000,000   ns |   10,000 us |  10 ms | 20x datacenter roundtrip    |
| Read 1 MB sequentially from disk   |  20,000,000   ns |   20,000 us |  20 ms | 80x memory, 20x SSD         |
| Send packet US -> Europe -> US     | 150,000,000   ns |  150,000 us | 150 ms | 600x memory                 |
]]

---

# Latency numbers

- Reading 1MB from **disk** = **100X** reading 1MB from **memory**

- Sending packet from **US to Europe to US** = **1 000 000X** main memory reference

### General tendency

True in general, not always:

- memory operations : .stress[fastest]

- disk operations : .stress[slow]

- network operations : .stress[slowest]

---

# Latency numbers

.small[[https://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html](https://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html)]

---

# Humanized latency numbers

Lets multiply all these durations by a billion

.small[.pure-table.pure-table-striped[
| Memory type                        | Latency      | Human duration                                        |
| :--------------------------------- | -----------: | ----------------------------------------------------: |
| L1 cache reference                 | 0.5 s        | One heart beat (0.5 s)                                |
| L2 cache reference                 | 7 s          | Long yawn                                             | 
| Main memory reference              | 100 s        | Brushing your teeth                                   |
| Send 2K bytes over 1 Gbps network  | 5.5 hr       | From lunch to end of work day                         | 
| SSD random read                    | 1.7 days     | A normal weekend                                      | 
| Read 1 MB sequentially from memory | 2.9 days     | A long weekend                                        |
| Round trip within same datacenter  | 5.8 days     | A medium vacation                                     | 
| Read 1 MB sequentially from SSD    | 11.6 days    | Waiting for almost 2 weeks for a delivery             |
| Disk seek                          | 16.5 weeks   | A semester in university                              |
| Read 1 MB sequentially from disk   | 7.8 months   | Almost producing a new human being                    | 
| Send packet US -> Europe -> US     | 4.8 years    | Average time it takes to complete a bachelor's degree | 
]]

---

## Challenges

Challenges with big datasets

- Large data .stress[don't fit] on a **single** hard-drive
- **One** large machine .stress[can't process or store] **all** the data
- For **computations** how do we .stress[stream data] from the **disk to the different 
layers of memory** ?
- **Concurrent accesses** to the data: disks .stress[cannot] be **read in parallel**

## Solutions

- Combine .stress[several machines] containing **hard drives** and **processors** on a **network**
- Using .stress[commodity hardware]: cheap, common architecture i.e. **processor** + **RAM** + **disk**
- .stress[Scalability] = **more machines** on the network
- .stress[Partition] the data across the machines

---

## Challenges

Dealing with distributed computations adds **software complexity**

- .stress[Scheduling]: How to **split the work across machines**? Must exploit and optimize data locality since moving data is very expensive
- .stress[Reliability]: How to **deal with failure**? Commodity (cheap) hardware fails more often. @Google [1%, 5%] HD failure/year and 0.2% DIMM failure/year 
- .stress[Uneven performance] of the machines: some nodes are slower than others

## Solutions

- .stress[Schedule], **manage** and **coordinate** threads and resources using appropriate software
- .stress[Locks] to **limit** access to resources
- .stress[Replicate] data for **faster reading** and **reliability**

---

# Is it HPC ?

- **High Performance Computing** (HPC)

- **Parallel computing**

### Comments

- For HPC, scaling-up means using a .stress[bigger machine]

- Huge performance increase for **medium** scale problems

- .stress[Very expensive], specialized machines, lots of processors and memory

### Answer is no !

---

# The Big Data universe

Many technologies combining .stress[software] and .stress[cloud computing]

---

# The Big Data universe

Often used with/for with .stress[Machine Learning] (or AI)

---

# Tools

- Softwares such as .stress[`Spark`] or .stress[`HadoopMR`] (Hadoop Map Reduce) are in charge of these challenges
- They are .stress[distributed compute engines]: softwares that ease the development of distributed algorithms

They run on .stress[clusters] (several machine on a network), managed by a .stress[resource manager] such as :
- **`Yarn` :** 
[https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)
- **`Mesos` :** [http://mesos.apache.org](http://mesos.apache.org)
- **`Kubernetes :` **[https://kubernetes.io](https://kubernetes.io/)

A resource manager ensures that the tasks running on the cluster do not try to use the same resources all at once

---

# `Apache Spark`

---

# Apache `Spark`

The course will focus mainly on .stress[`Spark`] for big data processing

[https://spark.apache.org](https://spark.apache.org)
]

- `Spark` is an .stress[industrial standard] 
(cf [https://spark.apache.org/powered-by.html](https://spark.apache.org/powered-by.html))
- One of the most used .stress[big data processing framework]
- .stress[Open source]

The predecessor of `Spark` is `Hadoop`

---

# `Hadoop`

- `Hadoop` has a simple API and good fault tolerance (tolerance to nodes failing midway through a processing job)

- The cost is lots of .stress[data shuffling] across the network

- With intermediate computations .stress[written to disk] **over the network** which we know is .stress[very time expensive]

It is made of three components:

- .stress[`HDFS`] (Highly Distributed File System) inspired from `GoogleFileSystem`, see 
.small[[https://ai.google/research/pubs/pub51](https://ai.google/research/pubs/pub51)]

- .stress[`YARN`] (Yet Another Ressource Negociator)

- .stress[`MapReduce`] inspired from Google .small[[https://research.google.com/archive/mapreduce.html](https://research.google.com/archive/mapreduce.html)]

---

# MapReduce's wordcount example

---

# `Spark`

Advantages of `Spark` over `HadoopMR` ?

- .stress[In-memory storage]: use **RAM** for fast iterative computations
- .stress[Lower overhead] for starting jobs
- .stress[Simple and expressive] with `Scala`, `Python`, `R`, `Java` APIs
- .stress[Higher level libraries] with `SparkSQL`, `SparkStreaming`, etc.

Disadvantages of `Spark` over `HadoopMR` ?
 
- `Spark` requires servers with **more CPU** and **more memory**
- But still much cheaper than HPC

`Spark` is .stress[much faster] than `Hadoop`

- `Hadoop` uses **disk** and **network** 
- `Spark` tries to use **memory** as much as possible for operations while minimizing network use

---

# `Spark` and `Hadoop` comparison

.pure-table.pure-table-striped[
|                          | HadoopMR     | Spark                           |
| -----------------------: | -----------: | ------------------------------: |
| storage                  | Disk         | in-memory or disk               |
| operations               | Map, reduce  | Map, reduce, join, sample, among many others    |
| execution model          | Batch        | Batch, interactive, streaming   |
| Programming environments | Java         | Scala, Java, Python, R          |
]

---

# `Spark` and `Hadoop` comparison

For **logistic regression** training (a simple **classification** algorithm which requires **several passes** on a dataset)

.center[
 <img src="figs/spark-dev3.png" width=50%/>
]
 
.center[
 <img src="figs/logistic-regression.png" width=30%/>
]

---

# The `Spark` stack

---

# The `Spark` stack

---

# `Spark` can run "everywhere"

---

# Agenda, tools and references

---

# Tentative agenda for the course

**Weeks 1, 2 and 3** 
The .stress[`Python` data-science stack] for **medium-scale** problems

**Weeks 4 and 5** 
Introduction to .stress[`spark`] and its .stress[low-level API]

**Weeks 6, 7 and 8** 
`Spark`'s high level API: .stress[`spark.sql`]. Data from different formats and sources

**Week 9** 
Run a job on a cluster with .stress[`spark-submit`], monitoring, mistakes and debugging

**Weeks 10, 11, 12** 
Introduction to .stress[`spark-streaming`] and a glimpse on other big data technologies

---

# Main tools for the course (tentative...)

### Infrastructure

### Python stack

.center[
<img src="figs/python.png" width=20%/>
<img src="" width=5%/>
<img src="figs/numpy.jpg" width=18%/>
<img src="" width=5%/>
<img src="figs/pandas.png" width=28%/>
<img src="" width=5%/>
<img src="figs/jupyter_logo.png" width=7%/>
]

### Data Visualization

.center[
<img src="figs/matplotlib.png" width=25%/>
<img src="" width=10%/>
<img src="figs/seaborn.png" width=20%/>
<img src="" width=10%/>
<img src="figs/bokeh.png" width=20%/>
]

---

# Main tools for the course (tentative...)

### Big data processing

.center[
<img src="figs/spark.png" width=20%/>
<img src="" width=10%/>
<img src="figs/pyspark.jpg" width=20%/>
<img src="" width=10%/>
<img src="figs/dask.png" width=10%/>
]

### Data storage / formats / querying

.center[
<img src="figs/sql.jpg" width=20%/>
<img src="" width=5%/>
<img src="figs/orc.png" width=20%/>
<img src="" width=5%/>
<img src="figs/parquet.png" width=30%/>

<img src="figs/json.png" width=20%/>
<img src="" width=15%/>
<img src="figs/hdfs.png" width=25%/>
]

---

# Learning resources

- .stress[Spark Documentation Website] 
.small[[http://spark.apache.org/docs/latest/](http://spark.apache.org/docs/latest/)]

- .stress[API docs] 
.small[[http://spark.apache.org/docs/latest/api/scala/index.html](http://spark.apache.org/docs/latest/api/scala/index.html)] 
.small[[http://spark.apache.org/docs/latest/api/python/](http://spark.apache.org/docs/latest/api/python/)]

- .stress[`Databricks` learning notebooks] 
.small[[https://databricks.com/resources](https://databricks.com/resources)]

- .stress[StackOverflow] 
.small[[https://stackoverflow.com/tags/apache-spark](https://stackoverflow.com/tags/apache-spark)] 
.small[[https://stackoverflow.com/tags/pyspark](https://stackoverflow.com/tags/pyspark)]

- .stress[More advanced] 
.small[[http://books.japila.pl/apache-spark-internals/](http://books.japila.pl/apache-spark-internals/)]

---

# Learning Resources

.pull-left-80[
- .stress[Book]: **"Spark The Definitive Guide"** 
 .small[[http://shop.oreilly.com/product/0636920034957.do](http://shop.oreilly.com/product/0636920034957.do)] 
 .tiny[[https://github.com/databricks/Spark-The-Definitive-Guide](https://github.com/databricks/Spark-The-Definitive-Guide)]
]

And the **most important thing is:**

.pull-left[
.stress[.large[Practice!]]
]
.pull-right[
 <img src="figs/wtf.jpg" style="height: 200px;" />
]

---

# Data centers

---

# Data centers

Wonder what a .stress[datacenter looks like] ?

- Have a look at [http://www.google.com/about/datacenters](http://www.google.com/about/datacenters)

---

# Data centers

Wonder what a .stress[datacenter looks like] ?

---

# Data centers

Wonder what a .stress[datacenter looks like] ?

.center[
<iframe width="672" height="378" src="https://www.youtube.com/embed/avP5d16wEp0" 
 frameborder="0" allowfullscreen>
</iframe>
]

---

# Thank you !