class: center, middle # Big Data Technologies ## Master Mathematics and Informatics .medium[Stéphane Gaïffas - Stéphane Boucheron] .center[
] --- layout: true class: top --- class: center, middle, inverse # Course logistics --- # Who are we ?
.pull-left-20[
] .pull-right-80[ - Stéphane Boucheron - Professor - LPSM, Univ. Paris Diderot - Concentration inequalities, Complex networks, Extreme values, Information theory, Machine learning - [https://stephane-v-boucheron.fr](https://stephane-v-boucheron.fr) ]
.pull-left-20[
] .pull-right-80[ - Stéphane Gaïffas - Professor - LPSM, Univ. Paris Diderot and DMA, ENS - Data Science, Machine Learning and Statistics - [https://stephanegaiffas.github.io](https://stephanegaiffas.github.io) ] --- # Course logistics - 24 hours = 2 hours $\times$ .stress[12 weeks] : classes + hands-on - Tuesdays, 10:30 - 12:30 ## About the hands-on - Hands-on and homeworks using .stress[`Jupyter` notebooks] - Using a `Docker` image specially built for the course - Hands-on must be done using your .stress[own laptop]. Bring it at **all the courses** --- # Course logistics - The .stress[webpage] of the course is: .center[[https://stephanegaiffas.github.io/big_data_course/](https://stephanegaiffas.github.io/big_data_course/)] - .stress[Bookmark it] ! - Follow .stress[carefully] the steps described in the `tools` page: .center[[https://stephanegaiffas.github.io/big_data_course/tools](https://stephanegaiffas.github.io/big_data_course/tools)] - Who knows about `docker` ? .center[
] --- # Course evaluation - .stress[Evaluation] using **homeworks** and a **final project** - Find a .stress[friend] : all work done by **pairs of students** - A single .stress[private] `GitHub` account for **each pair of students**. You put all your work there and **grant us access** - **All your work** goes in your private repository and nowhere else: .stress[no emails] ! - All your homework will be using .stress[`jupyter` notebooks] - .stress[Follow the steps] described here: .center[.small[[https://stephanegaiffas.github.io/big_data_course/homeworks/](https://stephanegaiffas.github.io/big_data_course/homeworks/)]] --- class: center, middle, inverse # `Docker` --- # Why `docker` ? What is it ? - Don't mess with your `python` env. and configuration files - Everything in embedded in a .stress[container] (better than a VM) - A .stress[container] is an **instance** of an .stress[image] - Same image = same environment for everybody - Same image = no {version, dependencies, install} problems - It is an .stress[industrial standard] used everywhere now! .pull-left[
] .pull-right[
] --- # `docker` Let's : - have a look at .center[[https://stephanegaiffas.github.io/big_data_course/](https://stephanegaiffas.github.io/big_data_course/)] - have a look at the `Dockerfile` to explain a little bit how the image is built - do a quick demo on how to use the `docker` image
### And that's it for the logistics ! --- class: center, middle, inverse # Big data --- # Big data - .stress[Moore's Law]: computing power **doubled** every two years from 1975 to 2012 - Nowadays, **less** than two years and a half - .stress[Rapid growth of datasets]: **internet activity**, social networks, genomics, physics, censor networks, etc. - .stress[Data size trends]: **doubles every year** according to [IDC executive summary](https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm) - .stress[Now, data grows faster than Moore's law] ### Question - How do we **scale** to **process it** and to **learn from it** ? --- # Let's recall some units ### bit A bit is a value of either a 1 or 0 (on or off) ### byte (B) A byte is 8 bits - 1 character, e.g., "a", is one byte ### Kilobyte (KB) A kilobyte is **1 024** B - **2** or **3** paragraphs of text --- # Let's recall some units ### Megabyte (MB) A megabyte is **1 048 576** B or **1 024** KB - **873** pages of plain text - **4** books (200 pages or 240 000 characters) ### Gigabyte (GB) A gigabyte is **1 073 741 824** B, **1 024** MB or **1 048 576** KB - **894 784** pages of plain text (1 200 characters) - **4 473** books (200 pages or 240 000 characters) - **640** web pages (with 1.6 MB average file size) - **341** digital pictures (with 3 MB average file size) - **256** MP3 audio files (with 4 MB average file size) - **1,5** 650 MB CD --- # Let's recall some units ### Terabyte (TB) A terabyte is **1 099 511 627 776** B, **1 024** GB or **1 048 576** MB. - **916 259 689** pages of plain text (1 200 characters) - **4 581 298** books (200 pages or 240 000 characters) - **655 360** web pages (with 1.6 MB average file size) - **349 525** digital pictures (with 3 MB average file size) - **262 144** MP3 audio files (with 4 MB average file size) - **1 613** 650 MB CD's - **233** 4.38 GB DVDs - **40** 25 GB Blu-ray discs --- # Let's recall some units ### Petabyte (PB) A petabyte is **1 024** TB, **1 048 576** GB or **1 073 741 824** MB - **938 249 922 368** pages of plain text (1 200 characters) - **4 691 249 611** books (200 pages or 240 000 characters) - **671 088 640** web pages (with 1.6 MB average file size) - **357 913 941** digital pictures (with 3 MB average file size) - **268 435 456** MP3 audio files (with 4 MB average file size) - **1 651 910** 650 MB CD's - **239 400** 4.38 GB DVDs - **41 943** 25 GB Blu-ray discs ### Exabyte, etc. - 1 EB = 1 exabyte = 1 024 PB - 1 ZB = 1 zettabyte = 1 024 EB --- # Some figures You have every .stress[single second]$^1$: - At least **8,000 tweets** sent - **900+ photos** posted on **Instagram** - **Thousands of Skype calls** made - Over **70,000 Google searches** performed - Around **80,000 YouTube videos** viewed - Over **2 million emails** sent .footnote[$^1$[https://www.internetlivestats.com](https://www.internetlivestats.com)] --- # Some figures There are$^1$: - .stress[5 billion web pages] as of mid-2019 (indexed web) and we expect$^2$ : - .stress[4.8 ZB] of annual IP traffic in 2022 Note that - **1** ZB $\approx$ **36 000** years of HD video - Netflix's **entire catalog** is $\approx$ **3.5 years** of HD video .footnote[ $^1$[https://www.worldwidewebsize.com](https://www.worldwidewebsize.com)
$^2$Cisco's Visual Networking Index ] --- # Some figures More figures : - **facebook** daily logs: **60TB** - **1000 genomes** project: **200TB** - Google web index: **10+ PB** - Cost of **1TB** of storage: **~$35** - Time to read **1TB** from disk: **3 hours** if **100MB/s** ### Let's give some .stress[latencies] now --- # Latency numbers .small[.pure-table.pure-table-striped[ | Memory type | Latency(ns) | Latency(us) | (ms) | | | :--------------------------------- | ---------------: | ----------: | -----: | :-------------------------- | | L1 cache reference | 0.5 ns | | | | | L2 cache reference | 7 ns | | | 14x L1 cache | | Main memory reference | 100 ns | | | 20x L2, 200x L1 | | Compress 1K bytes with Zippy | 3,000 ns | 3 us | | | | Send 1K bytes over 1 Gbps network | 10,000 ns | 10 us | | | | Read 4K randomly from SSD* | 150,000 ns | 150 us | | ~1GB/sec SSD | | Read 1 MB sequentially from memory | 250,000 ns | 250 us | | | | Round trip within same datacenter | 500,000 ns | 500 us | | | | Read 1 MB sequentially from SSD* | 1,000,000 ns | 1,000 us | 1 ms | ~1GB/sec SSD, 4X memory | | Disk seek | 10,000,000 ns | 10,000 us | 10 ms | 20x datacenter roundtrip | | Read 1 MB sequentially from disk | 20,000,000 ns | 20,000 us | 20 ms | 80x memory, 20x SSD | | Send packet US -> Europe -> US | 150,000,000 ns | 150,000 us | 150 ms | 600x memory | ]] --- # Latency numbers - Reading 1MB from **disk** = **100X** reading 1MB from **memory** - Sending packet from **US to Europe to US** = **1 000 000X** main memory reference ### General tendency True in general, not always: - memory operations : .stress[fastest] - disk operations : .stress[slow] - network operations : .stress[slowest] --- # Latency numbers .small[[https://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html](https://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html)] .center[
] --- # Humanized latency numbers Lets multiply all these durations by a billion .small[.pure-table.pure-table-striped[ | Memory type | Latency | Human duration | | :--------------------------------- | -----------: | ----------------------------------------------------: | | L1 cache reference | 0.5 s | One heart beat (0.5 s) | | L2 cache reference | 7 s | Long yawn | | Main memory reference | 100 s | Brushing your teeth | | Send 2K bytes over 1 Gbps network | 5.5 hr | From lunch to end of work day | | SSD random read | 1.7 days | A normal weekend | | Read 1 MB sequentially from memory | 2.9 days | A long weekend | | Round trip within same datacenter | 5.8 days | A medium vacation | | Read 1 MB sequentially from SSD | 11.6 days | Waiting for almost 2 weeks for a delivery | | Disk seek | 16.5 weeks | A semester in university | | Read 1 MB sequentially from disk | 7.8 months | Almost producing a new human being | | Send packet US -> Europe -> US | 4.8 years | Average time it takes to complete a bachelor's degree | ]] --- ## Challenges Challenges with big datasets - Large data .stress[don't fit] on a **single** hard-drive - **One** large machine .stress[can't process or store] **all** the data - For **computations** how do we .stress[stream data] from the **disk to the different layers of memory** ? - **Concurrent accesses** to the data: disks .stress[cannot] be **read in parallel** ## Solutions - Combine .stress[several machines] containing **hard drives** and **processors** on a **network** - Using .stress[commodity hardware]: cheap, common architecture i.e. **processor** + **RAM** + **disk** - .stress[Scalability] = **more machines** on the network - .stress[Partition] the data across the machines --- ## Challenges Dealing with distributed computations adds **software complexity** - .stress[Scheduling]: How to **split the work across machines**? Must exploit and optimize data locality since moving data is very expensive - .stress[Reliability]: How to **deal with failure**? Commodity (cheap) hardware fails more often. @Google [1%, 5%] HD failure/year and 0.2% DIMM failure/year - .stress[Uneven performance] of the machines: some nodes are slower than others ## Solutions - .stress[Schedule], **manage** and **coordinate** threads and resources using appropriate software - .stress[Locks] to **limit** access to resources - .stress[Replicate] data for **faster reading** and **reliability** --- # Is it HPC ? - **High Performance Computing** (HPC) - **Parallel computing** ### Comments - For HPC, scaling-up means using a .stress[bigger machine] - Huge performance increase for **medium** scale problems - .stress[Very expensive], specialized machines, lots of processors and memory ### Answer is no ! --- # The Big Data universe Many technologies combining .stress[software] and .stress[cloud computing] .center[
] --- # The Big Data universe Often used with/for with .stress[Machine Learning] (or AI) .center[
] --- # Tools - Softwares such as .stress[`Spark`] or .stress[`HadoopMR`] (Hadoop Map Reduce) are in charge of these challenges - They are .stress[distributed compute engines]: softwares that ease the development of distributed algorithms They run on .stress[clusters] (several machine on a network), managed by a .stress[resource manager] such as : - **`Yarn` :** [https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) - **`Mesos` :** [http://mesos.apache.org](http://mesos.apache.org) - **`Kubernetes :` **[https://kubernetes.io](https://kubernetes.io/) A resource manager ensures that the tasks running on the cluster do not try to use the same resources all at once --- class: center, middle, inverse # `Apache Spark` --- # Apache `Spark` The course will focus mainly on .stress[`Spark`] for big data processing .center[
[https://spark.apache.org](https://spark.apache.org) ] - `Spark` is an .stress[industrial standard]
(cf [https://spark.apache.org/powered-by.html](https://spark.apache.org/powered-by.html)) - One of the most used .stress[big data processing framework] - .stress[Open source] The predecessor of `Spark` is `Hadoop` --- # `Hadoop` - `Hadoop` has a simple API and good fault tolerance (tolerance to nodes failing midway through a processing job) - The cost is lots of .stress[data shuffling] across the network - With intermediate computations .stress[written to disk] **over the network** which we know is .stress[very time expensive] It is made of three components: - .stress[`HDFS`] (Highly Distributed File System) inspired from `GoogleFileSystem`, see .small[[https://ai.google/research/pubs/pub51](https://ai.google/research/pubs/pub51)] - .stress[`YARN`] (Yet Another Ressource Negociator) - .stress[`MapReduce`] inspired from Google
.small[[https://research.google.com/archive/mapreduce.html](https://research.google.com/archive/mapreduce.html)] --- # MapReduce's wordcount example .center[
] --- # `Spark` Advantages of `Spark` over `HadoopMR` ? - .stress[In-memory storage]: use **RAM** for fast iterative computations - .stress[Lower overhead] for starting jobs - .stress[Simple and expressive] with `Scala`, `Python`, `R`, `Java` APIs - .stress[Higher level libraries] with `SparkSQL`, `SparkStreaming`, etc. Disadvantages of `Spark` over `HadoopMR` ? - `Spark` requires servers with **more CPU** and **more memory** - But still much cheaper than HPC `Spark` is .stress[much faster] than `Hadoop` - `Hadoop` uses **disk** and **network** - `Spark` tries to use **memory** as much as possible for operations while minimizing network use --- # `Spark` and `Hadoop` comparison
.pure-table.pure-table-striped[ | | HadoopMR | Spark | | -----------------------: | -----------: | ------------------------------: | | storage | Disk | in-memory or disk | | operations | Map, reduce | Map, reduce, join, sample, among many others | | execution model | Batch | Batch, interactive, streaming | | Programming environments | Java | Scala, Java, Python, R | ] --- # `Spark` and `Hadoop` comparison For **logistic regression** training (a simple **classification** algorithm which requires **several passes** on a dataset) .center[
]
.center[
] --- # The `Spark` stack .center[
] --- # The `Spark` stack .center[
] --- # `Spark` can run "everywhere" .center[
] --- class: center, middle, inverse # Agenda, tools and references --- # Tentative agenda for the course **Weeks 1, 2 and 3**
The .stress[`Python` data-science stack] for **medium-scale** problems **Weeks 4 and 5**
Introduction to .stress[`spark`] and its .stress[low-level API] **Weeks 6, 7 and 8**
`Spark`'s high level API: .stress[`spark.sql`]. Data from different formats and sources **Week 9**
Run a job on a cluster with .stress[`spark-submit`], monitoring, mistakes and debugging **Weeks 10, 11, 12**
Introduction to .stress[`spark-streaming`] and a glimpse on other big data technologies --- # Main tools for the course (tentative...) ### Infrastructure .center[
] ### Python stack .center[
] ### Data Visualization .center[
] --- # Main tools for the course (tentative...) ### Big data processing .center[
] ### Data storage / formats / querying .center[
] --- # Learning resources - .stress[Spark Documentation Website]
.small[[http://spark.apache.org/docs/latest/](http://spark.apache.org/docs/latest/)] - .stress[API docs]
.small[[http://spark.apache.org/docs/latest/api/scala/index.html](http://spark.apache.org/docs/latest/api/scala/index.html)]
.small[[http://spark.apache.org/docs/latest/api/python/](http://spark.apache.org/docs/latest/api/python/)] - .stress[`Databricks` learning notebooks]
.small[[https://databricks.com/resources](https://databricks.com/resources)] - .stress[StackOverflow]
.small[[https://stackoverflow.com/tags/apache-spark](https://stackoverflow.com/tags/apache-spark)]
.small[[https://stackoverflow.com/tags/pyspark](https://stackoverflow.com/tags/pyspark)] - .stress[More advanced]
.small[[http://books.japila.pl/apache-spark-internals/](http://books.japila.pl/apache-spark-internals/)] --- # Learning Resources .pull-left-80[ - .stress[Book]: **"Spark The Definitive Guide"** .small[[http://shop.oreilly.com/product/0636920034957.do](http://shop.oreilly.com/product/0636920034957.do)]
.tiny[[https://github.com/databricks/Spark-The-Definitive-Guide](https://github.com/databricks/Spark-The-Definitive-Guide)] ] .pull-right-20[
]
And the **most important thing is:** .pull-left[ .stress[.large[Practice!]] ] .pull-right[
] --- class: center, middle, inverse # Data centers --- # Data centers Wonder what a .stress[datacenter looks like] ? - Have a look at [http://www.google.com/about/datacenters](http://www.google.com/about/datacenters) --- # Data centers Wonder what a .stress[datacenter looks like] ? .center[
] --- # Data centers Wonder what a .stress[datacenter looks like] ?
.center[
] --- class: center, middle, inverse # Thank you !