{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2020-10-09T12:32:56.371842Z",
"start_time": "2020-10-09T12:32:56.367372Z"
}
},
"source": [
"# Introduction to `pandas`\n",
"\n",
"The `pandas` library (https://pandas.pydata.org) is one of the most used tool at the disposal of people working with data in `python` today.\n",
"\n",
"- It allows to **crunch data** easily\n",
"- It mainly provides a `DataFrame` object (a **table of data**) with a huge set of functionalities"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Why ?\n",
"\n",
"Through `pandas`, you get acquainted with your data by **cleaning**, **transforming**, and **analyzing** it.\n",
"\n",
"- What's the average, median, max, or min of each column?\n",
"- Does column A correlate with column B?\n",
"- What does the distribution of data in column C look like?\n",
"- Removing missing values, filter rows or columns using some criteria\n",
"- Data visualization (when combined `matplotlib` or `seaborn` or others)\n",
"- Store the cleaned, transformed data back into virtually any format or database"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Where ?\n",
"\n",
"`pandas` is a central component of the `python` \"stack\" for data science\n",
"\n",
"- `pandas` is built on top of `numpy`\n",
"- often used in conjunction with other libraries\n",
"- a `DataFrame` is often fed to plotting functions or machine learning algorithms (such as `scikit-learn`)\n",
"- Well-interfaced with `jupyter`, leading to a nice interactive environment for data exploration and modeling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Core components of pandas\n",
"\n",
"The primary two components of pandas are the `Series` and `DataFrame`.\n",
"\n",
"- A `Series` is essentially a column\n",
"- a `DataFrame` is a multi-dimensional table made up of a collection of `Series`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a `DataFrame` from scratch"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:07:08.634333Z",
"start_time": "2022-01-19T11:07:08.201274Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 7 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges\n",
"0 3 0\n",
"1 2 3\n",
"2 0 7\n",
"3 1 2"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"fruits = {\n",
" \"apples\": [3, 2, 0, 1],\n",
" \"oranges\": [0, 3, 7, 2]\n",
"}\n",
"\n",
"df_fruits = pd.DataFrame(fruits)\n",
"df_fruits"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:08:05.535463Z",
"start_time": "2022-01-19T11:08:05.530187Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df_fruits)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:08:12.680877Z",
"start_time": "2022-01-19T11:08:12.674128Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"0 3\n",
"1 2\n",
"2 0\n",
"3 1\n",
"Name: apples, dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits[\"apples\"]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:08:40.517193Z",
"start_time": "2022-01-19T11:08:40.511875Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df_fruits[\"apples\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Indexing\n",
"\n",
"- By default, a `DataFrame` uses a contiguous index\n",
"- But what if we want to say **who** buys the fruits ?"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:08:51.018797Z",
"start_time": "2022-01-19T11:08:51.009897Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
"
\n",
" \n",
" \n",
" \n",
" Daniel | \n",
" 3 | \n",
" 0 | \n",
"
\n",
" \n",
" Sean | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Pierce | \n",
" 0 | \n",
" 7 | \n",
"
\n",
" \n",
" Roger | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges\n",
"Daniel 3 0\n",
"Sean 2 3\n",
"Pierce 0 7\n",
"Roger 1 2"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits = pd.DataFrame(fruits, index=[\"Daniel\", \"Sean\", \"Pierce\", \"Roger\"])\n",
"df_fruits"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `.loc` versus `.iloc`\n",
"\n",
"- `.loc` **loc**ates by name\n",
"- `.iloc` **loc**ates by numerical **i**ndex"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:11:43.050618Z",
"start_time": "2022-01-19T11:11:43.042623Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
"
\n",
" \n",
" \n",
" \n",
" Daniel | \n",
" 3 | \n",
" 0 | \n",
"
\n",
" \n",
" Sean | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Pierce | \n",
" 0 | \n",
" 7 | \n",
"
\n",
" \n",
" Roger | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges\n",
"Daniel 3 0\n",
"Sean 2 3\n",
"Pierce 0 7\n",
"Roger 1 2"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:12:17.606988Z",
"start_time": "2022-01-19T11:12:17.599866Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"apples 2\n",
"oranges 3\n",
"Name: Sean, dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# What's in Sean's basket ?\n",
"df_fruits.loc['Sean']"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:12:31.777154Z",
"start_time": "2022-01-19T11:12:31.771307Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Daniel 0\n",
"Sean 3\n",
"Pierce 7\n",
"Roger 2\n",
"Name: oranges, dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Who has oranges ?\n",
"df_fruits.loc[:, 'oranges']"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:13:12.355217Z",
"start_time": "2022-01-19T11:13:12.350177Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# How many apples in Pierce's basket ?\n",
"df_fruits.loc['Pierce', 'apples']"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:13:18.688274Z",
"start_time": "2022-01-19T11:13:18.676461Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
"
\n",
" \n",
" \n",
" \n",
" Daniel | \n",
" 3 | \n",
" 0 | \n",
"
\n",
" \n",
" Sean | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" Pierce | \n",
" 0 | \n",
" 7 | \n",
"
\n",
" \n",
" Roger | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges\n",
"Daniel 3 0\n",
"Sean 2 3\n",
"Pierce 0 7\n",
"Roger 1 2"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:13:20.548872Z",
"start_time": "2022-01-19T11:13:20.542442Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"7"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits.iloc[2, 1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Main attributes and methods of a `DataFrame`\n",
"\n",
"A `DataFrame` has many **attributes**"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:13:38.902805Z",
"start_time": "2022-01-19T11:13:38.896985Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['apples', 'oranges'], dtype='object')"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits.columns"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:14:23.322276Z",
"start_time": "2022-01-19T11:14:23.316645Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Daniel', 'Sean', 'Pierce', 'Roger'], dtype='object')"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits.index"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:14:32.512895Z",
"start_time": "2022-01-19T11:14:32.506454Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"apples int64\n",
"oranges int64\n",
"dtype: object"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A `DataFrame` has many **methods**"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:17:20.721882Z",
"start_time": "2022-01-19T11:17:20.708707Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Index: 4 entries, Daniel to Roger\n",
"Data columns (total 2 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 apples 3 non-null float64\n",
" 1 oranges 4 non-null int64 \n",
"dtypes: float64(1), int64(1)\n",
"memory usage: 268.0+ bytes\n"
]
}
],
"source": [
"df_fruits.info()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:15:35.080217Z",
"start_time": "2022-01-19T11:15:35.062875Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 4.000000 | \n",
" 4.00000 | \n",
"
\n",
" \n",
" mean | \n",
" 1.500000 | \n",
" 3.00000 | \n",
"
\n",
" \n",
" std | \n",
" 1.290994 | \n",
" 2.94392 | \n",
"
\n",
" \n",
" min | \n",
" 0.000000 | \n",
" 0.00000 | \n",
"
\n",
" \n",
" 25% | \n",
" 0.750000 | \n",
" 1.50000 | \n",
"
\n",
" \n",
" 50% | \n",
" 1.500000 | \n",
" 2.50000 | \n",
"
\n",
" \n",
" 75% | \n",
" 2.250000 | \n",
" 4.00000 | \n",
"
\n",
" \n",
" max | \n",
" 3.000000 | \n",
" 7.00000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges\n",
"count 4.000000 4.00000\n",
"mean 1.500000 3.00000\n",
"std 1.290994 2.94392\n",
"min 0.000000 0.00000\n",
"25% 0.750000 1.50000\n",
"50% 1.500000 2.50000\n",
"75% 2.250000 4.00000\n",
"max 3.000000 7.00000"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Missing values\n",
"\n",
"What if we don't know how many apples are in Sean's basket ?"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:15:57.731545Z",
"start_time": "2022-01-19T11:15:57.721720Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
"
\n",
" \n",
" \n",
" \n",
" Daniel | \n",
" 3.0 | \n",
" 0 | \n",
"
\n",
" \n",
" Sean | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
" Pierce | \n",
" 0.0 | \n",
" 7 | \n",
"
\n",
" \n",
" Roger | \n",
" 1.0 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges\n",
"Daniel 3.0 0\n",
"Sean NaN 3\n",
"Pierce 0.0 7\n",
"Roger 1.0 2"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits.loc['Sean', 'apples'] = None\n",
"df_fruits"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:16:40.181235Z",
"start_time": "2022-01-19T11:16:40.163184Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 3.000000 | \n",
" 4.00000 | \n",
"
\n",
" \n",
" mean | \n",
" 1.333333 | \n",
" 3.00000 | \n",
"
\n",
" \n",
" std | \n",
" 1.527525 | \n",
" 2.94392 | \n",
"
\n",
" \n",
" min | \n",
" 0.000000 | \n",
" 0.00000 | \n",
"
\n",
" \n",
" 25% | \n",
" 0.500000 | \n",
" 1.50000 | \n",
"
\n",
" \n",
" 50% | \n",
" 1.000000 | \n",
" 2.50000 | \n",
"
\n",
" \n",
" 75% | \n",
" 2.000000 | \n",
" 4.00000 | \n",
"
\n",
" \n",
" max | \n",
" 3.000000 | \n",
" 7.00000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges\n",
"count 3.000000 4.00000\n",
"mean 1.333333 3.00000\n",
"std 1.527525 2.94392\n",
"min 0.000000 0.00000\n",
"25% 0.500000 1.50000\n",
"50% 1.000000 2.50000\n",
"75% 2.000000 4.00000\n",
"max 3.000000 7.00000"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `count` is **3** for apples now, since we have 1 missing value among the 4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adding a column\n",
"\n",
"Ooooops, we forgot about the bananas !"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:17:38.191383Z",
"start_time": "2022-01-19T11:17:38.181663Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
" bananas | \n",
"
\n",
" \n",
" \n",
" \n",
" Daniel | \n",
" 3.0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Sean | \n",
" NaN | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" Pierce | \n",
" 0.0 | \n",
" 7 | \n",
" 1 | \n",
"
\n",
" \n",
" Roger | \n",
" 1.0 | \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges bananas\n",
"Daniel 3.0 0 0\n",
"Sean NaN 3 2\n",
"Pierce 0.0 7 1\n",
"Roger 1.0 2 6"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits[\"bananas\"] = [0, 2, 1, 6]\n",
"df_fruits"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adding a column with the date\n",
"\n",
"And we forgot the dates !"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:17:45.450938Z",
"start_time": "2022-01-19T11:17:45.439619Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
" bananas | \n",
" time | \n",
"
\n",
" \n",
" \n",
" \n",
" Daniel | \n",
" 3.0 | \n",
" 0 | \n",
" 0 | \n",
" 2020/10/08 12:13 | \n",
"
\n",
" \n",
" Sean | \n",
" NaN | \n",
" 3 | \n",
" 2 | \n",
" 2020/10/07 11:37 | \n",
"
\n",
" \n",
" Pierce | \n",
" 0.0 | \n",
" 7 | \n",
" 1 | \n",
" 2020/10/10 14:07 | \n",
"
\n",
" \n",
" Roger | \n",
" 1.0 | \n",
" 2 | \n",
" 6 | \n",
" 2020/10/09 10:51 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges bananas time\n",
"Daniel 3.0 0 0 2020/10/08 12:13\n",
"Sean NaN 3 2 2020/10/07 11:37\n",
"Pierce 0.0 7 1 2020/10/10 14:07\n",
"Roger 1.0 2 6 2020/10/09 10:51"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits['time'] = [\n",
" \"2020/10/08 12:13\", \"2020/10/07 11:37\", \n",
" \"2020/10/10 14:07\", \"2020/10/09 10:51\"\n",
"]\n",
"df_fruits"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:18:04.481768Z",
"start_time": "2022-01-19T11:18:04.475480Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"apples float64\n",
"oranges int64\n",
"bananas int64\n",
"time object\n",
"dtype: object"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:18:19.387128Z",
"start_time": "2022-01-19T11:18:19.381754Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"str"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df_fruits.loc[\"Roger\", \"time\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's not a date but a string (`str`) ! So we convert this column to something called `datetime`"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:19:11.306595Z",
"start_time": "2022-01-19T11:19:11.293662Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
" bananas | \n",
" time | \n",
"
\n",
" \n",
" \n",
" \n",
" Daniel | \n",
" 3.0 | \n",
" 0 | \n",
" 0 | \n",
" 2020-10-08 12:13:00 | \n",
"
\n",
" \n",
" Sean | \n",
" NaN | \n",
" 3 | \n",
" 2 | \n",
" 2020-10-07 11:37:00 | \n",
"
\n",
" \n",
" Pierce | \n",
" 0.0 | \n",
" 7 | \n",
" 1 | \n",
" 2020-10-10 14:07:00 | \n",
"
\n",
" \n",
" Roger | \n",
" 1.0 | \n",
" 2 | \n",
" 6 | \n",
" 2020-10-09 10:51:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges bananas time\n",
"Daniel 3.0 0 0 2020-10-08 12:13:00\n",
"Sean NaN 3 2 2020-10-07 11:37:00\n",
"Pierce 0.0 7 1 2020-10-10 14:07:00\n",
"Roger 1.0 2 6 2020-10-09 10:51:00"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits[\"time\"] = pd.to_datetime(df_fruits[\"time\"])\n",
"df_fruits"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:19:41.683741Z",
"start_time": "2022-01-19T11:19:41.677507Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"apples float64\n",
"oranges int64\n",
"bananas int64\n",
"time datetime64[ns]\n",
"dtype: object"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if we want to keep only the baskets after (including) October, 9th ?"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:19:51.787589Z",
"start_time": "2022-01-19T11:19:51.775535Z"
},
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
" bananas | \n",
" time | \n",
"
\n",
" \n",
" \n",
" \n",
" Pierce | \n",
" 0.0 | \n",
" 7 | \n",
" 1 | \n",
" 2020-10-10 14:07:00 | \n",
"
\n",
" \n",
" Roger | \n",
" 1.0 | \n",
" 2 | \n",
" 6 | \n",
" 2020-10-09 10:51:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges bananas time\n",
"Pierce 0.0 7 1 2020-10-10 14:07:00\n",
"Roger 1.0 2 6 2020-10-09 10:51:00"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits.loc[df_fruits[\"time\"] >= pd.Timestamp(\"2020/10/09\")]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Slices and subsets of rows or columns"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:20:30.215178Z",
"start_time": "2022-01-19T11:20:30.205482Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
" bananas | \n",
" time | \n",
"
\n",
" \n",
" \n",
" \n",
" Daniel | \n",
" 3.0 | \n",
" 0 | \n",
" 0 | \n",
" 2020-10-08 12:13:00 | \n",
"
\n",
" \n",
" Sean | \n",
" NaN | \n",
" 3 | \n",
" 2 | \n",
" 2020-10-07 11:37:00 | \n",
"
\n",
" \n",
" Pierce | \n",
" 0.0 | \n",
" 7 | \n",
" 1 | \n",
" 2020-10-10 14:07:00 | \n",
"
\n",
" \n",
" Roger | \n",
" 1.0 | \n",
" 2 | \n",
" 6 | \n",
" 2020-10-09 10:51:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges bananas time\n",
"Daniel 3.0 0 0 2020-10-08 12:13:00\n",
"Sean NaN 3 2 2020-10-07 11:37:00\n",
"Pierce 0.0 7 1 2020-10-10 14:07:00\n",
"Roger 1.0 2 6 2020-10-09 10:51:00"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:20:32.076067Z",
"start_time": "2022-01-19T11:20:32.066100Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" bananas | \n",
" time | \n",
"
\n",
" \n",
" \n",
" \n",
" Daniel | \n",
" 0 | \n",
" 2020-10-08 12:13:00 | \n",
"
\n",
" \n",
" Sean | \n",
" 2 | \n",
" 2020-10-07 11:37:00 | \n",
"
\n",
" \n",
" Pierce | \n",
" 1 | \n",
" 2020-10-10 14:07:00 | \n",
"
\n",
" \n",
" Roger | \n",
" 6 | \n",
" 2020-10-09 10:51:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" bananas time\n",
"Daniel 0 2020-10-08 12:13:00\n",
"Sean 2 2020-10-07 11:37:00\n",
"Pierce 1 2020-10-10 14:07:00\n",
"Roger 6 2020-10-09 10:51:00"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits.loc[:, \"bananas\":\"time\"]"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:20:46.504725Z",
"start_time": "2022-01-19T11:20:46.494468Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
" bananas | \n",
"
\n",
" \n",
" \n",
" \n",
" Daniel | \n",
" 3.0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Sean | \n",
" NaN | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges bananas\n",
"Daniel 3.0 0 0\n",
"Sean NaN 3 2"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits.loc[\"Daniel\":\"Sean\", \"apples\":\"bananas\"]"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:21:08.463797Z",
"start_time": "2022-01-19T11:21:08.452887Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" time | \n",
"
\n",
" \n",
" \n",
" \n",
" Daniel | \n",
" 3.0 | \n",
" 2020-10-08 12:13:00 | \n",
"
\n",
" \n",
" Sean | \n",
" NaN | \n",
" 2020-10-07 11:37:00 | \n",
"
\n",
" \n",
" Pierce | \n",
" 0.0 | \n",
" 2020-10-10 14:07:00 | \n",
"
\n",
" \n",
" Roger | \n",
" 1.0 | \n",
" 2020-10-09 10:51:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples time\n",
"Daniel 3.0 2020-10-08 12:13:00\n",
"Sean NaN 2020-10-07 11:37:00\n",
"Pierce 0.0 2020-10-10 14:07:00\n",
"Roger 1.0 2020-10-09 10:51:00"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits[[\"apples\", \"time\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Write our data to a CSV file\n",
"\n",
"What if we want to write the file ?"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:22:26.287324Z",
"start_time": "2022-01-19T11:22:26.277669Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apples | \n",
" oranges | \n",
" bananas | \n",
" time | \n",
"
\n",
" \n",
" \n",
" \n",
" Daniel | \n",
" 3.0 | \n",
" 0 | \n",
" 0 | \n",
" 2020-10-08 12:13:00 | \n",
"
\n",
" \n",
" Sean | \n",
" NaN | \n",
" 3 | \n",
" 2 | \n",
" 2020-10-07 11:37:00 | \n",
"
\n",
" \n",
" Pierce | \n",
" 0.0 | \n",
" 7 | \n",
" 1 | \n",
" 2020-10-10 14:07:00 | \n",
"
\n",
" \n",
" Roger | \n",
" 1.0 | \n",
" 2 | \n",
" 6 | \n",
" 2020-10-09 10:51:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apples oranges bananas time\n",
"Daniel 3.0 0 0 2020-10-08 12:13:00\n",
"Sean NaN 3 2 2020-10-07 11:37:00\n",
"Pierce 0.0 7 1 2020-10-10 14:07:00\n",
"Roger 1.0 2 6 2020-10-09 10:51:00"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_fruits"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:22:28.613982Z",
"start_time": "2022-01-19T11:22:28.602540Z"
}
},
"outputs": [],
"source": [
"df_fruits.to_csv(\"fruits.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:22:41.309341Z",
"start_time": "2022-01-19T11:22:40.971618Z"
},
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-rw-r--r-- 1 root root 163 Jan 19 12:22 fruits.csv\r\n"
]
}
],
"source": [
"# Use !dir on windows\n",
"!ls -alh | grep fru"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:23:12.183964Z",
"start_time": "2022-01-19T11:23:11.859384Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
",apples,oranges,bananas,time\r\n",
"Daniel,3.0,0,0,2020-10-08 12:13:00\r\n",
"Sean,,3,2,2020-10-07 11:37:00\r\n",
"Pierce,0.0,7,1,2020-10-10 14:07:00\r\n",
"Roger,1.0,2,6,2020-10-09 10:51:00\r\n"
]
}
],
"source": [
"!head -n 5 fruits.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading data and working with it\n",
"\n",
"Let's read the file `tips.csv`. It is a `CSV` file (**C**omma **S**eparated **V**alues).\n",
"\n",
"It contains data about a restaurant: the bill, tip and some informations about the customers."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:24:03.254808Z",
"start_time": "2022-01-19T11:24:03.193606Z"
},
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The file ./tips.csv already exists.\n"
]
}
],
"source": [
"import requests\n",
"import os\n",
"\n",
"# The path containing your notebook\n",
"path_data = './'\n",
"# The name of the file\n",
"filename = 'tips.csv'\n",
"\n",
"if os.path.exists(os.path.join(path_data, filename)):\n",
" print('The file %s already exists.' % os.path.join(path_data, filename))\n",
"else:\n",
" url = 'http://stephanegaiffas.github.io/big_data_course/data/tips.csv'\n",
" r = requests.get(url)\n",
" with open(os.path.join(path_data, filename), 'wb') as f:\n",
" f.write(r.content)\n",
" print('Downloaded file %s.' % os.path.join(path_data, filename))"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:25:11.872794Z",
"start_time": "2022-01-19T11:25:11.854234Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" total_bill | \n",
" tip | \n",
" sex | \n",
" smoker | \n",
" day | \n",
" time | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 16.99 | \n",
" 1.01 | \n",
" Female | \n",
" No | \n",
" Sun | \n",
" Dinner | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 10.34 | \n",
" 1.66 | \n",
" Male | \n",
" No | \n",
" Sun | \n",
" Dinner | \n",
" 3 | \n",
"
\n",
" \n",
" 2 | \n",
" 21.01 | \n",
" 3.50 | \n",
" Male | \n",
" No | \n",
" Sun | \n",
" Dinner | \n",
" 3 | \n",
"
\n",
" \n",
" 3 | \n",
" 23.68 | \n",
" 3.31 | \n",
" Male | \n",
" No | \n",
" Sun | \n",
" Dinner | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 24.59 | \n",
" 3.61 | \n",
" Female | \n",
" No | \n",
" Sun | \n",
" Dinner | \n",
" 4 | \n",
"
\n",
" \n",
" 5 | \n",
" 25.29 | \n",
" 4.71 | \n",
" Male | \n",
" No | \n",
" Sun | \n",
" Dinner | \n",
" 4 | \n",
"
\n",
" \n",
" 6 | \n",
" 8.77 | \n",
" 2.00 | \n",
" Male | \n",
" No | \n",
" Sun | \n",
" Dinner | \n",
" 2 | \n",
"
\n",
" \n",
" 7 | \n",
" 26.88 | \n",
" 3.12 | \n",
" Male | \n",
" No | \n",
" Sun | \n",
" Dinner | \n",
" 4 | \n",
"
\n",
" \n",
" 8 | \n",
" 15.04 | \n",
" 1.96 | \n",
" Male | \n",
" No | \n",
" Sun | \n",
" Dinner | \n",
" 2 | \n",
"
\n",
" \n",
" 9 | \n",
" 14.78 | \n",
" 3.23 | \n",
" Male | \n",
" No | \n",
" Sun | \n",
" Dinner | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" total_bill tip sex smoker day time size\n",
"0 16.99 1.01 Female No Sun Dinner 2\n",
"1 10.34 1.66 Male No Sun Dinner 3\n",
"2 21.01 3.50 Male No Sun Dinner 3\n",
"3 23.68 3.31 Male No Sun Dinner 2\n",
"4 24.59 3.61 Female No Sun Dinner 4\n",
"5 25.29 4.71 Male No Sun Dinner 4\n",
"6 8.77 2.00 Male No Sun Dinner 2\n",
"7 26.88 3.12 Male No Sun Dinner 4\n",
"8 15.04 1.96 Male No Sun Dinner 2\n",
"9 14.78 3.23 Male No Sun Dinner 2"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(\"tips.csv\")\n",
"\n",
"# `.head()` shows the first rows of the dataframe\n",
"df.head(n=10)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:25:14.135464Z",
"start_time": "2022-01-19T11:25:14.120703Z"
},
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 244 entries, 0 to 243\n",
"Data columns (total 7 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 total_bill 244 non-null float64\n",
" 1 tip 244 non-null float64\n",
" 2 sex 244 non-null object \n",
" 3 smoker 244 non-null object \n",
" 4 day 244 non-null object \n",
" 5 time 244 non-null object \n",
" 6 size 244 non-null int64 \n",
"dtypes: float64(2), int64(1), object(4)\n",
"memory usage: 13.5+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:25:55.860622Z",
"start_time": "2022-01-19T11:25:55.855132Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"'Sun'"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[42, \"day\"]"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:26:04.074000Z",
"start_time": "2022-01-19T11:26:04.068748Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"str"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df.loc[42, \"day\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, columns that are non-numerical contain strings (`str` type)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The `category` type\n",
"\n",
"An important type in `pandas` is `category` for variables that are **non-numerical**\n",
"\n",
"**Pro tip.** It's always a good idea to tell `pandas` which columns should be imported as **categorical**\n",
"\n",
"So, let's read again the file specifying some `dtype`s to the `read_csv` function"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:27:43.266029Z",
"start_time": "2022-01-19T11:27:43.254997Z"
}
},
"outputs": [],
"source": [
"dtypes = {\n",
" \"sex\": \"category\",\n",
" \"smoker\": \"category\",\n",
" \"day\": \"category\",\n",
" \"time\": \"category\"\n",
"} \n",
"\n",
"df = pd.read_csv(\"tips.csv\", dtype=dtypes)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:27:43.397120Z",
"start_time": "2022-01-19T11:27:43.390806Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"total_bill float64\n",
"tip float64\n",
"sex category\n",
"smoker category\n",
"day category\n",
"time category\n",
"size int64\n",
"dtype: object"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing statistics"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:28:48.597833Z",
"start_time": "2022-01-19T11:28:48.572605Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" total_bill | \n",
" tip | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 244.000000 | \n",
" 244.000000 | \n",
" 244.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 19.785943 | \n",
" 2.998279 | \n",
" 2.569672 | \n",
"
\n",
" \n",
" std | \n",
" 8.902412 | \n",
" 1.383638 | \n",
" 0.951100 | \n",
"
\n",
" \n",
" min | \n",
" 3.070000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 13.347500 | \n",
" 2.000000 | \n",
" 2.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 17.795000 | \n",
" 2.900000 | \n",
" 2.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 24.127500 | \n",
" 3.562500 | \n",
" 3.000000 | \n",
"
\n",
" \n",
" max | \n",
" 50.810000 | \n",
" 10.000000 | \n",
" 6.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" total_bill tip size\n",
"count 244.000000 244.000000 244.000000\n",
"mean 19.785943 2.998279 2.569672\n",
"std 8.902412 1.383638 0.951100\n",
"min 3.070000 1.000000 1.000000\n",
"25% 13.347500 2.000000 2.000000\n",
"50% 17.795000 2.900000 2.000000\n",
"75% 24.127500 3.562500 3.000000\n",
"max 50.810000 10.000000 6.000000"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The describe method only shows statistics for the numerical columns by default\n",
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:28:52.927812Z",
"start_time": "2022-01-19T11:28:52.861560Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" total_bill | \n",
" tip | \n",
" sex | \n",
" smoker | \n",
" day | \n",
" time | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 244.000000 | \n",
" 244.000000 | \n",
" 244 | \n",
" 244 | \n",
" 244 | \n",
" 244 | \n",
" 244.000000 | \n",
"
\n",
" \n",
" unique | \n",
" NaN | \n",
" NaN | \n",
" 2 | \n",
" 2 | \n",
" 4 | \n",
" 2 | \n",
" NaN | \n",
"
\n",
" \n",
" top | \n",
" NaN | \n",
" NaN | \n",
" Male | \n",
" No | \n",
" Sat | \n",
" Dinner | \n",
" NaN | \n",
"
\n",
" \n",
" freq | \n",
" NaN | \n",
" NaN | \n",
" 157 | \n",
" 151 | \n",
" 87 | \n",
" 176 | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" 19.785943 | \n",
" 2.998279 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2.569672 | \n",
"
\n",
" \n",
" std | \n",
" 8.902412 | \n",
" 1.383638 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0.951100 | \n",
"
\n",
" \n",
" min | \n",
" 3.070000 | \n",
" 1.000000 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 13.347500 | \n",
" 2.000000 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 17.795000 | \n",
" 2.900000 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 24.127500 | \n",
" 3.562500 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 3.000000 | \n",
"
\n",
" \n",
" max | \n",
" 50.810000 | \n",
" 10.000000 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 6.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" total_bill tip sex smoker day time size\n",
"count 244.000000 244.000000 244 244 244 244 244.000000\n",
"unique NaN NaN 2 2 4 2 NaN\n",
"top NaN NaN Male No Sat Dinner NaN\n",
"freq NaN NaN 157 151 87 176 NaN\n",
"mean 19.785943 2.998279 NaN NaN NaN NaN 2.569672\n",
"std 8.902412 1.383638 NaN NaN NaN NaN 0.951100\n",
"min 3.070000 1.000000 NaN NaN NaN NaN 1.000000\n",
"25% 13.347500 2.000000 NaN NaN NaN NaN 2.000000\n",
"50% 17.795000 2.900000 NaN NaN NaN NaN 2.000000\n",
"75% 24.127500 3.562500 NaN NaN NaN NaN 3.000000\n",
"max 50.810000 10.000000 NaN NaN NaN NaN 6.000000"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We use the include=\"all\" option to see everything\n",
"df.describe(include=\"all\")"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-19T11:30:32.769008Z",
"start_time": "2022-01-19T11:30:32.758201Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" total_bill | \n",
" tip | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" total_bill | \n",
" 1.000000 | \n",
" 0.675734 | \n",
" 0.598315 | \n",
"
\n",
" \n",
" tip | \n",
" 0.675734 | \n",
" 1.000000 | \n",
" 0.489299 | \n",
"
\n",
" \n",
" size | \n",
" 0.598315 | \n",
" 0.489299 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" total_bill tip size\n",
"total_bill 1.000000 0.675734 0.598315\n",
"tip 0.675734 1.000000 0.489299\n",
"size 0.598315 0.489299 1.000000"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Correlation between the numerical columns\n",
"df.corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data visualization with `matplotlib` and `seaborn`\n",
"\n",
"Let's show how we can use `matplotlib` and `seaborn` to visualize data contained in a `pandas` dataframe"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2022-01-26T07:51:56.942207Z",
"start_time": "2022-01-26T07:51:55.697821Z"
}
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How do the tip depends on the total bill ?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:19.482039Z",
"start_time": "2021-01-15T09:51:19.072820Z"
}
},
"outputs": [],
"source": [
"sns.jointplot(x=\"total_bill\", y=\"tip\", data=df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## When do customers go to this restaurant ?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:19.607955Z",
"start_time": "2021-01-15T09:51:19.484391Z"
},
"scrolled": false
},
"outputs": [],
"source": [
"sns.countplot(x='day', hue=\"time\", data=df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## When do customers spend the most ?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:19.809624Z",
"start_time": "2021-01-15T09:51:19.609684Z"
}
},
"outputs": [],
"source": [
"plt.figure(figsize=(7, 5))\n",
"sns.boxplot(x='day', y='total_bill', hue='time', data=df)\n",
"plt.legend(loc=\"upper left\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:20.023200Z",
"start_time": "2021-01-15T09:51:19.812067Z"
}
},
"outputs": [],
"source": [
"plt.figure(figsize=(7, 5))\n",
"sns.violinplot(x='day', y='total_bill', hue='time', split=True, data=df)\n",
"plt.legend(loc=\"upper left\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Who spends the most ?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:20.225167Z",
"start_time": "2021-01-15T09:51:20.025918Z"
},
"scrolled": true
},
"outputs": [],
"source": [
"sns.boxplot(x='sex', y='total_bill', hue='smoker', data=df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## When should waiters want to work ?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:20.442940Z",
"start_time": "2021-01-15T09:51:20.227565Z"
},
"scrolled": true
},
"outputs": [],
"source": [
"sns.boxplot(x='day', y='tip', hue='time', data=df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:20.689849Z",
"start_time": "2021-01-15T09:51:20.444898Z"
}
},
"outputs": [],
"source": [
"sns.violinplot(x='day', y='tip', hue='time', data=df)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2020-10-09T12:32:56.371842Z",
"start_time": "2020-10-09T12:32:56.367372Z"
}
},
"source": [
"# Data processing with `pandas`\n",
"\n",
"Let us read again the `tips.csv` file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:20.710132Z",
"start_time": "2021-01-15T09:51:20.692036Z"
},
"scrolled": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"dtypes = {\n",
" \"sex\": \"category\",\n",
" \"smoker\": \"category\",\n",
" \"day\": \"category\",\n",
" \"time\": \"category\"\n",
"} \n",
"\n",
"df = pd.read_csv(\"tips.csv\", dtype=dtypes)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computations using `pandas` : broadcasting\n",
"\n",
"Let's add a column that contains the tip percentage"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:20.725863Z",
"start_time": "2021-01-15T09:51:20.712207Z"
},
"scrolled": false
},
"outputs": [],
"source": [
"df[\"tip_percentage\"] = df[\"tip\"] / df[\"total_bill\"]\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The computation\n",
"```python\n",
"df[\"tip\"] / df[\"total_bill\"]\n",
"```\n",
"uses a **broadcast** rule.\n",
"\n",
"- We can multiply, add, subtract, etc. together `numpy` arrays, `Series` or `pandas` dataframes when the computation **makes sense** in view of their respective **shape**\n",
"\n",
"This principle is called **broadcast** or **broadcasting**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:20.732473Z",
"start_time": "2021-01-15T09:51:20.727786Z"
},
"scrolled": false
},
"outputs": [],
"source": [
"df[\"tip\"].shape, df[\"total_bill\"].shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `tip` and `total_bill`columns have the same `shape`, so broadcasting performs **pairwise division**.\n",
"\n",
"This corresponds to the following \"hand-crafted\" approach with a `for` loop:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:20.839274Z",
"start_time": "2021-01-15T09:51:20.734938Z"
}
},
"outputs": [],
"source": [
"for i in range(df.shape[0]):\n",
" df.loc[i, \"tip_percentage\"] = df.loc[i, \"tip\"] / df.loc[i, \"total_bill\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But using such a loop is: \n",
"\n",
"- much longer to write\n",
"- prone to mistakes\n",
"- ugly \n",
"- and **excruciatingly slower** :(\n",
"\n",
"**NEVER** use `Python` for-loops for numerical computations !"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:25.284758Z",
"start_time": "2021-01-15T09:51:20.841712Z"
}
},
"outputs": [],
"source": [
"%%timeit -n 10\n",
"for i in range(df.shape[0]):\n",
" df.loc[i, \"tip_percentage\"] = df.loc[i, \"tip\"] / df.loc[i, \"total_bill\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:25.332780Z",
"start_time": "2021-01-15T09:51:25.286877Z"
}
},
"outputs": [],
"source": [
"%%timeit -n 10\n",
"df[\"tip_percentage\"] = df[\"tip\"] / df[\"total_bill\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `for` loop is $\\approx$ **200 times slower** ! (even worse on larger data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pitfall. Changing values in a `DataFrame`\n",
"\n",
"When you want to change a value in a `DataFrame`, never use\n",
"```python\n",
"df[\"tip_percentage\"].loc[i] = 42\n",
"```\n",
"but use\n",
"```python\n",
"df.loc[i, \"tip_percentage\"] = 42\n",
"```\n",
"namely, use a **single** `loc` or `iloc` statement. The first version **might not work**: it might modify a copy of the column and not the dataframe itself !"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another example of broadcasting is:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:25.345631Z",
"start_time": "2021-01-15T09:51:25.335148Z"
}
},
"outputs": [],
"source": [
"(100 * df[[\"tip_percentage\"]]).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"where we multiplied **each entry** of the `tip_percentage` column by 100.\n",
"\n",
"**Remark.** Note the difference between\n",
"```python\n",
"df[['tip_percentage']]\n",
"```\n",
"which returns a `DataFrame` containing only the `tip_percentage` column and\n",
"```python\n",
"df['tip_percentage']\n",
"```\n",
"which returns a `Series` containing the data of the `tip_percentage` column"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Some more plots"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How do the tip percentages relates to the total bill ?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:25.832352Z",
"start_time": "2021-01-15T09:51:25.349460Z"
}
},
"outputs": [],
"source": [
"sns.jointplot(x=\"total_bill\", y=\"tip_percentage\", data=df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Who tips best ?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.063153Z",
"start_time": "2021-01-15T09:51:25.835329Z"
}
},
"outputs": [],
"source": [
"sns.boxplot(x='sex', y='tip_percentage', hue='smoker', data=df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Who tips best without the `tip_percentage` outliers ?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.338967Z",
"start_time": "2021-01-15T09:51:26.065918Z"
}
},
"outputs": [],
"source": [
"sns.boxplot(\n",
" x='sex', y='tip_percentage', hue='smoker', \n",
" data=df.loc[df[\"tip_percentage\"] <= 0.3]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.347568Z",
"start_time": "2021-01-15T09:51:26.341347Z"
}
},
"outputs": [],
"source": [
"id(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The all-mighty `groupby` and `aggregate`\n",
"\n",
"Many computations can be formulated as a **groupby** followed by and **aggregation**.\n",
"\n",
"### What is the mean `tip` and `tip percentage` each day ?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.367070Z",
"start_time": "2021-01-15T09:51:26.350529Z"
}
},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.385448Z",
"start_time": "2021-01-15T09:51:26.369477Z"
},
"scrolled": true
},
"outputs": [],
"source": [
"df.groupby(\"day\").mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But we don't care about the `size` column here, so we can use insead"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.404776Z",
"start_time": "2021-01-15T09:51:26.389083Z"
}
},
"outputs": [],
"source": [
"df[[\"total_bill\", \"tip\", \"tip_percentage\", \"day\"]].groupby(\"day\").mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we want to be more precise, we can `groupby` using several columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.441028Z",
"start_time": "2021-01-15T09:51:26.408033Z"
}
},
"outputs": [],
"source": [
"df[[\"total_bill\", \"tip\", \"tip_percentage\", \"day\", \"time\"]].groupby([\"day\", \"time\"]).mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Remarks**\n",
"\n",
"- We obtain a `DataFrame` with a two-level indexing: on the `day` and the `time`\n",
"- Groups must be homogeneous: we have `NaN` values for empty groups (e.g. `Sat`, `Lunch`)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pro tip\n",
"\n",
"Sometimes it's more convenient to get the groups as columns instead of a multi-level index.
\n",
"For this, use `reset_index`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.477405Z",
"start_time": "2021-01-15T09:51:26.444302Z"
}
},
"outputs": [],
"source": [
"df[[\"total_bill\", \"tip\", \"tip_percentage\", \"day\", \"time\"]]\\\n",
" .groupby([\"day\", \"time\"]).mean().reset_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Another pro tip\n",
"\n",
"Computations with pandas can include many operations that are **pipelined** until the final computation.
\n",
"Pipelining many operations is good practice and perfectly normal, but in order to make the code readable you can put it between parenthesis (`python` expression) as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.511329Z",
"start_time": "2021-01-15T09:51:26.480282Z"
},
"scrolled": true
},
"outputs": [],
"source": [
"(\n",
" df[[\"total_bill\", \"tip\", \"tip_percentage\", \"day\", \"time\"]]\n",
" .groupby([\"day\", \"time\"])\n",
" .mean()\n",
" .reset_index()\n",
" # and on top of all this we sort the dataframe with respect \n",
" # to the tip_percentage\n",
" .sort_values(\"tip_percentage\")\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Displaying a `DataFrame` with `style`\n",
"\n",
"Now, we can answer, with style, to the question: what are the average tip percentages along the week ?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.668935Z",
"start_time": "2021-01-15T09:51:26.514338Z"
},
"scrolled": true
},
"outputs": [],
"source": [
"(\n",
" df[[\"tip_percentage\", \"day\", \"time\"]]\n",
" .groupby([\"day\", \"time\"])\n",
" .mean()\n",
" # At the end of the pipeline you can use .style\n",
" .style\n",
" # Print numerical values as percentages \n",
" .format(\"{:.2%}\")\n",
" .background_gradient()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Removing the `NaN` values\n",
"\n",
"But the `NaN` values are somewhat annoying. Let's remove them"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.704232Z",
"start_time": "2021-01-15T09:51:26.674169Z"
}
},
"outputs": [],
"source": [
"(\n",
" df[[\"tip_percentage\", \"day\", \"time\"]]\n",
" .groupby([\"day\", \"time\"])\n",
" .mean()\n",
" # We just add this from the previous pipeline\n",
" .dropna()\n",
" .style\n",
" .format(\"{:.2%}\")\n",
" .background_gradient()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we clearly see when `tip_percentage` is maximal. But what about the standard deviation ?\n",
"\n",
"- We used only `.mean()` for now, but we can use several aggregating function using `.agg()`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.766086Z",
"start_time": "2021-01-15T09:51:26.707178Z"
},
"scrolled": true
},
"outputs": [],
"source": [
"(\n",
" df[[\"tip_percentage\", \"day\", \"time\"]]\n",
" .groupby([\"day\", \"time\"])\n",
" .agg([\"mean\", \"std\"])\n",
" .dropna()\n",
" .style\n",
" .format(\"{:.2%}\")\n",
" .background_gradient()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we can use also `.describe()` as aggregation function. Moreover we\n",
"- use the `subset` option to specify which column we want to style\n",
"- we use `(\"tip_percentage\", \"count\")` to access multi-level index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.851497Z",
"start_time": "2021-01-15T09:51:26.768613Z"
}
},
"outputs": [],
"source": [
"(\n",
" df[[\"tip_percentage\", \"day\", \"time\"]]\n",
" .groupby([\"day\", \"time\"])\n",
" .describe()\n",
" .dropna()\n",
" .style\n",
" .bar(subset=[(\"tip_percentage\", \"count\")])\n",
" .background_gradient(subset=[(\"tip_percentage\", \"50%\")])\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Supervised learning of `tip` based on the `total_bill` \n",
"\n",
"As an example of very simple **machine-learning** problem, let's try to understand how we can predict `tip` based on `total_bill`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:26.993824Z",
"start_time": "2021-01-15T09:51:26.854233Z"
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"plt.scatter(df[\"total_bill\"], df[\"tip\"])\n",
"plt.xlabel(\"total_bill\", fontsize=12)\n",
"plt.ylabel(\"tip\", fontsize=12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There's a rough **linear** dependence between the two. Let's try to find it by hand!
\n",
"Namely, we look for numbers $b$ and $w$ such that\n",
"\n",
"$$\n",
"\\texttt{tip} \\approx b + w \\times \\texttt{total_bill}\n",
"$$\n",
"\n",
"for all the examples of pairs of $(\\texttt{tip}, \\texttt{total_bill})$ we observe in the data.\n",
"\n",
"In **machine learning**, we say that this is a very simple example of a **supervised learning** problem (here it is a regression problem), where $\\texttt{tip}$ is the **label** and where $\\texttt{total_bill}$ is the (only) **feature**, for which we intend to use a **linear predictor**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:27.158925Z",
"start_time": "2021-01-15T09:51:26.996352Z"
}
},
"outputs": [],
"source": [
"plt.scatter(df[\"total_bill\"], df[\"tip\"])\n",
"plt.xlabel(\"total_bill\", fontsize=12)\n",
"plt.ylabel(\"tip\", fontsize=12)\n",
"\n",
"slope = 1.0\n",
"intercept = 0.0\n",
"\n",
"x = np.linspace(0, 50, 1000)\n",
"plt.plot(x, intercept + slope * x, color=\"red\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try to find it in a more interactive way. This might require\n",
"```python\n",
"!pip install ipympl\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:27.164804Z",
"start_time": "2021-01-15T09:51:27.161590Z"
}
},
"outputs": [],
"source": [
"# !pip install ipympl"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:27.361662Z",
"start_time": "2021-01-15T09:51:27.167569Z"
},
"scrolled": false
},
"outputs": [],
"source": [
"import ipywidgets as widgets\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"%matplotlib widget\n",
"%matplotlib inline\n",
"\n",
"x = np.linspace(0, 50, 1000)\n",
"\n",
"@widgets.interact(intercept=(-5, 5, 1.), slope=(0, 1, .05))\n",
"def update(intercept=0.0, slope=0.5):\n",
" plt.scatter(df[\"total_bill\"], df[\"tip\"])\n",
" plt.plot(x, intercept + slope * x, color=\"red\")\n",
" plt.xlim((0, 50))\n",
" plt.ylim((0, 10))\n",
" plt.xlabel(\"total_bill\", fontsize=12)\n",
" plt.ylabel(\"tip\", fontsize=12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is kind of tedious to do this by hand... it would be nice to come up with an **automated** way of doing this. Moreover:\n",
"\n",
"- We are using a **linear** function, while something more complicated (such as a polynomial) might be better\n",
"- More importantly, we use **only** the `total_bill` column to predict the `tip`, while we know about many other things"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:27.383439Z",
"start_time": "2021-01-15T09:51:27.364363Z"
}
},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## One-hot encoding of categorical variables\n",
"\n",
"We can't perform computations (products and sums) with columns containing **categorical** variables. So, we can't use them like this to predict the `tip`.\n",
"We need to **convert** them to numbers somehow.\n",
"\n",
"The most classical approach for this is **one-hot encoding** (or \"create dummies\" or \"binarize\") of the categorical variables, which can be easily achieved with `pandas.get_dummies`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:57:27.607858Z",
"start_time": "2021-01-15T09:57:27.563562Z"
},
"scrolled": true,
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"df_one_hot = pd.get_dummies(df, prefix_sep='#')\n",
"df_one_hot.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Only the categorical columns have been one-hot encoded. For instance, the `\"day\"` column is replaced by 4 columns named `\"day#Thur\"`, `\"day#Fri\"`, `\"day#Sat\"`, `\"day#Sun\"`, since `\"day\"` has 4 modalities (see next line)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:27.433926Z",
"start_time": "2021-01-15T09:51:27.426358Z"
},
"scrolled": true
},
"outputs": [],
"source": [
"df['day'].unique()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:27.442464Z",
"start_time": "2021-01-15T09:51:27.436439Z"
},
"scrolled": false
},
"outputs": [],
"source": [
"df_one_hot.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pitfall. Colinearities with one-hot encoding\n",
"\n",
"Sums over dummies for `sex`, `smoker`, `day`, `time` and `size` are all equal to one (by constrution of the one-hot encoded vectors).\n",
"\n",
"- Leads to **colinearities** in the matrix of features\n",
"- It is **much harder** to train a linear regressor when the columns of the features matrix has colinearities"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:27.456985Z",
"start_time": "2021-01-15T09:51:27.445235Z"
},
"scrolled": true
},
"outputs": [],
"source": [
"day_cols = [col for col in df_one_hot.columns if col.startswith(\"day\")]\n",
"df_one_hot[day_cols].head()\n",
"df_one_hot[day_cols].sum(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:27.466975Z",
"start_time": "2021-01-15T09:51:27.459209Z"
}
},
"outputs": [],
"source": [
"all(df_one_hot[day_cols].sum(axis=1) == 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The most standard solution is to remove a modality (i.e. remove a one-hot encoding vector). Simply achieved by specifying `drop_first=True` in the `get_dummies` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:27.477837Z",
"start_time": "2021-01-15T09:51:27.469797Z"
}
},
"outputs": [],
"source": [
"df[\"day\"].unique()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-15T09:51:27.516588Z",
"start_time": "2021-01-15T09:51:27.480318Z"
}
},
"outputs": [],
"source": [
"pd.get_dummies(df, prefix_sep='#', drop_first=True).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, if a categorical feature has $K$ modalities, we use only $K-1$ dummies.\n",
"For instance, there is no more `sex#Female` binary column. \n",
"\n",
"**Question.** So, a linear regression won't fit a weight for `sex#Female`. But, where do the model weights of the dropped binary columns go ?\n",
"\n",
"**Answer.** They just \"go\" to the **intercept**: interpretation of the population bias depends on the \"dropped\" one-hot encodings.\n",
"\n",
"So, we actually fit\n",
"\n",
"$$\n",
"\\begin{align*}\n",
"\\texttt{tip} \\approx b &+ w_1 \\times \\texttt{total_bill}\n",
"+ w_2 \\times \\texttt{size} \\\\\n",
"&+ w_3 \\times \\texttt{sex#Male}\n",
"+ w_4 \\times \\texttt{smoker#Yes} \\\\\n",
"&+ w_5 \\times \\texttt{day#Sat}\n",
"+ w_6 \\times \\texttt{day#Sun}\n",
"+ w_7 \\times \\texttt{day#Thur} \\\\\n",
"&+ w_8 \\times \\texttt{time#Lunch}\n",
"\\end{align*}\n",
"$$"
]
}
],
"metadata": {
"anaconda-cloud": {},
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "193.25px"
},
"toc_section_display": true,
"toc_window_display": true
}
},
"nbformat": 4,
"nbformat_minor": 1
}