{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-09T12:32:56.371842Z",
     "start_time": "2020-10-09T12:32:56.367372Z"
    }
   },
   "source": [
    "# Introduction to `pandas`\n",
    "\n",
    "The `pandas` library (https://pandas.pydata.org) is one of the most used tool at the disposal of people working with data in `python` today.\n",
    "\n",
    "- It allows to **crunch data** easily\n",
    "- It mainly provides a `DataFrame` object (a **table of data**) with a huge set of functionalities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why ?\n",
    "\n",
    "Through `pandas`, you get acquainted with your data by **cleaning**, **transforming**, and **analyzing** it.\n",
    "\n",
    "- What's the average, median, max, or min of each column?\n",
    "- Does column A correlate with column B?\n",
    "- What does the distribution of data in column C look like?\n",
    "- Removing missing values, filter rows or columns using some criteria\n",
    "- Data visualization (when combined `matplotlib` or `seaborn` or others)\n",
    "- Store the cleaned, transformed data back into virtually any format or database"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Where ?\n",
    "\n",
    "`pandas` is a central component of the `python` \"stack\" for data science\n",
    "\n",
    "- `pandas` is built on top of `numpy`\n",
    "- often used in conjunction with other libraries\n",
    "- a `DataFrame` is often fed to plotting functions or machine learning algorithms (such as `scikit-learn`)\n",
    "- Well-interfaced with `jupyter`, leading to a nice interactive environment for data exploration and modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Core components of pandas\n",
    "\n",
    "The primary two components of pandas are the `Series` and `DataFrame`.\n",
    "\n",
    "- A `Series` is essentially a column\n",
    "- a `DataFrame` is a multi-dimensional table made up of a collection of `Series`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a `DataFrame` from scratch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:07:08.634333Z",
     "start_time": "2022-01-19T11:07:08.201274Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   apples  oranges\n",
       "0       3        0\n",
       "1       2        3\n",
       "2       0        7\n",
       "3       1        2"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "fruits = {\n",
    "    \"apples\": [3, 2, 0, 1],\n",
    "    \"oranges\": [0, 3, 7, 2]\n",
    "}\n",
    "\n",
    "df_fruits = pd.DataFrame(fruits)\n",
    "df_fruits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:08:05.535463Z",
     "start_time": "2022-01-19T11:08:05.530187Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df_fruits)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:08:12.680877Z",
     "start_time": "2022-01-19T11:08:12.674128Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    3\n",
       "1    2\n",
       "2    0\n",
       "3    1\n",
       "Name: apples, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits[\"apples\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:08:40.517193Z",
     "start_time": "2022-01-19T11:08:40.511875Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df_fruits[\"apples\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Indexing\n",
    "\n",
    "- By default, a `DataFrame` uses a contiguous index\n",
    "- But what if we want to say **who** buys the fruits ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:08:51.018797Z",
     "start_time": "2022-01-19T11:08:51.009897Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Daniel</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sean</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pierce</th>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Roger</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apples  oranges\n",
       "Daniel       3        0\n",
       "Sean         2        3\n",
       "Pierce       0        7\n",
       "Roger        1        2"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits = pd.DataFrame(fruits, index=[\"Daniel\", \"Sean\", \"Pierce\", \"Roger\"])\n",
    "df_fruits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `.loc` versus `.iloc`\n",
    "\n",
    "- `.loc` **loc**ates by name\n",
    "- `.iloc` **loc**ates by numerical **i**ndex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:11:43.050618Z",
     "start_time": "2022-01-19T11:11:43.042623Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Daniel</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sean</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pierce</th>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Roger</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apples  oranges\n",
       "Daniel       3        0\n",
       "Sean         2        3\n",
       "Pierce       0        7\n",
       "Roger        1        2"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:12:17.606988Z",
     "start_time": "2022-01-19T11:12:17.599866Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "apples     2\n",
       "oranges    3\n",
       "Name: Sean, dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# What's in Sean's basket ?\n",
    "df_fruits.loc['Sean']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:12:31.777154Z",
     "start_time": "2022-01-19T11:12:31.771307Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Daniel    0\n",
       "Sean      3\n",
       "Pierce    7\n",
       "Roger     2\n",
       "Name: oranges, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Who has oranges ?\n",
    "df_fruits.loc[:, 'oranges']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:13:12.355217Z",
     "start_time": "2022-01-19T11:13:12.350177Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How many apples in Pierce's basket ?\n",
    "df_fruits.loc['Pierce', 'apples']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:13:18.688274Z",
     "start_time": "2022-01-19T11:13:18.676461Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Daniel</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sean</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pierce</th>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Roger</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apples  oranges\n",
       "Daniel       3        0\n",
       "Sean         2        3\n",
       "Pierce       0        7\n",
       "Roger        1        2"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:13:20.548872Z",
     "start_time": "2022-01-19T11:13:20.542442Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits.iloc[2, 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Main attributes and methods of a `DataFrame`\n",
    "\n",
    "A `DataFrame` has many **attributes**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:13:38.902805Z",
     "start_time": "2022-01-19T11:13:38.896985Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['apples', 'oranges'], dtype='object')"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:14:23.322276Z",
     "start_time": "2022-01-19T11:14:23.316645Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Daniel', 'Sean', 'Pierce', 'Roger'], dtype='object')"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:14:32.512895Z",
     "start_time": "2022-01-19T11:14:32.506454Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "apples     int64\n",
       "oranges    int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `DataFrame` has many **methods**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:17:20.721882Z",
     "start_time": "2022-01-19T11:17:20.708707Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 4 entries, Daniel to Roger\n",
      "Data columns (total 2 columns):\n",
      " #   Column   Non-Null Count  Dtype  \n",
      "---  ------   --------------  -----  \n",
      " 0   apples   3 non-null      float64\n",
      " 1   oranges  4 non-null      int64  \n",
      "dtypes: float64(1), int64(1)\n",
      "memory usage: 268.0+ bytes\n"
     ]
    }
   ],
   "source": [
    "df_fruits.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:15:35.080217Z",
     "start_time": "2022-01-19T11:15:35.062875Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>1.500000</td>\n",
       "      <td>3.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.290994</td>\n",
       "      <td>2.94392</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.750000</td>\n",
       "      <td>1.50000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>1.500000</td>\n",
       "      <td>2.50000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>2.250000</td>\n",
       "      <td>4.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>7.00000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         apples  oranges\n",
       "count  4.000000  4.00000\n",
       "mean   1.500000  3.00000\n",
       "std    1.290994  2.94392\n",
       "min    0.000000  0.00000\n",
       "25%    0.750000  1.50000\n",
       "50%    1.500000  2.50000\n",
       "75%    2.250000  4.00000\n",
       "max    3.000000  7.00000"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Missing values\n",
    "\n",
    "What if we don't know how many apples are in Sean's basket ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:15:57.731545Z",
     "start_time": "2022-01-19T11:15:57.721720Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Daniel</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pierce</th>\n",
       "      <td>0.0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Roger</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apples  oranges\n",
       "Daniel     3.0        0\n",
       "Sean       NaN        3\n",
       "Pierce     0.0        7\n",
       "Roger      1.0        2"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits.loc['Sean', 'apples'] = None\n",
    "df_fruits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:16:40.181235Z",
     "start_time": "2022-01-19T11:16:40.163184Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>4.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>1.333333</td>\n",
       "      <td>3.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.527525</td>\n",
       "      <td>2.94392</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.500000</td>\n",
       "      <td>1.50000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.50000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>4.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>7.00000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         apples  oranges\n",
       "count  3.000000  4.00000\n",
       "mean   1.333333  3.00000\n",
       "std    1.527525  2.94392\n",
       "min    0.000000  0.00000\n",
       "25%    0.500000  1.50000\n",
       "50%    1.000000  2.50000\n",
       "75%    2.000000  4.00000\n",
       "max    3.000000  7.00000"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that `count` is **3** for apples now, since we have 1 missing value among the 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding a column\n",
    "\n",
    "Ooooops, we forgot about the bananas !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:17:38.191383Z",
     "start_time": "2022-01-19T11:17:38.181663Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "      <th>bananas</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Daniel</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pierce</th>\n",
       "      <td>0.0</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Roger</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apples  oranges  bananas\n",
       "Daniel     3.0        0        0\n",
       "Sean       NaN        3        2\n",
       "Pierce     0.0        7        1\n",
       "Roger      1.0        2        6"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits[\"bananas\"] = [0, 2, 1, 6]\n",
    "df_fruits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding a column with the date\n",
    "\n",
    "And we forgot the dates !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:17:45.450938Z",
     "start_time": "2022-01-19T11:17:45.439619Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "      <th>bananas</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Daniel</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2020/10/08 12:13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>2020/10/07 11:37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pierce</th>\n",
       "      <td>0.0</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>2020/10/10 14:07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Roger</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>2020/10/09 10:51</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apples  oranges  bananas              time\n",
       "Daniel     3.0        0        0  2020/10/08 12:13\n",
       "Sean       NaN        3        2  2020/10/07 11:37\n",
       "Pierce     0.0        7        1  2020/10/10 14:07\n",
       "Roger      1.0        2        6  2020/10/09 10:51"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits['time'] = [\n",
    "    \"2020/10/08 12:13\", \"2020/10/07 11:37\", \n",
    "    \"2020/10/10 14:07\", \"2020/10/09 10:51\"\n",
    "]\n",
    "df_fruits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:18:04.481768Z",
     "start_time": "2022-01-19T11:18:04.475480Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "apples     float64\n",
       "oranges      int64\n",
       "bananas      int64\n",
       "time        object\n",
       "dtype: object"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:18:19.387128Z",
     "start_time": "2022-01-19T11:18:19.381754Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "str"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df_fruits.loc[\"Roger\", \"time\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's not a date but a string (`str`) ! So we convert this column to something called `datetime`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:19:11.306595Z",
     "start_time": "2022-01-19T11:19:11.293662Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "      <th>bananas</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Daniel</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2020-10-08 12:13:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>2020-10-07 11:37:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pierce</th>\n",
       "      <td>0.0</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>2020-10-10 14:07:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Roger</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>2020-10-09 10:51:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apples  oranges  bananas                time\n",
       "Daniel     3.0        0        0 2020-10-08 12:13:00\n",
       "Sean       NaN        3        2 2020-10-07 11:37:00\n",
       "Pierce     0.0        7        1 2020-10-10 14:07:00\n",
       "Roger      1.0        2        6 2020-10-09 10:51:00"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits[\"time\"] = pd.to_datetime(df_fruits[\"time\"])\n",
    "df_fruits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:19:41.683741Z",
     "start_time": "2022-01-19T11:19:41.677507Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "apples            float64\n",
       "oranges             int64\n",
       "bananas             int64\n",
       "time       datetime64[ns]\n",
       "dtype: object"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if we want to keep only the baskets after (including) October, 9th ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:19:51.787589Z",
     "start_time": "2022-01-19T11:19:51.775535Z"
    },
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "      <th>bananas</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Pierce</th>\n",
       "      <td>0.0</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>2020-10-10 14:07:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Roger</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>2020-10-09 10:51:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apples  oranges  bananas                time\n",
       "Pierce     0.0        7        1 2020-10-10 14:07:00\n",
       "Roger      1.0        2        6 2020-10-09 10:51:00"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits.loc[df_fruits[\"time\"] >= pd.Timestamp(\"2020/10/09\")]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Slices and subsets of rows or columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:20:30.215178Z",
     "start_time": "2022-01-19T11:20:30.205482Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "      <th>bananas</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Daniel</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2020-10-08 12:13:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>2020-10-07 11:37:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pierce</th>\n",
       "      <td>0.0</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>2020-10-10 14:07:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Roger</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>2020-10-09 10:51:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apples  oranges  bananas                time\n",
       "Daniel     3.0        0        0 2020-10-08 12:13:00\n",
       "Sean       NaN        3        2 2020-10-07 11:37:00\n",
       "Pierce     0.0        7        1 2020-10-10 14:07:00\n",
       "Roger      1.0        2        6 2020-10-09 10:51:00"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:20:32.076067Z",
     "start_time": "2022-01-19T11:20:32.066100Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bananas</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Daniel</th>\n",
       "      <td>0</td>\n",
       "      <td>2020-10-08 12:13:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sean</th>\n",
       "      <td>2</td>\n",
       "      <td>2020-10-07 11:37:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pierce</th>\n",
       "      <td>1</td>\n",
       "      <td>2020-10-10 14:07:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Roger</th>\n",
       "      <td>6</td>\n",
       "      <td>2020-10-09 10:51:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        bananas                time\n",
       "Daniel        0 2020-10-08 12:13:00\n",
       "Sean          2 2020-10-07 11:37:00\n",
       "Pierce        1 2020-10-10 14:07:00\n",
       "Roger         6 2020-10-09 10:51:00"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits.loc[:, \"bananas\":\"time\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:20:46.504725Z",
     "start_time": "2022-01-19T11:20:46.494468Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "      <th>bananas</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Daniel</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apples  oranges  bananas\n",
       "Daniel     3.0        0        0\n",
       "Sean       NaN        3        2"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits.loc[\"Daniel\":\"Sean\", \"apples\":\"bananas\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:21:08.463797Z",
     "start_time": "2022-01-19T11:21:08.452887Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Daniel</th>\n",
       "      <td>3.0</td>\n",
       "      <td>2020-10-08 12:13:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2020-10-07 11:37:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pierce</th>\n",
       "      <td>0.0</td>\n",
       "      <td>2020-10-10 14:07:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Roger</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2020-10-09 10:51:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apples                time\n",
       "Daniel     3.0 2020-10-08 12:13:00\n",
       "Sean       NaN 2020-10-07 11:37:00\n",
       "Pierce     0.0 2020-10-10 14:07:00\n",
       "Roger      1.0 2020-10-09 10:51:00"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits[[\"apples\", \"time\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Write our data to a CSV file\n",
    "\n",
    "What if we want to write the file ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:22:26.287324Z",
     "start_time": "2022-01-19T11:22:26.277669Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apples</th>\n",
       "      <th>oranges</th>\n",
       "      <th>bananas</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Daniel</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2020-10-08 12:13:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>2020-10-07 11:37:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pierce</th>\n",
       "      <td>0.0</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>2020-10-10 14:07:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Roger</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>2020-10-09 10:51:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apples  oranges  bananas                time\n",
       "Daniel     3.0        0        0 2020-10-08 12:13:00\n",
       "Sean       NaN        3        2 2020-10-07 11:37:00\n",
       "Pierce     0.0        7        1 2020-10-10 14:07:00\n",
       "Roger      1.0        2        6 2020-10-09 10:51:00"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fruits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:22:28.613982Z",
     "start_time": "2022-01-19T11:22:28.602540Z"
    }
   },
   "outputs": [],
   "source": [
    "df_fruits.to_csv(\"fruits.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:22:41.309341Z",
     "start_time": "2022-01-19T11:22:40.971618Z"
    },
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-r--r--  1 root root  163 Jan 19 12:22 fruits.csv\r\n"
     ]
    }
   ],
   "source": [
    "# Use !dir on windows\n",
    "!ls -alh | grep fru"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:23:12.183964Z",
     "start_time": "2022-01-19T11:23:11.859384Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ",apples,oranges,bananas,time\r\n",
      "Daniel,3.0,0,0,2020-10-08 12:13:00\r\n",
      "Sean,,3,2,2020-10-07 11:37:00\r\n",
      "Pierce,0.0,7,1,2020-10-10 14:07:00\r\n",
      "Roger,1.0,2,6,2020-10-09 10:51:00\r\n"
     ]
    }
   ],
   "source": [
    "!head -n 5 fruits.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading data and working with it\n",
    "\n",
    "Let's read the file `tips.csv`. It is a `CSV` file (**C**omma **S**eparated **V**alues).\n",
    "\n",
    "It contains data about a restaurant: the bill, tip and some informations about the customers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:24:03.254808Z",
     "start_time": "2022-01-19T11:24:03.193606Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The file ./tips.csv already exists.\n"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "import os\n",
    "\n",
    "# The path containing your notebook\n",
    "path_data = './'\n",
    "# The name of the file\n",
    "filename = 'tips.csv'\n",
    "\n",
    "if os.path.exists(os.path.join(path_data, filename)):\n",
    "    print('The file %s already exists.' % os.path.join(path_data, filename))\n",
    "else:\n",
    "    url = 'http://stephanegaiffas.github.io/big_data_course/data/tips.csv'\n",
    "    r = requests.get(url)\n",
    "    with open(os.path.join(path_data, filename), 'wb') as f:\n",
    "        f.write(r.content)\n",
    "    print('Downloaded file %s.' % os.path.join(path_data, filename))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:25:11.872794Z",
     "start_time": "2022-01-19T11:25:11.854234Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>sex</th>\n",
       "      <th>smoker</th>\n",
       "      <th>day</th>\n",
       "      <th>time</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>16.99</td>\n",
       "      <td>1.01</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10.34</td>\n",
       "      <td>1.66</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>21.01</td>\n",
       "      <td>3.50</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>23.68</td>\n",
       "      <td>3.31</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>24.59</td>\n",
       "      <td>3.61</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>25.29</td>\n",
       "      <td>4.71</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>8.77</td>\n",
       "      <td>2.00</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>26.88</td>\n",
       "      <td>3.12</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>15.04</td>\n",
       "      <td>1.96</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>14.78</td>\n",
       "      <td>3.23</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   total_bill   tip     sex smoker  day    time  size\n",
       "0       16.99  1.01  Female     No  Sun  Dinner     2\n",
       "1       10.34  1.66    Male     No  Sun  Dinner     3\n",
       "2       21.01  3.50    Male     No  Sun  Dinner     3\n",
       "3       23.68  3.31    Male     No  Sun  Dinner     2\n",
       "4       24.59  3.61  Female     No  Sun  Dinner     4\n",
       "5       25.29  4.71    Male     No  Sun  Dinner     4\n",
       "6        8.77  2.00    Male     No  Sun  Dinner     2\n",
       "7       26.88  3.12    Male     No  Sun  Dinner     4\n",
       "8       15.04  1.96    Male     No  Sun  Dinner     2\n",
       "9       14.78  3.23    Male     No  Sun  Dinner     2"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"tips.csv\")\n",
    "\n",
    "# `.head()` shows the first rows of the dataframe\n",
    "df.head(n=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:25:14.135464Z",
     "start_time": "2022-01-19T11:25:14.120703Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 244 entries, 0 to 243\n",
      "Data columns (total 7 columns):\n",
      " #   Column      Non-Null Count  Dtype  \n",
      "---  ------      --------------  -----  \n",
      " 0   total_bill  244 non-null    float64\n",
      " 1   tip         244 non-null    float64\n",
      " 2   sex         244 non-null    object \n",
      " 3   smoker      244 non-null    object \n",
      " 4   day         244 non-null    object \n",
      " 5   time        244 non-null    object \n",
      " 6   size        244 non-null    int64  \n",
      "dtypes: float64(2), int64(1), object(4)\n",
      "memory usage: 13.5+ KB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:25:55.860622Z",
     "start_time": "2022-01-19T11:25:55.855132Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Sun'"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[42, \"day\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:26:04.074000Z",
     "start_time": "2022-01-19T11:26:04.068748Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "str"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df.loc[42, \"day\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, columns that are non-numerical contain strings (`str` type)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The `category` type\n",
    "\n",
    "An important type in `pandas` is `category` for variables that are **non-numerical**\n",
    "\n",
    "**Pro tip.** It's always a good idea to tell `pandas` which columns should be imported as **categorical**\n",
    "\n",
    "So, let's read again the file specifying some `dtype`s to the `read_csv` function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:27:43.266029Z",
     "start_time": "2022-01-19T11:27:43.254997Z"
    }
   },
   "outputs": [],
   "source": [
    "dtypes = {\n",
    "    \"sex\": \"category\",\n",
    "    \"smoker\": \"category\",\n",
    "    \"day\": \"category\",\n",
    "    \"time\": \"category\"\n",
    "} \n",
    "\n",
    "df = pd.read_csv(\"tips.csv\", dtype=dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:27:43.397120Z",
     "start_time": "2022-01-19T11:27:43.390806Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "total_bill     float64\n",
       "tip            float64\n",
       "sex           category\n",
       "smoker        category\n",
       "day           category\n",
       "time          category\n",
       "size             int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:28:48.597833Z",
     "start_time": "2022-01-19T11:28:48.572605Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>244.000000</td>\n",
       "      <td>244.000000</td>\n",
       "      <td>244.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>19.785943</td>\n",
       "      <td>2.998279</td>\n",
       "      <td>2.569672</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>8.902412</td>\n",
       "      <td>1.383638</td>\n",
       "      <td>0.951100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>3.070000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>13.347500</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>17.795000</td>\n",
       "      <td>2.900000</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>24.127500</td>\n",
       "      <td>3.562500</td>\n",
       "      <td>3.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>50.810000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>6.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       total_bill         tip        size\n",
       "count  244.000000  244.000000  244.000000\n",
       "mean    19.785943    2.998279    2.569672\n",
       "std      8.902412    1.383638    0.951100\n",
       "min      3.070000    1.000000    1.000000\n",
       "25%     13.347500    2.000000    2.000000\n",
       "50%     17.795000    2.900000    2.000000\n",
       "75%     24.127500    3.562500    3.000000\n",
       "max     50.810000   10.000000    6.000000"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The describe method only shows statistics for the numerical columns by default\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:28:52.927812Z",
     "start_time": "2022-01-19T11:28:52.861560Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>sex</th>\n",
       "      <th>smoker</th>\n",
       "      <th>day</th>\n",
       "      <th>time</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>244.000000</td>\n",
       "      <td>244.000000</td>\n",
       "      <td>244</td>\n",
       "      <td>244</td>\n",
       "      <td>244</td>\n",
       "      <td>244</td>\n",
       "      <td>244.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>157</td>\n",
       "      <td>151</td>\n",
       "      <td>87</td>\n",
       "      <td>176</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>19.785943</td>\n",
       "      <td>2.998279</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.569672</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>8.902412</td>\n",
       "      <td>1.383638</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.951100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>3.070000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>13.347500</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>17.795000</td>\n",
       "      <td>2.900000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>24.127500</td>\n",
       "      <td>3.562500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>50.810000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        total_bill         tip   sex smoker  day    time        size\n",
       "count   244.000000  244.000000   244    244  244     244  244.000000\n",
       "unique         NaN         NaN     2      2    4       2         NaN\n",
       "top            NaN         NaN  Male     No  Sat  Dinner         NaN\n",
       "freq           NaN         NaN   157    151   87     176         NaN\n",
       "mean     19.785943    2.998279   NaN    NaN  NaN     NaN    2.569672\n",
       "std       8.902412    1.383638   NaN    NaN  NaN     NaN    0.951100\n",
       "min       3.070000    1.000000   NaN    NaN  NaN     NaN    1.000000\n",
       "25%      13.347500    2.000000   NaN    NaN  NaN     NaN    2.000000\n",
       "50%      17.795000    2.900000   NaN    NaN  NaN     NaN    2.000000\n",
       "75%      24.127500    3.562500   NaN    NaN  NaN     NaN    3.000000\n",
       "max      50.810000   10.000000   NaN    NaN  NaN     NaN    6.000000"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We use the include=\"all\" option to see everything\n",
    "df.describe(include=\"all\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-19T11:30:32.769008Z",
     "start_time": "2022-01-19T11:30:32.758201Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>total_bill</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.675734</td>\n",
       "      <td>0.598315</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tip</th>\n",
       "      <td>0.675734</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.489299</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>size</th>\n",
       "      <td>0.598315</td>\n",
       "      <td>0.489299</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            total_bill       tip      size\n",
       "total_bill    1.000000  0.675734  0.598315\n",
       "tip           0.675734  1.000000  0.489299\n",
       "size          0.598315  0.489299  1.000000"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Correlation between the numerical columns\n",
    "df.corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data visualization with `matplotlib` and `seaborn`\n",
    "\n",
    "Let's show how we can use `matplotlib` and `seaborn` to visualize data contained in a `pandas` dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-01-26T07:51:56.942207Z",
     "start_time": "2022-01-26T07:51:55.697821Z"
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How do the tip depends on the total bill ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:19.482039Z",
     "start_time": "2021-01-15T09:51:19.072820Z"
    }
   },
   "outputs": [],
   "source": [
    "sns.jointplot(x=\"total_bill\", y=\"tip\", data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## When do customers go to this restaurant ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:19.607955Z",
     "start_time": "2021-01-15T09:51:19.484391Z"
    },
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "sns.countplot(x='day', hue=\"time\", data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## When do customers spend the most ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:19.809624Z",
     "start_time": "2021-01-15T09:51:19.609684Z"
    }
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(7, 5))\n",
    "sns.boxplot(x='day', y='total_bill', hue='time', data=df)\n",
    "plt.legend(loc=\"upper left\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:20.023200Z",
     "start_time": "2021-01-15T09:51:19.812067Z"
    }
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(7, 5))\n",
    "sns.violinplot(x='day', y='total_bill', hue='time', split=True, data=df)\n",
    "plt.legend(loc=\"upper left\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Who spends the most ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:20.225167Z",
     "start_time": "2021-01-15T09:51:20.025918Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sns.boxplot(x='sex', y='total_bill', hue='smoker', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## When should waiters want to work ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:20.442940Z",
     "start_time": "2021-01-15T09:51:20.227565Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sns.boxplot(x='day', y='tip', hue='time', data=df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:20.689849Z",
     "start_time": "2021-01-15T09:51:20.444898Z"
    }
   },
   "outputs": [],
   "source": [
    "sns.violinplot(x='day', y='tip', hue='time', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-09T12:32:56.371842Z",
     "start_time": "2020-10-09T12:32:56.367372Z"
    }
   },
   "source": [
    "# Data processing with `pandas`\n",
    "\n",
    "Let us read again the `tips.csv` file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:20.710132Z",
     "start_time": "2021-01-15T09:51:20.692036Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "dtypes = {\n",
    "    \"sex\": \"category\",\n",
    "    \"smoker\": \"category\",\n",
    "    \"day\": \"category\",\n",
    "    \"time\": \"category\"\n",
    "} \n",
    "\n",
    "df = pd.read_csv(\"tips.csv\", dtype=dtypes)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computations using `pandas` : broadcasting\n",
    "\n",
    "Let's add a column that contains the tip percentage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:20.725863Z",
     "start_time": "2021-01-15T09:51:20.712207Z"
    },
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "df[\"tip_percentage\"] = df[\"tip\"] / df[\"total_bill\"]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The computation\n",
    "```python\n",
    "df[\"tip\"] / df[\"total_bill\"]\n",
    "```\n",
    "uses a **broadcast** rule.\n",
    "\n",
    "- We can multiply, add, subtract, etc. together `numpy` arrays, `Series` or `pandas` dataframes when the computation **makes sense** in view of their respective **shape**\n",
    "\n",
    "This principle is called **broadcast** or **broadcasting**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:20.732473Z",
     "start_time": "2021-01-15T09:51:20.727786Z"
    },
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "df[\"tip\"].shape, df[\"total_bill\"].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `tip` and `total_bill`columns have the same `shape`, so broadcasting performs **pairwise division**.\n",
    "\n",
    "This corresponds to the following \"hand-crafted\" approach with a `for` loop:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:20.839274Z",
     "start_time": "2021-01-15T09:51:20.734938Z"
    }
   },
   "outputs": [],
   "source": [
    "for i in range(df.shape[0]):\n",
    "    df.loc[i, \"tip_percentage\"] = df.loc[i, \"tip\"] / df.loc[i, \"total_bill\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But using such a loop is: \n",
    "\n",
    "- much longer to write\n",
    "- prone to mistakes\n",
    "- ugly \n",
    "- and **excruciatingly slower** :(\n",
    "\n",
    "**NEVER** use `Python` for-loops for numerical computations !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:25.284758Z",
     "start_time": "2021-01-15T09:51:20.841712Z"
    }
   },
   "outputs": [],
   "source": [
    "%%timeit -n 10\n",
    "for i in range(df.shape[0]):\n",
    "    df.loc[i, \"tip_percentage\"] = df.loc[i, \"tip\"] / df.loc[i, \"total_bill\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:25.332780Z",
     "start_time": "2021-01-15T09:51:25.286877Z"
    }
   },
   "outputs": [],
   "source": [
    "%%timeit -n 10\n",
    "df[\"tip_percentage\"] = df[\"tip\"] / df[\"total_bill\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `for` loop is $\\approx$ **200 times slower** ! (even worse on larger data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pitfall. Changing values in a `DataFrame`\n",
    "\n",
    "When you want to change a value in a `DataFrame`, never use\n",
    "```python\n",
    "df[\"tip_percentage\"].loc[i] = 42\n",
    "```\n",
    "but use\n",
    "```python\n",
    "df.loc[i, \"tip_percentage\"] = 42\n",
    "```\n",
    "namely, use a **single** `loc` or `iloc` statement. The first version **might not work**: it might modify a copy of the column and not the dataframe itself !"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another example of broadcasting is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:25.345631Z",
     "start_time": "2021-01-15T09:51:25.335148Z"
    }
   },
   "outputs": [],
   "source": [
    "(100 * df[[\"tip_percentage\"]]).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where we multiplied **each entry** of the `tip_percentage` column by 100.\n",
    "\n",
    "**Remark.** Note the difference between\n",
    "```python\n",
    "df[['tip_percentage']]\n",
    "```\n",
    "which returns a `DataFrame` containing only the `tip_percentage` column and\n",
    "```python\n",
    "df['tip_percentage']\n",
    "```\n",
    "which returns a `Series` containing the data of the `tip_percentage` column"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Some more plots"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How do the tip percentages relates to the total bill ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:25.832352Z",
     "start_time": "2021-01-15T09:51:25.349460Z"
    }
   },
   "outputs": [],
   "source": [
    "sns.jointplot(x=\"total_bill\", y=\"tip_percentage\", data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Who tips best ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.063153Z",
     "start_time": "2021-01-15T09:51:25.835329Z"
    }
   },
   "outputs": [],
   "source": [
    "sns.boxplot(x='sex', y='tip_percentage', hue='smoker', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Who tips best without the `tip_percentage` outliers ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.338967Z",
     "start_time": "2021-01-15T09:51:26.065918Z"
    }
   },
   "outputs": [],
   "source": [
    "sns.boxplot(\n",
    "    x='sex', y='tip_percentage', hue='smoker', \n",
    "    data=df.loc[df[\"tip_percentage\"] <= 0.3]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.347568Z",
     "start_time": "2021-01-15T09:51:26.341347Z"
    }
   },
   "outputs": [],
   "source": [
    "id(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The all-mighty `groupby` and `aggregate`\n",
    "\n",
    "Many computations can be formulated as a **groupby** followed by and **aggregation**.\n",
    "\n",
    "### What is the mean `tip` and `tip percentage` each day ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.367070Z",
     "start_time": "2021-01-15T09:51:26.350529Z"
    }
   },
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.385448Z",
     "start_time": "2021-01-15T09:51:26.369477Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df.groupby(\"day\").mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But we don't care about the `size` column here, so we can use insead"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.404776Z",
     "start_time": "2021-01-15T09:51:26.389083Z"
    }
   },
   "outputs": [],
   "source": [
    "df[[\"total_bill\", \"tip\", \"tip_percentage\", \"day\"]].groupby(\"day\").mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to be more precise, we can `groupby` using several columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.441028Z",
     "start_time": "2021-01-15T09:51:26.408033Z"
    }
   },
   "outputs": [],
   "source": [
    "df[[\"total_bill\", \"tip\", \"tip_percentage\", \"day\", \"time\"]].groupby([\"day\", \"time\"]).mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Remarks**\n",
    "\n",
    "- We obtain a `DataFrame` with a two-level indexing: on the `day` and the `time`\n",
    "- Groups must be homogeneous: we have `NaN` values for empty groups (e.g. `Sat`, `Lunch`)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pro tip\n",
    "\n",
    "Sometimes it's more convenient to get the groups as columns instead of a multi-level index.<br>\n",
    "For this, use `reset_index`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.477405Z",
     "start_time": "2021-01-15T09:51:26.444302Z"
    }
   },
   "outputs": [],
   "source": [
    "df[[\"total_bill\", \"tip\", \"tip_percentage\", \"day\", \"time\"]]\\\n",
    "    .groupby([\"day\", \"time\"]).mean().reset_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Another pro tip\n",
    "\n",
    "Computations with pandas can include many operations that are **pipelined** until the final computation.<br>\n",
    "Pipelining many operations is good practice and perfectly normal, but in order to make the code readable you can put it between parenthesis (`python` expression) as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.511329Z",
     "start_time": "2021-01-15T09:51:26.480282Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "(\n",
    "    df[[\"total_bill\", \"tip\", \"tip_percentage\", \"day\", \"time\"]]\n",
    "    .groupby([\"day\", \"time\"])\n",
    "    .mean()\n",
    "    .reset_index()\n",
    "    # and on top of all this we sort the dataframe with respect \n",
    "    # to the tip_percentage\n",
    "    .sort_values(\"tip_percentage\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Displaying a `DataFrame` with `style`\n",
    "\n",
    "Now, we can answer, with style, to the question: what are the average tip percentages along the week ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.668935Z",
     "start_time": "2021-01-15T09:51:26.514338Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "(\n",
    "    df[[\"tip_percentage\", \"day\", \"time\"]]\n",
    "    .groupby([\"day\", \"time\"])\n",
    "    .mean()\n",
    "    # At the end of the pipeline you can use .style\n",
    "    .style\n",
    "    # Print numerical values as percentages \n",
    "    .format(\"{:.2%}\")\n",
    "    .background_gradient()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Removing the `NaN` values\n",
    "\n",
    "But the `NaN` values are somewhat annoying. Let's remove them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.704232Z",
     "start_time": "2021-01-15T09:51:26.674169Z"
    }
   },
   "outputs": [],
   "source": [
    "(\n",
    "    df[[\"tip_percentage\", \"day\", \"time\"]]\n",
    "    .groupby([\"day\", \"time\"])\n",
    "    .mean()\n",
    "    # We just add this from the previous pipeline\n",
    "    .dropna()\n",
    "    .style\n",
    "    .format(\"{:.2%}\")\n",
    "    .background_gradient()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we clearly see when `tip_percentage` is maximal. But what about the standard deviation ?\n",
    "\n",
    "- We used only `.mean()` for now, but we can use several aggregating function using `.agg()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.766086Z",
     "start_time": "2021-01-15T09:51:26.707178Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "(\n",
    "    df[[\"tip_percentage\", \"day\", \"time\"]]\n",
    "    .groupby([\"day\", \"time\"])\n",
    "    .agg([\"mean\", \"std\"])\n",
    "    .dropna()\n",
    "    .style\n",
    "    .format(\"{:.2%}\")\n",
    "    .background_gradient()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can use also `.describe()` as aggregation function. Moreover we\n",
    "- use the `subset` option to specify which column we want to style\n",
    "- we use `(\"tip_percentage\", \"count\")` to access multi-level index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.851497Z",
     "start_time": "2021-01-15T09:51:26.768613Z"
    }
   },
   "outputs": [],
   "source": [
    "(\n",
    "    df[[\"tip_percentage\", \"day\", \"time\"]]\n",
    "    .groupby([\"day\", \"time\"])\n",
    "    .describe()\n",
    "    .dropna()\n",
    "    .style\n",
    "    .bar(subset=[(\"tip_percentage\", \"count\")])\n",
    "    .background_gradient(subset=[(\"tip_percentage\", \"50%\")])\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Supervised learning of `tip` based on the `total_bill` \n",
    "\n",
    "As an example of very simple **machine-learning** problem, let's try to understand how we can predict `tip` based on `total_bill`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:26.993824Z",
     "start_time": "2021-01-15T09:51:26.854233Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "plt.scatter(df[\"total_bill\"], df[\"tip\"])\n",
    "plt.xlabel(\"total_bill\", fontsize=12)\n",
    "plt.ylabel(\"tip\", fontsize=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There's a rough **linear** dependence between the two. Let's try to find it by hand!<br>\n",
    "Namely, we look for numbers $b$ and $w$ such that\n",
    "\n",
    "$$\n",
    "\\texttt{tip} \\approx b + w \\times \\texttt{total_bill}\n",
    "$$\n",
    "\n",
    "for all the examples of pairs of $(\\texttt{tip}, \\texttt{total_bill})$ we observe in the data.\n",
    "\n",
    "In **machine learning**, we say that this is a very simple example of a **supervised learning** problem (here it is a regression problem), where $\\texttt{tip}$ is the **label** and where $\\texttt{total_bill}$ is the (only) **feature**, for which we intend to use a **linear predictor**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:27.158925Z",
     "start_time": "2021-01-15T09:51:26.996352Z"
    }
   },
   "outputs": [],
   "source": [
    "plt.scatter(df[\"total_bill\"], df[\"tip\"])\n",
    "plt.xlabel(\"total_bill\", fontsize=12)\n",
    "plt.ylabel(\"tip\", fontsize=12)\n",
    "\n",
    "slope = 1.0\n",
    "intercept = 0.0\n",
    "\n",
    "x = np.linspace(0, 50, 1000)\n",
    "plt.plot(x, intercept + slope * x, color=\"red\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try to find it in a more interactive way. This might require\n",
    "```python\n",
    "!pip install ipympl\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:27.164804Z",
     "start_time": "2021-01-15T09:51:27.161590Z"
    }
   },
   "outputs": [],
   "source": [
    "# !pip install ipympl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:27.361662Z",
     "start_time": "2021-01-15T09:51:27.167569Z"
    },
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "import ipywidgets as widgets\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "%matplotlib widget\n",
    "%matplotlib inline\n",
    "\n",
    "x = np.linspace(0, 50, 1000)\n",
    "\n",
    "@widgets.interact(intercept=(-5, 5, 1.), slope=(0, 1, .05))\n",
    "def update(intercept=0.0, slope=0.5):\n",
    "    plt.scatter(df[\"total_bill\"], df[\"tip\"])\n",
    "    plt.plot(x, intercept + slope * x, color=\"red\")\n",
    "    plt.xlim((0, 50))\n",
    "    plt.ylim((0, 10))\n",
    "    plt.xlabel(\"total_bill\", fontsize=12)\n",
    "    plt.ylabel(\"tip\", fontsize=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is kind of tedious to do this by hand... it would be nice to come up with an **automated** way of doing this. Moreover:\n",
    "\n",
    "- We are using a **linear** function, while something more complicated (such as a polynomial) might be better\n",
    "- More importantly, we use **only** the `total_bill` column to predict the `tip`, while we know about many other things"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:27.383439Z",
     "start_time": "2021-01-15T09:51:27.364363Z"
    }
   },
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## One-hot encoding of categorical variables\n",
    "\n",
    "We can't perform computations (products and sums) with columns containing **categorical** variables. So, we can't use them like this to predict the `tip`.\n",
    "We need to **convert** them to numbers somehow.\n",
    "\n",
    "The most classical approach for this is **one-hot encoding** (or \"create dummies\" or \"binarize\") of the categorical variables, which can be easily achieved with `pandas.get_dummies`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:57:27.607858Z",
     "start_time": "2021-01-15T09:57:27.563562Z"
    },
    "scrolled": true,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "df_one_hot = pd.get_dummies(df, prefix_sep='#')\n",
    "df_one_hot.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only the categorical columns have been one-hot encoded. For instance, the `\"day\"` column is replaced by 4 columns named `\"day#Thur\"`, `\"day#Fri\"`, `\"day#Sat\"`, `\"day#Sun\"`, since `\"day\"` has 4 modalities (see next line)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:27.433926Z",
     "start_time": "2021-01-15T09:51:27.426358Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df['day'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:27.442464Z",
     "start_time": "2021-01-15T09:51:27.436439Z"
    },
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "df_one_hot.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pitfall. Colinearities with one-hot encoding\n",
    "\n",
    "Sums over dummies for `sex`, `smoker`, `day`, `time` and `size` are all equal to one (by constrution of the one-hot encoded vectors).\n",
    "\n",
    "- Leads to **colinearities** in the matrix of features\n",
    "- It is **much harder** to train a linear regressor when the columns of the features matrix has colinearities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:27.456985Z",
     "start_time": "2021-01-15T09:51:27.445235Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "day_cols = [col for col in df_one_hot.columns if col.startswith(\"day\")]\n",
    "df_one_hot[day_cols].head()\n",
    "df_one_hot[day_cols].sum(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:27.466975Z",
     "start_time": "2021-01-15T09:51:27.459209Z"
    }
   },
   "outputs": [],
   "source": [
    "all(df_one_hot[day_cols].sum(axis=1) == 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most standard solution is to remove a modality (i.e. remove a one-hot encoding vector). Simply achieved by specifying `drop_first=True` in the `get_dummies` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:27.477837Z",
     "start_time": "2021-01-15T09:51:27.469797Z"
    }
   },
   "outputs": [],
   "source": [
    "df[\"day\"].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-15T09:51:27.516588Z",
     "start_time": "2021-01-15T09:51:27.480318Z"
    }
   },
   "outputs": [],
   "source": [
    "pd.get_dummies(df, prefix_sep='#', drop_first=True).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, if a categorical feature has $K$ modalities, we use only $K-1$ dummies.\n",
    "For instance, there is no more `sex#Female` binary column. \n",
    "\n",
    "**Question.** So, a linear regression won't fit a weight for `sex#Female`. But, where do the model weights of the dropped binary columns go ?\n",
    "\n",
    "**Answer.** They just \"go\" to the **intercept**: interpretation of the population bias depends on the \"dropped\" one-hot encodings.\n",
    "\n",
    "So, we actually fit\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "\\texttt{tip} \\approx b &+ w_1 \\times \\texttt{total_bill}\n",
    "+ w_2 \\times \\texttt{size} \\\\\n",
    "&+ w_3 \\times \\texttt{sex#Male}\n",
    "+ w_4 \\times \\texttt{smoker#Yes} \\\\\n",
    "&+ w_5 \\times \\texttt{day#Sat}\n",
    "+ w_6 \\times \\texttt{day#Sun}\n",
    "+ w_7 \\times \\texttt{day#Thur} \\\\\n",
    "&+ w_8 \\times \\texttt{time#Lunch}\n",
    "\\end{align*}\n",
    "$$"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "193.25px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}