Lightspeed introduction to pandas

Pandas is the library providing tools to crunch data simply.

It mainly provides a DataFrame object

In [1]:
import numpy as np
import pandas as pd

df = pd.DataFrame({ 
    'A' : 1.,
    'B' : pd.Timestamp('20130102'),
    'C' : pd.Series(1, dtype='float32'),
    'C' : pd.Series(1, index=[0, 2], dtype='float32'),
    'D' : pd.Series([1, 2, 3, 4], dtype='int32'),
    'E' : pd.Categorical(["test", "train", "test", "train"]),
    'F' : 'foo',
    'G' : np.random.randn(4)
})
df
Out[1]:
A B C D E F G
0 1.0 2013-01-02 1.0 1 test foo 1.224033
1 1.0 2013-01-02 NaN 2 train foo -2.020326
2 1.0 2013-01-02 1.0 3 test foo 1.437953
3 1.0 2013-01-02 NaN 4 train foo 1.117535
In [2]:
# Columns can be accessed by name (when it makes sense)
df.B
Out[2]:
0   2013-01-02
1   2013-01-02
2   2013-01-02
3   2013-01-02
Name: B, dtype: datetime64[ns]
In [3]:
# Or like keys in a dict
df['B']
Out[3]:
0   2013-01-02
1   2013-01-02
2   2013-01-02
3   2013-01-02
Name: B, dtype: datetime64[ns]
In [4]:
type(df.B)
Out[4]:
pandas.core.series.Series
In [5]:
# To select a list of columns
df[['A', 'C']]
Out[5]:
A C
0 1.0 1.0
1 1.0 NaN
2 1.0 1.0
3 1.0 NaN
In [6]:
df.dtypes
Out[6]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
G           float64
dtype: object
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 7 columns):
A    4 non-null float64
B    4 non-null datetime64[ns]
C    2 non-null float32
D    4 non-null int32
E    4 non-null category
F    4 non-null object
G    4 non-null float64
dtypes: category(1), datetime64[ns](1), float32(1), float64(2), int32(1), object(1)
memory usage: 292.0+ bytes

Groupby and aggregations

Basic syntax: group rows using the categories of a column and perform an aggregation on the groups

In [8]:
df
Out[8]:
A B C D E F G
0 1.0 2013-01-02 1.0 1 test foo 1.224033
1 1.0 2013-01-02 NaN 2 train foo -2.020326
2 1.0 2013-01-02 1.0 3 test foo 1.437953
3 1.0 2013-01-02 NaN 4 train foo 1.117535
In [9]:
# Compute the sum of D for each category in E
df.groupby('E').sum()
Out[9]:
A C D G
E
test 2.0 2.0 4 2.661987
train 2.0 0.0 6 -0.902790

Lightspeed introduction to seaborn

It's a graphics library built on top of matplotlib which

  • works pretty neatly with pandas DataFrames
  • provides simpler ways to make nice visualization of datasets

Let's illustrate this using the toy tips dataset that comes with seaborn

In [10]:
import seaborn as sns

# Load one of the data sets that comes with seaborn
tips = sns.load_dataset("tips")

# First 10 rows of the dataframe
tips.head(n=10)
Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
7 26.88 3.12 Male No Sun Dinner 4
8 15.04 1.96 Male No Sun Dinner 2
9 14.78 3.23 Male No Sun Dinner 2
In [11]:
tips.describe(include='all')
Out[11]:
total_bill tip sex smoker day time size
count 244.000000 244.000000 244 244 244 244 244.000000
unique NaN NaN 2 2 4 2 NaN
top NaN NaN Male No Sat Dinner NaN
freq NaN NaN 157 151 87 176 NaN
mean 19.785943 2.998279 NaN NaN NaN NaN 2.569672
std 8.902412 1.383638 NaN NaN NaN NaN 0.951100
min 3.070000 1.000000 NaN NaN NaN NaN 1.000000
25% 13.347500 2.000000 NaN NaN NaN NaN 2.000000
50% 17.795000 2.900000 NaN NaN NaN NaN 2.000000
75% 24.127500 3.562500 NaN NaN NaN NaN 3.000000
max 50.810000 10.000000 NaN NaN NaN NaN 6.000000
In [12]:
tips['day'].unique()
Out[12]:
[Sun, Sat, Thur, Fri]
Categories (4, object): [Sun, Sat, Thur, Fri]
In [13]:
tips.dtypes
Out[13]:
total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object
In [14]:
tips.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null category
time          244 non-null category
size          244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.2 KB
In [15]:
sns.jointplot("total_bill", "tip", data=tips)
Out[15]:
<seaborn.axisgrid.JointGrid at 0x7f954082fba8>

Exercice 1

Compute the tip percentage of Dinner VS Lunch for each day of the week

In [16]:
tips.head()
Out[16]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Answer

In [17]:
tips['tip_percentage'] = 100 * tips['tip'] / tips['total_bill']
tips.groupby(['time', 'day']).mean()[['tip_percentage']]
Out[17]:
tip_percentage
time day
Lunch Thur 16.130074
Fri 18.876489
Sat NaN
Sun NaN
Dinner Thur 15.974441
Fri 15.891611
Sat 15.315172
Sun 16.689729

Exercice 2

Prepare the dataset for training a linear regression that predicts the tip based on the other features

  • Convert size as a categorical variable
  • One-hot encode the variables
  • Normalize the continuous features
  • Train a linear regression and display its coefficients
In [18]:
# We want to deal with size as a categorical variable
tips['size'] = tips['size'].astype('category')
tips.head(5)
Out[18]:
total_bill tip sex smoker day time size tip_percentage
0 16.99 1.01 Female No Sun Dinner 2 5.944673
1 10.34 1.66 Male No Sun Dinner 3 16.054159
2 21.01 3.50 Male No Sun Dinner 3 16.658734
3 23.68 3.31 Male No Sun Dinner 2 13.978041
4 24.59 3.61 Female No Sun Dinner 4 14.680765
In [19]:
tips.dtypes
Out[19]:
total_bill         float64
tip                float64
sex               category
smoker            category
day               category
time              category
size              category
tip_percentage     float64
dtype: object

We need to "one-hot encode" ("create dummies", "binarize") the categorical variables. This can be easily achieved with the pandas.get_dummies function.

In [20]:
data = pd.get_dummies(tips, prefix_sep='#')
data.head(5)
Out[20]:
total_bill tip tip_percentage sex#Male sex#Female smoker#Yes smoker#No day#Thur day#Fri day#Sat day#Sun time#Lunch time#Dinner size#1 size#2 size#3 size#4 size#5 size#6
0 16.99 1.01 5.944673 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 0
1 10.34 1.66 16.054159 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0
2 21.01 3.50 16.658734 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0
3 23.68 3.31 13.978041 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0
4 24.59 3.61 14.680765 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 0

Only categorical columns have been "binarized". For instance, the 'day' column is replaced by 4 columns named 'day#Thur', 'day#Fri', 'day#Sat', 'day#Sun', since 'day' has 4 modalities (see next line).

In [21]:
tips['day'].unique()
Out[21]:
[Sun, Sat, Thur, Fri]
Categories (4, object): [Sun, Sat, Thur, Fri]

Important remark

Sums over dummies for sex, smoker, day, time and size are all equal to one.

  • Leads to colinearities, hence bad conditioning of the features matrix
  • Can be checked through a SVD (but don't compute the SVD of a large matrix!)
In [22]:
import matplotlib.pyplot as plt
%matplotlib inline

s = np.linalg.svd(data, compute_uv=False)
plt.yscale('log')
plt.title('Spectrum of the features matrix', fontsize=16)
_ = plt.stem(s, use_line_collection=True)
In [23]:
data = pd.get_dummies(tips, prefix_sep='#', drop_first=True)
data.head()
Out[23]:
total_bill tip tip_percentage sex#Female smoker#No day#Fri day#Sat day#Sun time#Dinner size#2 size#3 size#4 size#5 size#6
0 16.99 1.01 5.944673 1 1 0 0 1 1 1 0 0 0 0
1 10.34 1.66 16.054159 0 1 0 0 1 1 0 1 0 0 0
2 21.01 3.50 16.658734 0 1 0 0 1 1 0 1 0 0 0
3 23.68 3.31 13.978041 0 1 0 0 1 1 1 0 0 0 0
4 24.59 3.61 14.680765 1 1 0 0 1 1 0 0 1 0 0

Now, if a categorical feature has $K$ modalities, we use only $K-1$ dummies

Question

Where are model weights for sex#Male and other modalities gone?

Final preprocessing

  • Let's remove and get the tip column from the dataframe, it's the labels we want to use
  • And let's remove the tip_percentage (otherwise it's considered cheating...)
In [24]:
tip = data.pop('tip')
_ = data.pop('tip_percentage')
data.head(5)
Out[24]:
total_bill sex#Female smoker#No day#Fri day#Sat day#Sun time#Dinner size#2 size#3 size#4 size#5 size#6
0 16.99 1 1 0 0 1 1 1 0 0 0 0
1 10.34 0 1 0 0 1 1 0 1 0 0 0
2 21.01 0 1 0 0 1 1 0 1 0 0 0
3 23.68 0 1 0 0 1 1 1 0 0 0 0
4 24.59 1 1 0 0 1 1 0 0 1 0 0

Remark

The first time you pop the tip column you're fine: it removes it from data and puts it in tip. If you do it a second time you get:

In [25]:
tip = data.pop('tip')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2656             try:
-> 2657                 return self._engine.get_loc(key)
   2658             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'tip'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-25-b580cd52955f> in <module>
----> 1 tip = data.pop('tip')

/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in pop(self, item)
    807         3  monkey        NaN
    808         """
--> 809         result = self[item]
    810         del self[item]
    811         try:

/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2925             if self.columns.nlevels > 1:
   2926                 return self._getitem_multilevel(key)
-> 2927             indexer = self.columns.get_loc(key)
   2928             if is_integer(indexer):
   2929                 indexer = [indexer]

/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2657                 return self._engine.get_loc(key)
   2658             except KeyError:
-> 2659                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2660         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2661         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'tip'

Which is KeyError: 'tip' error.

  • It's simply a long way of saying that there is no column named tip in data.
  • It's fair enough, since pop removed this column the first time.
  • But, this is annoying since in a notebook, you're likely to execute cells several time.

Question

Solve this problem, where we recall that data is obtained through

In [26]:
tips = sns.load_dataset("tips")
tips['size'] = tips['size'].astype('category')
data = pd.get_dummies(tips, prefix_sep='#', drop_first=True)
data.head(5)
Out[26]:
total_bill tip sex#Female smoker#No day#Fri day#Sat day#Sun time#Dinner size#2 size#3 size#4 size#5 size#6
0 16.99 1.01 1 1 0 0 1 1 1 0 0 0 0
1 10.34 1.66 0 1 0 0 1 1 0 1 0 0 0
2 21.01 3.50 0 1 0 0 1 1 0 1 0 0 0
3 23.68 3.31 0 1 0 0 1 1 1 0 0 0 0
4 24.59 3.61 1 1 0 0 1 1 0 0 1 0 0

Answer

In [27]:
if 'tip' in data:
    tip = data.pop('tip')

# Let's try it twice to check that it's really ok...
if 'tip' in data:
    tip = data.pop('tip')
In [28]:
data.head(2)
Out[28]:
total_bill sex#Female smoker#No day#Fri day#Sat day#Sun time#Dinner size#2 size#3 size#4 size#5 size#6
0 16.99 1 1 0 0 1 1 1 0 0 0 0
1 10.34 0 1 0 0 1 1 0 1 0 0 0
In [29]:
tip.head(2)
Out[29]:
0    1.01
1    1.66
Name: tip, dtype: float64
In [30]:
from sklearn.linear_model import LinearRegression

# The features matrix is data and the labels vector is tip
X, y = data, tip
# Create and train a linear regression
linreg = LinearRegression().fit(X, y)

plt.title('Linear regression coefficients for tips prediction', 
          fontsize=16)
plt.stem(linreg.coef_, use_line_collection=True)
_ = plt.xticks(np.arange(linreg.coef_.shape[0]), data.columns,
               rotation='vertical', fontsize=14)
_ = plt.yticks(fontsize=14)

Question

Why total_bill is not the most important feature ? Seems counter-intuitive!

Answer

We did not normalize the continuous features when training this linear regression!

  • While we would expect the 'total_bill' feature to have a strong impact on the prediction, the weight associated to it is not much larger other ones
  • This comes from the fact that 'total_bill' is not normalized, and takes values that are arbitrarily larger than the dummies for instance

So, let's try again with a min-max normalization (puts features between 0 and 1).

In [31]:
from sklearn.preprocessing import MinMaxScaler

# We normalize all the features by min-max scaling 
#  (putting all features between 0 and 1)
X2 = MinMaxScaler().fit_transform(X)

linreg = LinearRegression().fit(X2, y)

plt.title('Linear regression coefficients for tips prediction', 
          fontsize=16)
plt.stem(linreg.coef_, use_line_collection=True)
_ = plt.xticks(np.arange(linreg.coef_.shape[0]), data.columns,
               rotation='vertical', fontsize=14)
_ = plt.yticks(fontsize=14)