This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.

How to Handle Categorical Data?¶

Real world data comes with their unique blends. Sometime working with real world data, you will have to deal with categorical data and other time not. Categorical data are those types of data whose features' values contain limited number of categories. Take an example of feature gender that can have two categories: male and female.

In many cases, categorical features have text values. And most ML models accept numerical inputs. That is the reason why we have to manipulate these types of categories to be in proper format accepted by ML algorithms.

There are four techniques to encode or convert the categorical features into numbers. Here are them:

Mapping Method
Ordinary Encoding
Label Encoding
Pandas Dummies
OneHot Encoding

Note that some of these encoding techniques can produce same output, the difference is only implementation. The first 3 will produce the numerical outputs while the latter will produce the one hot matrix (with 1s and 0s).

Let's implement them

In [262]:

            
                Copied!
                
# Loading the dataset 

import seaborn as sns
import pandas as pd
# Loading the dataset 

import seaborn as sns
import pandas as pd

We are going to use Titanic dataset from seaborn datasets. There are so many categorical features to choose from.

In [263]:

            
                Copied!
                
titanic = sns.load_dataset('titanic')
titanic = sns.load_dataset('titanic')

In [264]:

            
                Copied!
                
titanic.head()
titanic.head()

Out[264]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

In [265]:

            
                Copied!
                
titanic.isnull().sum()
titanic.isnull().sum()

Out[265]:

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [266]:

            
                Copied!
                
titanic.info()
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB

You can see that even displaying information about the dataset, some features like deck or class have category as data type.

Let's peek at some categorical features in our data.

In [267]:

            
                Copied!
                
titanic['sex'].value_counts()
titanic['sex'].value_counts()

Out[267]:

male      577
female    314
Name: sex, dtype: int64

In [268]:

            
                Copied!
                
titanic['embarked'].value_counts()
titanic['embarked'].value_counts()

Out[268]:

S    644
C    168
Q     77
Name: embarked, dtype: int64

In [269]:

            
                Copied!
                
titanic['class'].value_counts()
titanic['class'].value_counts()

Out[269]:

Third     491
First     216
Second    184
Name: class, dtype: int64

In [270]:

            
                Copied!
                
titanic['who'].value_counts()
titanic['who'].value_counts()

Out[270]:

man      537
woman    271
child     83
Name: who, dtype: int64

In [271]:

            
                Copied!
                
titanic['adult_male'].value_counts()
titanic['adult_male'].value_counts()

Out[271]:

True     537
False    354
Name: adult_male, dtype: int64

In [272]:

            
                Copied!
                
titanic['embark_town'].value_counts()
titanic['embark_town'].value_counts()

Out[272]:

Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64

In [273]:

            
                Copied!
                
titanic['alone'].value_counts()
titanic['alone'].value_counts()

Out[273]:

True     537
False    354
Name: alone, dtype: int64

In [274]:

            
                Copied!
                
titanic['deck'].value_counts()
titanic['deck'].value_counts()

Out[274]:

C    59
B    47
D    33
E    32
A    15
F    13
G     4
Name: deck, dtype: int64

1. Mapping Method¶

Mapping method is straight forward way to encode categorical features with few categories. Let's apply it to the class feature: It has three categories: Third, First, Second. We create a dictionary whose keys are categories and values are numerics to encode to and then map it to the dataframe.

Here is how it is done:

In [275]:

            
                Copied!
                
map_dict = {
    'First':0,
    'Second': 1,
    'Third': 2 
}
map_dict = {
    'First':0,
    'Second': 1,
    'Third': 2 
}

In [276]:

            
                Copied!
                
titanic['class'] = titanic['class'].map(map_dict)
titanic['class'] = titanic['class'].map(map_dict)

In [277]:

            
                Copied!
                
titanic.head()
titanic.head()

Out[277]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	2	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	0	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	2	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	0	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	2	man	True	NaN	Southampton	no	True

In [278]:

            
                Copied!
                
titanic['class'].value_counts()
titanic['class'].value_counts()

Out[278]:

2    491
0    216
1    184
Name: class, dtype: int64

As you can see, the class feature is encoded. Everywhere the class was First, it was replaced with 0. Samething happened to other classes.

2. Ordinary Encoding¶

This will also convert categorical data into numbers. Let's implement it

In [284]:

            
                Copied!
                
from sklearn.preprocessing import OrdinalEncoder

cats_feats = titanic[['alive', 'alone']]

encoder = OrdinalEncoder()

cats_encoded = encoder.fit_transform(cats_feats)
from sklearn.preprocessing import OrdinalEncoder

cats_feats = titanic[['alive', 'alone']]

encoder = OrdinalEncoder()

cats_encoded = encoder.fit_transform(cats_feats)

The output of the encoder is a NumPy array. We can convert it back to the pandas dataframe.

In [285]:

            
                Copied!
                
titanic[['alive', 'alone']] = pd.DataFrame(cats_encoded, columns=cats_feats.columns, index=cats_feats.index)
titanic.head()
titanic[['alive', 'alone']] = pd.DataFrame(cats_encoded, columns=cats_feats.columns, index=cats_feats.index)
titanic.head()

Out[285]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	adult_male	deck	embark_town	alive	alone	man	woman
0	0	3	male	22.0	1	7.2500	S	2	True	NaN	Southampton	0.0	0.0	1	0
1	1	1	female	38.0	1	71.2833	C	0	False	C	Cherbourg	1.0	0.0	0	1
2	1	3	female	26.0	0	7.9250	S	2	False	NaN	Southampton	1.0	1.0	0	1
3	1	1	female	35.0	1	53.1000	S	0	False	C	Southampton	1.0	0.0	0	1
4	0	3	male	35.0	0	8.0500	S	2	True	NaN	Southampton	0.0	1.0	1	0

In [286]:

            
                Copied!
                
encoder.categories_
encoder.categories_

Out[286]:

[array(['no', 'yes'], dtype=object), array([False,  True])]

Warning: Ordinary Encoder can't handle missing values. It will be error. Try it on embarked and see...

3. Label Encoding¶

Label Encoding is noted to used for encoding target features (per sklearn documentation) but otherwise, it can also be used to achieve our purpose of encoding categorical features.

It also can't support missing values. So, to make it simple, let's drop all missing values.

In [287]:

            
                Copied!
                
titanic = sns.load_dataset('titanic')

titanic_cleaned = titanic.dropna()
titanic = sns.load_dataset('titanic')

titanic_cleaned = titanic.dropna()

In [288]:

            
                Copied!
                
titanic_cleaned.isnull().sum()
titanic_cleaned.isnull().sum()

Out[288]:

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [289]:

            
                Copied!
                
from sklearn.preprocessing import LabelEncoder

deck_feat = titanic_cleaned[['deck']]

label_encoder = LabelEncoder()

deck_encoded = label_encoder.fit_transform(deck_feat)
from sklearn.preprocessing import LabelEncoder

deck_feat = titanic_cleaned[['deck']]

label_encoder = LabelEncoder()

deck_encoded = label_encoder.fit_transform(deck_feat)

/Users/jean/opt/miniconda3/envs/tensor/lib/python3.7/site-packages/sklearn/utils/validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(*args, **kwargs)

Same as ordinary encoder, the output of Label Encoder is a NumPy array.

In [290]:

            
                Copied!
                
titanic_cleaned['deck'] = pd.DataFrame(deck_encoded, columns=deck_feat.columns, index=deck_feat.index)

titanic_cleaned.head()
titanic_cleaned['deck'] = pd.DataFrame(deck_encoded, columns=deck_feat.columns, index=deck_feat.index)

titanic_cleaned.head()

/Users/jean/opt/miniconda3/envs/tensor/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

Out[290]:

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
1	1	1	female	38.0	1	0	71.2833	C	First	woman	False	2	Cherbourg	yes	False
3	1	1	female	35.0	1	0	53.1000	S	First	woman	False	2	Southampton	yes	False
6	0	1	male	54.0	0	0	51.8625	S	First	man	True	4	Southampton	no	True
10	1	3	female	4.0	1	1	16.7000	S	Third	child	False	6	Southampton	yes	False
11	1	1	female	58.0	0	0	26.5500	S	First	woman	False	2	Southampton	yes	True

In [291]:

            
                Copied!
                
label_encoder.classes_
label_encoder.classes_

Out[291]:

array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)

In [292]:

            
                Copied!
                
titanic_cleaned['deck'].value_counts()
titanic_cleaned['deck'].value_counts()

Out[292]:

2    51
1    43
3    31
4    30
0    12
5    11
6     4
Name: deck, dtype: int64

4. Pandas Dummies¶

This is also simple way to handle categorical features. It will create extra features based on the available categories. Let's apply it to the feature who.

In [279]:

            
                Copied!
                
dummies = pd.get_dummies(titanic['who'], drop_first=True)
dummies = pd.get_dummies(titanic['who'], drop_first=True)

In [280]:

            
                Copied!
                
titanic.head()
titanic.head()

Out[280]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	2	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	0	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	2	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	0	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	2	man	True	NaN	Southampton	no	True

In [281]:

            
                Copied!
                
titanic = pd.concat([titanic.drop('who',axis=1),dummies],axis=1)
titanic = pd.concat([titanic.drop('who',axis=1),dummies],axis=1)

In [282]:

            
                Copied!
                
titanic.head()
titanic.head()

Out[282]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	adult_male	deck	embark_town	alive	alone	man	woman
0	0	3	male	22.0	1	7.2500	S	2	True	NaN	Southampton	no	False	1	0
1	1	1	female	38.0	1	71.2833	C	0	False	C	Cherbourg	yes	False	0	1
2	1	3	female	26.0	0	7.9250	S	2	False	NaN	Southampton	yes	True	0	1
3	1	1	female	35.0	1	53.1000	S	0	False	C	Southampton	yes	False	0	1
4	0	3	male	35.0	0	8.0500	S	2	True	NaN	Southampton	no	True	1	0

In [283]:

            
                Copied!
                
# Or you can do it at once with this code

#titanic[['man', 'woman']] = pd.get_dummies(titanic['who'], drop_first=True)
# Or you can do it at once with this code

#titanic[['man', 'woman']] = pd.get_dummies(titanic['who'], drop_first=True)

5. One Hot Encoding¶

This is the last encoding type of our list. It will convert a feature into one hot matrix. Additional features corresponding to the values of the given categories will be created. Basically same as dummies.

In [301]:

            
                Copied!
                
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder()

town_encoded = one_hot.fit_transform(titanic_cleaned[['embark_town']])
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder()

town_encoded = one_hot.fit_transform(titanic_cleaned[['embark_town']])

In [294]:

            
                Copied!
                
one_hot.categories_
one_hot.categories_

Out[294]:

[array(['Cherbourg', 'Queenstown', 'Southampton'], dtype=object)]

In [295]:

            
                Copied!
                
town_encoded
town_encoded

Out[295]:

<182x3 sparse matrix of type '<class 'numpy.float64'>'
	with 182 stored elements in Compressed Sparse Row format>

The output of One hot encoder is a sparse matrix. We will need to convert it into NumPy array.

In [302]:

            
                Copied!
                
town_encoded = town_encoded.toarray()
town_encoded = town_encoded.toarray()

In [303]:

            
                Copied!
                
columns = list(one_hot.categories_)

town_df = pd.DataFrame(town_encoded, columns =columns)

town_df.head()
columns = list(one_hot.categories_)

town_df = pd.DataFrame(town_encoded, columns =columns)

town_df.head()

Out[303]:

	Cherbourg	Southampton
0	1.0	0.0
1	0.0	1.0
2	0.0	1.0
3	0.0	1.0
4	0.0	1.0

In [298]:

            
                Copied!
                
len(town_df)
len(town_df)

Out[298]:

In [299]:

            
                Copied!
                
len(titanic_cleaned)
len(titanic_cleaned)

Out[299]:

In [305]:

            
                Copied!
                
drop_embark = titanic_cleaned.drop('embark_town',axis=1)

drop_embark[['Cherbourg', 'Queenstown', 'Southampton']] = town_df
drop_embark = titanic_cleaned.drop('embark_town',axis=1)

drop_embark[['Cherbourg', 'Queenstown', 'Southampton']] = town_df

In [316]:

            
                Copied!
                
drop_embark.head()
drop_embark.head()

Out[316]:

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	alive	alone	Southampton
1	1	1	female	38.0	1	0	71.2833	C	First	woman	False	2	yes	False	1.0
3	1	1	female	35.0	1	0	53.1000	S	First	woman	False	2	yes	False	1.0
6	0	1	male	54.0	0	0	51.8625	S	First	man	True	4	no	True	1.0
10	1	3	female	4.0	1	1	16.7000	S	Third	child	False	6	yes	False	1.0
11	1	1	female	58.0	0	0	26.5500	S	First	woman	False	2	yes	True	1.0

Hopefully these techniques will help you to handle all kinds of categorical features.

Back to top!