This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.
How to Handle Categorical Data?¶
Real world data comes with their unique blends. Sometime working with real world data, you will have to deal with categorical data and other time not. Categorical data are those types of data whose features' values contain limited number of categories. Take an example of feature gender
that can have two categories: male and female
.
In many cases, categorical features have text values. And most ML models accept numerical inputs. That is the reason why we have to manipulate these types of categories to be in proper format accepted by ML algorithms.
There are four techniques to encode or convert the categorical features into numbers. Here are them:
Note that some of these encoding techniques can produce same output, the difference is only implementation. The first 3 will produce the numerical outputs while the latter will produce the one hot matrix (with 1s and 0s).
Let's implement them
# Loading the dataset
import seaborn as sns
import pandas as pd
We are going to use Titanic dataset from seaborn datasets. There are so many categorical features to choose from.
titanic = sns.load_dataset('titanic')
titanic.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
titanic.isnull().sum()
survived 0 pclass 0 sex 0 age 177 sibsp 0 parch 0 fare 0 embarked 2 class 0 who 0 adult_male 0 deck 688 embark_town 2 alive 0 alone 0 dtype: int64
titanic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 survived 891 non-null int64 1 pclass 891 non-null int64 2 sex 891 non-null object 3 age 714 non-null float64 4 sibsp 891 non-null int64 5 parch 891 non-null int64 6 fare 891 non-null float64 7 embarked 889 non-null object 8 class 891 non-null category 9 who 891 non-null object 10 adult_male 891 non-null bool 11 deck 203 non-null category 12 embark_town 889 non-null object 13 alive 891 non-null object 14 alone 891 non-null bool dtypes: bool(2), category(2), float64(2), int64(4), object(5) memory usage: 80.7+ KB
You can see that even displaying information about the dataset, some features like deck
or class
have category as data type.
Let's peek at some categorical features in our data.
titanic['sex'].value_counts()
male 577 female 314 Name: sex, dtype: int64
titanic['embarked'].value_counts()
S 644 C 168 Q 77 Name: embarked, dtype: int64
titanic['class'].value_counts()
Third 491 First 216 Second 184 Name: class, dtype: int64
titanic['who'].value_counts()
man 537 woman 271 child 83 Name: who, dtype: int64
titanic['adult_male'].value_counts()
True 537 False 354 Name: adult_male, dtype: int64
titanic['embark_town'].value_counts()
Southampton 644 Cherbourg 168 Queenstown 77 Name: embark_town, dtype: int64
titanic['alone'].value_counts()
True 537 False 354 Name: alone, dtype: int64
titanic['deck'].value_counts()
C 59 B 47 D 33 E 32 A 15 F 13 G 4 Name: deck, dtype: int64
1. Mapping Method¶
Mapping method is straight forward way to encode categorical features with few categories. Let's apply it to the class feature: It has three categories: Third, First, Second
. We create a dictionary whose keys are categories and values are numerics to encode to and then map it to the dataframe.
Here is how it is done:
map_dict = {
'First':0,
'Second': 1,
'Third': 2
}
titanic['class'] = titanic['class'].map(map_dict)
titanic.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | 2 | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | 0 | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | 2 | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | 0 | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | 2 | man | True | NaN | Southampton | no | True |
titanic['class'].value_counts()
2 491 0 216 1 184 Name: class, dtype: int64
As you can see, the class feature is encoded. Everywhere the class was First
, it was replaced with 0. Samething happened to other classes.
from sklearn.preprocessing import OrdinalEncoder
cats_feats = titanic[['alive', 'alone']]
encoder = OrdinalEncoder()
cats_encoded = encoder.fit_transform(cats_feats)
The output of the encoder is a NumPy array. We can convert it back to the pandas dataframe.
titanic[['alive', 'alone']] = pd.DataFrame(cats_encoded, columns=cats_feats.columns, index=cats_feats.index)
titanic.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | adult_male | deck | embark_town | alive | alone | man | woman | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | 2 | True | NaN | Southampton | 0.0 | 0.0 | 1 | 0 |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | 0 | False | C | Cherbourg | 1.0 | 0.0 | 0 | 1 |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | 2 | False | NaN | Southampton | 1.0 | 1.0 | 0 | 1 |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | 0 | False | C | Southampton | 1.0 | 0.0 | 0 | 1 |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | 2 | True | NaN | Southampton | 0.0 | 1.0 | 1 | 0 |
encoder.categories_
[array(['no', 'yes'], dtype=object), array([False, True])]
Warning: Ordinary Encoder can't handle missing values. It will be error. Try it on embarked
and see...
3. Label Encoding¶
Label Encoding is noted to used for encoding target features (per sklearn documentation) but otherwise, it can also be used to achieve our purpose of encoding categorical features.
It also can't support missing values. So, to make it simple, let's drop all missing values.
titanic = sns.load_dataset('titanic')
titanic_cleaned = titanic.dropna()
titanic_cleaned.isnull().sum()
survived 0 pclass 0 sex 0 age 0 sibsp 0 parch 0 fare 0 embarked 0 class 0 who 0 adult_male 0 deck 0 embark_town 0 alive 0 alone 0 dtype: int64
from sklearn.preprocessing import LabelEncoder
deck_feat = titanic_cleaned[['deck']]
label_encoder = LabelEncoder()
deck_encoded = label_encoder.fit_transform(deck_feat)
/Users/jean/opt/miniconda3/envs/tensor/lib/python3.7/site-packages/sklearn/utils/validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). return f(*args, **kwargs)
Same as ordinary encoder, the output of Label Encoder is a NumPy array.
titanic_cleaned['deck'] = pd.DataFrame(deck_encoded, columns=deck_feat.columns, index=deck_feat.index)
titanic_cleaned.head()
/Users/jean/opt/miniconda3/envs/tensor/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """Entry point for launching an IPython kernel.
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | 2 | Cherbourg | yes | False |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | 2 | Southampton | yes | False |
6 | 0 | 1 | male | 54.0 | 0 | 0 | 51.8625 | S | First | man | True | 4 | Southampton | no | True |
10 | 1 | 3 | female | 4.0 | 1 | 1 | 16.7000 | S | Third | child | False | 6 | Southampton | yes | False |
11 | 1 | 1 | female | 58.0 | 0 | 0 | 26.5500 | S | First | woman | False | 2 | Southampton | yes | True |
label_encoder.classes_
array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)
titanic_cleaned['deck'].value_counts()
2 51 1 43 3 31 4 30 0 12 5 11 6 4 Name: deck, dtype: int64
4. Pandas Dummies¶
This is also simple way to handle categorical features. It will create extra features based on the available categories. Let's apply it to the feature who
.
dummies = pd.get_dummies(titanic['who'], drop_first=True)
titanic.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | 2 | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | 0 | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | 2 | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | 0 | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | 2 | man | True | NaN | Southampton | no | True |
titanic = pd.concat([titanic.drop('who',axis=1),dummies],axis=1)
titanic.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | adult_male | deck | embark_town | alive | alone | man | woman | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | 2 | True | NaN | Southampton | no | False | 1 | 0 |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | 0 | False | C | Cherbourg | yes | False | 0 | 1 |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | 2 | False | NaN | Southampton | yes | True | 0 | 1 |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | 0 | False | C | Southampton | yes | False | 0 | 1 |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | 2 | True | NaN | Southampton | no | True | 1 | 0 |
# Or you can do it at once with this code
#titanic[['man', 'woman']] = pd.get_dummies(titanic['who'], drop_first=True)
5. One Hot Encoding¶
This is the last encoding type of our list. It will convert a feature into one hot matrix. Additional features corresponding to the values of the given categories will be created. Basically same as dummies.
from sklearn.preprocessing import OneHotEncoder
one_hot = OneHotEncoder()
town_encoded = one_hot.fit_transform(titanic_cleaned[['embark_town']])
one_hot.categories_
[array(['Cherbourg', 'Queenstown', 'Southampton'], dtype=object)]
town_encoded
<182x3 sparse matrix of type '<class 'numpy.float64'>' with 182 stored elements in Compressed Sparse Row format>
The output of One hot encoder is a sparse matrix. We will need to convert it into NumPy array.
town_encoded = town_encoded.toarray()
columns = list(one_hot.categories_)
town_df = pd.DataFrame(town_encoded, columns =columns)
town_df.head()
Cherbourg | Queenstown | Southampton | |
---|---|---|---|
0 | 1.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 1.0 |
2 | 0.0 | 0.0 | 1.0 |
3 | 0.0 | 0.0 | 1.0 |
4 | 0.0 | 0.0 | 1.0 |
len(town_df)
182
len(titanic_cleaned)
182
drop_embark = titanic_cleaned.drop('embark_town',axis=1)
drop_embark[['Cherbourg', 'Queenstown', 'Southampton']] = town_df
drop_embark.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | alive | alone | Cherbourg | Queenstown | Southampton | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | 2 | yes | False | 0.0 | 0.0 | 1.0 |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | 2 | yes | False | 0.0 | 0.0 | 1.0 |
6 | 0 | 1 | male | 54.0 | 0 | 0 | 51.8625 | S | First | man | True | 4 | no | True | 0.0 | 0.0 | 1.0 |
10 | 1 | 3 | female | 4.0 | 1 | 1 | 16.7000 | S | Third | child | False | 6 | yes | False | 0.0 | 0.0 | 1.0 |
11 | 1 | 1 | female | 58.0 | 0 | 0 | 26.5500 | S | First | woman | False | 2 | yes | True | 0.0 | 0.0 | 1.0 |
Hopefully these techniques will help you to handle all kinds of categorical features.