This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.
A Handy Notes about Feature Scaling¶
Machine Learning models are very selective about the type and range of values that have to go for their input in order to work well. With the exception of decision trees, most ML models will expect you to scale the input features. What is feature scaling?
Feature scaling is nothing other than transforming the numerical features into a small range of values. In this notebook, we will see the following scaling technique:
- Normalization
- Standardizatiom
- Robust Scaling
Normalization and Standardization can be used or applied interchangeably, but they are quite different and they are suited for different purposes.
1. Normalization¶
Normalization is a scaling techniques that transform the numerical feature to the range of values between 0 and 1.
Here is a formula that is followed when normalizing the data. $Xmin$ is the minimum value of feature X, and $Xmax$ is the maximum value of X.
When Should you Normalize the Features?¶
When you have features that have different ranges of values, normalizing these features can be a good practice.
Take an example. If you have two features that have different ranges (say one feature from 1-100, other vary from 5-300), you will to scale them so they have the same range of values.
More specifically, normalization is a preferrable scaling technique when the data at hand has not a normal or gaussian distribution. If the data's distribution is gaussian, standardization is a preferrable scaling technique. If you don't know the distribution of the data, still, normalization is a good choice at first.
With that said, when the ML algorithm of choice is either neural network or K-Nearest Neighbors(KNN), normalization is a good choice for these type of algorithms because they don't make any assumption of the input data.
Most popular ML frameworks provide functions to normalize the numerical data.
For illustration purpose, I will use tips data available in Seaborn.
from seaborn import load_dataset
tip_data = load_dataset('tips')
tip_data.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Let's take all numerical features from the above data.
num_feats = tip_data[['total_bill', 'tip', 'size']]
For now let's scale those numerical features with Scikit-Learn preprocessing functions. We will use MinMaxScaler
which scale the data to the range between 0 and 1 by default. If you want a different range, you can change that.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
num_scaled = scaler.fit_transform(num_feats)
num_scaled[:5]
array([[0.29157939, 0.00111111, 0.2 ], [0.1522832 , 0.07333333, 0.4 ], [0.3757855 , 0.27777778, 0.4 ], [0.43171345, 0.25666667, 0.2 ], [0.45077503, 0.29 , 0.6 ]])
The output of the scaler is a NumPy array. You can convert it back to a Pandas DataFrame.
import pandas as pd
num_scaled_df = pd.DataFrame(num_scaled, columns=num_feats.columns)
num_scaled_df.head()
total_bill | tip | size | |
---|---|---|---|
0 | 0.291579 | 0.001111 | 0.2 |
1 | 0.152283 | 0.073333 | 0.4 |
2 | 0.375786 | 0.277778 | 0.4 |
3 | 0.431713 | 0.256667 | 0.2 |
4 | 0.450775 | 0.290000 | 0.6 |
As you can see above, all the values are scaled to the values between 0 and 1.
2. Standardization¶
In standardization, the numerical features are rescaled to have the 0 mean($u$) and unity standard deviation(std or $\sigma$ ).
Here is the formula of standardization. Xstd is the standardized feature, X is the feature, $u$ is mean of the feature, and $\sigma$ is the standard deviation.
$$ Xstd = \frac {X - u} {\sigma} $$When Should you Standardize the Features?¶
When you know that the training data at hand has a normal or gaussian distribution, you should standardize such data.
Some ML algorithms such as Support Vector Machines(with rbf kernel) and linear models expect that the input data to have a normal distribution.
In most cases, whether you choose normalization or standardization, it won't make much difference, but it can. So, it makes sense to try both especially if you are not sure about the distribution of the data.
Here is how Standardization is implemented in Scikit-Learn.
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
num_std = std_scaler.fit_transform(num_feats)
num_std[:5]
array([[-0.31471131, -1.43994695, -0.60019263], [-1.06323531, -0.96920534, 0.45338292], [ 0.1377799 , 0.36335554, 0.45338292], [ 0.4383151 , 0.22575414, -0.60019263], [ 0.5407447 , 0.4430195 , 1.50695847]])
# The mean of each feature in the scaled data
std_scaler.mean_
array([19.78594262, 2.99827869, 2.56967213])
# Variance of the scaled features
std_scaler.var_
array([78.92813149, 1.90660851, 0.9008835 ])
Scaled data has zero mean an unit variance.
import numpy as np
print(f'The mean of scaled data: {np.round(num_std.mean(axis=0))}')
print(f'The standard deviation of scaled data: {num_std.std(axis=0)}')
The mean of scaled data: [-0. 0. -0.] The standard deviation of scaled data: [1. 1. 1.]
# Converting the scaled values back to datframe
num_std_scaled_df = pd.DataFrame(num_std, columns=num_feats.columns)
num_std_scaled_df.head()
total_bill | tip | size | |
---|---|---|---|
0 | -0.314711 | -1.439947 | -0.600193 |
1 | -1.063235 | -0.969205 | 0.453383 |
2 | 0.137780 | 0.363356 | 0.453383 |
3 | 0.438315 | 0.225754 | -0.600193 |
4 | 0.540745 | 0.443020 | 1.506958 |
3. Robust Scaler¶
Robust scaler is kind of similar to standardization but is used when the data contains many outliers.
Instead of dropping mean, the median is dropped and the data is scaled to the Interquartile Range(IQR). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
Like normalization and standardization, Roust scaler is also implemented easily in Scikit-Learn.
from sklearn.preprocessing import RobustScaler
rob_scaler = RobustScaler()
num_rob_scaled = rob_scaler.fit_transform(num_feats)
num_rob_scaled[:5]
array([[-0.07467532, -1.2096 , 0. ], [-0.69155844, -0.7936 , 1. ], [ 0.29823748, 0.384 , 1. ], [ 0.54591837, 0.2624 , 0. ], [ 0.63033395, 0.4544 , 2. ]])
By scaling the data with Robust Scaler, the median of the resulting values will have a median of zero.
print(f'The median of scaled data: {np.round(np.median(num_rob_scaled, axis=0))}')
The median of scaled data: [-0. 0. 0.]
Final Notes¶
Scaling the input data before feeding it to a machine learning model is always a good practice.
Here are the punchlines:
- Scaling the features helps the model to converge faster.
- Normalization is scaling the data to be between 0 and 1. It is preferred when the data has not a normal distribution
- Standardization is scaling the data to have 0 mean and unit standard deviation. It is preferred when the data has a normal or gaussian distribution.
- Robust scaling technique is used if the data has many outliers.
- In most cases, the choice of scaling technique won't make much difference (or it can). Try all of them and see what work best with your data.
- Only the features are scaled. The labels should not be scaled.
- Make sure to not fit the scaler on test data. Only transfom.
Don't: scaler.fit_transfrom(X_test)
Do: scaler.transform(X_test)