This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.

A Handy Notes about Feature Scaling¶

Machine Learning models are very selective about the type and range of values that have to go for their input in order to work well. With the exception of decision trees, most ML models will expect you to scale the input features. What is feature scaling?

Feature scaling is nothing other than transforming the numerical features into a small range of values. In this notebook, we will see the following scaling technique:

Normalization
Standardizatiom
Robust Scaling

Normalization and Standardization can be used or applied interchangeably, but they are quite different and they are suited for different purposes.

1. Normalization¶

Normalization is a scaling techniques that transform the numerical feature to the range of values between 0 and 1.

Here is a formula that is followed when normalizing the data. $Xmin$ is the minimum value of feature X, and $Xmax$ is the maximum value of X.

$$ Xnorm = \frac {X-Xmin} {Xmax-Xmin} $$

When Should you Normalize the Features?¶

When you have features that have different ranges of values, normalizing these features can be a good practice.

Take an example. If you have two features that have different ranges (say one feature from 1-100, other vary from 5-300), you will to scale them so they have the same range of values.

More specifically, normalization is a preferrable scaling technique when the data at hand has not a normal or gaussian distribution. If the data's distribution is gaussian, standardization is a preferrable scaling technique. If you don't know the distribution of the data, still, normalization is a good choice at first.

With that said, when the ML algorithm of choice is either neural network or K-Nearest Neighbors(KNN), normalization is a good choice for these type of algorithms because they don't make any assumption of the input data.

Most popular ML frameworks provide functions to normalize the numerical data.

For illustration purpose, I will use tips data available in Seaborn.

In [2]:

            
                Copied!
                
from seaborn import load_dataset

tip_data = load_dataset('tips')
tip_data.head()
from seaborn import load_dataset

tip_data = load_dataset('tips')
tip_data.head()

Out[2]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

Let's take all numerical features from the above data.

In [3]:

            
                Copied!
                
num_feats = tip_data[['total_bill', 'tip', 'size']]
num_feats = tip_data[['total_bill', 'tip', 'size']]

For now let's scale those numerical features with Scikit-Learn preprocessing functions. We will use MinMaxScaler which scale the data to the range between 0 and 1 by default. If you want a different range, you can change that.

In [7]:

            
                Copied!
                
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

num_scaled = scaler.fit_transform(num_feats)
num_scaled[:5]
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

num_scaled = scaler.fit_transform(num_feats)
num_scaled[:5]

Out[7]:

array([[0.29157939, 0.00111111, 0.2       ],
       [0.1522832 , 0.07333333, 0.4       ],
       [0.3757855 , 0.27777778, 0.4       ],
       [0.43171345, 0.25666667, 0.2       ],
       [0.45077503, 0.29      , 0.6       ]])

The output of the scaler is a NumPy array. You can convert it back to a Pandas DataFrame.

In [5]:

            
                Copied!
                
import pandas as pd 

num_scaled_df = pd.DataFrame(num_scaled, columns=num_feats.columns)
num_scaled_df.head()
import pandas as pd 

num_scaled_df = pd.DataFrame(num_scaled, columns=num_feats.columns)
num_scaled_df.head()

Out[5]:

	total_bill	tip	size
0	0.291579	0.001111	0.2
1	0.152283	0.073333	0.4
2	0.375786	0.277778	0.4
3	0.431713	0.256667	0.2
4	0.450775	0.290000	0.6

As you can see above, all the values are scaled to the values between 0 and 1.

2. Standardization¶

In standardization, the numerical features are rescaled to have the 0 mean($u$) and unity standard deviation(std or $\sigma$ ).

Here is the formula of standardization. Xstd is the standardized feature, X is the feature, $u$ is mean of the feature, and $\sigma$ is the standard deviation.

$$ Xstd = \frac {X - u} {\sigma} $$

When Should you Standardize the Features?¶

When you know that the training data at hand has a normal or gaussian distribution, you should standardize such data.

Some ML algorithms such as Support Vector Machines(with rbf kernel) and linear models expect that the input data to have a normal distribution.

In most cases, whether you choose normalization or standardization, it won't make much difference, but it can. So, it makes sense to try both especially if you are not sure about the distribution of the data.

Here is how Standardization is implemented in Scikit-Learn.

In [19]:

            
                Copied!
                
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
num_std = std_scaler.fit_transform(num_feats)
num_std[:5]
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
num_std = std_scaler.fit_transform(num_feats)
num_std[:5]

Out[19]:

array([[-0.31471131, -1.43994695, -0.60019263],
       [-1.06323531, -0.96920534,  0.45338292],
       [ 0.1377799 ,  0.36335554,  0.45338292],
       [ 0.4383151 ,  0.22575414, -0.60019263],
       [ 0.5407447 ,  0.4430195 ,  1.50695847]])

In [14]:

            
                Copied!
                
# The mean of each feature in the scaled data

std_scaler.mean_
# The mean of each feature in the scaled data

std_scaler.mean_

Out[14]:

array([19.78594262,  2.99827869,  2.56967213])

In [15]:

            
                Copied!
                
# Variance of the scaled features

std_scaler.var_
# Variance of the scaled features

std_scaler.var_

Out[15]:

array([78.92813149,  1.90660851,  0.9008835 ])

Scaled data has zero mean an unit variance.

In [22]:

            
                Copied!
                
import numpy as np

print(f'The mean of scaled data: {np.round(num_std.mean(axis=0))}')
print(f'The standard deviation of scaled data: {num_std.std(axis=0)}')
import numpy as np

print(f'The mean of scaled data: {np.round(num_std.mean(axis=0))}')
print(f'The standard deviation of scaled data: {num_std.std(axis=0)}')

The mean of scaled data: [-0.  0. -0.]
The standard deviation of scaled data: [1. 1. 1.]

In [11]:

            
                Copied!
                
# Converting the scaled values back to datframe 

num_std_scaled_df = pd.DataFrame(num_std, columns=num_feats.columns)
num_std_scaled_df.head()
# Converting the scaled values back to datframe 

num_std_scaled_df = pd.DataFrame(num_std, columns=num_feats.columns)
num_std_scaled_df.head()

Out[11]:

	total_bill	tip	size
0	-0.314711	-1.439947	-0.600193
1	-1.063235	-0.969205	0.453383
2	0.137780	0.363356	0.453383
3	0.438315	0.225754	-0.600193
4	0.540745	0.443020	1.506958

3. Robust Scaler¶

Robust scaler is kind of similar to standardization but is used when the data contains many outliers.

Instead of dropping mean, the median is dropped and the data is scaled to the Interquartile Range(IQR). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Like normalization and standardization, Roust scaler is also implemented easily in Scikit-Learn.

In [23]:

            
                Copied!
                
from sklearn.preprocessing import RobustScaler

rob_scaler = RobustScaler()
num_rob_scaled = rob_scaler.fit_transform(num_feats)

num_rob_scaled[:5]
from sklearn.preprocessing import RobustScaler

rob_scaler = RobustScaler()
num_rob_scaled = rob_scaler.fit_transform(num_feats)

num_rob_scaled[:5]

Out[23]:

array([[-0.07467532, -1.2096    ,  0.        ],
       [-0.69155844, -0.7936    ,  1.        ],
       [ 0.29823748,  0.384     ,  1.        ],
       [ 0.54591837,  0.2624    ,  0.        ],
       [ 0.63033395,  0.4544    ,  2.        ]])

By scaling the data with Robust Scaler, the median of the resulting values will have a median of zero.

In [26]:

            
                Copied!
                
print(f'The median of scaled data: {np.round(np.median(num_rob_scaled, axis=0))}')
print(f'The median of scaled data: {np.round(np.median(num_rob_scaled, axis=0))}')

The median of scaled data: [-0.  0.  0.]

Final Notes¶

Scaling the input data before feeding it to a machine learning model is always a good practice.

Here are the punchlines:

Scaling the features helps the model to converge faster.
Normalization is scaling the data to be between 0 and 1. It is preferred when the data has not a normal distribution
Standardization is scaling the data to have 0 mean and unit standard deviation. It is preferred when the data has a normal or gaussian distribution.
Robust scaling technique is used if the data has many outliers.
In most cases, the choice of scaling technique won't make much difference (or it can). Try all of them and see what work best with your data.
Only the features are scaled. The labels should not be scaled.
Make sure to not fit the scaler on test data. Only transfom.

Don't: scaler.fit_transfrom(X_test)
Do: scaler.transform(X_test)

Futher Learning¶

In [ ]: