This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.

Random Forests - Intro and Regression¶

Random Forests are powerful machine learning algorithms used for supervised classification and regression. Random forests works by averaging the predictions of the multiple and randomized decision trees. Decision trees tends to overfit and so by combining multiple decision trees, the effect of overfitting can be minimized.

Random Forests are type of ensemble models. More about ensembles models in the next notebook.

Different to other learning algorithms, random forests provide a way to find the importance of each feature and this is implemented in Sklearn.

Random Forests for Regression¶

Contents¶

1 - Imports
2 - Loading the data
3 - Exploratory Analysis
4 - Preprocessing the data
5 - Training Random Forests Regressor
6 - Evaluating Random Forests Regressor
7 - Improving Random Forests Regressor
8 - Feature Importance
9 - Evaluating the Improved Model on the Test set

1 - Imports¶

In [1]:

            
                Copied!
                
                    
                    
                
                

        
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline

2 - Loading the data¶

In this regression task with random forests, we will use the same dataset previously used in deciosion trees regressor which is Machine CPU (Central Processing Unit) data which is available at OpenML. We will load it with Sklearn fetch_openml function.

If you are reading this, it's very likely that you know CPU or you have once(or many times) thought about it when you were buying your computer. In this notebook, we will predict the relative performance of the CPU given the following data:

MYCT: machine cycle time in nanoseconds (integer)
MMIN: minimum main memory in kilobytes (integer)
MMAX: maximum main memory in kilobytes (integer)
CACH: cache memory in kilobytes (integer)
CHMIN: minimum channels in units (integer)
CHMAX: maximum channels in units (integer)
PRP: published relative performance (integer) (target variable)

In [2]:

            
                Copied!
                
# Let's hide warnings

import warnings
warnings.filterwarnings('ignore')
# Let's hide warnings

import warnings
warnings.filterwarnings('ignore')

In [3]:

            
                Copied!
                
from sklearn.datasets import fetch_openml

machine_cpu = fetch_openml(name='machine_cpu', version=1)
from sklearn.datasets import fetch_openml

machine_cpu = fetch_openml(name='machine_cpu', version=1)

In [4]:

            
                Copied!
                
type(machine_cpu)
type(machine_cpu)

Out[4]:

sklearn.utils.Bunch

In [5]:

            
                Copied!
                
machine_cpu.details
machine_cpu.details

Out[5]:

{'id': '230',
 'name': 'machine_cpu',
 'version': '1',
 'description_version': '1',
 'format': 'ARFF',
 'contributor': 'L. Torgo',
 'upload_date': '2014-04-23T13:20:36',
 'language': 'English',
 'licence': 'Public',
 'url': 'https://www.openml.org/data/v1/download/3667/machine_cpu.arff',
 'file_id': '3667',
 'default_target_attribute': 'class',
 'version_label': '1',
 'citation': 'https://archive.ics.uci.edu/ml/citation_policy.html',
 'tag': 'OpenML-Reg19',
 'visibility': 'public',
 'original_data_url': 'http://www.ics.uci.edu/~mlearn/MLSummary.html',
 'minio_url': 'http://openml1.win.tue.nl/dataset230/dataset_230.pq',
 'status': 'active',
 'processing_date': '2020-11-20 19:15:43',
 'md5_checksum': 'e26d62e83069b74dff6cf492e06868a0'}

In [6]:

            
                Copied!
                
machine_cpu.data.shape
machine_cpu.data.shape

Out[6]:

(209, 6)

In [7]:

            
                Copied!
                
print(machine_cpu.DESCR)
print(machine_cpu.DESCR)

**Author**:   
**Source**: Unknown -   
**Please cite**:   

The problem concerns Relative CPU Performance Data. More information can be obtained in the UCI Machine
 Learning repository (http://www.ics.uci.edu/~mlearn/MLSummary.html).
 The used attributes are :
 MYCT: machine cycle time in nanoseconds (integer)
 MMIN: minimum main memory in kilobytes (integer)
 MMAX: maximum main memory in kilobytes (integer)
 CACH: cache memory in kilobytes (integer)
 CHMIN: minimum channels in units (integer)
 CHMAX: maximum channels in units (integer)
 PRP: published relative performance (integer) (target variable)
 
 Original source: UCI machine learning repository. 
 Source: collection of regression datasets by Luis Torgo (ltorgo@ncc.up.pt) at
 http://www.ncc.up.pt/~ltorgo/Regression/DataSets.html
 Characteristics: 209 cases; 6 continuous variables

Downloaded from openml.org.

In [8]:

            
                Copied!
                
# Displaying feature names

machine_cpu.feature_names
# Displaying feature names

machine_cpu.feature_names

Out[8]:

['MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX']

In [9]:

            
                Copied!
                
# Getting the whole dataframe

machine_data = machine_cpu.frame
# Getting the whole dataframe

machine_data = machine_cpu.frame

In [10]:

            
                Copied!
                
type(machine_data)
type(machine_data)

Out[10]:

pandas.core.frame.DataFrame

3 - Exploratory Analysis¶

Before doing exploratory analysis, let's get the training and test data.

In [11]:

            
                Copied!
                
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(machine_data, test_size=0.2,random_state=20)

print('The size of training data is: {} \nThe size of testing data is: {}'.format(len(train_data), len(test_data)))
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(machine_data, test_size=0.2,random_state=20)

print('The size of training data is: {} \nThe size of testing data is: {}'.format(len(train_data), len(test_data)))

The size of training data is: 167 
The size of testing data is: 42

Let's visualize the histograms of all numeric features.

In [12]:

            
                Copied!
                
train_data.hist(bins=50, figsize=(15,10))
plt.show()
train_data.hist(bins=50, figsize=(15,10))
plt.show()

Or we can quickly use sns.pairplot() to look into the data.

In [13]:

            
                Copied!
                
sns.pairplot(train_data)
sns.pairplot(train_data)

Out[13]:

<seaborn.axisgrid.PairGrid at 0x7faa5972d890>

In [14]:

            
                Copied!
                
# Checking summary stats
train_data.describe()
# Checking summary stats
train_data.describe()

Out[14]:

	MYCT	MMIN	MMAX	CACH	CHMIN	CHMAX	class
count	167.000000	167.000000	167.000000	167.000000	167.000000	167.000000	167.000000
mean	207.958084	2900.826347	11761.161677	26.071856	4.760479	18.616766	109.185629
std	266.772823	4165.950964	12108.332354	42.410014	6.487439	27.489919	174.061117
min	17.000000	64.000000	64.000000	0.000000	0.000000	0.000000	6.000000
25%	50.000000	768.000000	4000.000000	0.000000	1.000000	5.000000	27.500000
50%	110.000000	2000.000000	8000.000000	8.000000	2.000000	8.000000	50.000000
75%	232.500000	3100.000000	16000.000000	32.000000	6.000000	24.000000	110.000000
max	1500.000000	32000.000000	64000.000000	256.000000	52.000000	176.000000	1150.000000

In [15]:

            
                Copied!
                
# Checking missing values

train_data.isnull().sum()
# Checking missing values

train_data.isnull().sum()

Out[15]:

MYCT     0
MMIN     0
MMAX     0
CACH     0
CHMIN    0
CHMAX    0
class    0
dtype: int64

We don't have any missing values.

In [16]:

            
                Copied!
                
# Checking feature correlation

corr = train_data.corr()
corr['class']
# Checking feature correlation

corr = train_data.corr()
corr['class']

Out[16]:

MYCT    -0.301805
MMIN     0.797751
MMAX     0.869077
CACH     0.671581
CHMIN    0.648653
CHMAX    0.606557
class    1.000000
Name: class, dtype: float64

In [17]:

            
                Copied!
                
## Visualizing correlation

plt.figure(figsize=(12,7))

sns.heatmap(corr,annot=True,cmap='crest')
## Visualizing correlation

plt.figure(figsize=(12,7))

sns.heatmap(corr,annot=True,cmap='crest')

Out[17]:

<AxesSubplot:>

4 - Data Preprocessing¶

It is here that we prepare the data to be in the proper format for the machine learning model.

Let's set up a pipeline to scale features but before that, let's take training input data and labels.

In [18]:

            
                Copied!
                
X_train = train_data.drop('class', axis=1)
y_train = train_data['class']
X_train = train_data.drop('class', axis=1)
y_train = train_data['class']

In [19]:

            
                Copied!
                
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

scale_pipe = Pipeline([
    ('scaler', StandardScaler())
    
])

X_train_scaled = scale_pipe.fit_transform(X_train)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

scale_pipe = Pipeline([
    ('scaler', StandardScaler())
    
])

X_train_scaled = scale_pipe.fit_transform(X_train)

5 - Training Random Forests Regressor¶

In [20]:

            
                Copied!
                
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(min_samples_split=2,bootstrap=False, random_state=42,n_jobs=-1)

forest_reg.fit(X_train_scaled, y_train)
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(min_samples_split=2,bootstrap=False, random_state=42,n_jobs=-1)

forest_reg.fit(X_train_scaled, y_train)

Out[20]:

RandomForestRegressor(bootstrap=False, n_jobs=-1, random_state=42)

6 - Evaluating Random Forests Regressor¶

Let's first check the root mean squarred errr on the training. It is not advised to evaluate the model on the test data since we haven't improved it yet. I will make a function to make it easier and to avoid repetitions.

In [21]:

            
                Copied!
                
from sklearn.metrics import mean_squared_error

def predict(input_data,model,labels):
    """
    Take the input data, model and labels and return predictions
    
    """
    
    preds = model.predict(input_data)
    mse = mean_squared_error(labels,preds)
    rmse = np.sqrt(mse)
    rmse
    
    return rmse
from sklearn.metrics import mean_squared_error

def predict(input_data,model,labels):
    """
    Take the input data, model and labels and return predictions
    
    """
    
    preds = model.predict(input_data)
    mse = mean_squared_error(labels,preds)
    rmse = np.sqrt(mse)
    rmse
    
    return rmse

In [22]:

            
                Copied!
                
predict(X_train_scaled, forest_reg, y_train)
predict(X_train_scaled, forest_reg, y_train)

Out[22]:

9.724590719956222

7 - Improving Random Forests¶

In [23]:

            
                Copied!
                
forest_reg.get_params()
forest_reg.get_params()

Out[23]:

{'bootstrap': False,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': -1,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

We will use GridSearch to find the best hyperparameters that we can use to retrain the model with. By setting the refit to True, the random forest will be automatically retrained on the dataset with the best hyperparameters. By default, refit is True.

This will take too long.

In [47]:

            
                Copied!
                
from sklearn.model_selection import GridSearchCV

params_grid = {
    'n_estimators':[100,200,300,400,500],
    'max_leaf_nodes':list(range(0,50))}

#refit is true by default. The best estimator is trained on the whole dataset 

grid_search = GridSearchCV(RandomForestRegressor(min_samples_split=2,bootstrap=False,random_state=42), params_grid, verbose=1, cv=5)

grid_search.fit(X_train_scaled, y_train)
from sklearn.model_selection import GridSearchCV

params_grid = {
    'n_estimators':[100,200,300,400,500],
    'max_leaf_nodes':list(range(0,50))}

#refit is true by default. The best estimator is trained on the whole dataset 

grid_search = GridSearchCV(RandomForestRegressor(min_samples_split=2,bootstrap=False,random_state=42), params_grid, verbose=1, cv=5)

grid_search.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 250 candidates, totalling 1250 fits

Out[47]:

GridSearchCV(cv=5,
             estimator=RandomForestRegressor(bootstrap=False, random_state=42),
             param_grid={'max_leaf_nodes': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                                            11, 12, 13, 14, 15, 16, 17, 18, 19,
                                            20, 21, 22, 23, 24, 25, 26, 27, 28,
                                            29, ...],
                         'n_estimators': [100, 200, 300, 400, 500]},
             verbose=1)

In [48]:

            
                Copied!
                
grid_search.best_params_
grid_search.best_params_

Out[48]:

{'max_leaf_nodes': 42, 'n_estimators': 200}

In [49]:

            
                Copied!
                
grid_search.best_estimator_
grid_search.best_estimator_

Out[49]:

RandomForestRegressor(bootstrap=False, max_leaf_nodes=42, n_estimators=200,
                      random_state=42)

In [50]:

            
                Copied!
                
forest_best = grid_search.best_estimator_
forest_best = grid_search.best_estimator_

Let's make prediction on the training data again

In [51]:

            
                Copied!
                
predict(X_train_scaled, forest_best, y_train)
predict(X_train_scaled, forest_best, y_train)

Out[51]:

12.709506767466658

Surprisingly, by searching model hyperparameters, the model did not improve. Can you guess why? I have observed many things while running Grid Search and reading about the random forests. If you can't get good results, set the bootstrap to False. It is true by default, and that means that you are training on samples of the training set instead of the whole training set. Try going back to the orginal model and change it to True and note how the prediction changes. Also learn more about the other hyperparameters.

8. Feature Importance¶

Different to other machine learning models, random forests can show how each feature contributed to the model generalization. Let's find it. The results are values between 0 and 1. The closer to 1, the good the feature was to the model.

In [60]:

            
                Copied!
                
feat_import = forest_best.feature_importances_

feat_dict ={
    
    'Features': X_train.columns,
    'Feature Importance': feat_import
}

pd.DataFrame(feat_dict)
feat_import = forest_best.feature_importances_

feat_dict ={
    
    'Features': X_train.columns,
    'Feature Importance': feat_import
}

pd.DataFrame(feat_dict)

Out[60]:

	Features	Feature Importance
0	MYCT	0.002991
1	MMIN	0.005537
2	MMAX	0.835814
3	CACH	0.117431
4	CHMIN	0.007570
5	CHMAX	0.030657

As you can see above, the most 2 features which contributed to the prediction of the relative performance of the CPU are MMAX which is the Maximum Main Memory in Kilobytes and CACH (cache memory).

It makes sense that the model was able to find that out. Main memory (RAM, Read Only Memory) and cache memory (which stores frequently used information thus facilitating faster processing and quick retrieval of information) are the two most factors of the CPU performance and if you are going to buy a new computer, you want to have high RAM and cache memory in order to have a powerful machine that can process/compute and retrieve things faster.

9. Evaluating the Model on the Test Set¶

Let us evaluate the model on the test set. But I will first run the pipeline on the test data. Note that we only transform (not fit_transform).

In [53]:

            
                Copied!
                
X_test = test_data.drop('class', axis=1)
y_test = test_data['class']

test_scaled = scale_pipe.transform(X_test)
X_test = test_data.drop('class', axis=1)
y_test = test_data['class']

test_scaled = scale_pipe.transform(X_test)

In [54]:

            
                Copied!
                
predict(test_scaled, forest_best, y_test)
predict(test_scaled, forest_best, y_test)

Out[54]:

41.35371179215193

The results on the test set is not appealing, and it is a sign that the model is still overfitting the data(it is doing well on the training set and poor on the new data). One way to improve the model can be to regularize the model by searching more best hyperparameters and increasing the data and data quality. The later is what can improve the model in many scenarios.

This is the end of the notebook. We have learned the fundamental idea behind the random forests, and used it to predict the CPU performance. In the next lab, we will use it for classification task and we will use a real world dataset so that we can practically improve the random forests.

BACK TO TOP

In [ ]: