This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.
Linear Models for Regression¶
Early in the previous parts, we saw that regression is a supervised learning type in which we are predicting a continous value. Take an example, given the information about a house such as its region, size, number of bedrooms, can you predict the price of the house?
That is in fact what you will see in this lab and to achieve that, we will use Regression models available in Scikit-Learn.
Let's get started. How exciting is that!!
As much as we can, we will try to structure all machine learning projects in accordance to the standard ML worklow. Here are the typical steps that you will see in most ML projects:
1. Problem Formulation¶
Here is the problem. There is a real state agent who knows that you're a Machine Learning Engineer and would like you to help out with building a machine learning model that can predict the price of the house given the information about that particular house.
The idea of creating a model clicked and replied, oh, yeah, that sounds cool! Let's do it!!
You have understood the problem and probably you already have an idea of type of models you will use. You know there are so many models such as random forests, decision trees, and neural networks but you have learned that it's always okay to start simple and so, now you know you will use Linear Regression as this is not a complex problem.
Understanding the problem well goes beyond determining the right models for the problem, but also to doing effecting data processing and error analysis as you will be finding out.
It's time to collect the data now.
2. Collecting the data¶
Ideally, the real estate agent would hand you his/her housing data but unfortunately he/she told you, that the model will be used in California, so it's okay to use the California housing dataset available on public and free to use.
So, we will collect the data from the internet. Fortunately, Scikit-Learn provides the exact same data. Let's load it but first, let's import all relavant libraries that we will need.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
# Loading the data
X, y = fetch_california_housing(return_X_y=True)
type(X)
numpy.ndarray
type(y)
numpy.ndarray
Now that we have the dataset, we can try to see what it looks like. X is training data and y is training labels.
But wait, the data from Sklearn is a NumPy array and it seems it prepared to be fed to the model directly. That could easier, but quite often, real world data they are not like that. We often have to do our work in order to process it to be fed to the ML model.
Let's get the real data. You can learn more about the data at Kaggle.
import urllib.request
data_path = 'https://raw.githubusercontent.com/nyandwi/public_datasets/master/housing.csv'
def download_read_data(path):
"""
Function to retrieve data from the data paths
And to read the data as a pandas dataframe
To return the dataframe
"""
## Only retrieve the directory of the data
data_path = urllib.request.urlretrieve(path)[0]
data = pd.read_csv(path)
return data
Now that we have real world data, let's see how it looks like.
cal_data = download_read_data(data_path)
cal_data.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
cal_data.tail()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
20635 | -121.09 | 39.48 | 25.0 | 1665.0 | 374.0 | 845.0 | 330.0 | 1.5603 | 78100.0 | INLAND |
20636 | -121.21 | 39.49 | 18.0 | 697.0 | 150.0 | 356.0 | 114.0 | 2.5568 | 77100.0 | INLAND |
20637 | -121.22 | 39.43 | 17.0 | 2254.0 | 485.0 | 1007.0 | 433.0 | 1.7000 | 92300.0 | INLAND |
20638 | -121.32 | 39.43 | 18.0 | 1860.0 | 409.0 | 741.0 | 349.0 | 1.8672 | 84700.0 | INLAND |
20639 | -121.24 | 39.37 | 16.0 | 2785.0 | 616.0 | 1387.0 | 530.0 | 2.3886 | 89400.0 | INLAND |
Information about the features
1. longitude: A measure of how far west a house is; a higher value is farther west
2. latitude: A measure of how far north a house is; a higher value is farther north
3. housing_median_age: Median age of a house within a block; a lower number is a newer building
4. total_rooms: Total number of rooms within a block
5. total_bedrooms: Total number of bedrooms within a block
6. population: Total number of people residing within a block
7. households: Total number of households, a group of people residing within a home unit, for a block
8. median_income: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
9. median_house_value: Median house value for households within a block (measured in US Dollars)
10. oceanProximity: Location of the house w.r.t ocean/sea
Source: Kaggle
cal_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB
len(cal_data)
20640
len(cal_data.columns)
10
So, we have 20640 data points and 10 features. In those 10 features, 9 features are input features and the feature median_house_value
is the target variable/label.
3. Exploratory Data Analysis¶
This is where we are going to understand more about the data. But before we get there, let's split the data into training and testing sets. This is because on the course of data analysis and processing, there can be data leakage
. Put it in other words, we don't want the model to see the data that it will be tested on. If it does, it will show that it can make good predictions (or generalize well) on test data because it saw it, but it will fail on giving accurate predictions on the future data.
As a side notes, training set is used during the model training, and testing set is used during the model evaluation. As we go we will try to explain some terminologies! ML is huge :)
So to split the data, sklearn provides a function for that.
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(cal_data, test_size=0.1,random_state=20)
print('The size of training data is: {} \nThe size of testing data is: {}'.format(len(train_data), len(test_data)))
The size of training data is: 18576 The size of testing data is: 2064
As you can see, we have allocated 10 percent of the full data to the test set.
# Let's copy the training data to revert it in case we mess things up
cal_train = train_data.copy()
Checking data statistics¶
# By default, describe shows the stats of the numerical features.
# include paramater gives us option to show all features
train_data.describe(include='all').transpose()
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
longitude | 18576.0 | NaN | NaN | NaN | -119.56753 | 2.000581 | -124.35 | -121.79 | -118.49 | -118.01 | -114.49 |
latitude | 18576.0 | NaN | NaN | NaN | 35.630217 | 2.13326 | 32.54 | 33.93 | 34.26 | 37.71 | 41.95 |
housing_median_age | 18576.0 | NaN | NaN | NaN | 28.661068 | 12.604039 | 1.0 | 18.0 | 29.0 | 37.0 | 52.0 |
total_rooms | 18576.0 | NaN | NaN | NaN | 2631.567453 | 2169.46745 | 2.0 | 1445.0 | 2127.0 | 3149.0 | 39320.0 |
total_bedrooms | 18390.0 | NaN | NaN | NaN | 537.344698 | 417.672864 | 1.0 | 295.0 | 435.0 | 648.0 | 6445.0 |
population | 18576.0 | NaN | NaN | NaN | 1422.408376 | 1105.486111 | 3.0 | 785.75 | 1166.0 | 1725.0 | 28566.0 |
households | 18576.0 | NaN | NaN | NaN | 499.277078 | 379.473497 | 1.0 | 279.0 | 410.0 | 606.0 | 6082.0 |
median_income | 18576.0 | NaN | NaN | NaN | 3.870053 | 1.900225 | 0.4999 | 2.5643 | 3.5341 | 4.742725 | 15.0001 |
median_house_value | 18576.0 | NaN | NaN | NaN | 206881.011305 | 115237.605962 | 14999.0 | 120000.0 | 179800.0 | 264700.0 | 500001.0 |
ocean_proximity | 18576 | 5 | <1H OCEAN | 8231 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Checking Missing Values¶
train_data.isnull().sum()
longitude 0 latitude 0 housing_median_age 0 total_rooms 0 total_bedrooms 186 population 0 households 0 median_income 0 median_house_value 0 ocean_proximity 0 dtype: int64
We only have missing values in the total_bedrooms
feature. To g beyond a little bit, here is the percentage of the missing values in such feature.
print('The Percentage of missing values in total_bedrooms is: {}%'.format(train_data.isnull().sum()['total_bedrooms'] / len(train_data) * 100))
The Percentage of missing values in total_bedrooms is: 1.0012919896640826%
Checking Values in the Categorical Feature(s)¶
train_data['ocean_proximity'].value_counts()
<1H OCEAN 8231 INLAND 5896 NEAR OCEAN 2384 NEAR BAY 2061 ISLAND 4 Name: ocean_proximity, dtype: int64
sns.countplot(data=train_data, x='ocean_proximity')
<AxesSubplot:xlabel='ocean_proximity', ylabel='count'>
Checking Correlation Between Features¶
correlation = train_data.corr()
correlation['median_house_value']
longitude -0.048622 latitude -0.142543 housing_median_age 0.105237 total_rooms 0.133927 total_bedrooms 0.049672 population -0.026109 households 0.065508 median_income 0.685433 median_house_value 1.000000 Name: median_house_value, dtype: float64
#### Visualizing correlation
plt.figure(figsize=(12,7))
sns.heatmap(correlation,annot=True,cmap='crest')
<AxesSubplot:>
Some features like total_bedrooms and households are highly correlated. Same things for total_bedrooms
and total_rooms
and that makes sense because for many houses, the number of people who stay in that particular house (households
) goes with the number of available rooms(total_rooms
) and bed_rooms
.
The other interesting insights is that the price of the house
is closely correlated with the median income
, and that makes sense too. For many cases, you will resonably seek house that you will be able to afford based on your income.
Plotting Geographical Features¶
Since we have latitude and longitude, let's plot it. It can help us to know the location of certain houses on the map and hopefully this will resemble California map.
plt.figure(figsize=(12,7))
sns.scatterplot(data = train_data, x='longitude', y='latitude')
<AxesSubplot:xlabel='longitude', ylabel='latitude'>
plt.figure(figsize=(12,7))
sns.scatterplot(data = train_data, x='longitude', y='latitude', hue='median_house_value')
<AxesSubplot:xlabel='longitude', ylabel='latitude'>