This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.
Linear Models for Classification¶
In the previous lab, we learned about regression where the goal was to predict the continous value such as the price of house given the information about the house.
In this lab, we will learn about classification where the task is to predict the class or category. Both regression and classification are the main two types of supervised learning.
As always, we are going to approach our problem following a typical Machine Learning workflow.
- 1. Problem formulation
- 2. Finding data
- 3. Exploring insights in data or EDA
- 4. Data preprocessing
- 5. Choosing and training a model
- 6. Evaluating a model
By being systematic and keeping things organized, it will help you to reproduce some parts of the project or reuse them into other problems.
1. Problem Formulation¶
Let's say you have an idea of a revolutionary mobile phone and you want to establish a start up, but you know little about the price of the mobile phones. You are interested in learning that!
Fortunately, there is this mobile dataset on Kaggle that you can use to learn about the price ranges of mobiles based on their features such as wifi & bluetooth supports etc...
So, to make it simple, you have a dataset containing the features of mobiles and the problem is to predict the price range, not the exact price.
2. Finding the Data¶
The data that we are going to use is found on Kaggle.
Here are the details of the features. It is 21 features. The target feature is price range
and it has four price ranges: 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost).
- batter_power: Total energy a battery can store in one time measured in mAh
- blue: Has bluetooth or not
- clock_speed: speed at which microprocessor executes instructions
- dual_sim: Has dual sim support or not
- fc: Front Camera mega pixels
- four_g: Has 4G or not
- int_memory: Internal Memory in Gigabytes
- m_dep: Mobile Depth in cm
- mobile_wt: Weight of mobile phone
- n_cores: Number of cores of processor
- pc: Primary Camera mega pixels
- px_height: Pixel Resolution Height
- px_width: Pixel Resolution Width
- ram: Random Access Memory in Mega Bytes
- sc_h: Screen Height of mobile in cm
- sc_w: Screen Width of mobile in cm
- talk_time: longest time that a single battery charge will last when you are talking
- three_g: Has 3G or not
- touch_screen: Has touch screen or not
- wifi: Has wifi or not
- price_range: This is the target variable with value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost).
Let's download the data.
import urllib.request
import pandas as pd
train_data_path = 'https://raw.githubusercontent.com/nyandwi/public_datasets/master/mobile_price_train.csv'
test_data_path = 'https://raw.githubusercontent.com/nyandwi/public_datasets/master/mobile_price_test.csv'
def download_read_data(path):
"""
Function to retrieve data from the data paths
And to read the data as a pandas dataframe
To return the dataframe
"""
# Only retrieve the directory of the data
data_path = urllib.request.urlretrieve(path)[0]
data = pd.read_csv(str(data_path))
return data
# Getting train data
mobile_train = download_read_data(train_data_path)
mobile_train.head(5)
battery_power | blue | clock_speed | dual_sim | fc | four_g | int_memory | m_dep | mobile_wt | n_cores | ... | px_height | px_width | ram | sc_h | sc_w | talk_time | three_g | touch_screen | wifi | price_range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 842 | 0 | 2.2 | 0 | 1 | 0 | 7 | 0.6 | 188 | 2 | ... | 20 | 756 | 2549 | 9 | 7 | 19 | 0 | 0 | 1 | 1 |
1 | 1021 | 1 | 0.5 | 1 | 0 | 1 | 53 | 0.7 | 136 | 3 | ... | 905 | 1988 | 2631 | 17 | 3 | 7 | 1 | 1 | 0 | 2 |
2 | 563 | 1 | 0.5 | 1 | 2 | 1 | 41 | 0.9 | 145 | 5 | ... | 1263 | 1716 | 2603 | 11 | 2 | 9 | 1 | 1 | 0 | 2 |
3 | 615 | 1 | 2.5 | 0 | 0 | 0 | 10 | 0.8 | 131 | 6 | ... | 1216 | 1786 | 2769 | 16 | 8 | 11 | 1 | 0 | 0 | 2 |
4 | 1821 | 1 | 1.2 | 0 | 13 | 1 | 44 | 0.6 | 141 | 2 | ... | 1208 | 1212 | 1411 | 8 | 2 | 15 | 1 | 1 | 0 | 1 |
5 rows × 21 columns
# Getting test data
mobile_test = download_read_data(test_data_path)
# mobile_test.head(2)
# Looking at tail (last rows) of the data
mobile_train.tail()
battery_power | blue | clock_speed | dual_sim | fc | four_g | int_memory | m_dep | mobile_wt | n_cores | ... | px_height | px_width | ram | sc_h | sc_w | talk_time | three_g | touch_screen | wifi | price_range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1995 | 794 | 1 | 0.5 | 1 | 0 | 1 | 2 | 0.8 | 106 | 6 | ... | 1222 | 1890 | 668 | 13 | 4 | 19 | 1 | 1 | 0 | 0 |
1996 | 1965 | 1 | 2.6 | 1 | 0 | 0 | 39 | 0.2 | 187 | 4 | ... | 915 | 1965 | 2032 | 11 | 10 | 16 | 1 | 1 | 1 | 2 |
1997 | 1911 | 0 | 0.9 | 1 | 1 | 1 | 36 | 0.7 | 108 | 8 | ... | 868 | 1632 | 3057 | 9 | 1 | 5 | 1 | 1 | 0 | 3 |
1998 | 1512 | 0 | 0.9 | 0 | 4 | 1 | 46 | 0.1 | 145 | 5 | ... | 336 | 670 | 869 | 18 | 10 | 19 | 1 | 1 | 1 | 0 |
1999 | 510 | 1 | 2.0 | 1 | 5 | 1 | 45 | 0.9 | 168 | 6 | ... | 483 | 754 | 3919 | 19 | 4 | 2 | 1 | 1 | 1 | 3 |
5 rows × 21 columns
mobile_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2000 entries, 0 to 1999 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 battery_power 2000 non-null int64 1 blue 2000 non-null int64 2 clock_speed 2000 non-null float64 3 dual_sim 2000 non-null int64 4 fc 2000 non-null int64 5 four_g 2000 non-null int64 6 int_memory 2000 non-null int64 7 m_dep 2000 non-null float64 8 mobile_wt 2000 non-null int64 9 n_cores 2000 non-null int64 10 pc 2000 non-null int64 11 px_height 2000 non-null int64 12 px_width 2000 non-null int64 13 ram 2000 non-null int64 14 sc_h 2000 non-null int64 15 sc_w 2000 non-null int64 16 talk_time 2000 non-null int64 17 three_g 2000 non-null int64 18 touch_screen 2000 non-null int64 19 wifi 2000 non-null int64 20 price_range 2000 non-null int64 dtypes: float64(2), int64(19) memory usage: 328.2 KB
# Checking the number of data points/size of the data
print('The size of training data is: {} \nThe size of testing data is: {}'.format(len(mobile_train), len(mobile_test)))
The size of training data is: 2000 The size of testing data is: 1000
# Checking the number of features
len(mobile_train.columns)
21
Now that we have our data, it's time to do exploratory analysis, trying to find the insights that can be helpful in our analysis & modelling.
3. Exploring Insights in Data or EDA¶
In this part, we are going to learn more about the dataset. Let's start with the summary statistics, but before that, I will copy the training data in order to get it easily when we mess up down the road.
train_data = mobile_train.copy()
Checking summary statistics¶
mobile_train.describe()
battery_power | blue | clock_speed | dual_sim | fc | four_g | int_memory | m_dep | mobile_wt | n_cores | ... | px_height | px_width | ram | sc_h | sc_w | talk_time | three_g | touch_screen | wifi | price_range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2000.000000 | 2000.0000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | ... | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.000000 |
mean | 1238.518500 | 0.4950 | 1.522250 | 0.509500 | 4.309500 | 0.521500 | 32.046500 | 0.501750 | 140.249000 | 4.520500 | ... | 645.108000 | 1251.515500 | 2124.213000 | 12.306500 | 5.767000 | 11.011000 | 0.761500 | 0.503000 | 0.507000 | 1.500000 |
std | 439.418206 | 0.5001 | 0.816004 | 0.500035 | 4.341444 | 0.499662 | 18.145715 | 0.288416 | 35.399655 | 2.287837 | ... | 443.780811 | 432.199447 | 1084.732044 | 4.213245 | 4.356398 | 5.463955 | 0.426273 | 0.500116 | 0.500076 | 1.118314 |
min | 501.000000 | 0.0000 | 0.500000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.100000 | 80.000000 | 1.000000 | ... | 0.000000 | 500.000000 | 256.000000 | 5.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 851.750000 | 0.0000 | 0.700000 | 0.000000 | 1.000000 | 0.000000 | 16.000000 | 0.200000 | 109.000000 | 3.000000 | ... | 282.750000 | 874.750000 | 1207.500000 | 9.000000 | 2.000000 | 6.000000 | 1.000000 | 0.000000 | 0.000000 | 0.750000 |
50% | 1226.000000 | 0.0000 | 1.500000 | 1.000000 | 3.000000 | 1.000000 | 32.000000 | 0.500000 | 141.000000 | 4.000000 | ... | 564.000000 | 1247.000000 | 2146.500000 | 12.000000 | 5.000000 | 11.000000 | 1.000000 | 1.000000 | 1.000000 | 1.500000 |
75% | 1615.250000 | 1.0000 | 2.200000 | 1.000000 | 7.000000 | 1.000000 | 48.000000 | 0.800000 | 170.000000 | 7.000000 | ... | 947.250000 | 1633.000000 | 3064.500000 | 16.000000 | 9.000000 | 16.000000 | 1.000000 | 1.000000 | 1.000000 | 2.250000 |
max | 1998.000000 | 1.0000 | 3.000000 | 1.000000 | 19.000000 | 1.000000 | 64.000000 | 1.000000 | 200.000000 | 8.000000 | ... | 1960.000000 | 1998.000000 | 3998.000000 | 19.000000 | 18.000000 | 20.000000 | 1.000000 | 1.000000 | 1.000000 | 3.000000 |
8 rows × 21 columns
Checking missing values¶
mobile_train.isnull().sum()
battery_power 0 blue 0 clock_speed 0 dual_sim 0 fc 0 four_g 0 int_memory 0 m_dep 0 mobile_wt 0 n_cores 0 pc 0 px_height 0 px_width 0 ram 0 sc_h 0 sc_w 0 talk_time 0 three_g 0 touch_screen 0 wifi 0 price_range 0 dtype: int64
We are lucky to not have missing values. If we had some, we would have to fill them with some strategies such as mean, remove them compleletly or leave them as they are. None of those 3 options is always the right choice for imputing missing values in all kinds of problems. it depends on the problem and the size of your dataset. Take an example, by removing a feature, you're loosing data. And by filling the values, you're adding noise or so. There is a trade off when it comes to imputing the missing values.
Checking Correlation Between Features¶
correlation = mobile_train.corr()
correlation['price_range']
battery_power 0.200723 blue 0.020573 clock_speed -0.006606 dual_sim 0.017444 fc 0.021998 four_g 0.014772 int_memory 0.044435 m_dep 0.000853 mobile_wt -0.030302 n_cores 0.004399 pc 0.033599 px_height 0.148858 px_width 0.165818 ram 0.917046 sc_h 0.022986 sc_w 0.038711 talk_time 0.021859 three_g 0.023611 touch_screen -0.030411 wifi 0.018785 price_range 1.000000 Name: price_range, dtype: float64
# Visualizing correlation
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(20,7))
sns.heatmap(correlation, annot=True, cmap='PuBu')
<AxesSubplot:>
Looking from the correlation map, the price ranges of the mobile are closely correlated with the ram
or Random Access Memory by the correlation factor of 0.92
. So, that means the single determinant of how expensive the phone is going to be is its memory size and that makes sense even for many electronic devices.
If this is your first time reading correlation, the correlation of 1 (or close to 1) means that the features contain the same information, and if you remove one of them (or remain with one of them), your model will not be affected. On the flip side, if the correlation is -1 (or close to -1), then it means that the features contain different information completely.
There is another interesting insight to draw from the correlation. It seems that the feature fc
(the megapixels of front camera) is correlated with pc
which is the megapixels of the primary camera.
That type of similarity is same across three_g
and four_g
and it makes sense. If you can pick up any smartphone that can support 3G network, there is a chance that it will also have 4G support. It is also the same across the size of the screen (sc_h, sc_w
) and the pixel resolution (px_height, px_width
).
Let's take that insights from words to visualization.
More Data Exploration¶
Again, let's see what we have in the price ranges.
mobile_train['price_range'].value_counts()
0 500 1 500 2 500 3 500 Name: price_range, dtype: int64
sns.countplot(mobile_train['price_range'])
plt.title('Mobile Price Ranges')
/Users/jean/opt/miniconda3/envs/tensor/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Text(0.5, 1.0, 'Mobile Price Ranges')
This is cool, the price ranges are equal divided. We can confidently say that our data is balanced. Having unbalanced classes is a big issue because your model may learn to recognize the classes which dominate the data and fail to recognize the classes which are underrepresented.
We can also try to explore what's in the number of cores and their price ranges.
plt.figure(figsize=(12,7))
sns.countplot(data=mobile_train, x='n_cores')
plt.title('Number of Cores')
Text(0.5, 1.0, 'Number of Cores')
plt.figure(figsize=(12,7))
sns.countplot(data=mobile_train, x='n_cores', hue='price_range')
plt.title('Number of Cores')
Text(0.5, 1.0, 'Number of Cores')
Let's also try to explore the distributions of features, starting with mobile weight and ram (random access memory).
plt.figure(figsize=(12,7))
sns.histplot(data=mobile_train, x='mobile_wt', color='darkgreen')
<AxesSubplot:xlabel='mobile_wt', ylabel='Count'>
plt.figure(figsize=(12,7))
sns.histplot(data=mobile_train, x='mobile_wt', palette='gist_rainbow', hue='price_range')
<AxesSubplot:xlabel='mobile_wt', ylabel='Count'>
plt.figure(figsize=(12,7))
sns.histplot(data=mobile_train, x='ram', palette='PRGn')
<AxesSubplot:xlabel='ram', ylabel='Count'>
plt.figure(figsize=(12,7))
sns.histplot(data=mobile_train, x='ram', palette='PRGn', hue='price_range')
<AxesSubplot:xlabel='ram', ylabel='Count'>
Again, it seems that the phones that has over 2.5G of RAM are very expensive and that makes sense. The RAM is the big factor to determine the price of the phone.
plt.figure(figsize=(12,7))
sns.barplot(data=mobile_train, x='n_cores', y='ram')
<AxesSubplot:xlabel='n_cores', ylabel='ram'>
n_cores
is the number of cores possessed by a given processor. Plotting it with RAM, it doesn't show something remarkable.
We can also try to visualize the relationship between some features, typically starting with the features that we found correlating.
plt.figure(figsize=(12,7))
sns.scatterplot(data=mobile_train, x='pc', y='fc')
plt.title('Front Camera Vs Primary Camera')
Text(0.5, 1.0, 'Front Camera Vs Primary Camera')
plt.figure(figsize=(12,7))
sns.scatterplot(data=mobile_train, x='px_height', y='px_width')
<AxesSubplot:xlabel='px_height', ylabel='px_width'>