This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.

Intro to Scikit-Learn for Shallow Machine Learning¶

Scikit-Learn (aka sklearn) is a simple, beautifull and well designed machine learning library.

Scikit-Learn provides various learning algorithms. Using them is as simple as filling a cup of coffee.

Beyond learning algorithms, the library also provides data processing functions which are also very simple to use. In addition to those processing functions, Scikit-Learn offers other functions such as pipelines, model tuning, and much more.

To learn more about Scikit-Learn, check out its website. It's to navigate and you will immediately see how well designed it is.

To practice how simple Scikit-Learn is, let's build a simple linear regressor. Our goal is to fit a line.

We will start with imports, in this case, we will import Linear Regressor from sklearn linear models. We will also imports NumPy which we will use to create our sample data and Matplotlib to plot the created data.

In [ ]:

            
                Copied!
                
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

After we have imported Linear Regressor, we will create our data. It's a numpy array, X and y. If you are too quick to crunch the numbers, you are going to see that the X and y represent a linear equation y=2X+1.

In [ ]:

            
                Copied!
                
X = np.array([[1.0],[2.0],[3.0],[4.0],[5.0],[6.0],[7.0],[8.0]], dtype='float')
y = np.array([[3.0],[5.0],[7.0],[9.0],[11.0],[13.0],[15.0],[17.0]], dtype='float')
X = np.array([[1.0],[2.0],[3.0],[4.0],[5.0],[6.0],[7.0],[8.0]], dtype='float')
y = np.array([[3.0],[5.0],[7.0],[9.0],[11.0],[13.0],[15.0],[17.0]], dtype='float')

Like we said, here is the relationship between X and y. It is a line.

In [ ]:

            
                Copied!
                
plt.plot(X,y)
plt.plot(X,y)

Out[ ]:

[<matplotlib.lines.Line2D at 0x7f98e7294490>]

Now that we have created the data, it's time to create a model.

In [ ]:

            
                Copied!
                
model =  LinearRegression()
model =  LinearRegression()

That was simple. Next, we are going to train the model. Using fitmethod, we are going to pass the input data X and output data y. y is also referred to as a label.

In [ ]:

            
                Copied!
                
model.fit(X, y)
model.fit(X, y)

Out[ ]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Huum, that was quick. Now the model is trained on the dataset. Let's try to see how confident it is on the data it never saw. We will create a test array having the numbers from 9 to 11, just to test if the predictions will be 19, 21, 23 respectively. Let's do that!!

In [ ]:

            
                Copied!
                
test_array = np.array([[9.0],[10.0],[11.0]], dtype='float')

model.predict(test_array)
test_array = np.array([[9.0],[10.0],[11.0]], dtype='float')

model.predict(test_array)

Out[ ]:

array([[19.],
       [21.],
       [23.]])

As you can see, it did well. It was able to learn the relationship between X and y just from the data. This the what it means when we say that different to traditional programming which requires rules and data to give results, machine learning takes data and results and give the rules.

Let's try to see the rules in our example.

In [ ]:

            
                Copied!
                
model.coef_
model.coef_

Out[ ]:

array([[2.]])

In [ ]:

            
                Copied!
                
model.intercept_
model.intercept_

Out[ ]:

array([1.])

Great, the model was able to determine the exact linear equation that we used when creating the data. The coef is the coefficient, commonly known to weight. So in this case it is 2, and it what is multplied to the input data X. On the other hand, 1 is the intercept, commonly known to bias. Combining them we get our equation, y=2X+1. These two parameters (weights and biases) are the two output components of any machine learning model.

In our case, since the data was so simple, it is easy to directly tell that the output is a linear equation, but when it comes to the real world scenarios, it may be hard because you have many features and data points.

Hopefully this was a good introduction to Machine Learning with Scikit-Learn. We have not touched to the whole of what we can do with Scikit-Learn and in the next labs, you will see more models applied to the real world datasets!