Tuesday, May 21, 2019

Pandas - 41 Supervised Learning with scikit-learn (classification using K-Nearest Neighbors Classifier)

To perform a classification, and to do this operation with the scikit-learn library we need a classifier. Given a new measurement of an iris flower, the task of the classifier is to figure out to which of the three species it belongs. The simplest possible classifier is the nearest neighbor. This algorithm will search within the training set for the observation that most closely approaches the new test sample.

We have only a single dataset of data and we should'nt use the same data both for the test and for the training. In this regard, the elements of the dataset are divided into two parts, one dedicated to train the algorithm and the other to perform its validation.

Thus, before proceeding further we'll divide our Iris Dataset into two parts. It is wise to randomly mix the array elements and then make the division. In fact, often the data may have been collected in a particular order, and in our case the Iris Dataset contains items sorted by species. So to make a blending of elements of the dataset we will use a NumPy function called random.permutation(). The mixed dataset consists of 150 different observations; the first 140 will be used as the training set, the
remaining 10 as the test set. See the following program :

np.random.seed(0)
iris = datasets.load_iris()
x = iris.data
y = iris.target
i = np.random.permutation(len(iris.data))
x_train = x[i[:-10]]
y_train = y[i[:-10]]
x_test = x[i[-10:]]
y_test = y[i[-10:]]


Now that we have divided our Iris Dataset into two parts we can apply the K-Nearest Neighbor algorithm. Import the KNeighborsClassifier, call the constructor of the classifier, and then train it with the fit() function. See the following program :

from sklearn.neighbors import KNeighborsClassifier
np.random.seed(0)
iris = datasets.load_iris()
x = iris.data
y = iris.target
i = np.random.permutation(len(iris.data))
x_train = x[i[:-10]]
y_train = y[i[:-10]]
x_test = x[i[-10:]]
y_test = y[i[-10:]]

knn = KNeighborsClassifier()
knn.fit(x_train,y_train)


Now that we have a predictive model that consists of the knn classifier, trained by 140 observations, we will find out how it is valid. The classifier should correctly predict the species of iris of the 10 observations of the test set. In order to obtain the prediction we have to use the predict() function, which will be applied directly to the predictive model, knn. Finally, we will compare the results predicted with the actual observed contained in y_test.

np.random.seed(0)
iris = datasets.load_iris()
x = iris.data
y = iris.target
i = np.random.permutation(len(iris.data))
x_train = x[i[:-10]]
y_train = y[i[:-10]]
x_test = x[i[-10:]]
y_test = y[i[-10:]]

knn = KNeighborsClassifier()
knn.fit(x_train,y_train)
print(knn.predict(x_test))
print(y_test)


The output of the program is shown below:

[1 2 1 0 0 0 2 1 2 0]
[1 1 1 0 0 0 2 1 2 0]
------------------
(program exited with code: 0)

Press any key to continue . . .


The output shows that we obtained a 10% error. Now we can visualize all this using decision boundaries in a space represented by the 2D scatterplot of sepals. See the following program :

iris = datasets.load_iris()
x = iris.data[:,:2] #X-Axis - sepal lenght-width
y = iris.target     #Y-Axis - species

x_min, x_max = x[:,0].min() - .5, x[:,0].max() + .5
y_min, y_max = x[:,1].min() - .5, x[:,1].max() + .5

#MESH
cmap_light = ListedColormap(['#AAAAFF','#AAFFAA','#FFAAAA'])
h = .02
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
knn = KNeighborsClassifier()
knn.fit(x,y)
Z = knn.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure()
plt.pcolormesh(xx,yy,Z,cmap=cmap_light)

#Plot the training points
plt.scatter(x[:,0],x[:,1],c=y)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())
plt.show()


The output of the program is a subdivision of the scatterplot in decision boundaries as shown below :
Let's do the same thing considering the size of the petals. See the following program :

iris = datasets.load_iris()
x = iris.data[:,2:4] #X-Axis - petal lenght-width
y = iris.target     #Y-Axis - species

x_min, x_max = x[:,0].min() - .5, x[:,0].max() + .5
y_min, y_max = x[:,1].min() - .5, x[:,1].max() + .5

#MESH
cmap_light = ListedColormap(['#AAAAFF','#AAFFAA','#FFAAAA'])
h = .02
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
knn = KNeighborsClassifier()
knn.fit(x,y)
Z = knn.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure()
plt.pcolormesh(xx,yy,Z,cmap=cmap_light)

#Plot the training points
plt.scatter(x[:,0],x[:,1],c=y)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())
plt.show()


The output of the program is shown below: 
 
Here I am ending today’s post. In the next post we'll discuss about the diabetes dataset. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment