Pandas - 44 Supervised Learning with scikit-learn (Support Vector Machines) ~ Python is easy to learn

The Support Vector Machines (SVM) classifiers are binary or discriminating models, working on two classes of differentiation. Their main task is basically to discriminate against new observations between two classes. During the learning phase, these classifiers project the observations in a multidimensional space called decisional space and build a separation surface called the decision boundary that divides this space into two areas of belonging.

In the simplest case such as the linear case, the decision boundary will be represented by a plane (in 3D) or by a straight line (in 2D). In more complex cases, the separation surfaces are curved shapes with increasingly articulated shapes.

The SVM can be used both in regression with the SVR (Support Vector Regression) and in classification with the SVC (Support Vector Classification).

Support Vector Classification (SVC)

To understand how this algorithm works, we'll start by referring to the simplest case, that is the linear 2D case, where the decision boundary will be a straight line separating into two parts the decisional area. Take for example a simple training set where some points are assigned to two different classes.

The training set will consist of 11 points (observations) with two different attributes that will have values between 0 and 4. These values will be contained within a NumPy array called x. Their belonging to one of two classes will be defined by 0 or 1 values contained in another array called y.
Visualize distribution of these points in space with a scatterplot which will then be defined as a decision space. See the following program :

from sklearn import svm

x = np.array([[1,3],[1,2],[1,1.5],[1.5,2],[2,3],[2.5,1.5],[2,1],[3,1],[3,2],[3.5,1],[3.5,3]])
y = [0]*6 + [1]*5
plt.scatter(x[:,0],x[:,1],c=y,s=50,alpha=0.9)
plt.show()

The output of the program is shown below:

After defining the training set we can apply the SVC (Support Vector Classification) algorithm. This algorithm will create a line (decision boundary) in order to divide the decision area into two parts and this straight line will be placed so as to maximize its distance of closest observations contained in the training set. This condition should produce two different portions in which all points of a same class should be contained.

Next we apply the SVC algorithm to the training set and to do so, first we define the model with the SVC() constructor defining the kernel as linear. (A kernel is a class of algorithms for pattern analysis.) Then we will use the fit() function with the training set as an argument. Once the model is trained we can plot the decision boundary with the decision_function() function. Then we draw the scatterplot and provide a different color to the two portions of the decision space. See the following program :

x = np.array([[1,3],[1,2],[1,1.5],[1.5,2],[2,3],[2.5,1.5],[2,1],[3,1],[3,2],[3.5,1],[3.5,3]])
y = [0]*6 + [1]*5
svc = svm.SVC(kernel='linear').fit(x,y)
X,Y = np.mgrid[0:4:200j,0:4:200j]
Z = svc.decision_function(np.c_[X.ravel(),Y.ravel()])
Z = Z.reshape(X.shape)
plt.contourf(X,Y,Z > 0,alpha=0.4)
plt.contour(X,Y,Z,colors=['k'], linestyles=['-'],levels=[0])
plt.scatter(x[:,0],x[:,1],c=y,s=50,alpha=0.9)
plt.show()

The output of the program shows the two portions containing the two classes. It can be said that the division is successful except for a purple dot in the yellow portion.:

Thus once the model has been trained, it is simple to understand how the predictions operate. Graphically, depending on the position occupied by the new observation, we will know its corresponding membership in one of the two classes. Instead, from a more programmatic point of view, the predict() function will return the number of the corresponding class of belonging (0 for class in purple, 1 for the class in yellow). Thus if we add these lines to the above program:

print(svc.predict([[1.5,2.5]]))
print(svc.predict([[2.5,1]]))

We get the out put as [0] [1].

Regularization

A related concept with the SVC algorithm is regularization. It is set by the parameter C: a small value of C means that the margin is calculated using many or all of the observations around the line of separation (greater regularization), while a large value of C means that the margin is calculated on the observations near to the line separation (lower regularization). Unless otherwise specified, the default value of C is equal to 1. See the following program:

x = np.array([[1,3],[1,2],[1,1.5],[1.5,2],[2,3],[2.5,1.5],[2,1],[3,1],[3,2],[3.5,1],[3.5,3]])
y = [0]*6 + [1]*5
svc = svm.SVC(kernel='linear',C=1).fit(x,y)
X,Y = np.mgrid[0:4:200j,0:4:200j]
Z = svc.decision_function(np.c_[X.ravel(),Y.ravel()])
Z = Z.reshape(X.shape)
plt.contourf(X,Y,Z > 0,alpha=0.4)
plt.contour(X,Y,Z,colors=['k','k','k'], linestyles=['--','-','--'],
levels=[-1,0,1])
plt.scatter(svc.support_vectors_[:,0],svc.support_vectors_[:,1],s=120,facecolors='r')
plt.scatter(x[:,0],x[:,1],c=y,s=50,alpha=0.9)
plt.show()

The output of the program is shown below:

We can highlight points that participated in the margin calculation, identifying them through the support_vectors array. These points are represented by rimmed circles in the scatterplot. Furthermore, they will be within an evaluation area in the vicinity of the separation line (see the dashed
lines in the output figure).

In order to see the effect on the decision boundary, we can restrict the value to C = 0.1. Let’s
see how many points will be taken into consideration with the help of following program in which we'll only change C from 1 to 0.1:

x = np.array([[1,3],[1,2],[1,1.5],[1.5,2],[2,3],[2.5,1.5],[2,1],[3,1],[3,2],[3.5,1],[3.5,3]])
y = [0]*6 + [1]*5
svc = svm.SVC(kernel='linear',C=0.1).fit(x,y)
X,Y = np.mgrid[0:4:200j,0:4:200j]
Z = svc.decision_function(np.c_[X.ravel(),Y.ravel()])
Z = Z.reshape(X.shape)
plt.contourf(X,Y,Z > 0,alpha=0.4)
plt.contour(X,Y,Z,colors=['k','k','k'], linestyles=['--','-','--'],
levels=[-1,0,1])
plt.scatter(svc.support_vectors_[:,0],svc.support_vectors_[:,1],s=120,facecolors='r')
plt.scatter(x[:,0],x[:,1],c=y,s=50,alpha=0.9)

plt.show()

The output of the program is shown below which shows that the points taken into consideration are increased and consequently the separation line (decision boundary) has changed orientation. Also there are two points now that are in the wrong decision areas:

Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!

Python is easy to learn

Sunday, May 26, 2019

Pandas - 44 Supervised Learning with scikit-learn (Support Vector Machines)

0 comments:

Post a Comment