Saturday, August 31, 2019

Classification and Regression Using Supervised Learning 3 (Logistic Regression classifier)

Logistic regression is a technique that is used to explain the relationship between input variables and output variables. The input variables are assumed to be independent and the output variable is referred to as the dependent variable. The dependent variable can take only a fixed set of values. These values correspond to the classes of the classification problem.

Our goal is to identify the relationship between the independent variables and the dependent variables by estimating the probabilities using a logistic function. This logistic function is a sigmoid curve that's used to build the function with various parameters. It is very closely related to generalized linear model analysis, where we try to fit a line to a bunch of points to minimize the error. Instead of using linear regression, we use logistic regression. Logistic regression by itself is actually not a classification technique, but we use it in this way so as to facilitate classification. It is used very commonly in machine learning because of its simplicity. Let's see how to build a classifier using logistic regression as shown in the program below:

import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

# Define sample input data
X = np.array([[3.1, 7.2], [4, 6.7], [2.9, 8], [5.1, 4.5], [6, 5], [5.6, 5],[3.3, 0.4], [3.9, 0.9], [2.8, 1], [0.5, 3.4], [1, 4], [0.6, 4.9]])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3])

# Create the logistic regression classifier
classifier = linear_model.LogisticRegression(solver='liblinear', C=1)

# Train the classifier
classifier.fit(X, y)

# Visualize the performance of the classifier
visualize_classifier(classifier, X, y)


Our program begins with the necessary imports and then we define sample input data with two-dimensional vectors and corresponding labels.


X = np.array([[3.1, 7.2], [4, 6.7], [2.9, 8], [5.1, 4.5], [6, 5], [5.6, 5],[3.3, 0.4], [3.9, 0.9], [2.8, 1], [0.5, 3.4], [1, 4], [0.6, 4.9]])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3])



Once we have the labeled data we will train the classifier using this labeled data. First we'll create the logistic regression classifier object:


classifier = linear_model.LogisticRegression(solver='liblinear', C=1)
   
Then we train the classifier using the data that we defined earlier:   

classifier.fit(X, y)

Finally we visualize the performance of the classifier by looking at the boundaries of the classes:

visualize_classifier(classifier, X, y)

The visualize_classifier(classifier, X, y) function can either be defined in the same program or in another program. As we'll be using this multiple times in future posts,it's better to define it in a separate file and import the function. This function is given in the utilities.py file as shown below:

import numpy as np
import matplotlib.pyplot as plt

def visualize_classifier(classifier, X, y):
    # Define the minimum and maximum values for X and Y
    # that will be used in the mesh grid
    min_x, max_x = X[:, 0].min() - 1.0, X[:, 0].max() + 1.0
    min_y, max_y = X[:, 1].min() - 1.0, X[:, 1].max() + 1.0

    # Define the step size to use in plotting the mesh grid
    mesh_step_size = 0.01
    # Define the mesh grid of X and Y values
    x_vals, y_vals = np.meshgrid(np.arange(min_x, max_x, mesh_step_size),
    np.arange(min_y, max_y, mesh_step_size))
   
    # Run the classifier on the mesh grid
    output = classifier.predict(np.c_[x_vals.ravel(), y_vals.ravel()])
    # Reshape the output array
    output = output.reshape(x_vals.shape)
    # Create a plot
    plt.figure()
    # Choose a color scheme for the plot
    plt.pcolormesh(x_vals, y_vals, output, cmap=plt.cm.gray)
    # Overlay the training points on the plot
    plt.scatter(X[:, 0], X[:, 1], c=y, s=75, edgecolors='black',    linewidth=1, cmap=plt.cm.Paired)
   
    # Specify the boundaries of the plot
    plt.xlim(x_vals.min(), x_vals.max())
    plt.ylim(y_vals.min(), y_vals.max())
    # Specify the ticks on the X and Y axes
    plt.xticks((np.arange(int(X[:, 0].min() - 1), int(X[:, 0].max() + 1),    1.0)))
    plt.yticks((np.arange(int(X[:, 1].min() - 1), int(X[:, 1].max() + 1),    1.0)))
    plt.show()


We create the function definition by taking the classifier object, input data, and labels as input
parameters:

def visualize_classifier(classifier, X, y):

We also defined the minimum and maximum values of X and Y directions that will be used in our mesh grid. This grid is basically a set of values that is used to evaluate the function, so that we can visualize the boundaries of the classes.

min_x, max_x = X[:, 0].min() - 1.0, X[:, 0].max() + 1.0
min_y, max_y = X[:, 1].min() - 1.0, X[:, 1].max() + 1.0

Define the step size for the grid and create it using the minimum and maximum values:

mesh_step_size = 0.01
x_vals, y_vals = np.meshgrid(np.arange(min_x, max_x, mesh_step_size),
np.arange(min_y, max_y, mesh_step_size))

Next we run the classifier on all the points on the grid and then reshape the output array:

output = classifier.predict(np.c_[x_vals.ravel(), y_vals.ravel()])
output = output.reshape(x_vals.shape)

Once we have the output we create the figure, pick a color scheme, and overlay all the points:
plt.figure()

plt.pcolormesh(x_vals, y_vals, output, cmap=plt.cm.gray)

plt.scatter(X[:, 0], X[:, 1], c=y, s=75, edgecolors='black',linewidth=1, cmap=plt.cm.Paired)

Finally we specify the boundaries of the plots using the minimum and maximum values, add the tick
marks, and display the figure:

plt.xlim(x_vals.min(), x_vals.max())
plt.ylim(y_vals.min(), y_vals.max())

plt.xticks((np.arange(int(X[:, 0].min() - 1), int(X[:, 0].max() + 1),1.0)))
plt.yticks((np.arange(int(X[:, 1].min() - 1), int(X[:, 1].max() + 1),1.0)))
plt.show()

To use this in our main program let's import this function by adding the following statement in our program:

from utilities import visualize_classifier

Thus our program now is:

import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
from utilities import visualize_classifier

# Define sample input data
X = np.array([[3.1, 7.2], [4, 6.7], [2.9, 8], [5.1, 4.5], [6, 5], [5.6, 5],[3.3, 0.4], [3.9, 0.9], [2.8, 1], [0.5, 3.4], [1, 4], [0.6, 4.9]])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3])

# Create the logistic regression classifier
classifier = linear_model.LogisticRegression(solver='liblinear', C=1)

# Train the classifier
classifier.fit(X, y)

# Visualize the performance of the classifier
visualize_classifier(classifier, X, y)

When we run the program, we get the following output:
If we change the value of C to 100 in the following line, we will see that the boundaries have become more accurate:

classifier = linear_model.LogisticRegression(solver='liblinear', C=100)

The reason is that C imposes a certain penalty on misclassification, so the algorithm customizes more to the training data. We should be careful with this parameter, because if we increase it a lot, it will overfit to the training data and it won't generalize well. If we run the code with C set to 100, we get the following output:




Share:

0 comments:

Post a Comment