Thursday, September 5, 2019

Classification and Regression Using Supervised Learning 7 (Regression)

Regression is the process of estimating the relationship between input and output variables. One thing to note is that the output variables are continuous-valued real numbers. Hence there are an infinite number of possibilities. This is in contrast with classification, where the number of output classes is fixed. The classes belong to a finite set of possibilities.

In regression, it is assumed that the output variables depend on the input variables, so we want to see how they are related. Consequently, the input variables are called independent variables, also known as predictors, and output variables are called dependent variables, also known as criterion variables. It is not necessary that the input variables are independent of each other. There are a lot of situations where there are correlations between input variables.

Regression analysis helps us in understanding how the value of the output variable changes when we vary some input variables while keeping other input variables fixed. In linear regression, we assume that the relationship between input and output is linear. This puts a constraint on our modeling procedure, but it's fast and efficient. Sometimes, linear regression is not sufficient to explain the  relationship between input and output. Hence we use polynomial regression, where we use a polynomial to explain the relationship between input and output. This is more computationally complex, but gives higher accuracy.

Depending on the problem at hand, we use different forms of regression to extract the relationship. Regression is frequently used for prediction of prices, economics, variations, and so on. Now let's see how to build a single variable regression model with the help of following program:

import pickle
import numpy as np
from sklearn import linear_model
import sklearn.metrics as sm
import matplotlib.pyplot as plt


# Input file containing data
input_file = 'data_singlevar_regr.txt'

# Read data
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]

# Train and test split
num_training = int(0.8 * len(X))
num_test = len(X) - num_training

# Training data
X_train, y_train = X[:num_training], y[:num_training]

# Test data
X_test, y_test = X[num_training:], y[num_training:]

# Create linear regressor object
regressor = linear_model.LinearRegression()

# Train the model using the training sets
regressor.fit(X_train, y_train)

# Predict the output
y_test_pred = regressor.predict(X_test)

# Plot outputs
plt.scatter(X_test, y_test, color='green')
plt.plot(X_test, y_test_pred, color='black', linewidth=4)
plt.xticks(())
plt.yticks(())
plt.show()


In the above program I've used the file data_singlevar_regr.txt as source for data, you should replace this with your data source file.

As usual our program starts with importing the required packages and define the data source file. 
input_file = 'data_singlevar_regr.txt'
 
Our data source file is a comma-separated file which we read as shown below:
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]


Next we split the data into training and testing: 
num_training = int(0.8 * len(X))
num_test = len(X) - num_training


X_train, y_train = X[:num_training], y[:num_training]
X_test, y_test = X[num_training:], y[num_training:]

After splitting the data we create a linear regressor object and train it using the training data:
regressor = linear_model.LinearRegression() 
regressor.fit(X_train, y_train)

Next we predict the output for the testing dataset using the training model:
y_test_pred = regressor.predict(X_test)

Now it's the time to plot the output:

plt.scatter(X_test, y_test, color='green')
plt.plot(X_test, y_test_pred, color='black', linewidth=4)
plt.xticks(())
plt.yticks(())
plt.show()



When we run the program we get the following graph:
Now let's compute the performance metrics for the regressor by comparing the ground truth, which refers to the actual outputs, with the predicted outputs:

# Compute performance metrics
print("Linear regressor performance:")
print("Mean absolute error =", round(sm.mean_absolute_error(y_test,
y_test_pred), 2))
print("Mean squared error =", round(sm.mean_squared_error(y_test,
y_test_pred), 2))
print("Median absolute error =", round(sm.median_absolute_error(y_test,
y_test_pred), 2))
print("Explain variance score =", round(sm.explained_variance_score(y_test,
y_test_pred), 2))
print("R2 score =", round(sm.r2_score(y_test, y_test_pred), 2))


Once the model has been created, we can save it into a file so that we can use it later. Python
provides a nice module called pickle that enables us to do this:

# Model persistence
output_model_file = 'model.pkl'


# Save the model
with open(output_model_file, 'wb') as f:
      pickle.dump(regressor, f)


Now when we run the program, in the output we get plot and the following on the output window:

Linear regressor performance:
Mean absolute error = 0.59
Mean squared error = 0.49
Median absolute error = 0.51
Explain variance score = 0.86
R2 score = 0.86
------------------
(program exited with code: 0)

Press any key to continue . . .


Now let's load the model from the file on the disk and perform prediction:

# Load the model
with open(output_model_file, 'rb') as f:
regressor_model = pickle.load(f)
# Perform prediction on test data
y_test_pred_new = regressor_model.predict(X_test)
print("\nNew mean absolute error =", round(sm.mean_absolute_error(y_test,
y_test_pred_new), 2))

 
Now when we run the program, in the output we get plot and the following on the output window: 
Linear regressor performance:
Mean absolute error = 0.59
Mean squared error = 0.49
Median absolute error = 0.51
Explain variance score = 0.86
R2 score = 0.86

New mean absolute error = 0.59
------------------
(program exited with code: 0)

Press any key to continue . . .



Share:

0 comments:

Post a Comment