Classification and Regression Using Supervised Learning 9 (building a regressor to estimate the housing prices) ~ Python is easy to learn

In the following program we'll see how to use the SVM concept to build a regressor to estimate the housing prices. I'll use the dataset available in sklearn where each data point is define, by 13 attributes. Housing dataset contains information about different houses in Boston. This data was originally a part of UCI Machine Learning Repository and has been removed now. We can also access this data from the scikit-learn library. There are 506 samples and 13 feature variables in this dataset. Our aim is to estimate the housing prices based on these attributes. See the code below:

import numpy as np
from sklearn import datasets
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, explained_variance_score
from sklearn.utils import shuffle

# Load housing data
data = datasets.load_boston()

# Shuffle the data
X, y = shuffle(data.data, data.target, random_state=7)

# Split the data into training and testing datasets
num_training = int(0.8 * len(X))
X_train, y_train = X[:num_training], y[:num_training]
X_test, y_test = X[num_training:], y[num_training:]

# Create Support Vector Regression model
sv_regressor = SVR(kernel='linear', C=1.0, epsilon=0.1)

# Train Support Vector Regressor
sv_regressor.fit(X_train, y_train)

# Evaluate performance of Support Vector Regressor
y_test_pred = sv_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_test_pred)
evs = explained_variance_score(y_test, y_test_pred)
print("\n#### Performance ####")
print("Mean squared error =", round(mse, 2))
print("Explained variance score =", round(evs, 2))

# Test the regressor on test datapoint
test_data = [3.7, 0, 18.4, 1, 0.87, 5.95, 91, 2.5052, 26, 666, 20.2,
351.34, 15.27]
print("\nPredicted price:", sv_regressor.predict([test_data])[0])

First, we will import the required libraries in our program. Next, we load the housing data from the scikit-learn library and understand it. We print the value of the data to understand what it contains.

data = datasets.load_boston()
print(data.keys())

It prints-

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

data: contains the information for various houses
target: prices of the house
feature_names: names of the features
DESCR: describes the dataset

To know more about the features use data.DESCR The description of all the features is given below:

CRIM: Per capita crime rate by town
ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: Nitric oxide concentration (parts per 10 million)
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Full-value property tax rate per $10,000
PTRATIO: Pupil-teacher ratio by town
B: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
LSTAT: Percentage of lower status of the population
MEDV: Median value of owner-occupied homes in $1000s

The prices of the house indicated by the variable MEDV is our target variable and the remaining are the feature variables based on which we will predict the value of a house.

The next step is to shuffle the data so that we don't bias our analysis

X, y = shuffle(data.data, data.target, random_state=7)

Now split the dataset into training and testing in an 80/20 format:

num_training = int(0.8 * len(X))
X_train, y_train = X[:num_training], y[:num_training]
X_test, y_test = X[num_training:], y[num_training:]

Create and train the Support Vector Regressor using a linear kernel. The C parameter represents the penalty for training error. If you increase the value of C, the model will fine tune it more to fit the training data. But this might lead to overfitting and cause it to lose its generality. The epsilon parameter specifies a threshold; there is no penalty for training error if the predicted value is within this distance from the actual value:

sv_regressor = SVR(kernel='linear', C=1.0, epsilon=0.1)
sv_regressor.fit(X_train, y_train)

Now evaluate the performance of the regressor and print the metrics:

y_test_pred = sv_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_test_pred)
evs = explained_variance_score(y_test, y_test_pred)
print("\n#### Performance ####")
print("Mean squared error =", round(mse, 2))
print("Explained variance score =", round(evs, 2))

Let's take a test data point and perform prediction:

test_data = [3.7, 0, 18.4, 1, 0.87, 5.95, 91, 2.5052, 26, 666, 20.2,351.34, 15.27]
print("\nPredicted price:", sv_regressor.predict([test_data])[0])

If you run the code, you will see the following printed on the Terminal:

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

#### Performance ####
Mean squared error = 15.38
Explained variance score = 0.82

Predicted price: 18.521780107258536

------------------
(program exited with code: 0)

Press any key to continue . . .

Python is easy to learn

Monday, September 9, 2019

Classification and Regression Using Supervised Learning 9 (building a regressor to estimate the housing prices)

0 comments:

Post a Comment