Sunday, September 15, 2019

Using grid search for finding optimal training parameters

When we are working with classifiers, we do not always know what the best parameters are. We cannot brute-force it by checking for all possible combinations manually. This is where grid search becomes useful. Grid search allows us to specify a range of values and the classifier will automatically run various configurations to figure out the best combination of parameters. The following program demonstrate how to implement this:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from utilities import visualize_classifier

# Load input data
input_file = 'data_random_forests.txt'
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]

# Separate input data into three classes based on labels
class_0 = np.array(X[y==0])
class_1 = np.array(X[y==1])
class_2 = np.array(X[y==2])

# Split the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, random_state=5)

# Define the parameter grid
parameter_grid = [ {'n_estimators': [100], 'max_depth': [2, 4, 7, 12, 16]},
                   {'max_depth': [4], 'n_estimators': [25, 50, 100, 250]}
                 ]

metrics = ['precision_weighted', 'recall_weighted']

for metric in metrics:
    print("\n##### Searching optimal parameters for", metric)

    classifier = GridSearchCV(
            ExtraTreesClassifier(random_state=0),
            parameter_grid, cv=5, scoring=metric)
    classifier.fit(X_train, y_train)

    print("\nGrid scores for the parameter grid:")
    '''
    for params, avg_score, _ in classifier.cv_results_:
        print(params, '-->', round(avg_score, 3))

    print("\nBest parameters:", classifier.best_params_)
    '''
cv_keys = ('mean_test_score', 'std_test_score', 'params')

for r, _ in enumerate(classifier.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r"
              % (classifier.cv_results_[cv_keys[0]][r],
                 classifier.cv_results_[cv_keys[1]][r] / 2.0,
                 classifier.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % classifier.best_params_)
print('Accuracy: %.2f' % classifier.best_score_)
y_pred = classifier.predict(X_test)
print("\nPerformance report:\n")
print(classification_report(y_test, y_pred))


As usual our program begins with importing the required packages. I am using use the data available in the file data_random_forests.txt for analysis, this you can replace with your own data source file. First we load the data and separate it into three classes based on labels:

input_file = 'data_random_forests.txt'
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]

class_0 = np.array(X[y==0])
class_1 = np.array(X[y==1])
class_2 = np.array(X[y==2])

Next step is to split the data into training and testing datasets:

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, random_state=5)

Now we specify the grid of parameters that we want the classifier to test. Usually we keep one parameter constant and vary the other parameter. We then do it vice versa to figure out the best combination. In this case, we want to find the best values for n_estimators and max_depth:

parameter_grid = [ {'n_estimators': [100], 'max_depth': [2, 4, 7, 12, 16]},
{'max_depth': [4], 'n_estimators': [25, 50, 100,
250]}
]

Now we define the metrics that the classifier should use to find the best combination of parameters:

metrics = ['precision_weighted', 'recall_weighted']

For each metric, we need to run the grid search, where we train the classifier for a particular combination of parameters:

for metric in metrics:
    print("\n##### Searching optimal parameters for", metric)

    classifier = GridSearchCV(
            ExtraTreesClassifier(random_state=0),
            parameter_grid, cv=5, scoring=metric)
    classifier.fit(X_train, y_train)

Now we print the score for each parameter combination and the performance report:

print("\nGrid scores for the parameter grid:")
cv_keys = ('mean_test_score', 'std_test_score', 'params')

for r, _ in enumerate(classifier.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r"
              % (classifier.cv_results_[cv_keys[0]][r],
                 classifier.cv_results_[cv_keys[1]][r] / 2.0,
                 classifier.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % classifier.best_params_)
print('Accuracy: %.2f' % classifier.best_score_)
y_pred = classifier.predict(X_test)
print("\nPerformance report:\n")
print(classification_report(y_test, y_pred))

When we run the code, we will get this output on the Terminal for the precision metric:

 

Based on the combinations in the grid search, it will print out the best combination for the precision metric. If we want to know the best combination for recall, we need to check the following output on the Terminal:



It is a different combination for recall, which makes sense because precision and recall are different metrics that demand different parameter combinations.








Share:

0 comments:

Post a Comment