Thursday, August 29, 2019

Classification and Regression Using Supervised Learning 1 (classification)

The process of classification is one such technique where we classify data into a given number of classes. During classification, we arrange data into a fixed number of categories so that it can be used most effectively and efficiently.

In machine learning, classification solves the problem of identifying the category to which a new data point belongs. We build the classification model based on the training dataset containing data points and the corresponding labels.

For example, let's say that we want to check whether the given image contains a person's face or not. We would build a training dataset containing classes corresponding to these two classes: face and no-face. We then train the model based on the training samples we have. This trained model is then used for inference.

A good classification system makes it easy to find and retrieve data. This is used extensively in face recognition, spam identification, recommendation engines, and so on. The algorithms for data classification will come up with the right criteria to separate the given data into the given number of classes.

We are required to provide a sufficiently large number of samples so that it can generalize those criteria. If there is an insufficient number of samples, then the algorithm will overfit to the training data. This means that it won't perform well on unknown data because it fine-tuned the model too much to fit into the patterns observed in training data. This is actually a very common problem that occurs in the world of machine learning. It's good to consider this factor when you build various machine learning models.

Preprocessing data

Machine learning algorithms expect data to be formatted in a certain way before they start the training process. In real world we deal with a lot of raw data and in order to prepare the data for ingestion by machine learning algorithms, we have to preprocess it and convert it into the right format. There are several different preprocessing techniques such as -

  • Binarization
  • Mean removal
  • Scaling
  • Normalization

Let's start with:

1. Binarization

This process is used when we want to convert our numerical values into boolean values. The following program shows how to do it:

import numpy as np
from sklearn import preprocessing

input_data = np.array([[5.1, -2.9, 3.3],[-1.2, 7.8, -6.1],[3.9, 0.4, 2.1],[7.3, -9.9, -4.5]])

# Binarize data
data_binarized = preprocessing.Binarizer(threshold=2.1).transform(input_data)
print("\nBinarized data:\n", data_binarized)

We start with importing the packages and then define some sample data which is stored in the variable input_data.

input_data = np.array([[5.1, -2.9, 3.3],[-1.2, 7.8, -6.1],[3.9, 0.4, 2.1],[7.3, -9.9, -4.5]])

Next we use an inbuilt method to binarize input data using 2.1 as the threshold value.

data_binarized = preprocessing.Binarizer(threshold=2.1).transform(input_data)

Once we run the code, we will see the following output:

Binarized data:

[[ 1. 0. 1.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 1. 0. 0.]]

From the output we can see that all the values above 2.1 become 1. The remaining values become 0. Hence our objective is fulfilled.

The next preprocessing technique is:

2. Mean removal

This preprocessing technique is useful to remove the mean from our feature vector, so that each feature is centered on zero. We do this in order to remove bias from the features in our feature vector. See the following program:

import numpy as np
from sklearn import preprocessing

input_data = np.array([[5.1, -2.9, 3.3],[-1.2, 7.8, -6.1],[3.9, 0.4, 2.1],[7.3, -9.9, -4.5]])

# Print mean and standard deviation
print("\nBEFORE:")
print("Mean =", input_data.mean(axis=0))
print("Std deviation =", input_data.std(axis=0))

# Remove mean
data_scaled = preprocessing.scale(input_data)
print("\nAFTER:")
print("Mean =", data_scaled.mean(axis=0))
print("Std deviation =", data_scaled.std(axis=0))

We start with importing the packages and then define some sample data which is stored in the variable input_data.

input_data = np.array([[5.1, -2.9, 3.3],[-1.2, 7.8, -6.1],[3.9, 0.4, 2.1],[7.3, -9.9, -4.5]])

Next we display the mean and standard deviation of the input data:

print("\nBEFORE:")
print("Mean =", input_data.mean(axis=0))
print("Std deviation =", input_data.std(axis=0))

Finally we remove the mean and print the mean and standard deviation:

data_scaled = preprocessing.scale(input_data)
print("\nAFTER:")
print("Mean =", data_scaled.mean(axis=0))
print("Std deviation =", data_scaled.std(axis=0))

Once we run the code, we will see the following output:

BEFORE:
Mean = [ 3.775 -1.15 -1.3 ]
Std deviation = [ 3.12039661 6.36651396 4.0620192 ]
AFTER:
Mean = [ 1.11022302e-16 0.00000000e+00 2.77555756e-17]
Std deviation = [ 1. 1. 1.]

From the output we can see that the mean value is very close to 0 and standard deviation is 1.

The next preprocessing technique is:

Scaling

Usually in our feature vector, the value of each feature can vary between many random values. So it
becomes important to scale those features so that it is a level playing field for the machine learning algorithm to train on. We don't want any feature to be artificially large or small just because of the nature of the measurements. See the following program:

import numpy as np
from sklearn import preprocessing

input_data = np.array([[5.1, -2.9, 3.3],[-1.2, 7.8, -6.1],[3.9, 0.4, 2.1],[7.3, -9.9, -4.5]])

# Min max scaling
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
print("\nMin max scaled data:\n", data_scaled_minmax)

Once we run the code, we will see the following output:

Min max scaled data:
[[ 0.74117647 0.39548023 1. ]
[ 0. 1. 0. ]
[ 0.6 0.5819209 0.87234043]
[ 1. 0. 0.17021277]]

From the output we can see that each row is scaled so that the maximum value is 1 and all the other values are relative to this value.

The next preprocessing technique is:

Normalization

Normalization is used to modify the values in the feature vector so that we can measure them on a common scale. In machine learning, we use many different forms of normalization. Some of the most common forms of normalization aim to modify the values so that they sum up to 1.

L1 normalization, which refers to Least Absolute Deviations, works by making sure that the sum of absolute values is 1 in each row. L2 normalization, which refers to least squares, works by making sure that the sum of squares is 1.

In general, L1 normalization technique is considered more robust than L2 normalization technique because it is resistant to outliers in the data. A lot of times, data tends to contain outliers and we cannot do anything about it. We want to use techniques that can safely and effectively ignore them during the calculations. If we are solving a problem where outliers are important, then maybe L2 normalization becomes a better choice. See the following program:

import numpy as np
from sklearn import preprocessing

input_data = np.array([[5.1, -2.9, 3.3],[-1.2, 7.8, -6.1],[3.9, 0.4, 2.1],[7.3, -9.9, -4.5]])

# Normalize data
data_normalized_l1 = preprocessing.normalize(input_data, norm='l1')
data_normalized_l2 = preprocessing.normalize(input_data, norm='l2')
print("\nL1 normalized data:\n", data_normalized_l1)
print("\nL2 normalized data:\n", data_normalized_l2)

Once we run the code, we will see the following output:

L1 normalized data:
[[ 0.45132743 -0.25663717 0.2920354 ]
[-0.0794702 0.51655629 -0.40397351]
[ 0.609375 0.0625 0.328125 ]
[ 0.33640553 -0.4562212 -0.20737327]]

L2 normalized data:
[[ 0.75765788 -0.43082507 0.49024922]
[-0.12030718 0.78199664 -0.61156148]
[ 0.87690281 0.08993875 0.47217844]
[ 0.55734935 -0.75585734 -0.34357152]]

From the output we can see L1 and L2 normalized data.











Share:

0 comments:

Post a Comment