The machine learning algorithms require formatted data to start the training process. We must prepare or format data in a certain way so that it can be supplied as an input to ML algorithms. Usually the data with which we deal is in raw form. To provide the data as the input of machine learning algorithms, we need to convert it into a meaningful data or in other simple words, we can say that before providing the data to the machine learning algorithms we need to preprocess the data.
Data preprocessing involves the following steps-
Step 1: Importing the useful packages: For Python users this would be the first step for converting the data into a certain format, i.e., preprocessing. It can be done as follows:
import numpy as np
from sklearn import preprocessing
from sklearn import preprocessing
Here we have used the following two packages:
a. NumPy: NumPy is a general purpose array-processing package designed to efficiently manipulate large multi-dimensional arrays of arbitrary records without sacrificing too much speed for small multi-dimensional arrays.
b. Sklearn.preprocessing: This package provides many common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for machine learning algorithms.
a. NumPy: NumPy is a general purpose array-processing package designed to efficiently manipulate large multi-dimensional arrays of arbitrary records without sacrificing too much speed for small multi-dimensional arrays.
b. Sklearn.preprocessing: This package provides many common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for machine learning algorithms.
Step 2: Defining sample data: After importing the packages, we need to define some sample data so that we can apply preprocessing techniques on that data. We will now define the following sample data:
Input_data = np.array([2.1, -1.9, 5.5],
[-1.5, 2.4, 3.5],
[0.5, -7.9, 5.6],
[5.9, 2.3, -5.8]])
[-1.5, 2.4, 3.5],
[0.5, -7.9, 5.6],
[5.9, 2.3, -5.8]])
Step3: Applying preprocessing technique: In this step, we need to apply any of the following preprocessing techniques:
a.Binarization
It is used when we need to convert our numerical values into Boolean values. We can use an inbuilt method to binarize the input data say by using 0.5 as the threshold value in the following way:
data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data)
print("\nBinarized data:\n", data_binarized)
print("\nBinarized data:\n", data_binarized)
Now, after running the above code we will get the following output, all the values above 0.5(threshold value) would be converted to 1 and all the values below 0.5 would be converted to 0.
Binarized data:
[[ 1. 0. 1.]
[ 0. 1. 1.]
[ 0. 0. 1.]
[ 1. 1. 0.]]
[ 0. 1. 1.]
[ 0. 0. 1.]
[ 1. 1. 0.]]
b. Mean Removal
It is used to eliminate the mean from feature vector so that every feature is centered on zero. We can also remove the bias from the features in the feature vector.
The code will display the Mean and Standard deviation of the input data:
print("Mean =", input_data.mean(axis=0))
print("Std deviation = ", input_data.std(axis=0))
print("Std deviation = ", input_data.std(axis=0))
We will get the following output after running the above lines of code:
Mean = [ 1.75 -1.275 2.2 ]
Std deviation = [ 2.71431391 4.20022321 4.69414529]
Std deviation = [ 2.71431391 4.20022321 4.69414529]
Now, the code below will remove the Mean and Standard deviation of the input data:
data_scaled = preprocessing.scale(input_data)
print("Mean =", data_scaled.mean(axis=0))
print("Std deviation =", data_scaled.std(axis=0))
print("Mean =", data_scaled.mean(axis=0))
print("Std deviation =", data_scaled.std(axis=0))
We will get the following output after running the above lines of code:
Mean = [ 1.11022302e-16 0.00000000e+00 0.00000000e+00]
Std deviation = [ 1. 1. 1.]
Std deviation = [ 1. 1. 1.]
c. Scaling
It is used to scale the feature vectors. Scaling of feature vectors is needed because the values of every feature can vary between many random values. In other words we can say that scaling is important because we do not want any feature to be synthetically large or small. With the help of the following code, we can do the scaling of our input data, i.e., feature vector:
# Min max scaling
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
print ("\nMin max scaled data:\n", data_scaled_minmax)
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
print ("\nMin max scaled data:\n", data_scaled_minmax)
We will get the following output after running the above lines of code:
Min max scaled data
[[ 0.48648649 0.58252427 0.99122807]
[ 0. 1. 0.81578947]
[ 0.27027027 0. 1. ]
[ 1. 0.99029126 0. ]]
[ 0. 1. 0.81578947]
[ 0.27027027 0. 1. ]
[ 1. 0.99029126 0. ]]
d. Normalization
It is another data preprocessing technique that is used to modify the feature vectors. Such kind of modification is necessary to measure the feature vectors on a common scale. Followings are two types of normalization which can be used in machine learning:
i. L1 Normalization
It is also referred to as Least Absolute Deviations. This kind of normalization modifies the values so that the sum of the absolute values is always up to 1 in each row. It can be implemented on the input data with the help of the following code:
# Normalize data
data_normalized_l1 = preprocessing.normalize(input_data, norm='l1')
data_normalized_l1 = preprocessing.normalize(input_data, norm='l1')
print("\nL1 normalized data:\n", data_normalized_l1)
The above line of code generates the following output:
L1 normalized data:
[[ 0.22105263 -0.2 0.57894737]
[-0.2027027 0.32432432 0.47297297]
[ 0.03571429 -0.56428571 0.4 ]
[ 0.42142857 0.16428571 -0.41428571]]
[[ 0.22105263 -0.2 0.57894737]
[-0.2027027 0.32432432 0.47297297]
[ 0.03571429 -0.56428571 0.4 ]
[ 0.42142857 0.16428571 -0.41428571]]
ii. L2 Normalization
This kind of normalization, also referred to as least squares, modifies the values so that the sum of the squares is always up to 1 in each row. It can be implemented on the input data with the help of the following code:
# Normalize data
data_normalized_l2 = preprocessing.normalize(input_data, norm='l2')
print("\nL2 normalized data:\n", data_normalized_l2)
data_normalized_l2 = preprocessing.normalize(input_data, norm='l2')
print("\nL2 normalized data:\n", data_normalized_l2)
The above line of code will generate the following output:
L2 normalized data:
[[ 0.33946114 -0.30713151 0.88906489]
[-0.33325106 0.53320169 0.7775858 ]
[ 0.05156558 -0.81473612 0.57753446]
[ 0.68706914 0.26784051 -0.6754239 ]]
[[ 0.33946114 -0.30713151 0.88906489]
[-0.33325106 0.53320169 0.7775858 ]
[ 0.05156558 -0.81473612 0.57753446]
[ 0.68706914 0.26784051 -0.6754239 ]]
0 comments:
Post a Comment