Monday, June 8, 2020

Building Naïve Bayes Classifier in Python

Naïve Bayes classification technique used to build classifier using the Bayes theorem. The assumption is that the predictors are independent. In simple words, it assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For building Naïve Bayes classifier we need to use the scikit learn python library. There are three types of Naïve Bayes models named Gaussian, Multinomial and Bernoulli under scikit learn package.

Applying Gaussian Naïve Bayes Classifier in Python: Part One

To build a Naïve Bayes machine learning classifier model, we need the following:

1. Dataset

We are going to use the dataset named Breast Cancer Wisconsin Diagnostic Database. The dataset includes various information about breast cancer tumors, as well as classification labels of malignant or benign. The dataset has 569 instances, or data, on 569 tumors and includes information on 30 attributes, or features, such as the radius of the tumor, texture, smoothness, and area. We can import this dataset from sklearn package.

2. Naïve Bayes Model

For building Naïve Bayes classifier, we need a Naïve Bayes model. As told earlier, there are three types of Naïve Bayes models named Gaussian, Multinomial and Bernoulli under scikit learn package. Here, we are going to use the Gaussian Naïve Bayes model.

By using the above, we are going to build a Naïve Bayes machine learning model to use the tumor information to predict whether or not a tumor is malignant or benign.

To begin with, we need to import the sklearn module. It can be done with the help of the following command:

Import Sklearn

Next, we need to import the dataset named Breast Cancer Wisconsin Diagnostic Database.

from sklearn.datasets import load_breast_cancer

Now, the following command will load the dataset.

data = load_breast_cancer()

The data can be organized as follows:

label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']


Now, to make it clearer we can print the class labels, the first data instance’s label, our feature names and the feature’s value with the help of following commands:

print(label_names)

The above command will print the class names which are malignant and benign respectively. It is shown as the output below:

['malignant' 'benign']

Now, the command given below will show that they are mapped to binary values 0 and 1. Here 0 represents malignant cancer and 1 represents benign cancer. It is shown as the output below:

print(labels[0])

0

The following two commands will produce the feature names and feature values.

print(feature_names[0])

mean radius

print(features[0])

[ 1.79900000e+01 1.03800000e+01 1.22800000e+02 1.00100000e+03 1.18400000e-01 2.77600000e-01 3.00100000e-01 1.47100000e-01
2.41900000e-01 7.87100000e-02 1.09500000e+00 9.05300000e-01
8.58900000e+00 1.53400000e+02 6.39900000e-03 4.90400000e-02
5.37300000e-02 1.58700000e-02 3.00300000e-02 6.19300000e-03
2.53800000e+01 1.73300000e+01 1.84600000e+02 2.01900000e+03
1.62200000e-01 6.65600000e-01 7.11900000e-01 2.65400000e-01
4.60100000e-01 1.18900000e-01]

From the above output, we can see that the first data instance is a malignant tumor the main radius of which is 1.7990000e+01.

For testing our model on unseen data, we need to split our data into training and testing data. It can be done with the help of the following code:

from sklearn.model_selection import train_test_split

The above command will import the train_test_split function from sklearn and the command below will split the data into training and test data. In the below example, we are using 40 % of the data for testing and the remining data would be used for training the model.

train, test, train_labels, test_labels = train_test_split(features,labels,test_size = 0.40, random_state = 42)

Now, we are building the model with the following commands:

from sklearn.naive_bayes import GaussianNB

The above command will import the GaussianNB module. Now, with the command given below, we need to initialize the model.

gnb = GaussianNB()

We will train the model by fitting it to the data by using gnb.fit().

model = gnb.fit(train, train_labels)

Now, evaluate the model by making prediction on the test data and it can be done as follows:

preds = gnb.predict(test)

print(preds)

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1]

The above series of 0s and 1s are the predicted values for the tumor classes i.e. malignant and benign.
Now, by comparing the two arrays namely test_labels and preds, we can find out the accuracy of our model. We are going to use the accuracy_score() function to determine the accuracy. Consider the following command:

from sklearn.metrics import accuracy_score
print(accuracy_score(test_labels,preds))

0.951754385965

The result shows that NaïveBayes classifier is 95.17% accurate.

In this way we can build a machine learning classifier based on the Naïve Bayse Gaussian model.


Share:

0 comments:

Post a Comment