Wednesday, July 1, 2020

Category Prediction using BoW model

The distributed bag-of-words model - Mastering Machine Learning ...

In a set of documents, not only the words but the category of the words is also important; in which category of text a particular word falls. For example, we want to predict whether a given sentence belongs to the category email, news, sports, computer, etc. In the following example, we are going to use tf-idf to formulate a feature vector to find the category of documents. We will use the data from 20 newsgroup dataset of sklearn.

We need to import the necessary packages:

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

Define the category map. We are using five different categories named Religion, Autos, Sports, Electronics and Space.

category_map = {'talk.religion.misc': 'Religion', 'rec.autos': 'Autos','rec.sport.hockey':'Hockey','sci.electronics':'Electronics', 'sci.space': 'Space'}

Create the training set:

training_data = fetch_20newsgroups(subset='train',
categories=category_map.keys(), shuffle=True, random_state=5)

Build a count vectorizer and extract the term counts:

vectorizer_count = CountVectorizer()
train_tc = vectorizer_count.fit_transform(training_data.data)
print("\nDimensions of training data:", train_tc.shape)

The tf-idf transformer is created as follows:

tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)

Now, define the test data:

input_data = [
'Discovery was a space shuttle',
'Hindu, Christian, Sikh all are religions',
'We must have to drive safely',
'Puck is a disk made of rubber',
'Television, Microwave, Refrigrated all uses electricity'
]

The above data will help us train a Multinomial Naive Bayes classifier:

classifier = MultinomialNB().fit(train_tfidf, training_data.target)

Transform the input data using the count vectorizer:

input_tc = vectorizer_count.transform(input_data)

Now, we will transform the vectorized data using the tfidf transformer:

input_tfidf = tfidf.transform(input_tc)

We will predict the output categories:

predictions = classifier.predict(input_tfidf)

The output is generated as follows:

for sent, category in zip(input_data, predictions):
print('\nInput Data:', sent, '\n Category:', \
category_map[training_data.target_names[category]])

The category predictor generates the following output:

Dimensions of training data: (2755, 39297)

Input Data: Discovery was a space shuttle
Category: Space

Input Data: Hindu, Christian, Sikh all are religions
Category: Religion

Input Data: We must have to drive safely
Category: Autos

Input Data: Puck is a disk made of rubber
Category: Hockey

Input Data: Television, Microwave, Refrigrated all uses electricity
Category: Electronics
Share:

0 comments:

Post a Comment