Category Prediction using BoW model

In a set of documents, not only the words but the category of the words is also important; in which category of text a particular word falls. For example, we want to predict whether a given sentence belongs to the category email, news, sports, computer, etc. In the following example, we are going to use tf-idf to formulate a feature vector to find the category of documents. We will use the data from 20 newsgroup dataset of sklearn.

We need to import the necessary packages:

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

Define the category map. We are using five different categories named Religion, Autos, Sports, Electronics and Space.

category_map = {'talk.religion.misc': 'Religion', '': 'Autos','':'Hockey','sci.electronics':'Electronics', '': 'Space'}

Create the training set:

training_data = fetch_20newsgroups(subset='train',
categories=category_map.keys(), shuffle=True, random_state=5)

Build a count vectorizer and extract the term counts:

vectorizer_count = CountVectorizer()
train_tc = vectorizer_count.fit_transform(
print("\nDimensions of training data:", train_tc.shape)

The tf-idf transformer is created as follows:

tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)

Now, define the test data:

input_data = [
'Discovery was a space shuttle',
'Hindu, Christian, Sikh all are religions',
'We must have to drive safely',
'Puck is a disk made of rubber',
'Television, Microwave, Refrigrated all uses electricity'

The above data will help us train a Multinomial Naive Bayes classifier:

classifier = MultinomialNB().fit(train_tfidf,

Transform the input data using the count vectorizer:

input_tc = vectorizer_count.transform(input_data)

Now, we will transform the vectorized data using the tfidf transformer:

input_tfidf = tfidf.transform(input_tc)

We will predict the output categories:

predictions = classifier.predict(input_tfidf)

The output is generated as follows:

for sent, category in zip(input_data, predictions):
print('\nInput Data:', sent, '\n Category:', \

The category predictor generates the following output:

Dimensions of training data: (2755, 39297)

Input Data: Discovery was a space shuttle
Category: Space

Input Data: Hindu, Christian, Sikh all are religions
Category: Religion

Input Data: We must have to drive safely
Category: Autos

Input Data: Puck is a disk made of rubber
Category: Hockey

Input Data: Television, Microwave, Refrigrated all uses electricity
Category: Electronics


