Monday, June 29, 2020

Dividing Data into Chunks: Chunking

The main job of chunking is to identify the parts of speech and short phrases like noun phrases. We have already studied the process of tokenization, the creation of tokens. Chunking basically is the labeling of those tokens. In other words, chunking will show us the structure of the sentence.

There are two types of chunking. The types are as follows:

Chunking up

In this process of chunking, the object, things, etc. move towards being more general and the language gets more abstract. There are more chances of agreement. In this process, we zoom out. For example, if we will chunk up the question that “for what purpose cars are”? We may get the answer “transport”.

Chunking down

In this process of chunking, the object, things, etc. move towards being more specific and the language gets more penetrated. The deeper structure would be examined in chunking down. In this process, we zoom in. For example, if we chunk down the question “Tell specifically about a car”? We will get smaller pieces of information about the car.

Example

In this example, we will do Noun-Phrase chunking, a category of chunking which will find the noun phrases chunks in the sentence, by using the NLTK module in Python:

Follow these steps in python for implementing noun phrase chunking:

  • Step 1: In this step, we need to define the grammar for chunking. It would consist of the rules which we need to follow.
  • Step 2: In this step, we need to create a chunk parser. It would parse the grammar and give the output.
  • Step 3: In this last step, the output is produced in a tree format.

Let us import the necessary NLTK package as follows:

import nltk

Now, we need to define the sentence. Here, DT means the determinant, VBP means the verb, JJ means the adjective, IN means the preposition and NN means the noun.

sentence = [("a", "DT"),("clever", "JJ"),("fox","NN"),("was","VBP"),("jumping","VBP"),("over","IN"),("the","DT"),("wall","NN")]

Now, we need to give the grammar. Here, we will give the grammar in the form of regular expression.

grammar = "NP:{<DT>?<JJ>*<NN>}"

We need to define a parser which will parse the grammar.

parser_chunking=nltk.RegexpParser(grammar)

The parser parses the sentence as follows:

parser_chunking.parse(sentence)

Next, we need to get the output. The output is generated in the simple variable called output_chunk.

Output_chunk=parser_chunking.parse(sentence)

Upon execution of the following code, we can draw our output in the form of a tree.

output.draw()

Share:

0 comments:

Post a Comment