NLP chunking and information extraction from text using NLTK

5 min readFeb 24, 2023

Hey there! In today’s article, we’re going to explore how to use NLP chunking to extract information from text. We’ll cover the following concepts:

Parsing and parsing trees
POS tagging
Chunking
Chunk grammar and tag patterns
IOB tags
Regular expression-based chunkers
N-gram chunkers
Classifier based tagger
Cascaded chunkers
Named entity recognition
Information extraction

Parsing

This is an important process for understanding how different components of a sentence relate to each other. To create a parse tree, we use a model that combines context-free grammar with probabilities assigned to each rule. This helps us derive the tree that best represents the structure of the sentence.

Ranjan, N., Mundada, K., Phaltane, K., & Ahmad, S. (2016). A Survey on Techniques in NLP. International Journal of Computer Applications , 134(8), 6–9.

Parse Tree

A parse tree is an ordered, rooted tree that represents the syntactic structure of a sentence, according to some context-free grammar.

Pos tagging

The process of assigning a lexical class marker to each word in the sentence according to the context. The lexical class that is assigned to each word are types like nouns, pronouns, adjectives, and verbs among others.NLTK Part of speech tags Cheat Sheet

Ranjan, N., Mundada, K., Phaltane, K., & Ahmad, S. (2016). A Survey on Techniques in NLP. International Journal of Computer Applications , 134(8), 6–9.

Chunking

The role of chunking is to segment and label multi-token sequences.

Chunk grammar and tag patterns

Chunk grammar is made up of rules that guide how sentences should be chunked. These rules use tag patterns to describe sequences of tagged words. Tag patterns are sequences of part-of-speech tags enclosed in angle brackets. For example, <DT>?<JJ>*<NN>. Think of tag patterns as similar to regular expression patterns.

It is possible to specify multiple rules using regular expressions. The chunker will apply them one by one. It’s important to note that if a tag pattern matches at overlapping locations, the leftmost match takes precedence.

IOB tags

Chunk structures can be represented using either tags or trees. The most common file representation uses IOB tags.

Tokens are tagged as either I (inside), O (outside), or B (beginning). A token is tagged as B if it marks the start of a chunk. Subsequent tokens within the chunk are tagged as I. All other tokens are tagged as O. The B and I tags are suffixed with the chunk type, for example, B-NP, I-NP. It is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are labeled O.

There are 3 types of chunkers :

The regular-expression-based chunker: we can use a simple chunker using regular expressions for example to chunk NP.


tokens_sentence=nltk.word_tokenize(sentence)
tagged_sentence=nltk.pos_tag(tokens_sentence)

grammar = """NP: {<DT>?<JJ>*<NN>}
          """

cp = nltk.RegexpParser(grammar)
cp.parse(tagged_sentence)

2. The n-gram chunker

A better approach would be to create a “chunker” that labels sentences with “chunk tags” using a “unigram tagger”. The chunker’s job is to determine the correct chunk tag given each word’s part-of-speech tag.

To evaluate our chunker, we could use a chunked corpus like “The CoNLL 2000 corpus”. And to improve our results, we could try using a bigram or trigram tagger.

from nltk.corpus import conll2000

test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])

class ChunkParser(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
            for sent in train_sents]
        self.tagger = nltk.TrigramTagger(train_data)
    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
        in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

NPChunker = ChunkParser(train_sents)

3. Classifier-based-tagger

Incorporating information about the content of words in addition to part-of-speech tags can help maximize chunking performance. Here’s an example: even though two sentences might have the same part of speech, they should still be chunked differently.

For instance:

“I gave/VBD the/DT boy /NN toys/NN”
“He broke/VBD the/DT computer/NN screen/NN”

boy and toys should be separate chunks, while the computer screen is a single chunk.

If you want to build a classifier-based chunker, you’ll need to define the feature extractor function. One example of a feature you can use is the part-of-speech tag of the current token. This will make it similar to the unigram chunker. You can also add a feature for the previous part-of-speech tag. By doing this, you allow the classifier to model interactions between adjacent tags. This will result in a chunker that is closely related to the bigram chunker.

def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    return {"pos": pos}

def npchunk_features(sentence, i, history):
     word, pos = sentence[i]
     if i == 0:
        prevword, prevpos = "<START>", "<START>"
     else:
        prevword, prevpos = sentence[i-1]
     return {"pos": pos, "prevpos": prevpos}

class ConsecutiveNPChunkTagger(nltk.TaggerI): [1]

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history) [2]
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train( [3]
            train_set, algorithm='megam', trace=0)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

class ConsecutiveNPChunker(nltk.ChunkParserI): [4]
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in
                         nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

4. Cascaded chunkers

Using a multi-stage chunk grammar that contains recursive rules. it enables you to detect patterns for noun phrases, prepositional phrases, verb phrases, and sentences. It is also possible to define whatever pattern you want to detect.

grammar = r"""
  Pattern: {<NN><VBZ>} 
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  Obj: {<VB><PRP$>?<NN>}  
  
  """

Named entity recognition

It involves identifying and classifying named entities in text into predefined categories like people, organizations, locations, and more. Detecting named entities is often used to identify relations in information extraction, which is the process of automatically extracting structured information from unstructured or semi-structured data sources like text documents.NLTK has a classifier that can recognize named entities. You can access it using the function nltk.ne_chunk().

Entities = nltk.chunk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)))

Relation extraction

Focuses on finding connections between specific types of named entities. To start, we can begin by identifying all triples that follow this structure: (NE1, α, NE2). In other words, we’ll be looking for pairs of named entities (NE1 and NE2) and the string of words (α) that appear between them.


import re
from nltk.sem import relextract

IN = re.compile(r'.*\bin\b(?!\b.+ing\b)')
for rel in relextract.extract_rels('PER', 'ORG', Entities, corpus='ace', pattern=IN):
        print(relextract.rtuple(rel))

It is also possible to specify the pattern to be detected in the string of words (α), for instance, the role of a person in an organization.

roles = r"""
 (.*(
      analyst|
      chair(wo)?man|
      commissioner|
      counsel|
      director|
      economist|
      executive|
      governor|
      head|
      lawyer|
      leader|
      librarian).*)|
      manager|
      partner|
      president|
      producer|
      professor|
      researcher|
      spokes(wo)?man|
      writer|
      ,\sof\sthe?\s*  # "X, of (the) Y"
       """
ROLES = re.compile(roles, re.VERBOSE)


for rel in relextract.extract_rels('PER', 'ORG', Entities, corpus='ace', pattern=ROLES):
        print(relextract.rtuple(rel))

I hope you found this article helpful! follow for more articles about natural language processing.