Using NLP APIs

February 12, 2015 - 12:00 am

829

22 min read

In this article by Richard M Reese, author of the book, Natural Language Processing with Java, we will demonstrate the NER process using OpenNLP, Stanford API, and LingPipe. They each provide alternate techniques that can often do a good job identifying entities in text. The following declaration will serve as the sample text to demonstrate the APIs:

String sentences[] = {"Joe was the last person to see Fred. ",
"He saw him in Boston at McKenzie's pub at 3:00 where he paid "
   + "$2.45 for an ale. ",
   "Joe wanted to go to Vermont for the day to visit a cousin who "
   + "works at IBM, but Sally and he had to look for Fred"};

Using OpenNLP for NER

We will demonstrate the use of the TokenNameFinderModel class to perform NLP using the OpenNLP API. In addition, we will demonstrate how to determine the probability that the entity identified is correct.

The general approach is to convert the text into a series of tokenized sentences, create an instance of the TokenNameFinderModel class using an appropriate model, and then use the find method to identify the entities in the text.

The next example demonstrates the use of the TokenNameFinderModel class. We will use a simple sentence initially and then use multiple sentences. The sentence is defined here:

   String sentence = "He was the last person to see Fred.";

We will use the models found in the en-token.bin and en-ner-person.bin files for the tokenizer and name finder models respectively. InputStream for these files is opened using a try-with-resources block as shown here:

   try (InputStream tokenStream = new FileInputStream(
           new File(getModelDir(), "en-token.bin"));
           InputStream modelStream = new FileInputStream(
               new File(getModelDir(), "en-ner-person.bin"));) {
       ...
 
   } catch (Exception ex) {
       // Handle exceptions
   }

Within the try block, the TokenizerModel and Tokenizer objects are created:

   TokenizerModel tokenModel = new TokenizerModel(tokenStream);
   Tokenizer tokenizer = new TokenizerME(tokenModel);

Next, an instance of the NameFinderME class is created using the person model:

   TokenNameFinderModel entityModel =
       new TokenNameFinderModel(modelStream);
   NameFinderME nameFinder = new NameFinderME(entityModel);

We can now use the tokenize method to tokenize the text and the find method to identify the person in the text. The find method will use the tokenized String array as input and return an array of the Span objects as follows:

   String tokens[] = tokenizer.tokenize(sentence);
   Span nameSpans[] = nameFinder.find(tokens);

The following for statement displays the person found in the sentence. Its positional information and the person are displayed on separate lines:

   for (int i = 0; i < nameSpans.length; i++) {
       System.out.println("Span: " + nameSpans[i].toString());
       System.out.println("Entity: "
           + tokens[nameSpans[i].getStart()]);
   }

The output is as follows:

Span: [7..9) person
Entity: Fred

Often we will work with multiple sentences. To demonstrate this we will use the previously defined sentences string array. The previous for statement is replaced with the following sequence. The tokenize method is invoked against each sentence and then the entity information is displayed as before:

   for (String sentence : sentences) {
       String tokens[] = tokenizer.tokenize(sentence);
       Span nameSpans[] = nameFinder.find(tokens);
       for (int i = 0; i < nameSpans.length; i++) {
           System.out.println("Span: " + nameSpans[i].toString());
           System.out.println("Entity: "
               + tokens[nameSpans[i].getStart()]);
       }
       System.out.println();
   }

The output is as follows. There is an extra blank line between the two people detected because the second sentence did not contain a person.

Span: [0..1) person
Entity: Joe
Span: [7..9) person
Entity: Fred
 
 
Span: [0..1) person
Entity: Joe
Span: [19..20) person
Entity: Sally
Span: [26..27) person
Entity: Fred

Determining the accuracy of the entity

When TokenNameFinderModel identifies entities in text, it computes a probability for that entity. We can access this information using the probs method as shown next. The method returns an array of doubles, which corresponds to the elements of the nameSpans array:

   double[] spanProbs = nameFinder.probs(nameSpans);

Add this statement to the previous example immediately after the use of the find method. Then add the next statement at the end of the nested for statement:

   System.out.println("Probability: " + spanProbs[i]);

When the example is executed, you will get the following output. The probability fields reflect the confidence level of the entity assignment. For the first entity, the model is 80.529 percent confident that Joe is a person:

Span: [0..1) person
Entity: Joe
Probability: 0.8052914774025202
Span: [7..9) person
Entity: Fred
Probability: 0.9042160889302772
 
 
Span: [0..1) person
Entity: Joe
Probability: 0.9620970782763985
Span: [19..20) person
Entity: Sally
Probability: 0.964568603518126
Span: [26..27) person
Entity: Fred
Probability: 0.990383039618594

Using other entity types

OpenNLP supports different libraries as listed in the following table. These models can be downloaded from http://opennlp.sourceforge.net/models-1.5/. The prefix, en, specifies English as the language while ner indicates that the model is for NER.

English finder models	File name
Location name finder model	en-ner-location.bin
Money name finder model	en-ner-money.bin
Organization name finder model	en-ner-organization.bin
Percentage name finder model	en-ner-percentage.bin
Person name finder model	en-ner-person.bin
Time name finder model	en-ner-time.bin

If we modify the statement to use a different model file, we can see how they work against the sample sentences:

   InputStream modelStream = new FileInputStream(
       new File(getModelDir(), "en-ner-time.bin"));) {

When the en-ner-money.bin model is used, the index into the tokens array in the earlier code sequence has to be increased by one. Otherwise, all that is returned is the dollar sign.

The various outputs are shown in the following table.

Model	Output
en-ner-location.bin	Span: [4..5) location Entity: Boston Probability: 0.8656908776583051 Span: [5..6) location Entity: Vermont Probability: 0.9732488014011262
en-ner-money.bin	Span: [14..16) money Entity: 2.45 Probability: 0.7200919701507937
en-ner-organization.bin	Span: [16..17) organization Entity: IBM Probability: 0.9256970736336729
en-ner-time.bin	The model was not able to detect time in this text sequence

The model failed to find the time entities in the sample text. This illustrates that the model did not have enough confidence that it found any time entities in the text.

Processing multiple entities types

We can also handle multiple entity types at the same time. This involves creating instances of NameFinderME, based on each model within a loop and applying the model against each sentence, keeping track of the entities as they are found.

We will illustrate this process with the next example. It requires rewriting the previous try block to create the InputStream within the block as shown here:

   try {
       InputStream tokenStream = new FileInputStream(
           new File(getModelDir(), "en-token.bin"));
       TokenizerModel tokenModel = new TokenizerModel(tokenStream);
       Tokenizer tokenizer = new TokenizerME(tokenModel);
       ...
   } catch (Exception ex) {
       // Handle exceptions
   }

Within the try block, we will define a string array to hold the names of the model files. As shown here, we will use models for people, locations, and organizations:

   String modelNames[] = {"en-ner-person.bin",
       "en-ner-location.bin", "en-ner-organization.bin"};

ArrayList is created to hold the entities as they are discovered:

   ArrayList<String> list = new ArrayList();

A for-each statement is used to load one model at a time and then to create an instance of the NameFinderME class:

   for(String name : modelNames) {
       TokenNameFinderModel entityModel = new TokenNameFinderModel(
           new FileInputStream(new File(getModelDir(), name)));
       NameFinderME nameFinder = new NameFinderME(entityModel);
       ...
   }

Previously, we did not try to identify which sentences the entities were found in. This is not hard to do, but we need to use a simple for statement instead of a for-each statement to keep track of the sentence indexes. This is shown next where the previous example has been modified to use the integer variable index to keep the sentences. Otherwise, the code works the same way as before:

   for (int index = 0; index < sentences.length; index++) {
       String tokens[] = tokenizer.tokenize(sentences[index]);
       Span nameSpans[] = nameFinder.find(tokens);
       for(Span span : nameSpans) {
           list.add("Sentence: " + index
               + " Span: " + span.toString() + " Entity: "
               + tokens[span.getStart()]);
       }
   }

The entities discovered are then displayed:

   for(String element : list) {
       System.out.println(element);
   }

The output is as follows:

Sentence: 0 Span: [0..1) person Entity: Joe
Sentence: 0 Span: [7..9) person Entity: Fred
Sentence: 2 Span: [0..1) person Entity: Joe
Sentence: 2 Span: [19..20) person Entity: Sally
Sentence: 2 Span: [26..27) person Entity: Fred
Sentence: 1 Span: [4..5) location Entity: Boston
Sentence: 2 Span: [5..6) location Entity: Vermont
Sentence: 2 Span: [16..17) organization Entity: IBM

Using the Stanford API for NER

We will demonstrate the CRFClassifier class as used to perform NER. This class implements what is known as a linear chain Conditional Random Field (CRF) sequence model.

To demonstrate the use of the CRFClassifier class we will start with a declaration of the classifier file string as shown here:

   String model = getModelDir() +
       "\english.conll.4class.distsim.crf.ser.gz";

The classifier is then created using the model:

   CRFClassifier<CoreLabel> classifier =
       CRFClassifier.getClassifierNoExceptions(model);

The classify method takes a single string representing the text to be processed. To use the sentences text we need to convert it to a simple string:

   String sentence = "";
   for (String element : sentences) {
       sentence += element;
   }

The classify method is then applied to the text:

   List<List<CoreLabel>> entityList = classifier.classify(sentence);

A List of CoreLabel is returned. The object returned is a list that contains another list. The contained list is a list of CoreLabel. The CoreLabel class represents a word with additional information attached to it. The internal list contains a list of these words. In the outer for-each statement in the next code sequence, the internalList variable represents one sentence of the text. In the inner for-each statement, each word in that inner list is displayed. The word method returns the word and the get method returns the type of the word.

The words and their types are then displayed:

   for (List<CoreLabel> internalList: entityList) {
       for (CoreLabel coreLabel : internalList) {
           String word = coreLabel.word();
           String category = coreLabel.get(
               CoreAnnotations.AnswerAnnotation.class);
           System.out.println(word + ":" + category);
     }
   }

Part of the output is as follows. It has been truncated because every word is displayed. The O represents the Other category:

Joe:PERSON
was:O
the:O
last:O
person:O
to:O
see:O
Fred:PERSON
.:O
He:O
...
look:O
for:O
Fred:PERSON

To filter out those words that are not relevant, replace the println statement with the following statements. This will eliminate the other categories:

   if (!"O".equals(category)) {
       System.out.println(word + ":" + category);
   }

The output is simpler now:

Joe:PERSON
Fred:PERSON
Boston:LOCATION
McKenzie:PERSON
Joe:PERSON
Vermont:LOCATION
IBM:ORGANIZATION
Sally:PERSON
Fred:PERSON

Using LingPipe for NER

Here we will demonstrate how name entity models and the ExactDictionaryChunker class are used to perform NER analysis.

Using LingPipe’s name entity models

LingPipe has a few named entity models that we can use with chunking. These files consist of a serialized object that can be read from a file and then applied to text. These objects implement the Chunker interface. The chunking process results in a series of Chunking objects that identify the entities of interest.

A list of the NER models is in found in the following table. These models can be downloaded from http://alias-i.com/lingpipe/web/models.html.

Genre	Corpus	File
English News	MUC-6	ne-en-news-muc6.AbstractCharLmRescoringChunker
English Genes	GeneTag	ne-en-bio-genetag.HmmChunker
English Genomics	GENIA	ne-en-bio-genia.TokenShapeChunker

We will use the model found in the file, ne-en-news-muc6.AbstractCharLmRescoringChunker, to demonstrate how this class is used. We start with a try-catch block to deal with exceptions as shown next. The file is opened and used with the AbstractExternalizable class’ static readObject method to create an instance of a Chunker class. This method will read in the serialized model:

   try {
       File modelFile = new File(getModelDir(),
           "ne-en-news-muc6.AbstractCharLmRescoringChunker");
         Chunker chunker = (Chunker)
           AbstractExternalizable.readObject(modelFile);
       ...
   } catch (IOException | ClassNotFoundException ex) {
       // Handle exception
   }

The Chunker and Chunking interfaces provide methods that work with a set of chunks of text. Its chunk method returns an object that implements the Chunking instance. The following sequence displays the chunks found in each sentence of the text as shown here:

   for (int i = 0; i < sentences.length; ++i) {
       Chunking chunking = chunker.chunk(sentences[i]);
       System.out.println("Chunking=" + chunking);
   }

The output of this sequence is as follows:

Chunking=Joe was the last person to see Fred. : 
[0-3:PERSON@-Infinity, 31-35:ORGANIZATION@-Infinity]
Chunking=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. 
: [14-20:LOCATION@-Infinity, 24-32:PERSON@-Infinity]
Chunking=Joe wanted to go to Vermont for the day to visit a cousin who works at IBM, 
but Sally and he had to look for Fred : [0-3:PERSON@-Infinity, 20-27:ORGANIZATION@-Infinity, 
71-74:ORGANIZATION@-Infinity, 109-113:ORGANIZATION@-Infinity]

Instead, we can use methods of the Chunk class to extract specific pieces of information as illustrated next. We will replace the previous for statement with the following for-each statement. This calls a displayChunkSet method:

   for (String sentence : sentences) {
       displayChunkSet(chunker, sentence);
   }

The output that follows shows the result. However, it did not always match the entity type correctly:

Type: PERSON Entity: [Joe] Score: -Infinity
Type: ORGANIZATION Entity: [Fred] Score: -Infinity
Type: LOCATION Entity: [Boston] Score: -Infinity
Type: PERSON Entity: [McKenzie] Score: -Infinity
Type: PERSON Entity: [Joe] Score: -Infinity
Type: ORGANIZATION Entity: [Vermont] Score: -Infinity
Type: ORGANIZATION Entity: [IBM] Score: -Infinity
Type: ORGANIZATION Entity: [Fred] Score: -Infinity

Using the ExactDictionaryChunker class

The ExactDictionaryChunker class provides an easy way to create a dictionary of entities and their types, which can be used to find them later in text. It uses a MapDictionary object to store entries and then the ExactDictionaryChunker class is used to extract chunks based on the dictionary.

The AbstractDictionary interface supports basic operations for entities, category, and score. The score is used in the matching process. The MapDictionary and TrieDictionary classes implement the AbstractDictionary interface. The TrieDictionary class stores information using a character trie structure. This approach uses less memory when that is a concern. We will use the MapDictionary class for our example.

To illustrate this approach we start with a declaration of the MapDictionary:

   private MapDictionary<String> dictionary;

The dictionary will contain the entities that we are interested in finding. We need to initialize the model as performed in the following initializeDictionary method. The DictionaryEntry constructor used here accepts three arguments:

String: It gives the name of the entity
String: It gives the category of the entity
Double: It represents a score for the entity

The score is used when determining matches. A few entities are declared and added to the dictionary:

   private static void initializeDictionary() {
        dictionary = new MapDictionary<String>();
       dictionary.addEntry(
           new DictionaryEntry<String>("Joe","PERSON",1.0));
       dictionary.addEntry(
           new DictionaryEntry<String>("Fred","PERSON",1.0));
       dictionary.addEntry(
           new DictionaryEntry<String>("Boston","PLACE",1.0));
       dictionary.addEntry(
           new DictionaryEntry<String>("pub","PLACE",1.0));
       dictionary.addEntry(
           new DictionaryEntry<String>("Vermont","PLACE",1.0));
       dictionary.addEntry(
           new DictionaryEntry<String>("IBM","ORGANIZATION",1.0));
       dictionary.addEntry(
           new DictionaryEntry<String>("Sally","PERSON",1.0));
   }

An ExactDictionaryChunker instance will use this dictionary. The arguments of the ExactDictionaryChunker class are detailed here:

Dictionary<String>: It is a dictionary containing the entities
TokenizerFactory: It is a tokenizer used by the chunker
boolean: If true, the chunker should return all matches
boolean: If true, the matches are case sensitive

Matches can be overlapping. For example, in the phrase, “The First National Bank”, the entity “bank” could be used by itself or in conjunction with the rest of the phrase. The third parameter determines if all of the matches are returned.

In the following sequence, the dictionary is initialized. We then create an instance of the ExactDictionaryChunker class using the Indo-European tokenizer where we return all matches and ignore the case of the tokens:

   initializeDictionary();
   ExactDictionaryChunker dictionaryChunker
       = new ExactDictionaryChunker(dictionary,
           IndoEuropeanTokenizerFactory.INSTANCE, true, false);

The dictionaryChunker object is used with each sentence as shown next. We will use displayChunkSet:

   for (String sentence : sentences) {
       System.out.println("nTEXT=" + sentence);
       displayChunkSet(dictionaryChunker, sentence);
   }

When executed, we get the following output:

TEXT=Joe was the last person to see Fred. 
Type: PERSON Entity: [Joe] Score: 1.0
Type: PERSON Entity: [Fred] Score: 1.0
 
TEXT=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. 
Type: PLACE Entity: [Boston] Score: 1.0
Type: PLACE Entity: [pub] Score: 1.0
 
TEXT=Joe wanted to go to Vermont for the day to visit 
a cousin who works at IBM, but Sally and he had to look for Fred
Type: PERSON Entity: [Joe] Score: 1.0
Type: PLACE Entity: [Vermont] Score: 1.0
Type: ORGANIZATION Entity: [IBM] Score: 1.0
Type: PERSON Entity: [Sally] Score: 1.0
Type: PERSON Entity: [Fred] Score: 1.0

This does a pretty good job, but it requires a lot of effort to create the dictionary or a large vocabulary.

Training a model

We will use the OpenNLP to demonstrate how a model is trained. The training file used must have the following:

Marks to demarcate the entities
One sentence per line

We will use the following model file named en-ner-person.train:

<START:person> Joe <END> was the last person to see <START:person> Fred <END>. 
He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. 
<START:person> Joe <END> wanted to go to Vermont for the day to visit a cousin 
who works at IBM, but <START:person> Sally <END> and he had to look for <START:person> Fred <END>.

Several methods of this example are capable of throwing exceptions. These statements will be placed in try-with-resource block as shown here where the model’s output stream is created:

   try (OutputStream modelOutputStream = new BufferedOutputStream(
           new FileOutputStream(new File("modelFile")));) {
       ...
   } catch (IOException ex) {
       // Handle exception
   }

Within the block, we create an OutputStream<String> object using the PlainTextByLineStream class. This class’ constructor takes FileInputStream and returns each line as a String object. The en-ner-person.train file is used as the input file as shown here. UTF-8 refers to the encoding sequence used:

   ObjectStream<String> lineStream = new PlainTextByLineStream(
       new FileInputStream("en-ner-person.train"), "UTF-8");

The lineStream object contains stream that are annotated with tags delineating the entities in the text. These need to be converted to the NameSample objects so that the model can be trained. This conversion is performed by the NameSampleDataStream class as shown next. A NameSample object holds the names for the entities found in the text:

   ObjectStream<NameSample> sampleStream =
       new NameSampleDataStream(lineStream);

The train method can now be executed as shown next:

   TokenNameFinderModel model = NameFinderME.train(
       "en", "person", sampleStream,
       Collections.<String, Object>emptyMap(), 100, 5);

The arguments of the method are as detailed in the following table.

Parameter	Meaning
“en”	Language Code
“person”	Entity type
sampleStream	Sample data
null	Resources
100	The number of iterations
5	The cutoff

The model is then serialized to the file:

   model.serialize(modelOutputStream);

The output of this sequence is as follows. It has been shortened to conserve space. Basic information about the model creation is detailed:

Indexing events using cutoff of 5
 
   Computing event counts... done. 53 events
   Indexing... done.
Sorting and merging events... done. Reduced 53 events to 46.
Done indexing.
Incorporating indexed data for training... 
   Number of Event Tokens: 46
    Number of Outcomes: 2
    Number of Predicates: 34
...done.
Computing model parameters ...
Performing 100 iterations.
 1: ... loglikelihood=-36.73680056967707 0.05660377358490566
 2: ... loglikelihood=-17.499660626361216 0.9433962264150944
 3: ... loglikelihood=-13.216835449617108 0.9433962264150944
 4: ... loglikelihood=-11.461783667999262 0.9433962264150944
 5: ... loglikelihood=-10.380239416084963 0.9433962264150944
 6: ... loglikelihood=-9.570622475692486 0.9433962264150944
 7: ... loglikelihood=-8.919945779143012 0.9433962264150944
...
 99: ... loglikelihood=-3.513810438211968 0.9622641509433962
100: ... loglikelihood=-3.507213816708068 0.9622641509433962

Evaluating the model

The model can be evaluated using the TokenNameFinderEvaluator class. The evaluation process uses marked up sample text to perform the evaluation. For this simple example, a file called en-ner-person.eval was created that contained the following text:

<START:person> Bill <END> went to the farm to see <START:person> Sally <END>. 
Unable to find <START:person> Sally <END> he went to town.
There he saw <START:person> Fred <END> who had seen <START:person> 
Sally <END> at the book store with <START:person> Mary <END>.

The following code is used to perform the evaluation. The previous model is used as the argument of the TokenNameFinderEvaluator constructor. A NameSampleDataStream instance is created based on the evaluation file. The TokenNameFinderEvaluator class’ evaluate method performs the evaluation:

   TokenNameFinderEvaluator evaluator =
       new TokenNameFinderEvaluator(new NameFinderME(model));  
   lineStream = new PlainTextByLineStream(
       new FileInputStream("en-ner-person.eval"), "UTF-8");
   sampleStream = new NameSampleDataStream(lineStream);
   evaluator.evaluate(sampleStream);

To determine how well the model worked with the evaluation data, the getFMeasure method is executed. The results are then displayed:

   FMeasure result = evaluator.getFMeasure();
   System.out.println(result.toString());

The following output displays the precision, recall, and F-Measure. It indicates that 50 percent of the entities found exactly match the evaluation data. The recall is the percentage of entities defined in the corpus that were found in the same location. The performance measure is the harmonic mean, defined as F1 = 2 * Precision * Recall / (Recall + Precision):

Precision: 0.5
Recall: 0.25
F-Measure: 0.3333333333333333

The data and evaluation sets should be much larger to create a better model. The intent here was to demonstrate the basic approach used to train and evaluate a POS model.

Summary

The NER involves detecting entities and then classifying them. Common categories include names, locations, and things. This is an important task that many applications use to support searching, resolving references, and finding meaning in text. The process is frequently used in downstream tasks.

We investigated several techniques for performing NER. Regular expression is one approach that is supported both by core Java classes and NLP APIs. This technique is useful for many applications and there are a large number of regular expression libraries available.

Dictionary-based approaches are also possible and work well for some applications. However, they require considerable effort to populate at times. We used LingPipe’s MapDictionary class to illustrate this approach.

Trained models can also be used to perform NER. We examine several of these and demonstrated how to train a model using the Open NLP NameFinderME class.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…