Home Data Tutorials Finding People and Things

Finding People and Things

March 19, 2015 - 12:00 am

2479

18 min read

In this article by Richard M Reese, author of the book Natural Language Processing with Java, we will see how to use NLP APIs.

Using NLP APIs

We will demonstrate the NER process using OpenNLP, Stanford API, and LingPipe. Each of these provide alternate techniques that can often do a good job of identifying entities in the text. The following declaration will serve as the sample text to demonstrate the APIs:

String sentences[] = {"Joe was the last person to see Fred. ",
"He saw him in Boston at McKenzie's pub at 3:00 where he "
+ " paid $2.45 for an ale. ",
"Joe wanted to go to Vermont for the day to visit a cousin who "
+ "works at IBM, but Sally and he had to look for Fred"};

Using OpenNLP for NER

We will demonstrate the use of the TokenNameFinderModel class to perform NLP using the OpenNLP API. Additionally, we will demonstrate how to determine the probability that the entity identified is correct.

The general approach is to convert the text into a series of tokenized sentences, create an instance of the TokenNameFinderModel class using an appropriate model, and then use the find method to identify the entities in the text.

The following example demonstrates the use of the TokenNameFinderModel class. We will use a simple sentence initially and then use multiple sentences. The sentence is defined here:

String sentence = “He was the last person to see Fred.”;

We will use the models found in the en-token.bin and en-ner-person.bin files for the tokenizer and name finder models, respectively. The InputStream object for these files is opened using a try-with-resources block, as shown here:

try (InputStream tokenStream = new FileInputStream(
       new File(getModelDir(), "en-token.bin"));
       InputStream modelStream = new FileInputStream(
           new File(getModelDir(), "en-ner-person.bin"));) {
   ...
 
} catch (Exception ex) {
   // Handle exceptions
}

Within the try block, the TokenizerModel and Tokenizer objects are created:

   TokenizerModel tokenModel = new TokenizerModel(tokenStream);
   Tokenizer tokenizer = new TokenizerME(tokenModel);

Next, an instance of the NameFinderME class is created using the person model:

TokenNameFinderModel entityModel =
   new TokenNameFinderModel(modelStream);
NameFinderME nameFinder = new NameFinderME(entityModel);

We can now use the tokenize method to tokenize the text and the find method to identify the person in the text. The find method will use the tokenized String array as input and return an array of Span objects, as shown:

String tokens[] = tokenizer.tokenize(sentence);
Span nameSpans[] = nameFinder.find(tokens);

The Span class holds positional information about the entities found. The actual string entities are still in the tokens array:

The following for statement displays the person found in the sentence. Its positional information and the person are displayed on separate lines:

for (int i = 0; i < nameSpans.length; i++) {
   System.out.println("Span: " + nameSpans[i].toString());
   System.out.println("Entity: "
       + tokens[nameSpans[i].getStart()]);
}

The output is as follows:

Span: [7..9) person
Entity: Fred

We will often work with multiple sentences. To demonstrate this, we will use the previously defined sentences string array. The previous for statement is replaced with the following sequence. The tokenize method is invoked against each sentence and then the entity information is displayed as earlier:

for (String sentence : sentences) {
   String tokens[] = tokenizer.tokenize(sentence);
   Span nameSpans[] = nameFinder.find(tokens);
   for (int i = 0; i < nameSpans.length; i++) {
       System.out.println("Span: " + nameSpans[i].toString());
       System.out.println("Entity: "
           + tokens[nameSpans[i].getStart()]);
   }
   System.out.println();
}

The output is as follows. There is an extra blank line between the two people detected because the second sentence did not contain a person:

Span: [0..1) person
Entity: Joe
Span: [7..9) person
Entity: Fred
 
 
Span: [0..1) person
Entity: Joe
Span: [19..20) person
Entity: Sally
Span: [26..27) person
Entity: Fred

Determining the accuracy of the entity

When the TokenNameFinderModel identifies entities in text, it computes a probability for that entity. We can access this information using the probs method as shown in the following line of code. This method returns an array of doubles, which corresponds to the elements of the nameSpans array:

double[] spanProbs = nameFinder.probs(nameSpans);

Add this statement to the previous example immediately after the use of the find method. Then add the next statement at the end of the nested for statement:

System.out.println("Probability: " + spanProbs[i]);

When the example is executed, you will get the following output. The probability fields reflect the confidence level of the entity assignment. For the first entity, the model is 80.529 percent confident that “Joe” is a person:

Span: [0..1) person
Entity: Joe
Probability: 0.8052914774025202
Span: [7..9) person
Entity: Fred
Probability: 0.9042160889302772
 
Span: [0..1) person
Entity: Joe
Probability: 0.9620970782763985
Span: [19..20) person
Entity: Sally
Probability: 0.964568603518126
Span: [26..27) person
Entity: Fred
Probability: 0.990383039618594

Using other entity types

OpenNLP supports different libraries as listed in the following table. These models can be downloaded from http://opennlp.sourceforge.net/models-1.5/. The prefix, en, specifies English as the language and ner indicates that the model is for NER.

English finder models	Filename
Location name finder model	en-ner-location.bin
Money name finder model	en-ner-money.bin
Organization name finder model	en-ner-organization.bin
Percentage name finder model	en-ner-percentage.bin
Person name finder model	en-ner-person.bin
Time name finder model	en-ner-time.bin

If we modify the statement to use a different model file, we can see how they work against the sample sentences:

InputStream modelStream = new FileInputStream(
   new File(getModelDir(), "en-ner-time.bin"));) {

When the en-ner-money.bin model is used, the index in the tokens array in the earlier code sequence has to be increased by one. Otherwise, all that is returned is the dollar sign.

The various outputs are shown in the following table.

Model	Output
en-ner-location.bin	Span: [4..5) location Entity: Boston Probability: 0.8656908776583051 Span: [5..6) location Entity: Vermont Probability: 0.9732488014011262
en-ner-money.bin	Span: [14..16) money Entity: 2.45 Probability: 0.7200919701507937
en-ner-organization.bin	Span: [16..17) organization Entity: IBM Probability: 0.9256970736336729
en-ner-time.bin	The model was not able to detect time in this text sequence

The model failed to find the time entities in the sample text. This illustrates that the model did not have enough confidence that it found any time entities in the text.

Processing multiple entity types

We can also handle multiple entity types at the same time. This involves creating instances of the NameFinderME class based on each model within a loop and applying the model against each sentence, keeping track of the entities as they are found.

We will illustrate this process with the following example. It requires rewriting the previous try block to create the InputStream instance within the block, as shown here:

try {
   InputStream tokenStream = new FileInputStream(
       new File(getModelDir(), "en-token.bin"));
   TokenizerModel tokenModel = new TokenizerModel(tokenStream);
   Tokenizer tokenizer = new TokenizerME(tokenModel);
   ...
} catch (Exception ex) {
   // Handle exceptions
}

Within the try block, we will define a string array to hold the names of the model files. As shown here, we will use models for people, locations, and organizations:

String modelNames[] = {"en-ner-person.bin",
   "en-ner-location.bin", "en-ner-organization.bin"};

An ArrayList instance is created to hold the entities as they are discovered:

ArrayList<String> list = new ArrayList();

A for-each statement is used to load one model at a time and then to create an instance of the NameFinderME class:

for(String name : modelNames) {
   TokenNameFinderModel entityModel = new TokenNameFinderModel(
       new FileInputStream(new File(getModelDir(), name)));
   NameFinderME nameFinder = new NameFinderME(entityModel);
   ...
}

Previously, we did not try to identify which sentences the entities were found in. This is not hard to do but we need to use a simple for statement instead of a for-each statement to keep track of the sentence indexes. This is shown in the following example, where the previous example has been modified to use the integer variable index to keep the sentences. Otherwise, the code works the same way as earlier:

for (int index = 0; index < sentences.length; index++) {
   String tokens[] = tokenizer.tokenize(sentences[index]);
   Span nameSpans[] = nameFinder.find(tokens);
   for(Span span : nameSpans) {
       list.add("Sentence: " + index
           + " Span: " + span.toString() + " Entity: "
           + tokens[span.getStart()]);
   }
}

The entities discovered are then displayed:

for(String element : list) {
   System.out.println(element);
}

The output is as follows:

Sentence: 0 Span: [0..1) person Entity: Joe
Sentence: 0 Span: [7..9) person Entity: Fred
Sentence: 2 Span: [0..1) person Entity: Joe
Sentence: 2 Span: [19..20) person Entity: Sally
Sentence: 2 Span: [26..27) person Entity: Fred
Sentence: 1 Span: [4..5) location Entity: Boston
Sentence: 2 Span: [5..6) location Entity: Vermont
Sentence: 2 Span: [16..17) organization Entity: IBM

Using the Stanford API for NER

We will demonstrate the CRFClassifier class as used to perform NER. This class implements what is known as a linear chain Conditional Random Field (CRF) sequence model.

To demonstrate the use of the CRFClassifier class, we will start with a declaration of the classifier file string, as shown here:

String model = getModelDir() +
   "\english.conll.4class.distsim.crf.ser.gz";

The classifier is then created using the model:

CRFClassifier<CoreLabel> classifier =
   CRFClassifier.getClassifierNoExceptions(model);

The classify method takes a single string representing the text to be processed. To use the sentences text, we need to convert it to a simple string:

String sentence = "";
for (String element : sentences) {
   sentence += element;
}

The classify method is then applied to the text.

List<List<CoreLabel>> entityList = classifier.classify(sentence);

A List instance of List instances of CoreLabel objects is returned. The object returned is a list that contains another list. The contained list is a List instance of CoreLabel objects. The CoreLabel class represents a word with additional information attached to it. The “internal” list contains a list of these words. In the outer for-each statement in the following code sequence, the reference variable, internalList, represents one sentence of the text. In the inner for-each statement, each word in that inner list is displayed. The word method returns the word and the get method returns the type of the word.

The words and their types are then displayed:

for (List<CoreLabel> internalList: entityList) {
   for (CoreLabel coreLabel : internalList) {
       String word = coreLabel.word();
       String category = coreLabel.get(
           CoreAnnotations.AnswerAnnotation.class);
       System.out.println(word + ":" + category);
   }
}

Part of the output follows. It has been truncated because every word is displayed. The O represents the “Other” category:

Joe:PERSON
was:O
the:O
last:O
person:O
to:O
see:O
Fred:PERSON
.:O
He:O
...
look:O
for:O
Fred:PERSON

To filter out the words that are not relevant, replace the println statement with the following statements. This will eliminate the other categories:

if (!"O".equals(category)) {
   System.out.println(word + ":" + category);
}

The output is simpler now:

Joe:PERSON
Fred:PERSON
Boston:LOCATION
McKenzie:PERSON
Joe:PERSON
Vermont:LOCATION
IBM:ORGANIZATION
Sally:PERSON
Fred:PERSON

Using LingPipe for NER

We will demonstrate how name entity models and the ExactDictionaryChunker class are used to perform NER analysis.

Using LingPipe’s name entity models

LingPipe has a few named entity models that we can use with chunking. These files consist of a serialized object that can be read from a file and then applied to text. These objects implement the Chunker interface. The chunking process results in a series of Chunking objects that identify the entities of interest.

A list of the NER models is found in the following table. These models can be downloaded from http://alias-i.com/lingpipe/web/models.html:

Genre	Corpus	File
English News	MUC-6	ne-en-news-muc6.AbstractCharLmRescoringChunker
English Genes	GeneTag	ne-en-bio-genetag.HmmChunker
English Genomics	GENIA	ne-en-bio-genia.TokenShapeChunker

We will use the model found in the ne-en-news-muc6.AbstractCharLmRescoringChunker file to demonstrate how this class is used. We start with a try-catch block to deal with exceptions as shown in the following example. The file is opened and used with the AbstractExternalizable class’ static readObject method to create an instance of a Chunker class. This method will read in the serialized model:

try {
   File modelFile = new File(getModelDir(),
       "ne-en-news-muc6.AbstractCharLmRescoringChunker");
     Chunker chunker = (Chunker)
       AbstractExternalizable.readObject(modelFile);
   ...
} catch (IOException | ClassNotFoundException ex) {
   // Handle exception
}

The Chunker and Chunking interfaces provide methods that work with a set of chunks of text. Its chunk method returns an object that implements the Chunking instance. The following sequence displays the chunks found in each sentence of the text, as shown here:

for (int i = 0; i < sentences.length; ++i) {
   Chunking chunking = chunker.chunk(sentences[i]);
   System.out.println("Chunking=" + chunking);
}

The output of this sequence is as follows:

Chunking=Joe was the last person to see Fred. : [0-3:PERSON@-Infinity, 31-35:ORGANIZATION@-Infinity]
Chunking=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. : 
[14-20:LOCATION@-Infinity, 24-32:PERSON@-Infinity]
Chunking=Joe wanted to go to Vermont for the day to visit a cousin who works at IBM, 
but Sally and he had to look for Fred : [0-3:PERSON@-Infinity, 
20-27:ORGANIZATION@-Infinity, 71-74:ORGANIZATION@-Infinity, 109-113:ORGANIZATION@-Infinity]

Instead, we can use methods of the Chunk class to extract specific pieces of information as illustrated here. We will replace the previous for statement with the following for-each statement. This calls a displayChunkSet method:

for (String sentence : sentences) {
   displayChunkSet(chunker, sentence);
}

The output that follows shows the result. However, it does not always match the entity type correctly.

Type: PERSON Entity: [Joe] Score: -Infinity
Type: ORGANIZATION Entity: [Fred] Score: -Infinity
Type: LOCATION Entity: [Boston] Score: -Infinity
Type: PERSON Entity: [McKenzie] Score: -Infinity
Type: PERSON Entity: [Joe] Score: -Infinity
Type: ORGANIZATION Entity: [Vermont] Score: -Infinity
Type: ORGANIZATION Entity: [IBM] Score: -Infinity
Type: ORGANIZATION Entity: [Fred] Score: -Infinity

Using the ExactDictionaryChunker class

The ExactDictionaryChunker class provides an easy way to create a dictionary of entities and their types, which can be used to find them later in text. It uses a MapDictionary object to store entries and then the ExactDictionaryChunker class is used to extract chunks based on the dictionary.

The AbstractDictionary interface supports basic operations for entities, categories, and scores. The score is used in the matching process. The MapDictionary and TrieDictionary classes implement the AbstractDictionary interface. The TrieDictionary class stores information using a character trie structure. This approach uses less memory when it is a concern. We will use the MapDictionary class for our example.

To illustrate this approach, we start with a declaration of the MapDictionary class:

private MapDictionary<String> dictionary;

The dictionary will contain the entities that we are interested in finding. We need to initialize the model as performed in the following initializeDictionary method. The DictionaryEntry constructor used here accepts three arguments:

String: The name of the entity
String: The category of the entity
Double: Represent a score for the entity

The score is used when determining matches. A few entities are declared and added to the dictionary.

private static void initializeDictionary() {
   dictionary = new MapDictionary<String>();
   dictionary.addEntry(
       new DictionaryEntry<String>("Joe","PERSON",1.0));
   dictionary.addEntry(
       new DictionaryEntry<String>("Fred","PERSON",1.0));
   dictionary.addEntry(
       new DictionaryEntry<String>("Boston","PLACE",1.0));
   dictionary.addEntry(
       new DictionaryEntry<String>("pub","PLACE",1.0));
   dictionary.addEntry(
       new DictionaryEntry<String>("Vermont","PLACE",1.0));
   dictionary.addEntry(
       new DictionaryEntry<String>("IBM","ORGANIZATION",1.0));
   dictionary.addEntry(
       new DictionaryEntry<String>("Sally","PERSON",1.0));
}

An ExactDictionaryChunker instance will use this dictionary. The arguments of the ExactDictionaryChunker class are detailed here:

Dictionary<String>: It is a dictionary containing the entities
TokenizerFactory: It is a tokenizer used by the chunker
boolean: If it is true, the chunker should return all matches
boolean: If it is true, matches are case sensitive

Matches can be overlapping. For example, in the phrase “The First National Bank”, the entity “bank” could be used by itself or in conjunction with the rest of the phrase. The third parameter determines if all of the matches are returned.

In the following sequence, the dictionary is initialized. We then create an instance of the ExactDictionaryChunker class using the Indo-European tokenizer, where we return all matches and ignore the case of the tokens:

initializeDictionary();
ExactDictionaryChunker dictionaryChunker
   = new ExactDictionaryChunker(dictionary,
       IndoEuropeanTokenizerFactory.INSTANCE, true, false);

The dictionaryChunker object is used with each sentence, as shown in the following code sequence. We will use the displayChunkSet method:

for (String sentence : sentences) {
   System.out.println("nTEXT=" + sentence);
   displayChunkSet(dictionaryChunker, sentence);
}

On execution, we get the following output:

TEXT=Joe was the last person to see Fred. 
Type: PERSON Entity: [Joe] Score: 1.0
Type: PERSON Entity: [Fred] Score: 1.0
 
TEXT=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. 
Type: PLACE Entity: [Boston] Score: 1.0
Type: PLACE Entity: [pub] Score: 1.0
 
TEXT=Joe wanted to go to Vermont for the day to 
visit a cousin who works at IBM, but Sally and he had to look for Fred
Type: PERSON Entity: [Joe] Score: 1.0
Type: PLACE Entity: [Vermont] Score: 1.0
Type: ORGANIZATION Entity: [IBM] Score: 1.0
Type: PERSON Entity: [Sally] Score: 1.0
Type: PERSON Entity: [Fred] Score: 1.0

This does a pretty good job but it requires a lot of effort to create the dictionary for a large vocabulary.

Training a model

We will use OpenNLP to demonstrate how a model is trained. The training file used must:

Contain marks to demarcate the entities
Have one sentence per line

We will use the following model file named en-ner-person.train:

<START:person> Joe <END> was the last person to see <START:person> Fred <END>.
He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale.
<START:person> Joe <END> wanted to go to Vermont for the day to visit a cousin who works at IBM, but 
<START:person> Sally <END> and he had to look for <START:person> Fred <END>.

Several methods of this example are capable of throwing exceptions. These statements will be placed in a try-with-resource block as shown here, where the model’s output stream is created:

try (OutputStream modelOutputStream = new BufferedOutputStream(
       new FileOutputStream(new File("modelFile")));) {
   ...
} catch (IOException ex) {
   // Handle exception
}

Within the block, we create an OutputStream<String> object using the PlainTextByLineStream class. This class’ constructor takes a FileInputStream instance and returns each line as a String object. The en-ner-person.train file is used as the input file, as shown here. The UTF-8 string refers to the encoding sequence used:

ObjectStream<String> lineStream = new PlainTextByLineStream(
   new FileInputStream("en-ner-person.train"), "UTF-8");

The lineStream object contains streams that are annotated with tags delineating the entities in the text. These need to be converted to the NameSample objects so that the model can be trained. This conversion is performed by the NameSampleDataStream class as shown here. A NameSample object holds the names of the entities found in the text:

ObjectStream<NameSample> sampleStream =
   new NameSampleDataStream(lineStream);

The train method can now be executed as follows:

TokenNameFinderModel model = NameFinderME.train(
   "en", "person", sampleStream,
   Collections.<String, Object>emptyMap(), 100, 5);

The arguments of the method are as detailed in the following table:

Parameter	Meaning
“en”	Language Code
“person”	Entity type
sampleStream	Sample data
null	Resources
100	The number of iterations
5	The cutoff

The model is then serialized to an output file:

model.serialize(modelOutputStream);

The output of this sequence is as follows. It has been shortened to conserve space. Basic information about the model creation is detailed:

Indexing events using cutoff of 5
 
 Computing event counts... done. 53 events
 Indexing... done.
Sorting and merging events... done. Reduced 53 events to 46.
Done indexing.
Incorporating indexed data for training... 
 Number of Event Tokens: 46
     Number of Outcomes: 2
   Number of Predicates: 34
...done.
Computing model parameters ...
Performing 100 iterations.
 1: ... loglikelihood=-36.73680056967707 0.05660377358490566
 2: ... loglikelihood=-17.499660626361216 0.9433962264150944
 3: ... loglikelihood=-13.216835449617108 0.9433962264150944
 4: ... loglikelihood=-11.461783667999262 0.9433962264150944
 5: ... loglikelihood=-10.380239416084963 0.9433962264150944
 6: ... loglikelihood=-9.570622475692486 0.9433962264150944
 7: ... loglikelihood=-8.919945779143012 0.9433962264150944
...
 99: ... loglikelihood=-3.513810438211968 0.9622641509433962
100: ... loglikelihood=-3.507213816708068 0.9622641509433962

Evaluating a model

The model can be evaluated using the TokenNameFinderEvaluator class. The evaluation process uses marked up sample text to perform the evaluation. For this simple example, a file called en-ner-person.eval was created that contained the following text:

<START:person> Bill <END> went to the farm to see <START:person> Sally <END>.
Unable to find <START:person> Sally <END> he went to town.
There he saw <START:person> Fred <END> who had seen <START:person> Sally <END> at the book store with <START:person> Mary <END>.

The following code is used to perform the evaluation. The previous model is used as the argument of the TokenNameFinderEvaluator constructor. A NameSampleDataStream instance is created based on the evaluation file. The TokenNameFinderEvaluator class’ evaluate method performs the evaluation:

TokenNameFinderEvaluator evaluator =
   new TokenNameFinderEvaluator(new NameFinderME(model));  
lineStream = new PlainTextByLineStream(
   new FileInputStream("en-ner-person.eval"), "UTF-8");
sampleStream = new NameSampleDataStream(lineStream);
evaluator.evaluate(sampleStream);

To determine how well the model worked with the evaluation data, the getFMeasure method is executed. The results are then displayed:

FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());

The following output displays the precision, recall, and F-measure. It indicates that 50 percent of the entities found exactly match the evaluation data. The recall is the percentage of entities defined in the corpus that were found in the same location. The performance measure is the harmonic mean and is defined as: F1 = 2 * Precision * Recall / (Recall + Precision)

Precision: 0.5
Recall: 0.25
F-Measure: 0.3333333333333333

The data and evaluation sets should be much larger to create a better model. The intent here was to demonstrate the basic approach used to train and evaluate a POS model.

Summary

We investigated several techniques for performing NER. Regular expressions is one approach that is supported by both core Java classes and NLP APIs. This technique is useful for many applications and there are a large number of regular expression libraries available.

Dictionary-based approaches are also possible and work well for some applications. However, they require considerable effort to populate at times. We used LingPipe’s MapDictionary class to illustrate this approach.

Resources for Article:

Further resources on this subject:

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…