4 min read

[box type=”note” align=”” class=”” width=””]This article is an excerpt from a book written by Richard M. Reese and Jennifer L. Reese titled Java for Data Science. This book provides in-depth understanding of important tools and proven techniques used across data science projects in a Java environment.[/box]

In this article, we are going to show Java implementation of Information Extraction (IE) task to identify what the document is all about. From this task you will know how to enhance search retrieval and boost the ranking of your document in the search results.

To begin with, let’s understand what Named Entity Recognition (NER) is all about. It is  referred to as classifying elements of a document or a text such as finding people, location and things. Given a text segment, we may want to identify all the names of people present. However, this is not always easy because a name such as Rob may also be used as a verb.

In this section, we will demonstrate how to use OpenNLP’s TokenNameFinderModel class to find names and locations in text. While there are other entities we may want to find, this example will demonstrate the basics of the technique. We begin with names.

Most names occur within a single line. We do not want to use multiple lines because an entity such as a state might inadvertently be identified incorrectly. Consider the following sentences:

Jim headed north. Dakota headed south.

If we ignored the period, then the state of North Dakota might be identified as a location, when in fact it is not present.

Using OpenNLP to perform NER

We start our example with a try-catch block to handle exceptions. OpenNLP uses models that have been trained on different sets of data. In this example, the en-token.bin and enner-person.bin
files contain the models for the tokenization of English text and for English name elements, respectively. These files can be downloaded fromhttp://opennlp.sourceforge.net/models-1.5/. However, the IO stream used here is standard Java:

try (InputStream tokenStream =

new FileInputStream(new File("en-token.bin"));

InputStream personModelStream = new FileInputStream(

new File("en-ner-person.bin"));) {

...

} catch (Exception ex) {

// Handle exceptions

}

An instance of the TokenizerModel class is initialized using the token stream. This instance is then used to create the actual TokenizerME tokenizer. We will use this instance to tokenize our sentence:

TokenizerModel tm = new TokenizerModel(tokenStream);

TokenizerME tokenizer = new TokenizerME(tm);

The TokenNameFinderModel class is used to hold a model for name entities. It is initialized using the person model stream. An instance of the NameFinderME class is created using this model since we are looking for names:

TokenNameFinderModel tnfm = new

TokenNameFinderModel(personModelStream);

NameFinderME nf = new NameFinderME(tnfm);

To demonstrate the process, we will use the following sentence. We then convert it to a series of tokens using the tokenizer and tokenizer method:

String sentence = "Mrs. Wilson went to Mary's house for dinner.";

String[] tokens = tokenizer.tokenize(sentence);

The Span class holds information regarding the positions of entities. The find method will return the position information, as shown here:

Span[] spans = nf.find(tokens);

This array holds information about person entities found in the sentence. We then display this information as shown here:

for (int i = 0; i < spans.length; i++) {

out.println(spans[i] + " - " + tokens[spans[i].getStart()]);

}

The output for this sequence is as follows. Notice that it identifies the last name of Mrs. Wilson but not the “Mrs.”:

[1..2) person - Wilson

[4..5) person - Mary

Once these entities have been extracted, we can use them for specialized analysis.

Identifying location entities

We can also find other types of entities such as dates and locations. In the following example, we find locations in a sentence. It is very similar to the previous person example, except that an en-ner-location.bin file is used for the model:

try (InputStream tokenStream =

new FileInputStream("en-token.bin");

InputStream locationModelStream = new FileInputStream(

new File("en-ner-location.bin"));) {

TokenizerModel tm = new TokenizerModel(tokenStream);

TokenizerME tokenizer = new TokenizerME(tm);

TokenNameFinderModel tnfm =

new TokenNameFinderModel(locationModelStream);

NameFinderME nf = new NameFinderME(tnfm);

sentence = "Enid is located north of Oklahoma City.";

String tokens[] = tokenizer.tokenize(sentence);

Span spans[] = nf.find(tokens);

for (int i = 0; i < spans.length; i++) {

out.println(spans[i] + " - " +

tokens[spans[i].getStart()]);

}

} catch (Exception ex) {

// Handle exceptions

}

With the sentence defined previously, the model was only able to find the second city, as shown here. This likely due to the confusion that arises with the name Enid which is both the name of a city and a person’ name:

[5..7) location - Oklahoma

Suppose we use the following sentence:

sentence = "Pond Creek is located north of Oklahoma City.";

Then we get this output:

[1..2) location - Creek

[6..8) location - Oklahoma

Unfortunately, it has missed the town of Pond Creek. NER is a useful tool for many applications, but like many techniques, it is not always foolproof. The accuracy of the NER approach presented, and many of the other NLP examples, will vary depending on factors such as the accuracy of the model, the language being used, and the type of entity.  

With this, we successfully learnt one of the core tasks of natural language processing using Java and Apache OpenNLP. To know what else you can do with Java in the exciting domain of Data Science, check out this book Java for Data Science.

Java for Data Science

 

Category Manager and tech enthusiast. Previously worked on global market research and lead generation assignments. Keeps a constant eye on Artificial Intelligence.

LEAVE A REPLY

Please enter your comment!
Please enter your name here