In this article by Anurag Shrivastava, author of Hadoop Blueprints, we will be learning how to build a text analytics system which detects the specific events from the random news headlines.

Internet has become the main source of news in the world. There are thousands of website which constantly publish and update the news stories around the world. Not every news items is relevant for everyone but some news items are very critical for some people or businesses. For example, if you were major car manufacturer based in Germany having your suppliers located in India then you would be interested in the news from the region which can affect your supply chain.

(For more resources related to this topic, see here.)

Road accidents in India are a major social and economic problem. Road accidents leave a large number of fatalities behind and result in the loss of capital. In this example, we will build a system which detects if a news item refers to a road accident event. Let us define what we mean by it in the next paragraph.

A road accident event may or may not result in fatal injuries. One or more vehicles and pedestrians may be involved in the accidents. A non road accident event news item is everything else which can not be categorized as a road accident event. It could be a road accident trend analysis related to road accidents or something totally unrelated.

Technology stack

To build this system, we will use the following technologies:

Task

Technology

Data storage

HDFS

Data processing

Hadoop MapReduce

Query engine

Hive and Hive UDF

Data ingestion

Curl and HDFS copy

Event detection

OpenNLP

The event detection system is a machine learning based natural language processing system. The natural language processing system brings the intelligence to detect the events in the random headline sentences from the news items.

An OpenNLP

OpenSourceNaturalLanguageProcessingFramework (OpenNLP) is from apache software foundation. You can download the version 1.6.0 from https://opennlp.apache.org/ to run the examples in this blog. It is capable of detecting the entities, document categories, parts of speech, and so on in the text written by humans. We will use document categorization feature of OpenNLP in our system. Document categorization feature requires you to train the OpenNLP model with the help of sample text. As a result of training, we get a model. This resulting model is used to categorize the new text.

Our training data looks as follows:

1.46 lakh lives lost on Indian roads last year - The Hindu.

Indian road accident data | OpenGovernmentData (OGD) platform...

400 people die everyday in road accidents in India: Report - India TV.

Top Indian female biker dies in road accident during country-wide tour.

Thirty die in road accidents in north India mountains—World—Dunya...

India's top woman biker Veenu Paliwal dies in road accident: India...

Accidents on India's deadly roads cost the economy over $8 billion...

Thirty die in road accidents in north India mountains (The Express)

The first column can take two values:

n indicates that the news item is a road accident event

r indicates that the news item is not a road accident event or everything else

This training set has total 200 lines. Please note that OpenNLP requires at least 15000 lines in the training set to deliver good results. Because we do not have so much training data, we will start with a small set but remain aware about the limitations of our model. You will see that even with a small training dataset, this model works reasonably well.

Let us train and build our model:

$ opennlp DoccatTrainer -model en-doccat.bin -lang en -data roadaccident.train.prn -encoding UTF-8

Here the file roadaccident.train.prn contains the training data. The output file en-doccat.bin contains the model which we will use in our data pipeline. We have built our model using the command line utility but it is also possible to build the model programmatically. The training data file is a plain text file, which you can expand with a bigger corpus of knowledge to make the model smarter.

Next we will build the data pipeline as follows:

Fetch RSS feeds

This component will fetch RSS news feeds from the popular news web sites. In this case, we will just use one news from Google. We can always add more sites after our first RSS feed has been integrated. The whole RSS feed can be downloaded using the following command:

$ curl "https://news.google.com/news?cf=all&hl=en&ned=in&topic=n&output=rss"

The previous command downloads the news headline for India. You can customize the RSS feed by visiting the Google news site is https://news.google.com for your region.

Scheduler

Our scheduler will fetch the RSS feed once in 6 hours. Let us assume that in 6 hours time interval, we have good likelihood of fetching fresh news items.

We will wrap our feed fetching script in a shell file and invoke it using cron. The script is as follows:

$ cat feedfetch.sh
NAME= "newsfeed-"`date +%Y-%m-%dT%H.%M.%S`
curl "https://news.google.com/news?cf=all&hl=en&ned=in&topic=n&output=rss"  > $NAME
hadoop fs -put $NAME /xml/rss/newsfeeds

Cron job setup line will be as follows:

0 */6 * * * /home/hduser/mycommand

Please edit your cron job table using the following command and add the setup line in it:

$ cronjob -e

Loading data in HDFS

To load data in HDFS, we will use HDFS put command which copies the downloaded RSS feed in a directory in HDFS. Let us make this directory in HDFS where our feed fetcher script will store the rss feeds:

$ hadoop fs -mkdir  /xml/rss/newsfeeds

Query using Hive

First we will create an external table in Hive for the new RSS feed. Using Xpath based select queries, we will extract the news headlines from the RSS feeds. These headlines will be passed to UDF to detect the categories:

CREATE EXTERNAL TABLE IF NOT EXISTS rssnews(
        document STRING)
    COMMENT 'RSS Feeds from media'
    STORED AS TEXTFILE
    location '/xml/rss/newsfeeds';

The following command parses the XML to retrieve the title or the headlines from XML and explodes them in a single column table:

SELECT explode(xpath(name, '//item/title/text()')) FROM xmlnews1;

The sample output of the above command on my system is as follows:

hive> select explode(xpath(document, '//item/title/text()')) from rssnews;
Query ID = hduser_20161010134407_dcbcfd1c-53ac-4c87-976e-275a61ac3e8d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475744961620_0016, Tracking URL = http://localhost:8088/proxy/application_1475744961620_0016/
Kill Command = /home/hduser/hadoop-2.7.1/bin/hadoop job  -kill job_1475744961620_0016
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-10-10 14:46:14,022 Stage-1 map = 0%,  reduce = 0%
2016-10-10 14:46:20,464 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.69 sec
MapReduce Total cumulative CPU time: 4 seconds 690 msec
Ended Job = job_1475744961620_0016
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 4.69 sec   HDFS Read: 120671 HDFS Write: 1713 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 690 msec
OK
China dispels hopes of early breakthrough on NSG, sticks to its guns on Azhar - The Hindu
Pampore attack: Militants holed up inside govt building; combing operations intensify - Firstpost
CPI(M) worker hacked to death in Kannur - The Hindu
Akhilesh Yadav's comment on PM Modi's Lucknow visit shows Samajwadi Party's insecurity: BJP - The Indian Express
PMO maintains no data about petitions personally read by PM - Daily News & Analysis
AIADMK launches social media campaign to put an end to rumours regarding Amma's health - Times of India
Pakistan, India using us to play politics: Former Baloch CM - Times of India
Indian soldier, who recited patriotic poem against Pakistan, gets death threat - Zee News
This Dussehra effigies of 'terrorism' to go up in flames - Business Standard
'Personal reasons behind Rohith's suicide': Read commission's report - Hindustan Times
Time taken: 5.56 seconds, Fetched: 10 row(s)

Hive UDF

Our Hive User Defined Function (UDF) categorizeDoc takes a news headline and suggests if it is a news about a road accident or the road accident event as we explained earlier. This function is as follows:

package com.mycompany.app;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import java.lang.String;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.IOException;

@
Description(
    name = "getCategory",
    value = "_FUNC_(string) - gets the catgory of a document "
)
public final class MyUDF extends UDF {
    public Text evaluate(Text input) {
        if (input == null) return null;
        try {
            return new Text(categorizeDoc(input.toString()));
        } catch (Exception ex) {
            ex.printStackTrace();
            return new Text("Sorry Failed: >> " + input.toString());
        }
    }

    public String categorizeDoc(String doc) throws InvalidFormatException, IOException {
        InputStream is = new FileInputStream("./en-doccat.bin");
        DoccatModel model = new DoccatModel(is);
        is.close();
        DocumentCategorizerME classificationME = new DocumentCategorizerME(model);
        String documentContent = doc;
        double[] classDistribution = classificationME.categorize(documentContent);
        String predictedCategory = classificationME.getBestCategory(classDistribution);
        return predictedCategory;
    }
}

The function categorizeDoc take a single string as input. It loads the model which we created earlier from the file en-doccat.bin from the local directory. Finally it calls the classifier which returns the result to the calling function. The calling function MyUDF extends the hive UDF class. It calls the function categorizeDoc for each string line item input. If the it succeed then the value is returned to the calling program otherwise a message is returned which indicates that the category detection has failed.

The pom.xml file to build the above file is as follows:

$ cat pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project 
         
         xsi_schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.mycompany</groupId>
    <artifactId>app</artifactId>
    <version>1.0</version>
        <packaging>jar</packaging>
        <properties>
            <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
            <maven.compiler.source>1.7</maven.compiler.source>
            <maven.compiler.target>1.7</maven.compiler.target>
        </properties>

        <dependencies>
            <dependency>
                <groupId>junit</groupId>
                <artifactId>junit</artifactId>
                <version>4.12</version>
                <scope>test</scope>
            </dependency>

            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-client</artifactId>
                <version>2.7.1</version>
                <type>jar</type>
            </dependency>
            <dependency>
                <groupId>org.apache.hive</groupId>
                <artifactId>hive-exec</artifactId>
                <version>2.0.0</version>
                <type>jar</type>
            </dependency>
    <dependency>
   <groupId>org.apache.opennlp</groupId>
   <artifactId>opennlp-tools</artifactId>
   <version>1.6.0</version>
    </dependency>

        </dependencies>

<build>
    <pluginManagement>
      <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.8</version>
        </plugin>

        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <archive>
                    <manifest>
                        <mainClass>com.mycompany.app.App</mainClass>
                    </manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>
</project>

You can build the jar with all the dependencies in it using the following commands:

$  mvn clean compile assembly:single

The resulting jar file app-1.0-jar-with-dependencies.jar can be found in the target directory.

Let us use this jar file in Hive to categorise the news headlines as follows:

Copy jar file to the bin subdirectory in the Hive root:

$ cp app-1.0-jar-with-dependencies.jar $HIVE_ROOT/bin

Copy the trained model in the bin sub directory in the Hive root:
```
$ cp en-doccat.bin $HIVE_ROOT/bin
```

Run the categorization queries

Run Hive:
```
$hive
```

Add jar file in Hive:

hive> ADD JAR ./app-1.0-jar-with-dependencies.jar ;

Create a temporary categorization function catDoc:

hive> CREATE TEMPORARY FUNCTION catDoc as 'com.mycompany.app.MyUDF';

Create a table headlines to hold the headlines extracted from the RSS feed:
```
hive> create table headlines( headline  string);
```

Insert the extracted headlines in the table headlines:

hive> insert overwrite table headlines select explode(xpath(document, '//item/title/text()')) from rssnews;

Let's test our UDF by manually passing a real news headline to it from a newspaper website:

event-detection-news-headlines-hadoop-img-0

hive> hive> select catDoc("8 die as SUV falls into river while crossing bridge in Ghazipur") ;
OK
N

The output is N which means this is indeed a headline about a road accident incident. This is reasonably good, so now let us run this function for the all the headlines:

hive> select headline, catDoc(*) from headlines;
OK
China dispels hopes of early breakthrough on NSG, sticks to its guns on Azhar - The Hindu  r
Pampore attack: Militants holed up inside govt building; combing operations intensify - Firstpost  r
Akhilesh Yadav Backs Rahul Gandhi's 'Dalali' Remark - NDTV  r
PMO maintains no data about petitions personally read by PM Narendra Modi - Economic Times	n
Mobile Internet Services Suspended In Protest-Hit Nashik - NDTV  n
Pakistan, India using us to play politics: Former Baloch CM - Times of India  r
CBI arrests Central Excise superintendent for taking bribe - Economic Times  n
Be extra vigilant during festivals: Centre's advisory to states - Times of India  r
CPI-M worker killed in Kerala - Business Standard  n
Burqa-clad VHP activist thrashed for sneaking into Muslim women gathering - The Hindu  r
Time taken: 0.121 seconds, Fetched: 10 row(s)

You can see that our headline detection function works and output r or n. In the above example, we see many false positives where a headline has been incorrectly identified as a road accident. A better training for our model can improve the quality of our results.

Summary

In this article we have learned to build a text analytics system which detects the specific events from the random news headlines. This also covers how to apply Hadoop, HDFS, and other different tools.