(For more resources related to this topic, see here.)
In this part and the following one, we’ll see how to extract data efficiently from tweets such as hashtags and emoticons. We need to do it because we want to be able to know what the most discussed topics are, and also get the mood across the tweets. And then, we’ll want to join that information to get people’s sentiments.
We’ll start with hashtags; to do so, we need to do the following:
So, I have some bad news and good news:
The following is the Hive-processing workflow that we are going to apply to our tweets:
Hive-processing workflow
The preceding diagram describes the workflow to be followed to extract the hashtags. The steps are basically as follows:
This kind of processing is really useful if we want to have a feeling of what the top tweeted topics are, and is most of the time represented by a word cloud chart like the one shown in the following diagram:
Topic word cloud sample
Let’s do this by creating a new CH04_01_HIVE_PROCESSING_HASH_TAGS job under a new Chapter4 folder. This job will contain six components:
The following would be the steps to create a new job:
Name | Value |
custom_udf_jar | PATH_TO_THE_JAR For Example: /Users/bahaaldine/here/is/the/jar/extractHashTags.jar |
This new context variable is just the path to the Hive UDF JAR file provided in the source file
Adding a Custom UDF JAR to Hive classpath.
We use the create temporary function query to create a new extract_patterns function in Hive UDF catalog and give the implementation class contained in our package
Name | Value |
hash_tags_id | String |
day_of_week | String |
day_of_month | String |
time | String |
month | String |
hash_tags_label | String |
CREATE EXTERNAL TABLE hash_tags ( hash_tags_id string, day_of_week string, day_of_month string, time string, month string, hash_tags_label string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' LOCATION '/user/"+context.hive_user+"/packt/chp04/hashtags'
insert into table hash_tags select concat(formatted_tweets.day_of_week, formatted_tweets. day_of_month, formatted_tweets.time, formatted_tweets.month) as hash_id, formatted_tweets.day_of_week, formatted_tweets.day_of_month, formatted_tweets.time, formatted_tweets.month, hash_tags_label from formatted_tweets LATERAL VIEW explode( extract_patterns(formatted_tweets.content,'#(\\w+)') ) hashTable as hash_tags_label
Let’s analyze the query from the end to the beginning. The last part of the query uses the extract_patterns function to parse in the formatted_tweets.content all hashtags based on the regex #(+).
In Talend, all strings are Java string objects. That’s why we need here to escape all backslash. Hive also needs special character escape, that brings us to finally having four backslashes.
The extract_patterns command returns an array that we inject in the exploded Hive UDF in order to obtain a list of objects. We then pass them to the lateral view statement, which creates a new on-the-fly view called hashTable with one column hash_tags_label. Take a breath. We are almost done.
If we go one level up, we will see that we selected all the required columns for our new hash_tags table, do a concatenation of data to build hash_id, and dynamically select a runtime-built column called hash_tags_label provided by the lateral view.
Finally, all the selected data is inserted in the hash_tags table.
We just need to run the job, and then, using the following query, we will check in Hive if the new table contains our hashtags:
$ select * from hash_tags
The following diagram shows the complete hashtags-extracting job structure:
Hive processing job
By now, you should have a good overview of how to use Apache Hive features with Talend, from the ELT mode to the lateral view, passing by the custom Hive user-defined function. From the point of view of a use case, we have now reached the step where we need to reveal some added-value data from our Hive-based processing data.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…