(For more resources related to this topic, see here.)
Here are the initial steps to be followed:
Create the init.hql file in the current directory. This file sets up the database, specifies our Hive settings, and creates our input table.
create database if not exists ch3 ; use ch3 ; set hive.exec.mode.local.auto=true ; create table if not exists athlete( name string, id string, demonstration_events_competed_in array<string>, demonstration_medals_won array<string>, country array<string>, medals_won array<string>) row format delimited fields terminated by 't' collection items terminated by ',' ; load data local inpath 'data/olympic_athlete.tsv' overwrite into table athlete ;
Create the script.hql in the current directory. This file creates our output table and executes a query.
create table if not exists top_athletes( name string, num_medals int) ; insert overwrite table top_athletes select name, size(medals_won) as num_medals from athlete where size(medals_won) >= ${threshold} order by num_medals desc, name asc ;
Follow these steps to complete the task:
We start by running our initialization and query scripts from the command line:
$ hive -v --hivevar threshold=10 -i init.hql -f script.hql
We write the header of our top_athletes table to a file using the standard cut and paste Unix tools
$ hive -S -e "use ch4; describe top_athletes;" | cut -f 1 | paste -s - > output.tsv
We write the data from the top_athletes table to our output file by executing explicit SQL statements:
$ hive -S -e "use ch4; select * from top_athletes ;" >> output.tsv
We can verify the contents of our file by simply using the cat command:
$ cat output.tsv name num_medals Michael Phelps 22 Larissa Latynina 14 Jenny Thompson 12 Nikolai Andrianov 12 Matt Biondi 11 Ole Einar Bjørndalen 11 Ryan Lochte 11 Boris Shakhlin 10 Takashi Ono 10
From the command line, Hive supports three mutually exclusive modes. In addition to the interactive mode, we can pass commands to Hive using two different flags.
Flag | Argument | Effect |
-f | filename | Execute the contents of the file |
-e | SQL statements | Execute the argument as input |
(none) | (none) | Run Hive interactively |
For all three of these cases, Hive allows using the -i filename parameter for passing an initialization script. This script will be executed first, then the session will continue with the file contents, explicit statements, or interactive session.
Hive supports using the -i flag multiple times, so we can make our initialization scripts even more reusable by splitting them according to their functionality. For example, we could have one initialization script with common settings used by all jobs running on a particular cluster and a second initialization script for all jobs that use particular table definitions.
In this example, we first use our two files to create the input and output tables and run a query. By using interactive scripts, we can separate the context of our query (the database, data loading, and Hive configuration) from the logic and output. Different processes can share the same context without needing to duplicate the contents of the initialization file across all of their scripts.
We also make our script.sql file reusable with different thresholds through variable substitution. Hive will automatically replace any occurrences of ${variable} with the values passed to the hive command via the –hivevar parameter. The -d and –define parameters are synonyms for –hivevar, and all of these parameters may be specified multiple times if necessary.
The -v flag simply puts Hive into verbose mode, so each statement is echoed to the console as it is executed. Combining the verbose flag, variable substitution, and our scripts gives us the first shell command we executed:
$ hive-v --hivevar threshold=10 -iinit.hql -f script.hql
We then execute two explicit commands, first to describe the columns of the top_athletes table, and then to output its contents. These are redirected to our output file on the local filesystem.
The -S flag puts Hive into silent mode, so only the output of our queries will be written to the files. This helps us capture only the table contents to our output file.
$ hive -S -e "use ch3; describe top_athletes;" | cut -f 1 | paste -s - > output.tsv $ hive -S -e "use ch3; select * from top_athletes ;" >> output.tsv
In Hive Versions 0.10.0 and later, we can use the -database ch3 flag instead of specifying use ch3; as part of the query. Alternatively, we could refer to the table by its full name ch3. top_athletes.
In this recipe, we learned how Hive supports uses cases, such as periodic ETL jobs, by rerunning the top athletes query in batch mode from the command line.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…