(For more resources related to this topic, see here.)
Hadoop is one of the names we think about when it comes to Big Data. I’m not going into details about it since there is plenty of information out there; moreover, like somebody once said, “If you decided to use Hadoop for your data warehouse, then you probably have a good reason for it”. Let’s not forget: it is primarily a distributed filesystem, not a relational database.
That said, there are many cases when we may need to use this technology for number crunching, for example, together with MicroStrategy for analysis and reporting.
There are mainly two ways to leverage Hadoop data from MicroStrategy: the first is Hive and the second is Impala. They both work as SQL bridges to the underlying Hadoop structures, converting standard SELECT statements into jobs. The connection is handled by a proprietary 32-bit ODBC driver available for free from the Cloudera website.
In my tests, Impala resulted largely faster than Hive, so I will show you how to use it from our MicroStrategy virtual machine.
Please note that I am using Version 9.3.0 for consistency with the rest of the book. If you’re serious about Big Data and Hadoop, I strongly recommend upgrading to 9.3.1 for enhanced performance and easier setup. See MicroStrategy knowledge base document TN43588 : Post-Certification of Cloudera Impala 1.0 with MicroStrategy 9.3.1 .
The ODBC driver is the same for both Hive and Impala, only the driver settings change.
To show how we can connect to a Hadoop database, I will use two virtual machines: one with MicroStrategy Suite and the second with Cloudera Hadoop distribution, specifically, a virtual appliance that is available for download from their website.
The configuration of the Hadoop cluster is out of scope; moreover, I am not a Hadoop expert. I’ll simply give some hints, feel free to use any other configuration/vendor, the procedure and ODBC parameters should be similar.
Start by going to http://at5.us/AppAU1
The Cloudera VM download is almost 3 GB (cloudera-quickstart-vm-4.3.0-vmware.tar.gz) and features the CH4 version. After unpacking the archive, you’ll find a cloudera-quickstart-vm-4.3.0-vmware.ovf file that can be opened with VMware, see screen capture:
Accept the defaults and click on Import to generate the cloudera-quickstart-vm-4.3.0-vmware virtual machine.
Before starting the Cloudera appliance, change the network card settings from NAT to Bridged since we need to access the database from another VM:
Leave the rest of the parameters, as per the default, and start the machine.
After a while, you’ll be presented with a graphical interface of Centos Linux. If the network has started correctly, the machine should have received an IP address from your network DHCP. We need a fixed rather than dynamic address in the Hadoop VM, so:
When we first start Hadoop, there are no tables in the database, so we create the samples:
Next, we open the MicroStrategy virtual machine and download the 32-bit Cloudera ODBC Driver for Apache Hive, Version 2.0 from http://at5.us/AppAU2.
Download the ClouderaHiveODBCSetup_v2_00.exe file and save it in C:install.
We install the ODBC driver:
From here, the procedure to create objects is the same as in any other project:
Table: sample_08
Column: code
Table: sample_08
Column: description
There you go; you just created your first Hadoop report.
Executing Hadoop reports is no different from running any other standard DBMS reports. The ODBC driver handles the communication with Cloudera machine and Impala manages the creation of jobs to retrieve data. From MicroStrategy perspective, it is just another SELECT query that returns a dataset.
Impala and Hive do not support the whole set of ANSI SQL syntax, so in some cases you may receive an error if a specific feature is not implemented:
See the Cloudera documentation for details.
Vertica Analytic Database is grid-based, column-oriented, and designed to manage large, fast-growing volumes of data while providing rapid query performance. It features a storage organization that favors SELECT statements over UPDATE and DELETE plus a high compression that stores columns of homogeneous datatype together.
The Community (free) Edition allows up to three hosts and 1 TB of data, which is fairly sufficient for small to medium BI projects with MicroStrategy. There are several clients available for different operating systems, including 32-bit and 64-bit ODBC drivers for Windows.
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…