(For more resources related to this topic, see here.)
Each ecosystem component has its own security challenges and needs to be configured uniquely based on its architecture to secure them. Each of these ecosystem components has end users directly accessing the component or a backend service accessing the Hadoop core components (HDFS and MapReduce).
The following are the topics that we’ll be covering in this article:
- Configuring authentication and authorization for the following Hadoop ecosystem components:
- Best practices in configuring secured Hadoop components
Configuring Kerberos for Hadoop ecosystem components
The Hadoop ecosystem is growing continuously and maturing with increasing enterprise adoption. In this section, we look at some of the most important Hadoop ecosystem components, their architecture, and how they can be secured.
Hive provides the ability to run SQL queries over the data stored in the HDFS. Hive provides the Hive query engine that converts Hive queries provided by the user to a pipeline of MapReduce jobs that are submitted to Hadoop (JobTracker or ResourceManager) for execution. The results of the MapReduce executions are then presented back to the user or stored in HDFS. The following figure shows a high-level interaction of a business user working with Hive to run Hive queries on Hadoop:
There are multiple ways a Hadoop user can interact with Hive and run Hive queries; these are as follows:
- The user can directly run the Hive queries using Command Line Interface (CLI). The CLI connects to the Hive metastore using the metastore server and invokes Hive query engine directly to execute Hive query on the cluster.
- Custom applications written in Java and other languages interacts with Hive using the HiveServer. HiveServer, internally, uses the metastore server and the Hive Query Engine to execute the Hive query on the cluster.
To secure Hive in the Hadoop ecosystem, the following interactions should be secured:
- User interaction with Hive CLI or HiveServer
- User roles and privileges needs to be enforced to ensure users have access to only authorized data
- The interaction between Hive and Hadoop (JobTracker or ResourceManager) has to be secured and the user roles and privileges should be propagated to Hadoop jobs
To ensure secure Hive user interaction, there is a need to ensure that the user is authenticated by HiveServer or CLI before running any jobs on the cluster. The user has to first use the kinit command to fetch the Kerberos ticket. This ticket is stored in the credential cache and used to authenticate with Kerberos-enabled systems.
Once the user is authenticated, Hive submits the job to Hadoop (JobTracker or ResourceManager). Hive needs to impersonate the user to execute MapReduce on the cluster. From Hive Version 0.11 onwards, HiveServer2 was introduced. The earlier HiveServer had serious security limitations related to user authentication.
HiveServer2 supports Kerberos and LDAP authentication for the user authentication.
When HiveServer2 is configured to have LDAP authentication, Hive users are managed using the LDAP store. Hive asks the users to submit the MapReduce jobs to Hadoop. Thus, if we configure HiveServer2 to use LDAP, only the user authentication between the client and HiveServer2 is addressed. The interaction of Hive with Hadoop is insecure, and Hive MapReduce will be able to access other users’ data in the Hadoop cluster.
On the other hand, when we use Kerberos authentication for Hive users with HiveServer2, the same user is impersonated to execute MapReduce on the Hadoop cluster. So it is recommended that in a production environment, we configure HiveServer2 with Kerberos to have a seamless authentication and access control for the users submitting Hive queries. The credential store for Kerberos KDC can be configured to be LDAP so that we can centrally manage the user credentials of the end users.
To set up a secured Hive interactions, we need to do the following steps:
One of the key steps in securing Hive interaction is to ensure that the Hive user is impersonated in Hadoop, as Hive executes a MapReduce job on the Hadoop cluster. To achieve this goal, we need to add the hive.server2.enable.impersonation configuration in hive-site.xml, and hadoop.proxyuser.hive.hosts and hadoop. proxyuser.hive.groups in core-site.xml.
<property> <name>hive.server2.authentication</name> <value>KERBEROS</value> </property> <property> <name>hive.server2.authentication.kerberos.principal</name> <value>hive/[email protected]</value> </property> <property> <name>hive.server2.authentication.kerberos.keytab</name> <value>/etc/hive/conf/hive.keytab</value> </property> <property> <name>hive.server2.enable.impersonation</name> <description>Enable user impersonation for HiveServer2</description> <value>true</value> </property>
Securing Hive using Sentry
In the previous section, we saw how Hive authentication can be enforced using Kerberos and the user privileges that are enforced by using user impersonation in Hadoop by the superuser.
Sentryis the one of the latest entrant in the Hadoop ecosystem that provides finegrained user authorization for the data that is stored in Hive. Sentry provides finegrained, role-based authorization to Hive and Impala. Sentry uses HiveServer2 and metastore server to execute the queries on the Hadoop platform. However, the user impersonation is turned off in HiveServer2 when Sentry is used. Sentry enforces user privileges on the Hadoop data using the Hive metastore. Sentry supports authorization policies per database/schema. This could be leveraged to enforce user management policies.
More details on Sentry are available at the following URL:
In this article we learned how to configure Kerberos for Hadoop ecosystem components. We also looked at how to secure Hive using Sentry.
Resources for Article:
- Advanced Hadoop MapReduce Administration [Article]
- Managing a Hadoop Cluster [Article]
- Making Big Data Work for Hadoop and Solr [Article]