IBM Support

Ranger: Things to consider for a Hive + HDFS Usecase

Technical Blog Post


Abstract

Ranger: Things to consider for a Hive + HDFS Usecase

Body

Abstract
Ranger is the new tool in the Hadoop arsenal for managing security for many services through one UI. This article explores the things to consider for a Hive + HDFS usecase in the context of Ranger.

Hive + HDFS Usecase
There is a wide range of data access patterns when it comes to Hadoop systems. Some systems are setup to allow access to HDFS data only via SQL engines like Hive, and other services access the data via Hive. Some have a separation between the data that is allowed SQL access and the non-SQL access data. Some other systems need to allow promiscuous access to the same data via different services.

Based on the access pattern is use, there are 2 streams of usecases –
1. Non-Impersonation Usecase
For better security control, it is desirable to clamp down the access at the lower levels and control access at higher service levels. In case of a Hive + HDFS usecase, this would mean disabling Impersonation via the hive.server2.enable.doAs property and enabling SQL Standard Authorization in Hive. End users connecting to Hive are then vetted by Hive and allowed read/write access to data only if that permission is granted for the user in Hive. The actual data residing in HDFS is only accessible to service users like hive.

2. Impersonation Usecase
If there are multiple services that need access to the data residing in HDFS, then HDFS becomes the central place to put all the security controls and have other services impersonate the end users while accessing the data. This provides for consistent security controls across the multiple services to access the same data. Read/write permissions to the data are given by way of rwx permissions or ACLs in HDFS.

The above is regardless of whether or not Ranger is available in the system.

Why Ranger?
Ranger is a framework that enables setting up of security controls, and audit data access for many services through one UI. Without Ranger, these have to be setup individually but Ranger allows for doing all the setup per service from one place. Auditing is a very good value-add provided by Ranger for services that it is enabled for, since it comes at no additional cost.

Which Ranger plugins to enable?
Ranger has plugins for various services that can be enabled to allow for setting the controls as well as auditing access via that service. The list of services that support Ranger will keep growing, but as of the time of writing this blog, these are the services that have a Ranger plugin –
HDFS, Yarn, Hive, HBase, Knox.

Having a plugin for a service does not mean you have to enable the plugin. Enabling a plugin lets you setup security controls for access at that service and allows for auditing that access.

HDFS Ranger plugin allows for access control settings in HDFS as well as Ranger. This means that a given access is allowed if either HDFS has the permissions or ACLs for the access, or Ranger has the policy defined for the access.

On the other hand, when Hive Ranger plugin is enabled, a given access is allowed only if it is allowed via a Ranger policy. One important thing to note is that the Hive Ranger plugin is only for access via HiveServer2, and not Hive CLI, which switches to SQLStandardAuthorization on enabling the Hive plugin. Hive CLI is used when you invoke hive shell or when you have a program or any other service that invokes Hive using the client APIs.

In our Hive + HDFS usecase, you would enable a plugin, depending on whether Impersonation is used or not.

1. Non-Impersonation Usecase
This usecase is characterized by setting up access controls at the Hive level. If there is need for auditing these accesses, or if there is need for more granular access control, then you might consider enabling the Hive Ranger plugin. If all the relevant access is via Hive, then you might choose to not enable the HDFS Ranger plugin.
On the other hand, if there is a separate set of data that is access controlled at the HDFS layer, and there is need to audit these accesses, then enabling the HDFS Ranger plugin might be better.

2. Impersonation Usecase
If the usecase requires all access control to be done in HDFS, then it might make sense to enable the HDFS Ranger plugin for setting up read/write permissions instead of ACLs.
Enabling the Hive Ranger plugin requires disabling Impersonation and setting up SQL Standard Authorization for Hive CLI. If this is against the needs of the usecase, then Hive Ranger plugin is best disabled in this case. Since all the access will be vetted in HDFS, auditing will happen at HDFS layer, and that would be useful for audit trail of access via Hive too.

When enabling Ranger at both HDFS and Hive, the user needs to enable one at a time first, check all the access through the usecases from Hive. The suggestion is to enable HDFS first and make sure Hive works well and then enable Hive.

Conclusion
All Ranger plugins need not be enabled for every usecase. You will have to choose which plugins would be enabled or disabled depending on the usage pattern, auditing needs and control level granularity.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16260083