Like many software projects, security is often an afterthought. Early Apache Hadoop use models were based on clusters of machines processing large amounts of public data (crawled web pages) inside a private data center. Given that the collected data was already public, the need to protect the data and its results was an unnecessary aspect of the implementation.
But as Hadoop has grown, so have the use models to which it is being applied. Today, not only are private data sets being processed, but Hadoop is being applied to multi-tenant scenarios in which varying datasets are being processed by a variety of users (each with different needs for the raw and processed data). Hadoop is also increasingly applied to sensitive data sets where data must be encrypted to avoid leaks. For this reason, security is now considered an integral part of a Hadoop cluster. In this article, I explore some of the interesting work going on with Hadoop for security. I look at data security, perimeter security, and data access security.
Big data, privacy, and data vulnerability
With big data comes big responsibility. As the growth of data and the applications and frameworks grow to exploit that data, new vulnerabilities appear that must be addressed. One of the most famous examples of privacy issues related to big data occurred in 2006. As part of a contest to improve its movie recommendation service, Netflix released a large data set for researchers to use in their search for more accurate algorithms. The data set consisted of Netflix internal data that was anonymized to remove information that could directly identify a customer. Unfortunately, researchers were able to identify users from their user IDs in the Netflix database based on their anonymous reviews within the Netflix data and public reviews from the Internet Movie Database.
This case illustrated that even anonymous data can be traced back to real users by correlating the information with other public sources. Netflix responded by excluding certain information from the data set but included other information, such as the users' zip code, age, and gender, but this approach was also flawed. Carnegie Mellon University identified that with this information, a person could be identified from zip code, age, and gender 87% of the time.
This case illustrates one danger of big data. Even through attempts to anonymize the data, leaking sensitive or private information is still possible through external correlation. When data sets necessitate regulatory requirements (such as the Health Insurance Portability and Accountability Act [HIPAA]), more must be done to avoid the more serious legal consequences that can result.
Let's start with a description of the problem, and then dive into Cloudera's solution, called Sentry.
Security within the Hadoop ecosystem
Today, Hadoop supports strong security at the file system level. Recall that the Hadoop Distributed File System (HDFS) is implemented within another native file system (such as the third extensible file system [ext3]). Access controls for Hadoop are implemented by using file-based permissions that follow the UNIX® permissions model. Although this model provides file-level permissions within the HDFS, it lacks more fine-grained access controls.
As an example, consider a file within the HDFS that contains movie reviews for a set of users. This data consists of a user ID, zip code, gender, age, movie title, and review. In Hadoop, access is an all-or-nothing model. If you can access the file using the permissions model, you can access all fields within the file. What's needed is a more fine-grained model of access. Where more secure access is granted to all data within the file, lower security access could be provided for individual fields of the data (such as all data except the user ID and zip code). Lower security access minimizes the possibility of leaking user information, and the role-based access of individual fields makes it possible to restrict access within files instead of all-or-nothing file access.
The overall problem of data security within Hadoop becomes even more difficult when you consider its implementation. Hadoop, and its underlying file system, is a complex distributed system with many points of contact. Given its complexity and scale, the application of security to this system is a challenge by itself. Any security implementation must integrate with the overall architecture to ensure proper security coverage.
Authorization frameworks and RBACs
Sentry supports the model defined previously for role-based access called Role-based Access Control (RBAC) over the relational database form of data (databases, tables, views, and so on). The RBAC model provides several features necessary for secure enterprise big data environments. The first is secure authorization, which enables access enforcement to data for authenticated users. Users can be placed under roles, and then given privileges of data access. This behavior allows the model to scale to permit the categorization of users into roles using templates rather than administrators having to assign detailed privileges to each user. This feature also simplifies permission management and reduces the load on administrators while minimizing the potential for errors and unintended access.
Further, administration of privileges can be configured for users to distribute the
task to multiple administrators at the database or schema level. The fine-grained
access controls to data and metadata can be controlled within databases. For
example, a specific role may permit
select of data,
where another role may permit
insert of data (at
the server, database, and table scopes). Per my Netflix example, this means that
for less strict security levels, roles can be defined to limit visibility to personally
Finally, Sentry implements authentication by using the existing and proven Kerberos authentication protocol, which is integrated into Hadoop.
Sentry with HDFS, Hive, and Impala
Figure 1 presents the basic architecture of Sentry. As you'll soon discover, it was designed for extensibility to support a wide variety of Hadoop-based applications and portability for varying forms of data providers.
Figure 1. Basic architecture of Sentry
Today, Cloudera has implemented support for many important open source Structured Query Language query engines, including Apache Hive (through the HiveServer2 thrift-based Remote Procedure Call interface) and Cloudera Impala. Each application is secured through a set of bindings implemented for that particular application. These bindings work with the policy engine to evaluate and validate predefined security policies and, when access is approved, work through a policy abstraction to gain access to the underlying data. Today, a file-based abstraction is provided that integrates support for the HDFS or access to the local file system for the security policies.
So, what does this mean for Hive and Impala? Sentry permits fine-grained authorization with the ability to define security controls over a server, database, table, and view, including the ability to specify select privileges for views and tables, insert privileges on tables, and transform privileges on servers. Each database or schema can have separate authorization policies. Sentry also provides support for Hive's metastore architecture.
To support greater extensibility, Sentry can secure new applications such as Apache Pig (through a set of Pig bindings) and enable access to new abstractions for access to security policies (such as a database). All are implemented as pluggable interfaces.
Sentry is available today as part of the Cloudera CDH version 4.3 release for use with Hive and Impala version 1.1. You can also download it separately from the Cloudera website as an add-on. Sentry is released under an Apache 2 license.
Other aspects of security within Hadoop
Sentry provides a role-based authorization framework, but that's not the only security innovation coming to Hadoop. Let's look at some of the other work going on for securing and controlling access to big data.
Project Rhino is an open source effort by Intel to enhance Hadoop with additional protection mechanisms. The goal is to fill gaps representing insecurity within the Hadoop stack and provide multicomponent security within the Hadoop ecosystem. To that end, Intel has several development items addressing a variety of topics related to security and focusing on crypto capabilities.
Among the variety of work being implemented under Rhino, some of the most interesting cover new crypto capabilities for encryption and decryption of files over a number of use models. For example, the addition of a common abstraction layer for crypto codecs implements an application programming interface (API) through which multiple crypto codecs can be registered and used within the framework. To support this capability, a key distribution and management framework is also in the works.
A Hadoop cryptographic file system (called Hadoop CFS) is also under construction that will provide low-level cryptographic services for files within the HDFS. At this level, any Hadoop user can transparently exploit the new data security (from MapReduce applications to Hive, Apache HBase, and Pig).
Other services under construction include transparent encryption of snapshots and commit logs on disk and new Pig capabilities to support encryption-aware load and store functions.
Apache Knox Gateway
The Apache Knox Gateway provides a perimeter security solution for Hadoop. Where Sentry provides fine-grained access controls to data, the Knox Gateway provides controlled access to Hadoop services. The goal of the Knox Gateway is to provide a single point of secure access for Hadoop clusters. The solution is implemented as a gateway (or small cluster of gateways) that exposes access to Hadoop clusters through a Representational State Transfer (REST)-ful API. The gateway provides a firewall between users and Hadoop clusters (see Figure 2) and can manage access to clusters that run different versions of Hadoop.
Figure 2. Perimeter security with the Apache Knox Gateway
The Knox Gateway is a complementary security solution to Sentry that provides the outer level of access security. As a gateway in a demilitarized zone, Knox Gateway provides controlled access to one or more Hadoop clusters segregated by network firewalls.
Where the Apache Knox Gateway and Sentry provide perimeter and data access
security, one missing element is HDFS data access from MapReduce tasks. One
solution used by Oozie relies on the concept of delegation tokens. A
delegation token is a two-party authentication protocol that
lets users authenticate themselves with the
(using Kerberos); on receipt of the delegation token, users can provide the token to the
JobTracker so that resulting Hadoop jobs for those
users can rely on the token for secure access to data within the HDFS.
Oozie, a workflow scheduler system to manage Hadoop jobs, uses delegation tokens
when submitting Oozie jobs to Hadoop. As defined, an authenticated user provides
a job to Oozie that results in a request for a delegation token from the
JobTracker. As part of job submission, the delegation
token provided for future Hadoop work accesses the HDFS. Any resulting MapReduce
tasks for the job uses the associated delegation token to fully secure the resulting
Delegation tokens rely on a two-party authentication that is simpler and more efficient than the three-party authentication that Kerberos uses. This difference minimizes Kerberos traffic and leads to improved scaling and minimizing load on the Kerberos assets.
Sentry, contributed to the Apache Incubator by Cloudera, is a great step in the direction of an extensible authentication framework. As Hadoop clusters grow and their multi-tenancy increases, Sentry will provide the basis for protection over sensitive data and minimize the potential for leaks that were previously possible. For those health care, financial, or government deployments that must comply with strict data regulations (such as HIPAA or the Sarbanes-Oxely Act [SOX]), Sentry is a welcome addition to the Hadoop ecosystem. And although Sentry won't solve all of the issues that a complex system such as Hadoop presents, it's a step in the right direction. In conjunction with other Hadoop security projects (such as the Knox Security Gateway and Rhino), Hadoop is edging closer to an enterprise-capable secure platform.
- The Sentry main page at Cloudera provides a basic introduction to the framework and links for download and installation (using its Cloudera Distribution for Hadoop, or CDH).
- Hadoop, the open source system for scalable distributed computing, is a top-level Apache project. At the Hadoop site, you can learn about not only Hadoop but the collection of other projects that extend and enhance big data processing with Hadoop. One such project, Apache Oozie, provides a workflow scheduler to manage Hadoop jobs. An open item for Hadoop is the Cryptographic File System. This is covered by an HDFS ticket for future development.
- Delegation tokens implement a two-party authentication protocol to add security for job submissions to a Hadoop cluster.
- Differential privacy refers to the goal of providing an accurate query from a statistical database while minimizing the chance of identifying its records. Netflix, in its open search for a more accurate search algorithm, discovered this issue when it released data whose records could be tied back to a known person by using external data for correlation.
- For another perspective on role-based access control, check out the "Anatomy of Security Enhanced Linux (SELinux)" (developerWorks, May 2012), which describes the SELinux security architecture within Linux®.
- Sensitive data comes in many forms, but a growing list of regulations exists to ensure that data is kept private. These regulations include HIPAA for medical data and SOX for financial data.
- Kerberos is a network authentication protocol designed to provide strong authentication for client-server applications using secret-key cryptography.
- Sentry relies on the REST architectural style for its interface. You can learn more about this interface style in "Understand Representational State Transfer (REST) in Ruby" (developerWorks, August 2012).
- Get more information on security topics in the Security site on developerWorks.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
Get products and technologies
- Project Rhino is a collection of enhancements to the Hadoop ecosystem to improve its security so that it can be applied to new markets that have data security and compliance challenges.
- The Apache Knox Gateway is a gateway solution to provide perimeter security to one or more Hadoop clusters. It implements a single point of secure access.
- Join the developerWorks Community, a professional network and unified set of community tools for connecting, sharing, and collaborating.