August 20, 2021 By Maya Anderson
Gal Lushi
4 min read

Protecting the confidentiality and integrity of data is important to enterprises in many fields, such as healthcare, finance, transportation and energy.

IBM Cloud enables enterprises to use a new open standard for big data security – Parquet Modular Encryption (PME). IBM Research initiated and led joint work with the Apache Parquet community to address critical issues in securing the confidentiality and integrity of sensitive data, without degrading the performance of analytic systems [2], [3].

Apache Parquet is the industry-leading standard for the formatting, storage and efficient processing of big data. Parquet Modular Encryption encrypts Parquet files module-by-module — the footer, page headers, column indexes, offset indexes, pages, etc. Thus, it not only enables granular control of the data based on access to per-column encryption keys, it also preserves all the benefits of efficient analytics on Parquet. This includes column projection and predicate push-down, where entire file parts can be skipped if the metadata indicates that the part has no matching values.

PME has already reached some major milestones. Both Java and C++ implementations of Parquet with PME have been released, and the upcoming Spark 3.2 release is going to use this Java implementation [1].

Parquet encryption in OSS

The Java implementation of the Parquet encryption standard has just been released with PME in parquet-mr 1.12 [5] by the Apache Parquet community. Now, the Apache Spark community is working on integrating it in its upcoming release of Spark 3.2. In addition, after the C++ implementation was released in Apache Arrow [6], IBM Research began working with the Apache Arrow community to expose PME in PyArrow, as well.

In short, PME has been a successful and effective open-source community effort, initiated and led by IBM Research. We would like to thank the Apache Parquet community for the fruitful collaboration.

Parquet encryption in IBM products

PME is already available to IBM customers in IBM® Analytics Engine [7] and in IBM Cloud Pak® for Data [8], and there is an example Notebook [9]. It is available with flavors of Spark 2.3.x, Spark 2.4.x and Spark 3.x, and it is going to be available with the open-source versions of Apache Spark 3.2.

PME is integrated into Xskipper too, which is an open-source library for data skipping. Xskipper stores the indexes in separate Parquet files, regardless of the original storage format. With PME integration, Xskipper encrypts indexes with a per-index granularity — this enables a user to create data-skipping indexes over sensitive data without leaking information, while still allowing each user to use only the subset of indexes available to them. Xskipper with PME is already available in IBM Analytics Engine [10] and IBM Cloud Pak for Data [11].

IBM Research is also investigating its integration in additional IBM products.

Parquet encryption in Cyberkit4SME use cases

An interesting financial use case that we encountered for PME is part of the European Horizon H2020 project CyberKit4SME [4]. Here, a small financial institute buys Foreign Exchange tick data that records every price change about once per second for every pair of currencies. The financial institute gives orders to traders to buy or sell currencies based on the analytics models that run on the ForEx data. Clearly, confidentiality is important since this detailed data has been paid for, but its integrity is important too since financial decisions are made based on the data. Any missing or erroneous data can affect the decision and possibly result in great financial losses. Moreover, storing the data should be cheap and easy for the SME partner. As a result, saving the data in encrypted Parquet files helps protect the privacy and integrity of the data. It is affordable because of the excellent compression of Apache Parquet, and the performance of analytics queries running on these parquet files is very good:

Another interesting use case for PME is a smart transportation use case from the European Horizon H2020 project CyberKit4SME [4], where data is collected from cars (e.g., positions, acceleration and velocity). This data is used to build and train machine-learning models using TensorFlow, which are then used in smart cars to make real-time decisions.

The data collected from the cars contains sensitive information, so it must be stored in a way that is compact and encrypted. That said, various personas should be able to run Python scripts on this data to analyze it and to train models. PME allows you to store large amounts of data in a compact way encrypted with different encryption keys (e.g., according to sensitivity levels) and to give access to the encryption keys based on security clearance or some other enterprise policy. Access control is achieved by controlling access to the keys without creating multiple replicas of the table — the physical data files remain accessible to a large set of people, but they can only read data for which they have access to keys.

For example, in the diagram below, two different users run queries on the same table that has five columns encrypted with PME. The first user selects three columns out of the four available to them based on permissions granted with their access token, and the second user selects two columns out of the three available to them based on their access token. That might be achieved by using one key for the least sensitive columns 1, 3 and 5, another key for the more sensitive column 2 and yet another key for the most sensitive column 4:


In short, make sure to try out Parquet Modular Encryption in IBM products [7], [8] and in the upcoming Apache Spark release. We look forward to your feedback.


[1] Data and AI Summit: Data Security at Scale through Spark and Parquet Encryption

[2] Parquet Modular Encryption: Developing a new open standard for big data security

[3] Structured Data and Hybrid Clouds: Getting Value From Your Data While Remaining Secure and Compliant

[4] CyberKit4SME H2020 Project: The CyberKit4SME project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No 883188.

[5] Apache Parquet 1.12.0 release

[6] Apache Arrow 4.0.0 release

[7] Parquet Encryption in IBM Analytics Engine

[8] Parquet Encryption in Cloud Pak for Data

[9] Example Notebook with Parquet Modular Encryption

[10] Xskipper Index Encryption in IBM Analytics Engine

[11] Xskipper Index Encryption in IBM Cloud Pak for Data

[12] Apache Parquet columnar storage format

[13] Apache Spark – unified analytics engine for large-scale data processing

More from Cloud

Get ready for change with IBM Cloud Training

2 min read - As generative AI creates new opportunities and transforms cloud operations, it is crucial to learn how to maximize the value of these tools. A recent report from the IBM Institute for Business Value found that 68% of hybrid cloud users already have a formal, organization-wide policy or approach for the use of generative AI. That same report also noted that 58% of global decision makers say that cloud skills remain a considerable challenge. Being proactive in your learning can significantly…

Data center consolidation: Strategy and best practices

7 min read - The modern pace of data creation is staggering. The average organization produces data constantly—perhaps even continuously—and soon it’s investing in servers to provide ample storage for that information. In time, and probably sooner than expected, the organization accrues more data and outgrows that server, so it invests in multiple servers. Or that company could tie into a data center, which is built to accommodate even larger warehouses of information. But the creation of new data never slows for long. And…

Hybrid cloud examples, applications and use cases

7 min read - To keep pace with the dynamic environment of digitally-driven business, organizations continue to embrace hybrid cloud, which combines and unifies public cloud, private cloud and on-premises infrastructure, while providing orchestration, management and application portability across all three. According to the IBM Transformation Index: State of Cloud, a 2022 survey commissioned by IBM and conducted by an independent research firm, more than 77% of business and IT professionals say they have adopted a hybrid cloud approach. By creating an agile, flexible and…

Tokens and login sessions in IBM Cloud

9 min read - IBM Cloud authentication and authorization relies on the industry-standard protocol OAuth 2.0. You can read more about OAuth 2.0 in RFC 6749—The OAuth 2.0 Authorization Framework. Like most adopters of OAuth 2.0, IBM has also extended some of OAuth 2.0 functionality to meet the requirements of IBM Cloud and its customers. Access and refresh tokens As specified in RFC 6749, applications are getting an access token to represent the identity that has been authenticated and its permissions. Additionally, in IBM…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters