August 20, 2021 By Maya Anderson
Gal Lushi
4 min read

Protecting the confidentiality and integrity of data is important to enterprises in many fields, such as healthcare, finance, transportation and energy.

IBM Cloud enables enterprises to use a new open standard for big data security – Parquet Modular Encryption (PME). IBM Research initiated and led joint work with the Apache Parquet community to address critical issues in securing the confidentiality and integrity of sensitive data, without degrading the performance of analytic systems [2], [3].

Apache Parquet is the industry-leading standard for the formatting, storage and efficient processing of big data. Parquet Modular Encryption encrypts Parquet files module-by-module — the footer, page headers, column indexes, offset indexes, pages, etc. Thus, it not only enables granular control of the data based on access to per-column encryption keys, it also preserves all the benefits of efficient analytics on Parquet. This includes column projection and predicate push-down, where entire file parts can be skipped if the metadata indicates that the part has no matching values.

PME has already reached some major milestones. Both Java and C++ implementations of Parquet with PME have been released, and the upcoming Spark 3.2 release is going to use this Java implementation [1].

Parquet encryption in OSS

The Java implementation of the Parquet encryption standard has just been released with PME in parquet-mr 1.12 [5] by the Apache Parquet community. Now, the Apache Spark community is working on integrating it in its upcoming release of Spark 3.2. In addition, after the C++ implementation was released in Apache Arrow [6], IBM Research began working with the Apache Arrow community to expose PME in PyArrow, as well.

In short, PME has been a successful and effective open-source community effort, initiated and led by IBM Research. We would like to thank the Apache Parquet community for the fruitful collaboration.

Parquet encryption in IBM products

PME is already available to IBM customers in IBM® Analytics Engine [7] and in IBM Cloud Pak® for Data [8], and there is an example Notebook [9]. It is available with flavors of Spark 2.3.x, Spark 2.4.x and Spark 3.x, and it is going to be available with the open-source versions of Apache Spark 3.2.

PME is integrated into Xskipper too, which is an open-source library for data skipping. Xskipper stores the indexes in separate Parquet files, regardless of the original storage format. With PME integration, Xskipper encrypts indexes with a per-index granularity — this enables a user to create data-skipping indexes over sensitive data without leaking information, while still allowing each user to use only the subset of indexes available to them. Xskipper with PME is already available in IBM Analytics Engine [10] and IBM Cloud Pak for Data [11].

IBM Research is also investigating its integration in additional IBM products.

Parquet encryption in Cyberkit4SME use cases

An interesting financial use case that we encountered for PME is part of the European Horizon H2020 project CyberKit4SME [4]. Here, a small financial institute buys Foreign Exchange tick data that records every price change about once per second for every pair of currencies. The financial institute gives orders to traders to buy or sell currencies based on the analytics models that run on the ForEx data. Clearly, confidentiality is important since this detailed data has been paid for, but its integrity is important too since financial decisions are made based on the data. Any missing or erroneous data can affect the decision and possibly result in great financial losses. Moreover, storing the data should be cheap and easy for the SME partner. As a result, saving the data in encrypted Parquet files helps protect the privacy and integrity of the data. It is affordable because of the excellent compression of Apache Parquet, and the performance of analytics queries running on these parquet files is very good:

Another interesting use case for PME is a smart transportation use case from the European Horizon H2020 project CyberKit4SME [4], where data is collected from cars (e.g., positions, acceleration and velocity). This data is used to build and train machine-learning models using TensorFlow, which are then used in smart cars to make real-time decisions.

The data collected from the cars contains sensitive information, so it must be stored in a way that is compact and encrypted. That said, various personas should be able to run Python scripts on this data to analyze it and to train models. PME allows you to store large amounts of data in a compact way encrypted with different encryption keys (e.g., according to sensitivity levels) and to give access to the encryption keys based on security clearance or some other enterprise policy. Access control is achieved by controlling access to the keys without creating multiple replicas of the table — the physical data files remain accessible to a large set of people, but they can only read data for which they have access to keys.

For example, in the diagram below, two different users run queries on the same table that has five columns encrypted with PME. The first user selects three columns out of the four available to them based on permissions granted with their access token, and the second user selects two columns out of the three available to them based on their access token. That might be achieved by using one key for the least sensitive columns 1, 3 and 5, another key for the more sensitive column 2 and yet another key for the most sensitive column 4:


In short, make sure to try out Parquet Modular Encryption in IBM products [7], [8] and in the upcoming Apache Spark release. We look forward to your feedback.


[1] Data and AI Summit: Data Security at Scale through Spark and Parquet Encryption

[2] Parquet Modular Encryption: Developing a new open standard for big data security

[3] Structured Data and Hybrid Clouds: Getting Value From Your Data While Remaining Secure and Compliant

[4] CyberKit4SME H2020 Project: The CyberKit4SME project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No 883188.

[5] Apache Parquet 1.12.0 release

[6] Apache Arrow 4.0.0 release

[7] Parquet Encryption in IBM Analytics Engine

[8] Parquet Encryption in Cloud Pak for Data

[9] Example Notebook with Parquet Modular Encryption

[10] Xskipper Index Encryption in IBM Analytics Engine

[11] Xskipper Index Encryption in IBM Cloud Pak for Data

[12] Apache Parquet columnar storage format

[13] Apache Spark – unified analytics engine for large-scale data processing

Was this article helpful?

More from Cloud

Enhance your data security posture with a no-code approach to application-level encryption

4 min read - Data is the lifeblood of every organization. As your organization’s data footprint expands across the clouds and between your own business lines to drive value, it is essential to secure data at all stages of the cloud adoption and throughout the data lifecycle. While there are different mechanisms available to encrypt data throughout its lifecycle (in transit, at rest and in use), application-level encryption (ALE) provides an additional layer of protection by encrypting data at its source. ALE can enhance…

Attention new clients: exciting financial incentives for VMware Cloud Foundation on IBM Cloud

4 min read - New client specials: Get up to 50% off when you commit to a 1- or 3-year term contract on new VCF-as-a-Service offerings, plus an additional value of up to USD 200K in credits through 30 June 2025 when you migrate your VMware workloads to IBM Cloud®.1 Low starting prices: On-demand VCF-as-a-Service deployments begin under USD 200 per month.2 The IBM Cloud benefit: See the potential for a 201%3 return on investment (ROI) over 3 years with reduced downtime, cost and…

The history of the central processing unit (CPU)

10 min read - The central processing unit (CPU) is the computer’s brain. It handles the assignment and processing of tasks, in addition to functions that make a computer run. There’s no way to overstate the importance of the CPU to computing. Virtually all computer systems contain, at the least, some type of basic CPU. Regardless of whether they’re used in personal computers (PCs), laptops, tablets, smartphones or even in supercomputers whose output is so strong it must be measured in floating-point operations per…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters