Parquet Modular Encryption: The New Open Standard for Big Data Security Reaches a Milestone

4 min read

Protecting the confidentiality and integrity of data is important to enterprises in many fields, such as healthcare, finance, transportation and energy.

IBM Cloud enables enterprises to use a new open standard for big data security – Parquet Modular Encryption (PME). IBM Research initiated and led joint work with the Apache Parquet community to address critical issues in securing the confidentiality and integrity of sensitive data, without degrading the performance of analytic systems [2], [3].

Apache Parquet is the industry-leading standard for the formatting, storage and efficient processing of big data. Parquet Modular Encryption encrypts Parquet files module-by-module — the footer, page headers, column indexes, offset indexes, pages, etc. Thus, it not only enables granular control of the data based on access to per-column encryption keys, it also preserves all the benefits of efficient analytics on Parquet. This includes column projection and predicate push-down, where entire file parts can be skipped if the metadata indicates that the part has no matching values.

PME has already reached some major milestones. Both Java and C++ implementations of Parquet with PME have been released, and the upcoming Spark 3.2 release is going to use this Java implementation [1].

Parquet encryption in OSS

The Java implementation of the Parquet encryption standard has just been released with PME in parquet-mr 1.12 [5] by the Apache Parquet community. Now, the Apache Spark community is working on integrating it in its upcoming release of Spark 3.2. In addition, after the C++ implementation was released in Apache Arrow [6], IBM Research began working with the Apache Arrow community to expose PME in PyArrow, as well.

In short, PME has been a successful and effective open-source community effort, initiated and led by IBM Research. We would like to thank the Apache Parquet community for the fruitful collaboration.

Parquet encryption in IBM products

PME is already available to IBM customers in IBM® Analytics Engine [7] and in IBM Cloud Pak® for Data [8], and there is an example Notebook [9]. It is available with flavors of Spark 2.3.x, Spark 2.4.x and Spark 3.x, and it is going to be available with the open-source versions of Apache Spark 3.2.

PME is integrated into Xskipper too, which is an open-source library for data skipping. Xskipper stores the indexes in separate Parquet files, regardless of the original storage format. With PME integration, Xskipper encrypts indexes with a per-index granularity — this enables a user to create data-skipping indexes over sensitive data without leaking information, while still allowing each user to use only the subset of indexes available to them. Xskipper with PME is already available in IBM Analytics Engine [10] and IBM Cloud Pak for Data [11].

IBM Research is also investigating its integration in additional IBM products.

Parquet encryption in Cyberkit4SME use cases

An interesting financial use case that we encountered for PME is part of the European Horizon H2020 project CyberKit4SME [4]. Here, a small financial institute buys Foreign Exchange tick data that records every price change about once per second for every pair of currencies. The financial institute gives orders to traders to buy or sell currencies based on the analytics models that run on the ForEx data. Clearly, confidentiality is important since this detailed data has been paid for, but its integrity is important too since financial decisions are made based on the data. Any missing or erroneous data can affect the decision and possibly result in great financial losses. Moreover, storing the data should be cheap and easy for the SME partner. As a result, saving the data in encrypted Parquet files helps protect the privacy and integrity of the data. It is affordable because of the excellent compression of Apache Parquet, and the performance of analytics queries running on these parquet files is very good:

It is affordable because of the excellent compression of Apache Parquet, and the performance of analytics queries running on these parquet files is very good:

Another interesting use case for PME is a smart transportation use case from the European Horizon H2020 project CyberKit4SME [4], where data is collected from cars (e.g., positions, acceleration and velocity). This data is used to build and train machine-learning models using TensorFlow, which are then used in smart cars to make real-time decisions.

The data collected from the cars contains sensitive information, so it must be stored in a way that is compact and encrypted. That said, various personas should be able to run Python scripts on this data to analyze it and to train models. PME allows you to store large amounts of data in a compact way encrypted with different encryption keys (e.g., according to sensitivity levels) and to give access to the encryption keys based on security clearance or some other enterprise policy. Access control is achieved by controlling access to the keys without creating multiple replicas of the table — the physical data files remain accessible to a large set of people, but they can only read data for which they have access to keys.

For example, in the diagram below, two different users run queries on the same table that has five columns encrypted with PME. The first user selects three columns out of the four available to them based on permissions granted with their access token, and the second user selects two columns out of the three available to them based on their access token. That might be achieved by using one key for the least sensitive columns 1, 3 and 5, another key for the more sensitive column 2 and yet another key for the most sensitive column 4:

That might be achieved by using one key for the least sensitive columns 1, 3 and 5, another key for the more sensitive column 2 and yet another key for the most sensitive column 4:

Summary

In short, make sure to try out Parquet Modular Encryption in IBM products [7], [8] and in the upcoming Apache Spark release. We look forward to your feedback.

References

[1] Data and AI Summit: Data Security at Scale through Spark and Parquet Encryption

[2] Parquet Modular Encryption: Developing a new open standard for big data security

[3] Structured Data and Hybrid Clouds: Getting Value From Your Data While Remaining Secure and Compliant

[4] CyberKit4SME H2020 Project: The CyberKit4SME project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No 883188.

[5] Apache Parquet 1.12.0 release

[6] Apache Arrow 4.0.0 release

[7] Parquet Encryption in IBM Analytics Engine

[8] Parquet Encryption in Cloud Pak for Data

[9] Example Notebook with Parquet Modular Encryption

[10] Xskipper Index Encryption in IBM Analytics Engine

[11] Xskipper Index Encryption in IBM Cloud Pak for Data

[12] Apache Parquet columnar storage format

[13] Apache Spark - unified analytics engine for large-scale data processing

Be the first to hear about news, product updates, and innovation from IBM Cloud