Hybrid Cloud

Parquet Modular Encryption: Developing a new open standard for big data security

Share this post:

Finding the best way to protect sensitive big data is an issue that is top of mind for businesses and enterprises. Many storage solutions can encrypt the data, but that could still leave it exposed to the whims of the storage administrator. This becomes even more complex in the cloud, where the administrator is not the data owner, and the public storage is shared by many users.

The IBM Cloud delivers top security tools for enterprises, including giving them the ability to keep their own keys to control access to their data, stored in public clouds – or even stored in their private infrastructure. To maintain trust, we need to safeguard both the privacy and the integrity of the data. Integrity ensures that the data is tamper-proof so existing values cannot be changed or manipulated by unauthorized parties —even when they can’t see what’s inside.

To help advance data security in the cloud, IBM Research has initiated and currently leads joint work with the Apache Parquet community to address critical issues in securing confidentiality and integrity of sensitive data. Apache Parquet is the industry-leading standard for the formatting, storage, and efficient processing of big data. Working together, we proposed the Parquet Modular Encryption (PME) as a solution to address the issues of privacy and integrity for sensitive Parquet data, in a way that won’t degrade the performance of analytic systems. This solution is now part of the open standard for big data storage and is already providing data protection capabilities inside the Parquet format implementations.

Some security scenarios we are solving

One scenario we are addressing with PME involves a use case for connected cars that we’re working on as part of the EU-funded RestAssured project to deliver secure data processing in the cloud. Connected cars transmit a range of information, including location, speed, acceleration, and driver identity, alongside other factors like tire pressure, gear position, momentum, and more. Our scenario is specifically interested in the values for car speed, which can allow insurance companies to offer customers use-based coverage. All data from the connected cars is stored in the cloud of the car manufacturer or telco provider. With the massive amount of customer data generated by connected cars, we wanted to give the insurance provider access only to the information needed to offer a discount for careful drivers.

Using PME, the data owners can specify that the platform has permission to share car speed and timestamp, but not location. Our standard’s columnar-based permissions enforce fine-grained access control in the data storage, and allow the insurance company to get the data for speed and timestamp. For example, this means they can automatically give reduced prices to people who don’t drive faster than 20 percent above the speed limit at night.


Strong security yet still fast and easy access to data

We started by looking at Apache Parquet, which has become the de facto standard for big data storage due to its ability to encode and compress stored data, and to apply advanced filtering methods when fetching the data from storage. For example, Parquet allows users to retrieve just the columns they need from the table. This serves to drastically reduce the time and resources needed to find and process critical information. The format also allows users to leverage the min and max values for each part (chunk or page) of a column and retrieve from storage only those parts that are relevant to their search. In short, we can start filtering right inside the data storage before we start processing the data.
PME works with individual modules, such as pages, and therefore enables the standard Parquet selective column retrieval and filtering by min/max values. Moreover, it allows users to encrypt different columns with different keys. For example, we can encrypt the entire table in a private cloud and then send it to the public cloud with access restricted to only certain columns. This makes our security approach well-suited for working in hybrid cloud scenarios.

PME applies encryption on the minimal Parquet units (pages) instead of on individual values. Since the pages pack many values, their encryption is two or more orders of magnitude faster than the straightforward value encryption. Also, page-based encryption has a zero-size overhead (~0.003%) – unlike value-based encryption, which can easily double the data size due to crypto metadata addition, and even bloat it further since encrypted data is hard to compress. Our approach taps into the hardware cipher acceleration that is available in most CPUs, and can be leveraged in C++ applications or in Java applications starting with Java 9. This approach is effective only with page-based encryption and not when encrypting single values. We ran Parquet benchmarks with our solution’s encryption and without encryption. On a Java 11 platform, the results showed a single digit percentage degradation in throughput when encrypting full tables, and below 1 percent degradation when encrypting only a sensitive subset of table columns (which is often the case) https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/77144

Why not give it a try?

PME is deployed in the IBM Cloud. If you are interested in trying it out, simply launch a cloud Spark instance. This is a product-level service, so if you try it and like it, you are welcome to start using it for secure data analytics in your organization.

Feel free to contact us if you have ideas or requirements for additions to this technology at GIDON@il.ibm.com.

We also invite you to join the Apache Parquet community and work with us on this open standard. Many thanks to our partners who work together with us in the community to make this standard a success: Uber, Netflix, Cloudera, Emotiv, Vertica, Ursa Labs and Apple.


Senior Technical Staff Member, Secure Analytics Research, IBM Research

More Hybrid Cloud stories

IBM launches blockchain for high-end textile for transparency of the supply chain

A new solution for the textile industry use blockchain allows users to track the entire spectrum of fabric manufacturing.

Continue reading

New advances in speaker diarization

In a recent publication, “New Advances in Speaker Diarization,” presented virtually at Interspeech 2020, we describe our new state-of-the-art speaker diarization system that introduces several novel techniques.

Continue reading

IBM Research AI at ICASSP 2020

The 45th International Conference on Acoustics, Speech, and Signal Processing is taking place virtually from May 4-8. IBM Research AI is pleased to support the conference as a bronze patron and to share our latest research results, described in nine papers that will be presented at the conference.

Continue reading