Computer programs take data as input and produce data as output. By design, the value of the output data should greater than that of the input.
The input data can be structured data—such as a table of customer records, including their purchases—or unstructured data, such as photos of license plates from an electronic toll road. Often, the structured data is extracted from the unstructured data and organized in a table based on the type of information, like the license plate numbers, location of the camera that took the picture, and time of the picture.
For more on structured and unstructured data, see "Structured vs. Unstructured Data: What's the Difference?"
The relational data model, which was introduced fifty years ago by Edgar Codd, is a key means of organizing structured data, and SQL (the language for querying relational data), is as popular as ever. Because most of the higher-value business data today is still structured data, this blog focuses on tabular/rectangular structured data.
If the world were simple, our story would end here. But the world is not simple. People need to run computer programs in different locations and data must be treated appropriately. Looking a little deeper, three issues stand out.
The reality of data security requirements in hybrid cloud
First, ensuring compliance with various regulations and policies means limiting who can see what data, for what purpose, and where they can see the data. For instance, given a table of customers, the marketing team may be able to see the names, addresses, and purchase history of the customers, while the financing team may be able to see the customer IDs, credit scores, and purchase history. Ensuring that all data usage is compliant is a challenge.
Second, there are copies of data, copies of copies of data, and even copies of copies of copies of data. These copies may be slight variants that were created to comply with regulations (e.g., to give the marketing team the data without the customer IDs or credit scores). They may even be bit-by-bit identical copies made to bring the data closer to the computation. With all these copies of the data floating around, it can be difficult to ensure that the right version of the data is given to a program and that the regulations and policies are followed consistently across the versions.
Third, people want to take advantage of computational resources wherever they may exist, independent of where the input data was created. For example, a computer program may need to take advantage of a set of GPUs that are only available in a cloud or it may need to integrate with some cloud native service. After all, this is the promise of hybrid cloud.
In the world of hybrid cloud, technologies such as Kubernetes have made it very easy to deploy a program where it makes the most sense. If it makes sense to run in the corporate data center, use Kubernetes to deploy it there. If it makes sense to run in a public cloud, use Kubernetes to deploy it there.
But moving the program is not even half the story. The program needs its input data where the execution is taking place. This involves not only making a copy of the data (say, from the corporate data center to a public cloud), it also involves controlling which portion of a data set can be seen where. For instance, corporate regulations may forbid making purchase histories visible on a public cloud.
In short, the reason that it’s hard to get more value from data involves three key challenges:
- Ensuring data is used in a way that is compliant with policy, and in particular, ensuring someone only sees the data they are supposed to see.
- Reducing the amount of copies of the data—in particular, the copies made to let different users see different views of the data as required by compliance.
- Ensuring that the data can be used in any location, while also guaranteeing that only that portion of the data that should be visible in a particular location is indeed visible.
While enabling any program to run on any data regardless of location will require more than addressing these three challenges, providing a good solution for these challenges will provide huge benefits.
Deriving value from data with operational controls
One of the first questions to answer is how the data should be stored. Tabular data can be stored in a database-specific format or in one of several open formats. Storing in a database-specific format prevents the data from being easily moved between locations since we need an instance of the database running wherever we want to use the data. This is, to a large degree, orthogonal to whether the database is proprietary or open.
The alternative to storing the data in a database is to store it in an open format. There is a range of open formats available for storing tabular data. The most common and widely known is almost certainly comma separated values (CSV). In CSV, the data is listed row by row, with each row on its own line, and a comma between the items in different columns.
With CSV, one needs to know the structure of the table that describes the names of the columns and the type of data held in each column. This row-oriented structure means a program cannot read only the columns it needs from the file—say only the columns with customer names, gender, and purchase history. The row orientation also inhibits compression since compression depends upon similarity, which is more likely to occur within a column than within a row. One benefit of CSV is its simplicity and the fact that it can be easily edited by a human.
An alternative to the CSV row-oriented format is a column-oriented format such as Apache Parquet or Apache ORC. Although these formats are not as clear-cut as CSV, both are fully open with a standard definition supported by the Apache Foundation.
Using their columnar structure, data is laid out in the file column by column. Conceptually, if we have a table with five columns—e.g., customer ID, customer name, address, purchase history, and credit rating— the file will contain all of the customer IDs in the order of the rows, followed by all of the customer names in order of the rows, and so on. (In practice, there are some additional details on how the Parquet and ORC are stored but they are not significant for purposes of this discussion.)
This format has several benefits. First, since all of the values in a column are in a consecutive range, it is possible to retrieve only the columns needed for a particular computation. Second, since values in a column are more likely a priori to be similar to each other than values from different columns, this makes the file much more compressible. While one can debate which format—Parquet or ORC—is the “better” one, it is clear that columnar formats have many benefits over row-based formats. Furthermore, we’re seeing a huge uptake in the industry of these formats for data processing and in particular in data lake settings. We have been working with Parquet, so we will focus on the Apache Parquet format.
A breakthrough for Apache Parquet: Modular Encryption for secure analytics
Now, let’s look at how we can control who can see what data and where they can see it. One way to control who can see data is to make a separate copy for each user and each place where they are using data. This introduces many problems:
- A business can lose complete control over the data due to the challenge of managing so many copies.
- It is expensive to make, store, and manage many copies of the data.
- The more copies we have for the data, the greater the risk of data leakage, since the data can be leaked if any of the copies are breached.
An alternative approach is to encrypt the data, only giving ‘the keys’ to people who should be seeing the data, according to policies. Encryption is a well-accepted and common approach to protecting access to data. But, if we encrypt the entire Parquet file with a single key, we don’t quite address the challenges:
- We don’t have a way to give different users access to different columns in the table or provide access to only some columns in a given location. This is because once someone has the key, they can decrypt the entire file and see all of the columns.
- We lose some of the benefits of the columnar formats like Parquet, in particular support for projection. With the entire file encrypted with a single key, there is no way to read only the elements for a particular set of columns. The entire file needs to be decrypted to retrieve a particular column.
The solution lies in encrypting different columns with different keys. To this end, we worked with the Apache Parquet community to add a feature called Modular Encryption. The basic idea of Parquet Modular Encryption is to have a standard approach of assigning a different encryption key to each column.
Parquet Modular Encryption addresses each of the three key challenges associated with taking input data and turning it into higher value output data:
- We can ensure compliance and that someone only sees the data they are supposed to see by giving them only the keys for specific columns.
- We can avoid making lots of copies of different subsets of the data based upon what columns can be seen. Rather, we can have a single master Parquet file; access to different subsets of the data is handled by giving different users different sets of keys
- We don’t need to make a separate version of the data for use in each location. If certain columns of data should not be visible (for example, in a public cloud), we simply ensure that the keys for those columns are not shared.
Similar features have also been developed for the Apache ORC format.
To learn more about Parquet Modular Encryption and how it can be used with the IBM Cloud see “Parquet Modular Encryption: Developing a new open standard for big data security.”