Faster, More Secure and More Cost-Effective: Hybrid Analytics

6 min read

Secure your data in flight and at rest while enabling high performance analytics.

Your organization has tons of really interesting data that your sales, marketing, system administration and human resources departments have been managing and curating in dedicated systems-of-record across your organization for years.

Your data scientists tell you that they have some really cool ideas for algorithms and that if they could only get their hands on the data, they could create analytics applications that would increase sales by a zillion percent.

Your data engineers tell you that you should combine all the information on a data lake, but they don't know what size that should be until they see the applications.

Your purchasing people tell you that they have a 12-month lead between conceiving and building a data center.

The visionaries are telling you that you should move to the cloud, but your governance people are telling you this data is very sensitive — do you really want to take the risk?

The above summary describes a very common pattern that we see again and again in our own organization and in our customers' systems. You need to move data quickly and securely to the location in which it is to be processed, and once there, it needs to be stored securely and processed efficiently.

What our demonstration shows is how we at IBM Research have been working to address some of the pain points described in the above scenario.

We show a use-case in which an insurance company wants to bring Pay-As-You-Drive data from a system-of-record, where it is stored for auditing and billing purposes, to a public cloud instance where it can be used for multiple analytics purposes. We demonstrate how both the movement and the processing can be performed securely and efficiently by adding value-added capabilities that work seamlessly with industry standards:

We demonstrate how both the movement and the processing can be performed securely and efficiently by adding value-added capabilities that work seamlessly with industry standards:

The above picture shows some of the architectural components shown in the demo.

Let's delve deeper into the secret sauce that is making this happen.

Efficient in motion

You need to move your data from the constrained environment where it is generated (e.g., on-prem systems-of-record) to an elastic environment where it can be crunched (e.g., multiple public cloud instances). The elasticity of the cloud allows you to expand as your data scientists come up with ever more clever and processing-intensive workflows. As the value of the analytics is dependent on the freshness of the data, the more frequently you can move it, the better. Ultimately, you would like to do it in real-time. You want to use multiple cloud vendors because you don't want lock-in. In addition, some cloud offerings are just more appropriate for particular types of analytics.

Change Data Capture (CDC) systems allow for the streaming of changes from source systems to many target systems. This has multiple benefits: 

  • The systems are updated in near real-time
  • Only the changes are moved not the whole data set
  • The down-stream systems can also be kept consistent with respect to each other as they all read from the same stream of changes

Debezium is an open-source distributed platform for CDC that allows database tables in many database systems (e.g., MySQL, Postgres, SQL Server) to be streamed as sets of changes. It is available as part of Red Hat Integration. For a highl-evel view of how we used Debezium for ingestion to a data lake, see "Data Orchestration with Debezium." For a detailed description of that system, see "Building and Operating a Large-Scale Enterprise Data Analytics Platform."

IBM Research worked with Red Hat to extend Debezium to Db2 and add the necessary connectors to read the stream of changes and write them to target systems in a customizable form that allows the resulting data to be stored and queried efficiently. As the demo shows, this generates data that can be stored securely and queried efficiently.

Secure in motion

Often, the most interesting data is also the most sensitive. When you move the data from the protective shell of your on-prem system to the various cloud instances where you are doing the processing, you need to make sure that the data is securely transported and its integrity is not compromised. Compliance issues related to the governance of data stores also need to be considered when using data movement tools, taking into account how the data is stored as it is transported.

When using CDC, often Apache Kafka is the preferred method of data movement because it is fault tolerant, highly scalable and offered as a managed service by all cloud vendors. When using Kafka, the data needs to be secured in the same way as it would be in any data store. In order to achieve this, IBM Research has extended Kafka so that clients have full control of the keys to control the encryption of their data sets at a per-topic level, meaning that, for example, different parts of the organization can use and manage different keys for data sets with different sensitivity levels. Moreover, this fully integrates with Key Management Systems, meaning customers can use preferred industry patterns like Bring-Your-Own-Key. This is achieved by an encrypting/decrypting proxy that sits in front of a standard Kafka cluster and is transparent to consumers/producers. Our fully managed Kafka offering, IBM Event Streams, already supports customer-managed keys at a cluster level and intends to enhance this capability with this latest innovation from IBM Research.

Efficient at rest

Once the data is on the public cloud, you need to process it as efficiently as possible. Typically, data engineers use SQL, so the faster their SQL queries, run the better. According to today's best practices, big data (storage) and analytics (compute) are typically managed by independent services in the cloud. Big data is typically stored in object storage (e.g. IBM Cloud Object Storage), which provides unbounded capacity where you pay according to what you use. However, this means that large amounts of data often need to be sent across the network, resulting in slow queries with high cost. In the demo, we use a technique called "data skipping," which minimizes the amount of data sent across the network resulting in dramatic performance acceleration and cost reduction.

In the demo, as the data is brought to the cloud, it is organized geospatially in order to allow highly efficient querying using IBM Research's Data Skipping and Geospatial technologies. These use industry standards like Parquet and Spark and intelligently scan and structure the data such that SQL queries don't need to read irrelevant parts of that data. In the demo, the location of the vehicle has geospatial coordinates, and a combination of data skipping and geo-spatial technologies improve query performance. 

In real-world examples, we saw reductions of one or more orders of magnitude in query time and cost. Data skipping stores summary metadata for each object (or file) in a dataset. For each column in the object, the metadata might, for example, store the minimum and maximum values of the column. This metadata can be used during query evaluation to skip over objects which have no relevant data. See the below for product and open source availability. For more information, see the following: 

IBM's geospatial toolkit has been integrated into multiple products and services — see below.

Secure at rest

Of course, now that you have very sensitive data on the public cloud, you would like to control who is allowed to see that data and for what purpose — just as you do for your on-prem system. You want as much of the control of this access to be under your supervision and to have an audit trail to show to auditors. Moreover, you want the control of revoking access to be under your supervision.

In the demo, we show two different actors accessing the exact same data sets. They are allowed different levels of access so they see the data in different forms and can perform different types of queries. This data-centric security is achieved using Parquet Modular Encryption (PME) — a new open standard for data encryption and integrity verification that is joint work with the Apache Parquet community that IBM Research has initiated and led. This is fully integrated with Key Management Systems, meaning that the customers can use a common key policy for securing both the data in motion and the data at rest.

Under the covers, PME allows for the encryption of sensitive columns when writing Parquet files and decryption of these columns when reading the encrypted files. Encrypting data at the column level enables you to decide which columns to encrypt and how to control the column access. It is available in IBM Cloud Analytics Engine and IBM Cloud Pak® for Data, and it is being integrated into IBM Cloud SQL Query

For more information, see the following: 

Availability in IBM products

For each capability, we show the IBM products in which it is available, documentation about the usage in that product and, when appropriate, the relevant open sources repositories.

Data Skipping

  • IBM Cloud Pak for Data: Docs
  • IBM Watson Studio: Docs
  • IBM Analytics Engine: Docs
  • IBM Cloud SQL Query: Docs
  • Open Source: xskipper

Parquet Encryption

  • IBM Cloud Pak for Data: Docs
  • IBM Watson Studio: Docs
  • IBM Analytics Engine: Docs
  • Open Source: apache-parquet
  • IBM Cloud SQL Query: (Planned)

Spatio Temporal

  • IBM Cloud Pak for Data: Docs
  • IBM Watson Studio: Docs
  • IBM Analytics Engine: Docs
  • IBM Cloud SQL Query: Docs

Kafka Per-Topic Encryption

  • IBM Event Streams: (Planned)
  • Open Source: strimzi

Debezium Db2 Connector

Be the first to hear about news, product updates, and innovation from IBM Cloud