In today’s interconnected world, an incredible amount of data is being sensed, uploaded and stored for analysis. While the resulting enormous data sets are a valuable engine of innovation, they also present new challenges to data analysis for researchers and businesses. It is critical that data privacy is maintained and that a framework is in place to provide data privacy guarantees, especially when working with large scale data sets.
In our demonstration “PRIMA: an End-to-End Framework for Privacy at Scale” for the IEEE International Conference on Data Engineering in Paris, we present an end-to-end framework that we developed to manage privacy risks – especially for large scale data sets. The framework assists researchers and decision makers to map out and execute a data privacy strategy. It does this through a comprehensive workflow structure that is especially useful when working with such big data sets and sensitive data.
Traditional approaches to data anonymization rely on tools with limited scalability and manual analysis. At IBM Research – Ireland we have constructed a framework which offers several features for creating the strategy, design and enforcement of data privacy at scale. The framework assists in navigating the enormous number of combinations of data anonymization settings. Its reporting structure supports an assessment review of the privacy settings in terms of data utility and risk.
The role of the framework is to assist the decision making process with step–by–step visual feedback. It is a production-grade system that can execute on components such as vulnerability analysis, anonymization, risk and information loss measurements for arbitrarily large datasets.
In addition, the framework contains a library of tools for software developers to integrate and extend the functionality of embedding de-identification components into their applications.
Full automation of the privacy assessment process is difficult to achieve due to the variety of business and industry contexts for accessing data sets. For example, the needs and access entitlements of a marketing executive will be different to that of an economist or finance auditor, yet all may have a requirement to use the same customer datasets. As a result, our framework allows an owner or manager of the framework to adapt the privacy process settings to comply with their own business needs and end user requirements.
In constructing the framework we reviewed the fundamental questions for designing a privacy strategy including, how can we locate the personal and sensitive data? And what are the privacy vulnerabilities of that data? Once these requirements are defined it flows on to protect the data against privacy risks while maintaining the usability of the data for the business use cases.
Equally important is the need to quantify the amount of utility loss in this anonymization process and, more importantly, to assess the potential risk associated with the cross-referencing of the resulting datasets. With these questions in mind, a privacy workflow for data de-identification was developed (See Figure 1).
Figure 1: Six major components of our privacy workflow
The framework reporting element provides step–by–step visual feedback, offering answers to challenging questions such as how to discover and protect data from privacy vulnerabilities or how to help a user to determine the best data privacy strategy for their organization.
In this video, our team demonstrates how the framework works by using a healthcare data set as an example. It also outlines the challenges a researcher or decision maker may face and how our framework and tool set could assist them in designing a privacy strategy for sensitive data sets.
Our end-to-end framework for helping researchers and decision makers design and enforce a data privacy strategy for their use cases offers two major contributions compared to existing approaches. First, it provides a comprehensive workflow accompanied by a risk/utility exploration framework that helps informed decision making through detailed reporting and fast exploration of the anonymization space. Second, it is designed and implemented to scale with data, making the data privacy domain adapt to the modern era of enormous and linked datasets.
At the 18th European Conference on Computational Biology and the 27th Conference on Intelligent Systems for Molecular Biology, IBM will present significant, novel research that led to the implementation of three machine learning solutions aimed at accelerating and guiding cancer research.