A person sits with their laptop on a hill against the backdrop of a dark purple sunset that is fading into darkness.
Turn compliance into competitive advantage

01

3 min read

A data privacy and security plan for the modern enterprise

How are enterprises bringing together data access, security and compliance?

As organizations increasingly look to data and AI to help drive innovation and competitive advantage, business leaders are faced with two seemingly contradicting imperatives: make all enterprise data accessible and ensure that information is secure and compliant. As strategies and solutions evolve, enterprises are coming to realize that this thought is a false dichotomy — the same tools that help businesses stay compliant with the latest data privacy regulations can help them take advantage of their data better, too.

38% said a barrier they faced in seeking closer alignment to GDPR was the complexity of aligning the IT landscape.1

Historically, the data privacy and security landscape has been plagued by piecemeal approaches, employing a web of disparate point solutions that, when cobbled together in aggregate, provide the necessary view and understanding needed. At best, this approach creates a complicated architecture, requiring additional time and resources to manually integrate. At worst, it fails to deliver a complete view of data and its usage, leaving blind spots that can open the business up to unknown risks — increasing the potential for the losses and fines associated with breaches or noncompliance.

When executives were asked to rate the top challenges organizations face while preparing for the CCPA, legacy IT (42%) emerged as critical”1

To combat this issue, leading businesses are looking to more holistic strategies and solutions that provide visibility across the entire data and AI lifecycle — from building and securing a trusted, compliant data foundation to optimizing AI models and their impact on the business and, ultimately, auditing and regulating compliance. Organizations need a unified solution from which they can view the impact of sensitive data and universally enforce policies.

In this paper, we dig deeper into each of those key areas to explore the vital data privacy and security elements at each stage of the journey.

1 Championing Data Protection and Privacy (PDF, 2.7 MB), Capgemini Research Institute, 2019

02

4 min read

Access data more securely across multiple sources

Use data virtualization to connect data from all locations and make it accessible through a single access point

One of the most difficult parts of data privacy and security is managing the diverse landscape of data stored across multiple siloed environments. In the past, the inability to ensure compliance from one environment or business unit to another has led to many parts of the business being reticent to share data with colleagues. In these scenarios, compliance became a hinderance to the business rather than an advantage, further entrenching the disjointed repositories of data and forcing IT teams to protect and secure each on an individual basis.

The answer to this siloed landscape is data virtualization. Data virtualization allows users to access data from a single access point no matter where in the organization or even outside of the organization it happens to be. This access occurs without moving the data or using extract, transform, load (ETL) processes, so the risk of data corruption and loss is mitigated significantly. Moreover, the queries sent from this single access point to the data repositories is protected with Secure Sockets Layer (SSL) and Transport Layer Security (TLS) encryption using standard protocols. So, even though the data itself doesn’t move, you can be sure the communications are secured, as well.

Forward-looking implementations of data virtualization also provide schema folding — automatically detecting common schemas across repositories and making them appear as a single schema. For example, if a similar sales table existed across 20 databases, it would appear to the user as if it were just one table that could be queried. This method heightens users’ ability to use more complete sets of data for greater accuracy when developing insights or models.

Yet, the most useful aspect of data virtualization from a data privacy and security viewpoint is that data can be governed at that single access point. So, instead of adding governance across myriad different places, it can be implemented at the place where users are receiving self-service data access. These different aspects of governance are covered in the next section.

Read our data virtualization paper

03

9 min read

Organize the data with a common catalog and metadata

Build a trusted foundation of data quality

Quality data, or a lack thereof, can be the difference between insights that are confidently acted upon or ones that aren’t trusted. If low-quality data goes into AI models it could even lead to regulatory noncompliance if it has a discriminatory result. Some questions of data quality can be answered with metadata like the source and how fresh the data is.

However, an additional level of data cleansing is often helpful. A data catalog with a built-in data quality analysis and refinery should be used. The data quality analysis can be used to make inferences and identify anomalies, while the refinery can be used to discover, cleanse and transform the data with data shaping operations.

One of the best examples of data privacy and data use’s complementary nature is how both are supported by governance. At its core, data governance is about organization —knowing where data comes from, what it is, who can access it and when it should be retired. While this information certainly is important for auditability, right to be forgotten requests and determining access rights, it also helps data users determine the most relevant, freshest and cleanest data so they can deliver the best insights. We’ll explore several critical components to governance, along with how they complement both data privacy and better data use.

31% said the most significant challenges in getting ready for GDPR was cataloging and inventorying their data.2

A common catalog

One of the most difficult challenges when bringing together data from across the business is that people use different terms to refer to the same thing or may be using the same term to refer to two different things. Creating a common taxonomy makes sure that everyone can communicate more effectively. Doing so is important for data privacy because, if the wrong term is used, data that should be limited in access might accidentally be made available to the whole business. From a data use perspective, using multiple terms or terms that are incorrect for that business or industry can make finding data for models and understanding insights more difficult and time-consuming.

Data catalogs, particularly ones with access to AI tools, can help with this issue. Organizations should look for opportunities to use AI to search documents and text within the business and industry to pull out the unique terminology that’s most applicable to them. This process will make their taxonomy efforts much more applicable to the data they possess.

The number one challenge and area for adopting a privacy-centric approach was performing data discovery and ensuring data accuracy.3

Metadata

Metadata is at the heart of both privacy and use for the same reason — if you don’t know details about your data, how can you truly say who is meant to see it or how you might be able to use it? Metadata keeps track of the origins of the data, age of the data, privacy level, potential uses and much more.

While this information can be added manually, it can be an extremely cumbersome and time-intensive process. Fortunately, machine learning allows for automated metadata generation. Based on the existing data catalog and other business policies, data is reviewed and then automatically tagged with relevant metadata based on what the machine learning algorithm finds. Not only does this process help make data ready for use as it comes in it helps eliminate human errors that might occur when applying metadata manually. Moreover, it mitigates the problems with so-called “dark-data,” which remains hidden or unused because little to no information is known about it after it is ingested.

Automated metadata generation is particularly important with regard to access and anonymization procedures. Consider, for example, an enterprise that wants to bring in a new data set that contains information about transactions that include item descriptions, quantity purchased, name, address and credit card number. When this data set is ingested, automated tagging would tag the item descriptions and quantity as general transaction data, the name and address as personal data, and the credit card number as financial data. This tagging allows policy enforcement at the point of access. So, if business users were to access the data set, they could see the general transaction data, but the personal and financial data would be automatically anonymized — another automation feature being introduced in the most up-to-date governance tools. As such, policies are easily enforced and even more sensitive data can be used in a nonidentifiable and compliant way. Of course, those individuals with the need and authority to access personal or financial data from this data set still can, and those access rights are acknowledged at the single access point for data, as well. Additional information about anonymization features is provided in the next section.

Read “A comprehensive guide for the modern data catalog”

Read the “Gartner Magic Quadrant for Metadata Management Solutions”

Read the “2020 Gartner Magic Quadrant for Data Quality Solutions”

Read “The Forrester Wave™ for Machine Learning Data Catalogs, Q4 2020”

2 Maximizing the value of your data privacy investments, Cisco, January 2019
3 Privacy Gains: Business Benefits of Privacy Investment, Cisco, 2019

04

7 min read

Deliver compliant outcomes with data access and lineage

Track where data comes from and how it’s used without overcomplicating data access

Data access and lineage most directly applies to privacy concerns and auditability for obvious reasons. Privacy is all about data only being used by the people who need it, and lineage shows who has had that access in practice. However, these safeguards are also important for self-sufficient use of data, as well. If there’s confusion by data users over what they can or can’t use, they may opt not to employ a valuable data set to which they should have access. Moreover, it takes time to sort usable data from unusable data.

17% are looking at anonymization and pseudonymization as a solution they are evaluating or implementing.4

A much better option is to have access restrictions built directly into the single access point where users are getting their data so only the data they have authorization to use is visible. It removes any confusion they might otherwise have. Another helpful feature is dynamic masking of sensitive data so that data sets and models can be used and shared without exposing private data to those who shouldn’t have access. After access has been granted, the governance solution should also be able to create reports that analyze the flow of data from data sources through jobs and stages, and into databases, data files, business intelligence reports, models and other assets. This data lineage capability, alongside the access protections, should make auditing for internal or external purposes much easier.

It’s worthwhile to focus here on two of these stages as they relate to lineage in a bit more detail. Foremost, is the ingestion of data itself. Data lineage helps to answer the question: “Where did this data come from?” That answer is important because it speaks to the accuracy and relevancy of data for future insight. For example, data that comes directly from transactions and is the full data set may be more accurate than a sample of data pulled from social media. By the same principle, data from a predominantly South American population may lead to insights that would be incorrect to apply to an Eastern European market. Knowing the data’s origin more accurately tells users where it should be applied.

This in turn, relates to the other stage in data lineage, which is worthwhile to discuss here: AI model lineage. The increased interest in producing AI models for deeper and more robust insight necessitates a higher level of scrutiny from a data lineage standpoint. Understanding where data comes from is vital to make sure models are trained on data that’s applicable to where the model is used in production, however tracing the lineage of the model itself can be just as important. Specifically, it means tracking when and how the model was created, as well as when and where it has been used, and the decisions that resulted. Such model lineage is part of a new push for explainable AI that’s receiving ever more scrutiny and regulation. Essentially, it’s not enough to know that a decision was made, enterprises must be able to explain why a decision was made and why that decision was correct.

Take, for instance, an AI model that determines whether home loans should be approved. An important consideration in these types of decisions is whether the decisions are discriminatory either in the data considered or their results. The enterprise could reassure itself of unbiased decisions with data lineage by identifying that the model was trained using data representative of the population, it was applied uniformly in all cases where such a decision needed to be made, and that the end results or decisions didn’t disproportionately harm a particular group of people. Or, if errors were found — thanks to reporting or auditing of this lineage — it can be corrected quickly before heightened regulatory or reputational harm can come to the business. Such reports and auditing are the subject of the next section.

Listen to the podcast “The digitization of governance in the age of AI”

05

7 min read

Manage risk and compliance with reports and auditing

Simplify and democratize compliance and risk management across the organization

Data privacy and security can be a confusing process with a wide variety of regulations that differ by industry, location and even the type of data itself. Greater consciousness of data privacy in the public will continue to lead to more regulation in the coming years, which businesses will have to track and adhere to so they remain compliant. A holistic data privacy and security solution should provide capabilities that help businesses stay aware of these policies, implement them effectively and regularly audit their compliance. Automation is another crucial factor that helps eliminate manual effort, saves time and increases accuracy.

As a first step, solutions should be used to break down complex regulations into a catalog of requirements, understand how they affect the business specifically and create actionable tasks that the business can undertake. Regulatory information should be ingested automatically from sources like Thomson Reuters and Wolters Kluwer and automatically applied to terminology and workflows. Similar regulations should also be deduplicated. Any actions should then be made clear and measurable with terminology that’s specific to that industry or organization. And those actions should be grouped logically and assigned to specific owners within the system. In this way, an organization need not be an expert on every regulatory compliance initiative to act on them.

The best solutions should also go beyond these steps to simplify adherence to regulations for business users who may not be involved with managing risk on a day-to-day basis. A user interface (UI) should be implemented that mitigates or even eliminates the need for training by using AI-powered dashboard widgets to help contextualize information in the moment and suggest courses for action. A business user, guided by a virtual assistant within the UI, can then easily follow suggestions on how to handle potentially sensitive data or whether it should even be used at all or ask the assistant for guidance if confused.

Once proper guidelines and processes have been established, automated data collection and constant monitoring for dashboards and audits must be implemented. Tracking and documenting all data privacy and IT incidents automatically to facilitate root cause analysis is vital, but so is preventative monitoring. Outlier detection, using machine learning and statistical modeling, can be used to flag anomalous activities and give them a high-risk score. This process, in turn, can be set up to trigger an alert that will signal the data security team to investigate. To save time, enterprises should look for embedded workflow features that work out of the box for a variety of use cases. For more complex workflows, drag and drop functionality should be available to make creation easier.

Auditing is equally important and should be done not only when a problem has been identified, but as part of routine data privacy and security practices. One-click audit reports are a clear benefit in this regard, for regular, quick check-ups. These reports can quickly show key stakeholders how the data is being used and by whom over certain time periods. Of course, for more in-depth audits, the ability to manage and monitor the audit’s execution, as well as the assignment and tracking of resources from a central system, is crucial. That’s why having a proper data lineage, as discussed earlier, and an audit trail is so essential. Data points, such as modification of configuration data, user actions, privileged access, and system events, should all be accessible. Organizations should also consider the length of time the audit data is available, whether performing an internal or external audit, lacking needed records is far from ideal.

One final consideration is the auditing of models, which was discussed more thoroughly in the previous section. Given the importance of models in driving insight and the increased scrutiny they are receiving for accuracy on a company and regulatory level, it behooves data professionals to check them often. A good place to start is by creating a comprehensive model inventory and maintaining it along with the purpose of each model. Then ownership, roles and responsibilities can be established for each model. From there, interactive dashboards can be set up to indicate model risk and further assessments can be conducted if anything looks amiss.

See how General Motors unified its audit, risk and control activities

Read our ebook about a fully integrated governance, risk and compliance platform (PDF, 7.2 MB)

Explore the Total Economic Impact study of one solution

06

2 min read

Find your holistic data privacy and security solution

See what IBM Cloud Pak for Data has to offer for your data privacy and security needs

Data privacy and security can and should work together with the universal business desire to get as much value as possible out of data. Connecting disparate data through data virtualization helps introduce privacy and security at a single access point while offering users an easier way to self-serve the data they need. Strong governance makes the right data, quality data, easier to find for those who should have access to it, while allowing sensitive data to remain hidden unless appropriate. And the ability to conduct real-time monitoring and audits helps secure the systems and comply with regulations, but it also helps the business mitigate data loss through breaches and keep models accurate.

This holistic system can be found on the IBM Cloud Pak® for Data platform which brings together a comprehensive solution that addresses all the needs discussed previously with built-in data virtualization and the cataloging of the IBM Watson® Knowledge Catalog. The IBM Security™ Guardium® Insights solution and IBM OpenPages® software also help with the monitoring and auditing capabilities as part of the IBM Cloud Pak for Data platform. To learn more about this solution, talk with one of our experts using the following link.