Explore the advanced analytics platform, Part 8
The Information Governance Model
This content is part # of # in the series: Explore the advanced analytics platform, Part 8
This content is part of the series:Explore the advanced analytics platform, Part 8
Stay tuned for additional content in this series.
This is the eighth tutorial in the Advanced Analytics Platform (AAP) series. In previous tutorials, we provided an overview of key use cases, algorithms, discovery patterns, and data/logic flows. These tutorials help you learn when to use AAP and how to assemble and integrate it into an end-to-end application architecture. Part 5, "Deep dive into discovery and visualization," focused on information architecture. It covered aspects of data modeling and ontology for data discovery. The "Customer profile hub" tutorial explored development of the customer profile data model. It described how to organize ingested data and derived attributes to get valuable insights about customer behaviors.
This tutorial builds on the previous tutorials by examining information governance of the data that is used in AAP. For example, assume that a Telecom service provider, by using some of the techniques that are described in previous tutorials and Call Detail Records (CDR), discovers that a customer is in a new international location. Some key questions that arise related to the power and risks that are associated with information governance are:
- Can we use this information to sell Telecom and third-party services in this new international location (privacy preferences)?
- Is this a fraudulent event where someone is stealing the identity of the customer for unauthorized use (data theft)?
- Was the handset returned by a customer and sold to another customer but the customer records had not been properly updated (data quality)?
Data is one of the greatest assets an organization has, but data is also increasingly difficult to manage and control. Clean, trusted data helps organizations provide better service, drive customer loyalty, and spend less effort complying with regulatory policies. However, the data can also be considered an organization's greatest source of risk. Using information effectively brings with it the promise of increased innovation by optimizing people and processes through creative uses of information. Conversely, poor data management often means poor business decisions and results, with greater exposure to compliance violations and theft.
Big data brings extra considerations to the information governance processes, tools, and organizations. It becomes even more important as we move toward low latency decisions and high volumes of ungoverned external data. Several key governance questions need to be asked in big data environments including how do you use information governance where real-time analytics and real-time decision-making forces a low latency data curation?
From structured to unstructured data—including customer and employee data, metadata, trade secrets, email, video, and audio—organizations must find a way to govern data in alignment with business requirements without obstructing the free flow of information and innovation. The Capabilities Maturity Model (CMM) describes a framework and methodology to measure progress for data governance. This structured collection of elements offers a steady, measurable progression to the final wanted maturity state.
According to CMM, the five levels to measure progress for data governance are:
- Maturity Level 1 (initial): Processes are usually ad hoc, and the environment is not stable.
- Maturity Level 2 (managed): Successes are repeatable, but the processes might not repeat for all projects in the organization.
- Maturity Level 3 (defined): The organization's standard processes are used to establish consistency across the organization.
- Maturity Level 4 (quantitatively managed): Organizations set quantitative quality goals for both process and maintenance.
- Maturity Level 5 (optimizing): Quantitative process improvement objectives for the organization are firmly established and continually revised to reflect changing business objectives and are used as criteria in managing process improvement.
The IBM Data Governance Maturity Model helps to educate other stakeholders about how to make the strategy more effective. The Maturity Model is based on input from the members of the IBM Data Governance Council. It defines the scope of who needs to be involved in governing and measuring the way businesses govern data across an organization.
The IBM Data Governance Maturity Model measures data governance competencies based on these 11 categories of data governance maturity:
- Data risk management and compliance: A methodology by which risks are identified, qualified, quantified, avoided, accepted, mitigated, or transferred out. There can be varying requirements for the common infrastructure; for example, related with high availability or disaster recovery. These are also areas where big data technologies are not as mature.
- Value creation: A process by which data assets are qualified and quantified to enable the business to maximize the value that is created by data assets. As big data deals with large volume and velocity, the infrastructure cannot be easily replicated in silos. Business value across organization divisions can be pooled to create a common infrastructure to share across the different organizations such as marketing, care, and risk management.
- Organizational structures and awareness: The level of mutual responsibility between business and IT, and the recognition of fiduciary responsibility to govern data across divisions. Each organization might bring diverse external big data sources with varying levels of veracity. As these data sources are curated and mined for common identifiers and use, it is important to understand a federated unification, which offers the ability for each organization to maintain their environment, while staying connected with the federated definitions.
- Stewardship: A quality-control discipline that is designed to ensure the custodial care of data for asset enhancement, risk mitigation, and organizational control. As external data, such as social media, is accessed it is important to extend stewardship roles to include external data. Stewards should also consider privacy issues, especially with social media and usage data.
- Policy: The written articulation of wanted organizational behavior. Big data lakes and the curated data adhere to these policies by using a Governance, Risk, and Compliance (GRC) framework. For example, there's an organization leveraging usage data in its CRM environment. The organization has established a policy that requires the deletion of this data on a periodic basis to maintain customer privacy. The big data governance program might keep anonymized usage data for a longer period, but removes the links to CRM.
- Data quality management: Methods to measure, improve, and certify the quality and integrity of production, test, and archival data. Big data brings data quality issues that are associated with data-in-motion and data-at-rest. You can use data mining with CRM and big data sources to improve data quality. For example, a billing address for a subscriber might be different from their service location. Using CDR data, it's possible to update service location and use that data to improve service quality.
- Information lifecycle management (ILM): A systematic, policy-based approach to information collection, use, retention, and deletion. You can easily fill petabytes of Hadoop storage with high-volume big data. Though the cost is less than a traditional business intelligence environment, the cost of petabytes of storage for a long time adds up. ILM policies are based on volume projections, business value, and cost. The policies let businesses decide where to store the data (online for analytics versus offline for regulatory compliance), how much data to store (how much aggregated versus raw data) and when to start deleting the data (old usage patterns that might not be valid after life style changes).
- Information security and privacy: The policies, practices, and controls used by an organization to mitigate risk and protect data assets. The dimension covers both definition and execution of the policy. This is the most important governance dimension for big data. Even though private and sensitive data should be carefully protected, the potential to uncover and store private and sensitive data exists. In some cases, under opt-in, subscribers agree to the use of private data for specific use cases. In those situations, the data should not be available outside the limited use cases for which the opt-in was obtained. With the usage information, inferred behavior data (work location, buddy list, and hangouts) might be as private, or in some cases, more private, in comparison with demographic data, such as name, phone number, and credit card information.
- Data architecture: The architectural design of structured and unstructured data systems and applications that enables data availability and distribution to appropriate users. In a typical organization, heavy past investments in business intelligence must be preserved. This leads to a hybrid architecture whereby the transactional and demographic data might remain in a traditional Business Intelligence environment and a big data architecture can be added to bring conversation and usage data. Organizing the sharing of ETLs, master/reference data, and metadata is important in these hybrid situations. For high velocity data and information, the data architecture must be designed to work with latency requirements.
- Classification and metadata: The methods and tools that are used to create common semantic definitions for business and IT terms, data models, and repositories. Common business glossary, data lineage, and physical data representations are examples of metadata integration between traditional and big data. This is an evolving area, and big data brings both new challenges (record level data lineage versus field level data lineage, for example) and new opportunities (use of ontology to understand external data).
Figure 1 shows an overview of the IBM Data Governance Maturity Model.
Figure 1. IBM's Data Governance Maturity Model
Figure 2 shows the measurement of data governance of a Global Information Services provider. We don't always measure all data governance dimensions. For example, in Figure 2, eight out of 11 dimensions were considered important and were included in the assessment. For each measured dimension, both current and target maturities were computed. This provided a measure of the gaps to complete by using a data governance program.
Figure 2. Illustrative data governance maturity – current and target
Big data and governance challenges
Big data solutions are grappling with many data governance challenges. The source data comes from internal and external sources that require governance:
- Data quality and matching
- Master data indexing
- Identification and protection of data privacy
It can be challenging to go through a formal governance exercise on all sources. However, if the data is left ungoverned, significant downstream challenges arise. The downstream challenges are:
Governance on read: It is challenging to apply governance during data ingestion when the data is generated by external sources at high velocity. As a result, the data carries a fair amount of ungoverned data. The governance is then applied when the data is used. Unfortunately, this approach can result in mixing ungoverned data with highly governed data from Enterprise Data Warehouses (EDW) and other governed sources. Identify and curate the data before use, even if the usage is for data discovery and exploration purpose. Often data scientists assume that the data quality issues are insignificant due to large data sizes. This is the case for population aggregation, but might not be the case when it's time to discover and define micro-segments.
Match in the lake: If big data is sourced from various systems, it often carries unmatched data. Unmatched data is not linked to common identifications, such as using a common subscriber ID. As the data grows, so does the effort that is required to match the data. Often the data has varying levels of latency from data sources, which makes it challenging to correlate during data ingestion. The alternative is to dump unmatched data in the lake with the hope of matching it in the lake. However, the cost of matching is reduced when the matching is done closer to the source.
Data relevance for analysis: Big data can include many attributes that are often duplicated across many observations. Similarly, external data sources, such as social media might carry more data than is needed for insight development. If the entire raw data set is moved to the data lake its size can grow rapidly, even for inexpensive Hadoop storage. It is not uncommon for a Telecom to have network usage probes that generate data that approaches thousands of gigabits per second. If stored for a week, this data can explode to petabytes. Raw data storage is not advisable in such a situation for extended time periods. Hold the data that is needed for analytics and discard or archive the rest.
Privacy: Privacy policies typically define customer privacy by using Personally Identifiable Information (PII). However, a fair amount of private information can be inferred from other data. Take location (presence of a device at a certain latitude and longitude), for example. With location, the raw data might be considered as private as that person's credit card and social security data. Explicit customer permission is necessary for access to, and use of, such data.
Remember until contradicted: Most data becomes stale over time. In the US, approximately one-third of customers changes their residence every year. This can affect their hangouts and interest in specific locations. Use new data that contradicts a past insight to build evidence for a change. The analytics system should be capable of placing different weights to past insights based on elapsed time and contradicting evidences.
Data transformation and quality in data lake-driven discovery
Data lakes are large repositories that contain vast amounts of data in raw format. Conversation and usage data accumulates in the repositories, or data lakes, and is analyzed for useful insight about subscribers. Behaviors and attitudes toward products and services can be discovered, for example. This was described in previous tutorials in this series.
Most of the usage data is structured. For example, CDR data from network sources, as described in the "Customer profile hub" tutorial, is a good example of structured data. However, the CDR data might be sourced from various network sources, each with their own format. To analyze this data, first unify the data so that a discovery or prediction engine can see all the data in the same way. There might be missing data, or some data that is retrieved with substantially longer delays. In addition, curate the data to remove noise.
The features that are extracted for each entity have different lifespans. Some features that are extracted are ephemeral, as they are related to events that will happen or are valid only for a limited time period. An example is actions that are taken by users soon, such as going to the movies, buying a product, or eating. These are actions often shared in social media and they have limited validity over time.
Gender, age, marital status, and ethnicity are examples of features with data that is valuable for a long time. Some of these features are difficult to infer or extract. If not stated explicitly by a user, age is an open research question, as it's difficult to infer based on language features only. The predictions and inference on these features should also contain a metric on the confidence level of the predictor or the feature extracted. In terms of governance, add the confidence level of a variable metric.
Social data is inherently unstructured, and most of these repositories are open to external manipulation. For example:
- External factors: Spam, publicity, link abuse
- Internal factors: Inaccuracies, self-reporting, formatting problems
There are multiple approaches to enhance the quality of the data that is contained in a data lake. There are several ways to evaluate and enhance the quality of the data and they are divided in two basic approaches:
- Community based
- Machine supervised methods
The community-based methods have proved successful in the past, but they rely on an active community to curate the data contained in it. Wikipedia and Yahoo answers are good examples of large communities curating data.
In certain cases, it is possible to use automatic methods to detect possible quality issues, correct the issues, then automatically enhance the quality of the data source. An example of this is the multiple automated agents that inspect new Wikipedia articles to find possible spam, and multiple agents to detect robots posting on Facebook and Twitter.
Other common problems that affect data quality in social media data include sarcasm, neologisms (newly coined words), abbreviations, slang, and so on. Often, domain-specific ontologies are used for parsing the data to understand and translate these words, and also to keep up with trends and other changes.
Governance architecture and products
Figure 3 shows the four major components of big data information governance architecture. These components are:
Data sources: Includes all the raw data, landing zones, discovery zones, and harmonized zones. Store the data by using flat files, Hadoop, columnar, or relational data stores.
Information fabric: Provides the policies and design of the governance and the tools to organize the data. The primary repository for governance is the Information Governance Catalog. Additionally, other tools such as Streams, Spark, Optim, Guardium, and InfoServer provide the design and execution of governance.
Security: Provides an execution of security by using policies set in the information fabric. It uses standard security tools such as LDAP, Kerberos, HTTPS, Certificates, and so on.
Analytics, reporting, and consumption: Provides tools to monitor governance. It also provides tools for analyst and user consumption of the governance structure, by using R, ML, SPSS, and Cognos.
Figure 3. Information governance architecture for big data
For a company to use data to gain insights and make the right decisions, governance to manage the enterprise data is critical. In this tutorial, you learned about the governance framework, specific components for data quality, privacy, and the overall governance architecture. You learned which IBM tools support governance.
We now come to the end of this series on the Advanced Analytics Platform.
The intent of this series is to help you understand the need for an advanced analytics platform within the enterprise and how to design such a platform. This series started with an overview of the platform. You then learned and how to use that platform to implement multiple use cases that run across various industries. Because you can implement the platform incrementally, two stand-alone patterns around text analytics and location analytics as starting points were the next two topics. The complexity of the system increases as all this data accumulates. How to discover the data accumulating in the data lake and visualize it to get further insights was the subject of Part 5. Then came Analyzing large volumes of data in real time. Common data structures, in particular the creation of a 360-degree profile, was next. This article about the governance of data concludes the series. The governance of data is often ignored, but it is important. Making correct, effective decisions is vital to your business. It's hard to make those decisions without data governance.
- "Information Governance Principles and Practices for a Big Data Landscape" describes how the IBM Big Data Platform provides the integrated capabilities that are required for the adoption of Information Governance in the big data landscape.
- "Big Data Governance: A Framework to Assess Maturity" outlines a framework to assess your data maturity.
- The Big Data and Analytics Redbook page is updated as new resources on big data become available.
- See how a smarter planet takes advantage of Big Data. Check out the tabs on What is big data?, Big data technology, and Big data conversations.
- Engaging Customers Using Big Data: How Marketing Analytics Are Transforming Business explains how big data and automation have tremendous impacts on our marketing processes and capabilities.
- Learn more about big data with the recommended reading from the IBM Institute for Business Value:
- Read Understanding Big Data: Analytics for Enterprise Hadoop and Streaming Data to learn the three defining characteristics of Big Data.
- Find out how Behavior-based Customer Insight for Banking can help banks transform and meet the challenges of a customer-centric world.
- Learn how Behavior-based Customer Insight for Insurance helps insurers use behavioral data from internal and external sources to better understand policyholders and improve retention and loyalty.
- Learn how you can use Behavior-based Segmentation and Insight for Wealth Management to generate advanced customer segmentation based on someone's behavioral profile.
- See how Behavior-based Customer Insight for Communications Service Providers describes advanced and predictive analytics of customer activity across locations, devices, applications, and interests to help drive improved marketing and customer care performance.