My IBM

What is dark data?

2 October 2023

What is dark data?

According to Gartner, dark data refers to the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes, such as analytics, business relationships and direct monetizing.¹

Most companies today store vast quantities of dark data. In Splunk’s global research survey of more than 1,300 business and IT decision makers, 60 percent of respondents reported that half or more of their organization’s data is considered dark. A full one-third of respondents reported this amount to be 75 percent or more.²

Dark data accumulates because organizations have embraced the idea that it’s valuable to store all the information they can possibly capture in big data lakes. This is partially due to the advent of inexpensive storage, which has made it easy to justify storing so much data—in the event that one day it becomes valuable.

In the end, most companies never use even a fraction of what they store because the storage reservoir doesn't document the metadata labels appropriately, some of the data is in a format the integrated tools can't read or the data isn't retrievable through a query.

Dark data is a major limiting factor in producing good data analysis because the quality of any data analysis depends on the body of information accessible to the analytics tools, both promptly and in full detail.

Other problems with dark data are that it creates liabilities, significant storage costs and missed opportunities due to teams not realizing what data is potentially available to them.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

Why data goes dark

There are numerous causes for an organization’s data to go dark, including:

Lack of awareness: Data obtained in the course of normal business operations often goes dark because organizations are either unaware of its existence, or don’t understand its value or relevance.
Data stuck in silos: When different departments within an organization collect and store data independently, it can lead to data fragmentation and isolation. These data silos may not be accessible or visible to other teams, who would potentially find the data quite valuable.
Lack of data governance: Without a robust data governance framework in place, organizations may struggle to manage and track data across their ecosystem effectively. This causes data to become disorganized, lost and unusable.
Legacy systems: As organizations upgrade their software and hardware, older systems may be retired or become less relevant. Data stored in these legacy systems goes dark if it can’t be integrated with the organization’s modern analytics tools.
Incomplete data integration: Incomplete or ineffective data integration processes can result in data gaps and inconsistencies. This can leave certain datasets inaccessible or not properly linked to other data sources.
Changing business priorities: As business priorities evolve, certain datasets may become less relevant or fall out of focus. Data that was once actively used may be left in the dark as organizational objectives shift.
Limited resources and literacy: Organizations with limited resources may prioritize data collection and storage over data analysis. As well, insufficient data literacy among employees can hinder the discovery and utilization of valuable data.
Data quality issues: Poor data quality, such as inaccurate or incomplete data, can lead to data being discounted or ignored. Data perceived as unreliable is less likely to be utilized, effectively rendering it dark.
Regulatory compliance purposes: Many compliance and governing standards force organizations to follow strict regulations for how long they must store sensitive data. They often wind up storing it long after the mandatory period because they fail to keep track of what sensitive data should be destroyed.
Redundant, obsolete, trivial (ROT) data: ROT is created when employees save multiple copies of the same information, outdated information and extraneous information that does not help the organization meet its goals.

Mixture of Experts | 11 April, episode 50

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch the latest podcast episodes

Types of dark data

In terms of its discoverability for timely and complete data analytics initiatives, dark data may be structured data, unstructured data or semi-structured data.

Structured data is information added to clearly defined spreadsheet or database fields before being stored.

Server log files, Internet of Things (IoT) sensor data, customer relationship management (CRM) databases and enterprise resources planning (ERP) systems are examples of dark data created from structured data sources.

Although most forms of sensitive data, like electronic bank statements, medical records and encrypted customer data are typically in structured form, it is difficult to view and categorize because of permission issues.

Unlike structured data, unstructured data includes information that can’t be organized in databases or spreadsheets for analysis without conversion, codification, tiering and structuring.

Email correspondences, PDFs, text documents, social media posts, call center recordings, chat logs and surveillance video footage are examples of dark data created from unstructured data sources.

Semi-structured data is unstructured data that contains some information in defined data fields. Although it doesn’t have the same ease of dark data discovery as structured data, it is able to be searched or catalogued.

Examples include HTML code, invoices, graphs, tables and XML documents.

The costs of dark data

The costs of storing dark data can be significant and extend well beyond the direct financial cost of dark data storage. Direct and indirect costs include:

Data storage costs

Storing data, even if it's not actively used, requires physical or digital storage infrastructure. This can include servers, data centers, cloud storage solutions and backup systems. The more data in your ecosystem, the more data storage capacity you need, which leads to increased infrastructure costs.

Liability costs

Governments have introduced a host of global privacy laws over the past several years, which apply to all data—even data that’s sitting unused in analytics repositories.

Opportunity costs

Many companies lose out on opportunities by not using this data. While it’s good to get rid of dark data that’s actually not usable—due to risks and costs—it pays to first analyze what data is available to determine what might be usable.

Inefficiency costs

Managing large volumes of data, including dark data, can slow down data retrieval and analysis processes. Employees may spend more time searching for relevant information, leading to reduced productivity and increased labor costs.

Risks costs

Dark data can pose risks in terms of insufficient cybersecurity, data breaches, compliance violations and data loss. These risks can result in reputational damage and financial consequences.

Data quality issues and dark data

Sometimes dark data gets created because of data quality issues.

For example, a transcript from an audio recording is automatically generated, but the AI that created the transcript makes some mistakes in the transcription. Someone keeps the transcript though, thinking that they’ll resolve it at some point, which they never do.

When organizations do attempt to clean poor quality data, they sometimes miss what’s causing the issue. Without the proper understanding, it’s impossible to ensure that the data quality issue won’t continue happening in the future.

This situation then becomes cyclical, because rather than simply employing deletion policies for dark data that sits around without ever getting used, organizations let it continue to sit and contribute to a growing data quality issue.

Fortunately, there are three steps for data quality management that organizations can take to help alleviate this issue:

Analyze and identify the “as is” situation: In order to prioritize issues, first identify all current issues, existing data standards and business impact.
Prevent bad data from recurring: Next, evaluate the root cause of each issue and apply resources to tackle the problem in a sustainable way so it won’t happen again.
Communicate often along the way: Share what’s happening, what the team is doing, the impact of that work and how those efforts connect to business goals.

How to shine a light on dark data

For all the costs and data quality issues of dark data, there are upsides. As Splunk puts it, “dark data may be one of an organization’s biggest untapped resources.”³

By taking a proactive approach to managing dark data, organizations can shine a light on dark data. This not only reduces liabilities and costs, but also gives teams the resources they need to discover insights from hidden data.

When it comes to handling dark data and potentially using it to make better data-driven decisions, there are several best practices to follow:

Break down silos

Dark data often comes about because of silos within the organization. One team creates data that could be useful to another, but that other team doesn’t know about it. Breaking down those silos makes that data available to the team who needs it. It goes from sitting around to providing immense value.

Improve data management

It’s important to understand what data exists within the organization. This effort starts by classifying all data within the organization to get a complete and accurate view. From there, teams can begin to organize their data better with the goal of making it easier for individuals across teams to find and use what they need.

Set data governance policies

Introducing a data governance policy can help improve the challenge long term. This policy should cover how all data coming in gets reviewed and offer clear guidelines for what should be retained (and organized to maintain clear data management), archived or destroyed. An important part of this policy is being strict about what data should be destroyed and when. Enforcing data governance and regularly reviewing practices can help minimize the amount of dark data that will never be used.

Use ML and AI tools to parse data

To help discover dark data, machine learning (ML) and artificial intelligence (AI) can do the heavy lifting of categorizing dark data by performing analysis on data that may contain valuable insights. In addition, ML automation can help with data privacy compliance regulations by automatically redacting sensitive information from stored data.

Footnotes

¹ Gartner Glossary, Gartner

² The State of Dark Data, Splunk, 2019

³ Dark Data: Discovery, Uses & Benefits of Hidden Data , Splunk, 03 August 2023

Data management for AI and analytics

Explore the value of data architectures and learn how IBM’s database portfolio can help simplify data for all your applications, analytics and AI workflows.

Resources

Managing data for AI and analytics at scale

Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.

2024 Gartner® Magic Quadrant™ for Data Integration Tools

IBM named a Leader for the 19th year in a row in the 2024 Gartner® Magic Quadrant™ for Data Integration Tools.

The data differentiator

Explore the data leader’s guide to building a data-driven organization and driving business advantage.

Increase AI adoption with AI-ready data

Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

IBM Research® data management publications

Explore how IBM Research is regularly integrated into new features for IBM Cloud Pak® for Data.

Gartner® predicts 2024: How AI will impact analytics users

Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.

What is dark data?

2 October 2023

What is dark data?

The latest AI News + Insights

Why data goes dark

Decoding AI: Weekly News Roundup

Types of dark data

The costs of dark data

Data quality issues and dark data

How to shine a light on dark data

Break down silos

Improve data management

Set data governance policies

Use ML and AI tools to parse data

Footnotes

Resources

Related solutions

The latest AI News + Insights