What is dark data?

Dark data is the information that organizations accumulate but often never use for analytics or decision-making.

Most companies today store vast quantities of dark data. In Splunk’s global research survey of more than 1,300 business and IT decision makers, 60% of respondents reported that half or more of their organization’s data is considered dark. A full one-third of respondents reported this amount to be 75% or more.¹

Dark data accumulates because organizations have embraced the idea that it’s valuable to store all the information they can possibly capture in big data lakes. This is partially due to the advent of inexpensive storage, which has made it easy to justify storing so much data—in case one day it becomes valuable.

In the end, most companies never use even a fraction of what they store because the storage reservoir doesn’t document the metadata labels appropriately, some of the data is in a format the integrated tools can’t read or the data isn’t retrievable through a query.

Dark data is a major limiting factor in producing good data analysis because the quality of any data analysis depends on the body of information accessible to the analytics tools, both promptly and in full detail.

Other problems with dark data are that it creates liabilities, significant storage costs and missed opportunities due to teams not realizing what data is potentially available to them.

IBM-Guardium-Data-Protection-Social-Asset

IBM watsonx.governance

Recognized as one of the best IT Management Products in the 2026 G2 Best Software Awards

Why data goes dark

There are numerous causes for an organization’s data to go dark, including:

Lack of awareness: Data obtained during normal business operations often goes dark because organizations are either unaware of its existence, or don’t understand its value or relevance.
Data stuck in silos: When different departments within an organization collect and store data independently, it can lead to data fragmentation and isolation. These data silos might not be accessible or visible to other teams, who would potentially find the data valuable.
Lack of data governance: Without a robust data governance framework in place, organizations might struggle to manage and track data across their ecosystem effectively. This causes data to become disorganized, lost and unusable.
Legacy systems: As organizations upgrade their software and hardware, older systems might be retired or become less relevant. Data stored in these legacy systems goes dark if it can’t be integrated with the organization’s modern analytics tools.
Incomplete data integration: Incomplete or ineffective data integration processes can result in data gaps and inconsistencies. This can leave certain datasets inaccessible or not properly linked to other data sources.
Changing business priorities: As business priorities evolve, certain datasets might become less relevant or fall out of focus. Data that was once actively used might be left in the dark as organizational objectives shift.
Limited resources and literacy: Organizations with limited resources might prioritize data collection and storage over data analysis. As well, insufficient data literacy among employees can hinder the discovery and utilization of valuable data.
Data quality issues: Poor data quality, such as inaccurate or incomplete data, can lead to data being discounted or ignored. Data perceived as unreliable is less likely to be used, effectively rendering it dark.
Regulatory compliance purposes: Many compliance and governing standards force organizations to follow strict regulations for how long they must store sensitive data. They often wind up storing it long after the mandatory period because they fail to track what sensitive data should be destroyed.
Redundant, obsolete, trivial (ROT) data: ROT is created when employees save multiple copies of the same information, outdated information and extraneous information that does not help the organization meet its goals.

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think Newsletter, delivered twice weekly. See the IBM Privacy Statement.

Types of dark data

In terms of its discoverability for timely and complete data analytics initiatives, dark data can be structured data, unstructured data or semi-structured data.

Structured data is information added to clearly defined spreadsheet or database fields before being stored.

Server log files, Internet of Things (IoT) sensor data, customer relationship management (CRM) databases and enterprise resource planning (ERP) systems are examples of dark data created from structured data sources.

Although most forms of sensitive data—such as electronic bank statements, medical records and encrypted customer data—are typically in structured form, it is difficult to view and categorize because of permission issues.

Unlike structured data, unstructured data includes information that can’t be organized in databases or spreadsheets for analysis without conversion, codification, tiering and structuring.

Email correspondences, PDFs, text documents, social media posts, call center recordings, chat logs and surveillance video footage are examples of dark data created from unstructured data sources.

Semi-structured data is unstructured data that contains some information in defined data fields. Although it doesn’t have the same ease of dark data discovery as structured data, it is able to be searched or cataloged.

Examples include HTML code, invoices, graphs, tables and XML documents.

Think Keynotes

Power the agentic enterprise

Understand how AI-ready data platforms enable real-time insights and execution, while supporting secure, sovereign deployment across environments.

Explore watsonx.data

The costs of dark data

The costs of storing dark data can be significant and extend well beyond the direct financial cost of dark data storage. Direct and indirect costs include:

Data storage costs

Storing data, even if it’s not actively used, requires physical or digital storage infrastructure. This can include servers, data centers, cloud storage solutions and backup systems. The more data in your ecosystem, the more data storage capacity you need, which leads to increased infrastructure costs.

Liability costs

Governments have introduced a host of global privacy laws over the past several years, which apply to all data—even data that’s sitting unused in analytics repositories.

Opportunity costs

Many companies lose out on opportunities by not using this data. While it’s good to get rid of dark data that’s not usable—due to risks and costs—it pays to first analyze what data is available to determine what might be usable.

Inefficiency costs

Managing large volumes of data, including dark data, can slow down data retrieval and analysis processes. Employees might spend more time searching for relevant information, leading to reduced productivity and increased labor costs.

Risks costs

Dark data can pose risks in terms of insufficient cybersecurity, data breaches, compliance violations and data loss. These risks can result in reputational damage and financial consequences.

Data quality issues and dark data

Sometimes dark data gets created because of data quality issues.

For example, a transcript from an audio recording is automatically generated, but the AI that created the transcript makes some mistakes in the transcription. Someone keeps the transcript though, thinking that they’ll resolve it at some point, which they don’t.

When organizations do attempt to clean poor quality data, they sometimes miss what’s causing the issue. Without the proper understanding, it’s impossible to ensure that the data quality issue won’t continue happening in the future.

This situation then becomes cyclical, because rather than simply employing deletion policies for dark data that sits around without ever getting used, organizations let it continue to sit and contribute to a growing data quality issue.

Fortunately, there are three steps for data quality management that organizations can take to help alleviate this issue:

Analyze and identify the “as is” situation: In order to prioritize issues, first identify all current issues, existing data standards and business impact.
Prevent bad data from recurring: Next, evaluate the root cause of each issue and apply resources to tackle the problem in a sustainable way so it won’t happen again.
Communicate often along the way: Share what’s happening, what the team is doing, the impact of that work and how those efforts connect to business goals.

How to shine a light on dark data

For all the costs and data quality issues of dark data, there are upsides. As Splunk puts it, “dark data may be one of an organization’s biggest untapped resources.”²

By taking a proactive approach to managing dark data, organizations can shine a light on dark data. This not only reduces liabilities and costs, but also gives teams the resources they need to discover insights from hidden data.

When handling dark data and potentially using it to make better data-driven decisions, there are several best practices to follow:

Break down silos

Dark data often comes about because of silos within the organization. One team creates data that could be useful to another, but that other team doesn’t know about it. Breaking down those silos makes that data available to the team who needs it. It goes from sitting around to providing immense value.

Improve data management

It’s important to understand what data exists within the organization. This effort starts by classifying all data within the organization to get a complete and accurate view. From there, teams can begin to organize their data better with the goal of making it easier for individuals across teams to find and use what they need.

Set data governance policies

Introducing a data governance policy can help improve the challenge long term. This policy should cover how all data coming in gets reviewed and offer clear guidelines for what should be retained (and organized to maintain clear data management), archived or destroyed. An important part of this policy is being strict about what data should be destroyed and when. Enforcing data governance and regularly reviewing practices can help minimize the amount of dark data that won’t be used.

Use ML and AI tools to parse data

To help discover dark data, machine learning (ML) and artificial intelligence (AI) can do the heavy lifting of categorizing dark data by performing analysis on data that might contain valuable insights. In addition, ML automation can help with data privacy compliance regulations by automatically redacting sensitive information from stored data.

Techsplainers | Podcast

Listen to: 'What is dark data?'

Follow Techsplainers: Spotifyand Apple Podcasts

Find more episodes

AI is scaling—Governance is lagging: How Deloitte and IBM close the gap

Learn how organizations are scaling trusted AI by operationalizing governance frameworks and tools to manage risk, ensure compliance and move from policy to real-world execution.

Resources

Designing an AI‑native airline from the ground up

When margins are thin, every inefficiency matters. While legacy systems still limit AI’s impact across aviation, Riyadh Air chose a different course. In partnership with IBM, Riyadh Air built the world’s first AI‑native airline, redefining a smarter, faster, more intuitive way to travel.

Smarter AI governance and security solutions

Learn how to turn governance and security into drivers of resilience, smarter decision-making and confident growth with practical strategies from this buyer’s guide.

IBM X-Force Threat Intelligence Index 2026

Gain insights to prepare and respond to cyberattacks with greater speed and effectiveness with the IBM X-Force® Threat Intelligence Index.

Build the foundation for trusted, secure and responsible AI

Download the ebook to discover how to move from today’s most pressing data challenges and establish an automated, end‑to‑end governance framework that improves data quality, strengthens trust and ensures regulatory readiness.

Agent ops and responsible AI

Join this webinar to explore practical strategies for operating and governing AI agents responsibly at scale, with expert insights on observability, risk management and accountable AI operations.

Building a strong data foundation for trustworthy AI

Explore the Data Matters hub to see how strong data practices and governance lay the foundation for scalable AI success.

Gartner® Market Guide for data observability tools

Learn about incorporating data observability into your organization to improve the overall data quality, governance and cost efficiency of your data ecosystem.

Navigating governance, risk management and compliance in modern business

Explore the vital synergy of governance, risk and compliance (GRC) in modern business operations.

Learning path: Enforcing data governance

Gain an introduction to the data fabric topic as well as guidance on enforcing data governance and security for shared data between applications.

Footnotes

¹The State of Dark Data, Splunk, 2019

²Dark Data: Discovery, Uses & Benefits of Hidden Data , Splunk, 03 August 2023

What is dark data?

What is dark data?

IBM watsonx.governance

Why data goes dark

The latest tech news, backed by expert insights

Thank you!

Types of dark data

Power the agentic enterprise

The costs of dark data

Data quality issues and dark data

How to shine a light on dark data

Break down silos

Improve data management

Set data governance policies

Use ML and AI tools to parse data

Listen to: 'What is dark data?'

Resources

Footnotes