What is dark data?
Explore Databand
A dark cloud looms on the horizon to signify dark data
What is dark data?

According to Gartner, dark data refers to the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes, such as analytics, business relationships and direct monetizing.1

Most companies today store vast quantities of dark data. In Splunk’s global research survey of more than 1,300 business and IT decision makers, 60 percent of respondents reported that half or more of their organization’s data is considered dark. A full one-third of respondents reported this amount to be 75 percent or more.2

Dark data accumulates because organizations have embraced the idea that it’s valuable to store all the information they can possibly capture in big data lakes. This is partially due to the advent of inexpensive storage, which has made it easy to justify storing so much data—in the event that one day it becomes valuable.

In the end, most companies never use even a fraction of what they store because the storage reservoir doesn't document the metadata labels appropriately, some of the data is in a format the integrated tools can't read or the data isn't retrievable through a query.

Dark data is a major limiting factor in producing good data analysis because the quality of any data analysis depends on the body of information accessible to the analytics tools, both promptly and in full detail.

Other problems with dark data are that it creates liabilities, significant storage costs and missed opportunities due to teams not realizing what data is potentially available to them.

Book an IBM Databand demo today

See how proactive data observability can help you detect data incidents earlier and resolve them faster.

Related content

Subscribe to the IBM newsletter

Why data goes dark

There are numerous causes for an organization’s data to go dark, including:

  • Lack of awareness: Data obtained in the course of normal business operations often goes dark because organizations are either unaware of its existence, or don’t understand its value or relevance.

  • Data stuck in silos: When different departments within an organization collect and store data independently, it can lead to data fragmentation and isolation. These data silos may not be accessible or visible to other teams, who would potentially find the data quite valuable.

  • Lack of data governance: Without a robust data governance framework in place, organizations may struggle to manage and track data across their ecosystem effectively. This causes data to become disorganized, lost and unusable.

  • Legacy systems: As organizations upgrade their software and hardware, older systems may be retired or become less relevant. Data stored in these legacy systems goes dark if it can’t be integrated with the organization’s modern analytics tools.

  • Incomplete data integration: Incomplete or ineffective data integration processes can result in data gaps and inconsistencies. This can leave certain datasets inaccessible or not properly linked to other data sources.

  • Changing business priorities: As business priorities evolve, certain datasets may become less relevant or fall out of focus. Data that was once actively used may be left in the dark as organizational objectives shift.

  • Limited resources and literacy: Organizations with limited resources may prioritize data collection and storage over data analysis. As well, insufficient data literacy among employees can hinder the discovery and utilization of valuable data.

  • Data quality issues: Poor data quality, such as inaccurate or incomplete data, can lead to data being discounted or ignored. Data perceived as unreliable is less likely to be utilized, effectively rendering it dark.

  • Regulatory compliance purposes: Many compliance and governing standards force organizations to follow strict regulations for how long they must store sensitive data. They often wind up storing it long after the mandatory period because they fail to keep track of what sensitive data should be destroyed.

  • Redundant, obsolete, trivial (ROT) data: ROT is created when employees save multiple copies of the same information, outdated information and extraneous information that does not help the organization meet its goals.
Types of dark data

In terms of its discoverability for timely and complete data analytics initiatives, dark data may be structured data, unstructured data or semi-structured data.   

Structured data is information added to clearly defined spreadsheet or database fields before being stored.

Server log files, Internet of Things (IoT) sensor data, customer relationship management (CRM) databases and enterprise resources planning (ERP) systems are examples of dark data created from structured data sources.

Although most forms of sensitive data, like electronic bank statements, medical records and encrypted customer data are typically in structured form, it is difficult to view and categorize because of permission issues.

Unlike structured data, unstructured data includes information that can’t be organized in databases or spreadsheets for analysis without conversion, codification, tiering and structuring.

Email correspondences, PDFs, text documents, social media posts, call center recordings, chat logs and surveillance video footage are examples of dark data created from unstructured data sources.

Semi-structured data is unstructured data that contains some information in defined data fields. Although it doesn’t have the same ease of dark data discovery as structured data, it is able to be searched or catalogued.

Examples include HTML code, invoices, graphs, tables and XML documents.

The costs of dark data

The costs of storing dark data can be significant and extend well beyond the direct financial cost of dark data storage. Direct and indirect costs include:

Data storage costs

Storing data, even if it's not actively used, requires physical or digital storage infrastructure. This can include servers, data centers, cloud storage solutions and backup systems. The more data in your ecosystem, the more data storage capacity you need, which leads to increased infrastructure costs.

Liability costs

Governments have introduced a host of global privacy laws over the past several years, which apply to all data—even data that’s sitting unused in analytics repositories.

Opportunity costs

Many companies lose out on opportunities by not using this data. While it’s good to get rid of dark data that’s actually not usable—due to risks and costs—it pays to first analyze what data is available to determine what might be usable.

Inefficiency costs

Managing large volumes of data, including dark data, can slow down data retrieval and analysis processes. Employees may spend more time searching for relevant information, leading to reduced productivity and increased labor costs.

Risks costs

Dark data can pose risks in terms of insufficient cybersecurity, data breaches, compliance violations and data loss. These risks can result in reputational damage and financial consequences.

Data quality issues and dark data

Sometimes dark data gets created because of data quality issues.

For example, a transcript from an audio recording is automatically generated, but the AI that created the transcript makes some mistakes in the transcription. Someone keeps the transcript though, thinking that they’ll resolve it at some point, which they never do.

When organizations do attempt to clean poor quality data, they sometimes miss what’s causing the issue. Without the proper understanding, it’s impossible to ensure that the data quality issue won’t continue happening in the future.

This situation then becomes cyclical, because rather than simply employing deletion policies for dark data that sits around without ever getting used, organizations let it continue to sit and contribute to a growing data quality issue.

Fortunately, there are three steps for data quality management that organizations can take to help alleviate this issue:

  1. Analyze and identify the “as is” situation: In order to prioritize issues, first identify all current issues, existing data standards and business impact.

  2. Prevent bad data from recurring: Next, evaluate the root cause of each issue and apply resources to tackle the problem in a sustainable way so it won’t happen again.

  3. Communicate often along the way: Share what’s happening, what the team is doing, the impact of that work and how those efforts connect to business goals.
How to shine a light on dark data

For all the costs and data quality issues of dark data, there are upsides. As Splunk puts it, “dark data may be one of an organization’s biggest untapped resources.”3

By taking a proactive approach to managing dark data, organizations can shine a light on dark data. This not only reduces liabilities and costs, but also gives teams the resources they need to discover insights from hidden data.

When it comes to handling dark data and potentially using it to make better data-driven decisions, there are several best practices to follow:

Break down silos

Dark data often comes about because of silos within the organization. One team creates data that could be useful to another, but that other team doesn’t know about it. Breaking down those silos makes that data available to the team who needs it. It goes from sitting around to providing immense value.

Improve data management

It’s important to understand what data exists within the organization. This effort starts by classifying all data within the organization to get a complete and accurate view. From there, teams can begin to organize their data better with the goal of making it easier for individuals across teams to find and use what they need.

Set data governance policies

Introducing a data governance policy can help improve the challenge long term. This policy should cover how all data coming in gets reviewed and offer clear guidelines for what should be retained (and organized to maintain clear data management), archived or destroyed. An important part of this policy is being strict about what data should be destroyed and when. Enforcing data governance and regularly reviewing practices can help minimize the amount of dark data that will never be used.

Use ML and AI tools to parse data

To help discover dark data, machine learning (ML) and artificial intelligence (AI) can do the heavy lifting of categorizing dark data by performing analysis on data that may contain valuable insights. In addition, ML automation can help with data privacy compliance regulations by automatically redacting sensitive information from stored data.

Related products
IBM Databand

IBM® Databand® is observability software for data pipelines and warehouses that automatically collects metadata to build historical baselines, detect anomalies and triage alerts to remediate data quality issues.

Explore Databand

IBM DataStage

Supporting ETL and ELT patterns, IBM® DataStage® delivers flexible and near-real-time data integration both on premises and in the cloud.

Explore DataStage

IBM Knowledge Catalog

An intelligent data catalog for the AI era, IBM® Knowledge Catalog lets you access, curate, categorize and share data, knowledge assets and their relationships—no matter where they reside.

Explore Knowledge Catalog
Resources What is data science?

Learn what data science is and how it can unlock business insights and accelerate digital transformation while enabling data-driven decision making.

What is artificial intelligence (AI)?

Learn what AI is, its different types, the history of AI why generative AI has dramatically accelerated the adoption of AI in enterprise.

6 Pillars Of Data Quality And How To Improve Your Data

Learn why high-quality data is essential for making well-informed decisions, performing accurate analyses and developing effective strategies.

What to do with dark data?

Every business accumulates dark data. Learn how management tools can turn it from a costly liability into a valuable resource full of untapped opportunities.

Take the next step

Implement proactive data observability with IBM Databand today—so you can know when there’s a data health issue before your users do.

Explore Databand Book a live demo
Footnotes

1 Gartner Glossary (link resides outside ibm.com), Gartner

2 The State of Dark Data (link resides outside ibm.com), Splunk, 2019

3 Dark Data: Discovery, Uses & Benefits of Hidden Data (link resides outside ibm.com), Splunk, 03 August 2023