According to Gartner, dark data refers to the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes, such as analytics, business relationships and direct monetizing.1
Most companies today store vast quantities of dark data. In Splunk’s global research survey of more than 1,300 business and IT decision makers, 60 percent of respondents reported that half or more of their organization’s data is considered dark. A full one-third of respondents reported this amount to be 75 percent or more.2
Dark data accumulates because organizations have embraced the idea that it’s valuable to store all the information they can possibly capture in big data lakes. This is partially due to the advent of inexpensive storage, which has made it easy to justify storing so much data—in the event that one day it becomes valuable.
In the end, most companies never use even a fraction of what they store because the storage reservoir doesn't document the metadata labels appropriately, some of the data is in a format the integrated tools can't read or the data isn't retrievable through a query.
Dark data is a major limiting factor in producing good data analysis because the quality of any data analysis depends on the body of information accessible to the analytics tools, both promptly and in full detail.
Other problems with dark data are that it creates liabilities, significant storage costs and missed opportunities due to teams not realizing what data is potentially available to them.
There are numerous causes for an organization’s data to go dark, including:
In terms of its discoverability for timely and complete data analytics initiatives, dark data may be structured data, unstructured data or semi-structured data.
Structured data is information added to clearly defined spreadsheet or database fields before being stored.
Server log files, Internet of Things (IoT) sensor data, customer relationship management (CRM) databases and enterprise resources planning (ERP) systems are examples of dark data created from structured data sources.
Although most forms of sensitive data, like electronic bank statements, medical records and encrypted customer data are typically in structured form, it is difficult to view and categorize because of permission issues.
Unlike structured data, unstructured data includes information that can’t be organized in databases or spreadsheets for analysis without conversion, codification, tiering and structuring.
Email correspondences, PDFs, text documents, social media posts, call center recordings, chat logs and surveillance video footage are examples of dark data created from unstructured data sources.
Semi-structured data is unstructured data that contains some information in defined data fields. Although it doesn’t have the same ease of dark data discovery as structured data, it is able to be searched or catalogued.
Examples include HTML code, invoices, graphs, tables and XML documents.
The costs of storing dark data can be significant and extend well beyond the direct financial cost of dark data storage. Direct and indirect costs include:
Storing data, even if it's not actively used, requires physical or digital storage infrastructure. This can include servers, data centers, cloud storage solutions and backup systems. The more data in your ecosystem, the more data storage capacity you need, which leads to increased infrastructure costs.
Governments have introduced a host of global privacy laws over the past several years, which apply to all data—even data that’s sitting unused in analytics repositories.
Many companies lose out on opportunities by not using this data. While it’s good to get rid of dark data that’s actually not usable—due to risks and costs—it pays to first analyze what data is available to determine what might be usable.
Managing large volumes of data, including dark data, can slow down data retrieval and analysis processes. Employees may spend more time searching for relevant information, leading to reduced productivity and increased labor costs.
Dark data can pose risks in terms of insufficient cybersecurity, data breaches, compliance violations and data loss. These risks can result in reputational damage and financial consequences.
Sometimes dark data gets created because of data quality issues.
For example, a transcript from an audio recording is automatically generated, but the AI that created the transcript makes some mistakes in the transcription. Someone keeps the transcript though, thinking that they’ll resolve it at some point, which they never do.
When organizations do attempt to clean poor quality data, they sometimes miss what’s causing the issue. Without the proper understanding, it’s impossible to ensure that the data quality issue won’t continue happening in the future.
This situation then becomes cyclical, because rather than simply employing deletion policies for dark data that sits around without ever getting used, organizations let it continue to sit and contribute to a growing data quality issue.
Fortunately, there are three steps for data quality management that organizations can take to help alleviate this issue:
For all the costs and data quality issues of dark data, there are upsides. As Splunk puts it, “dark data may be one of an organization’s biggest untapped resources.”3
By taking a proactive approach to managing dark data, organizations can shine a light on dark data. This not only reduces liabilities and costs, but also gives teams the resources they need to discover insights from hidden data.
When it comes to handling dark data and potentially using it to make better data-driven decisions, there are several best practices to follow:
Dark data often comes about because of silos within the organization. One team creates data that could be useful to another, but that other team doesn’t know about it. Breaking down those silos makes that data available to the team who needs it. It goes from sitting around to providing immense value.
It’s important to understand what data exists within the organization. This effort starts by classifying all data within the organization to get a complete and accurate view. From there, teams can begin to organize their data better with the goal of making it easier for individuals across teams to find and use what they need.
Introducing a data governance policy can help improve the challenge long term. This policy should cover how all data coming in gets reviewed and offer clear guidelines for what should be retained (and organized to maintain clear data management), archived or destroyed. An important part of this policy is being strict about what data should be destroyed and when. Enforcing data governance and regularly reviewing practices can help minimize the amount of dark data that will never be used.
To help discover dark data, machine learning (ML) and artificial intelligence (AI) can do the heavy lifting of categorizing dark data by performing analysis on data that may contain valuable insights. In addition, ML automation can help with data privacy compliance regulations by automatically redacting sensitive information from stored data.
1 Gartner Glossary, Gartner
2 The State of Dark Data, Splunk, 2019
3 Dark Data: Discovery, Uses & Benefits of Hidden Data , Splunk, 03 August 2023
Discover the importance of observability and how it can help you gain insights into system behaviors.
IBM Instana Observability can help you achieve an ROI of 219% and reduce developer time spent troubleshooting by 90%
Learn how combining APM and hybrid cloud cost optimization tools helps organizations reduce costs and increase productivity.
Quickly identify and fix the source of the problem. Real-time, high-fidelity data offers complete visibility of dynamic application and infrastructure environments.
Step up IT automation and operations with generative AI, aligning every aspect of your IT infrastructure with business priorities.
IBM SevOne Network Performance Management is monitoring and analytics software that provides real-time visibility and insights into complex networks.