According to Gartner, dark data refers to the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes, such as analytics, business relationships and direct monetizing.1
Most companies today store vast quantities of dark data. In Splunk’s global research survey of more than 1,300 business and IT decision makers, 60 percent of respondents reported that half or more of their organization’s data is considered dark. A full one-third of respondents reported this amount to be 75 percent or more.2
Dark data accumulates because organizations have embraced the idea that it’s valuable to store all the information they can possibly capture in big data lakes. This is partially due to the advent of inexpensive storage, which has made it easy to justify storing so much data—in the event that one day it becomes valuable.
In the end, most companies never use even a fraction of what they store because the storage reservoir doesn't document the metadata labels appropriately, some of the data is in a format the integrated tools can't read or the data isn't retrievable through a query.
Dark data is a major limiting factor in producing good data analysis because the quality of any data analysis depends on the body of information accessible to the analytics tools, both promptly and in full detail.
Other problems with dark data are that it creates liabilities, significant storage costs and missed opportunities due to teams not realizing what data is potentially available to them.
There are numerous causes for an organization’s data to go dark, including:
In terms of its discoverability for timely and complete data analytics initiatives, dark data may be structured data, unstructured data or semi-structured data.
Structured data is information added to clearly defined spreadsheet or database fields before being stored.
Server log files, Internet of Things (IoT) sensor data, customer relationship management (CRM) databases and enterprise resources planning (ERP) systems are examples of dark data created from structured data sources.
Although most forms of sensitive data, like electronic bank statements, medical records and encrypted customer data are typically in structured form, it is difficult to view and categorize because of permission issues.
Unlike structured data, unstructured data includes information that can’t be organized in databases or spreadsheets for analysis without conversion, codification, tiering and structuring.
Email correspondences, PDFs, text documents, social media posts, call center recordings, chat logs and surveillance video footage are examples of dark data created from unstructured data sources.
Semi-structured data is unstructured data that contains some information in defined data fields. Although it doesn’t have the same ease of dark data discovery as structured data, it is able to be searched or catalogued.
Examples include HTML code, invoices, graphs, tables and XML documents.
The costs of storing dark data can be significant and extend well beyond the direct financial cost of dark data storage. Direct and indirect costs include:
Storing data, even if it's not actively used, requires physical or digital storage infrastructure. This can include servers, data centers, cloud storage solutions and backup systems. The more data in your ecosystem, the more data storage capacity you need, which leads to increased infrastructure costs.
Governments have introduced a host of global privacy laws over the past several years, which apply to all data—even data that’s sitting unused in analytics repositories.
Many companies lose out on opportunities by not using this data. While it’s good to get rid of dark data that’s actually not usable—due to risks and costs—it pays to first analyze what data is available to determine what might be usable.
Managing large volumes of data, including dark data, can slow down data retrieval and analysis processes. Employees may spend more time searching for relevant information, leading to reduced productivity and increased labor costs.
Dark data can pose risks in terms of insufficient cybersecurity, data breaches, compliance violations and data loss. These risks can result in reputational damage and financial consequences.
Sometimes dark data gets created because of data quality issues.
For example, a transcript from an audio recording is automatically generated, but the AI that created the transcript makes some mistakes in the transcription. Someone keeps the transcript though, thinking that they’ll resolve it at some point, which they never do.
When organizations do attempt to clean poor quality data, they sometimes miss what’s causing the issue. Without the proper understanding, it’s impossible to ensure that the data quality issue won’t continue happening in the future.
This situation then becomes cyclical, because rather than simply employing deletion policies for dark data that sits around without ever getting used, organizations let it continue to sit and contribute to a growing data quality issue.
Fortunately, there are three steps for data quality management that organizations can take to help alleviate this issue:
For all the costs and data quality issues of dark data, there are upsides. As Splunk puts it, “dark data may be one of an organization’s biggest untapped resources.”3
By taking a proactive approach to managing dark data, organizations can shine a light on dark data. This not only reduces liabilities and costs, but also gives teams the resources they need to discover insights from hidden data.
When it comes to handling dark data and potentially using it to make better data-driven decisions, there are several best practices to follow:
Dark data often comes about because of silos within the organization. One team creates data that could be useful to another, but that other team doesn’t know about it. Breaking down those silos makes that data available to the team who needs it. It goes from sitting around to providing immense value.
It’s important to understand what data exists within the organization. This effort starts by classifying all data within the organization to get a complete and accurate view. From there, teams can begin to organize their data better with the goal of making it easier for individuals across teams to find and use what they need.
Introducing a data governance policy can help improve the challenge long term. This policy should cover how all data coming in gets reviewed and offer clear guidelines for what should be retained (and organized to maintain clear data management), archived or destroyed. An important part of this policy is being strict about what data should be destroyed and when. Enforcing data governance and regularly reviewing practices can help minimize the amount of dark data that will never be used.
To help discover dark data, machine learning (ML) and artificial intelligence (AI) can do the heavy lifting of categorizing dark data by performing analysis on data that may contain valuable insights. In addition, ML automation can help with data privacy compliance regulations by automatically redacting sensitive information from stored data.
1 Gartner Glossary, Gartner
2 The State of Dark Data, Splunk, 2019
3 Dark Data: Discovery, Uses & Benefits of Hidden Data , Splunk, 03 August 2023
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
IBM named a Leader for the 19th year in a row in the 2024 Gartner® Magic Quadrant™ for Data Integration Tools.
Explore the data leader’s guide to building a data-driven organization and driving business advantage.
Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.
Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Unlock the value of enterprise data with IBM Consulting®, building an insight-driven organization that delivers business advantage.
IBM web domains
ibm.com, ibm.org, ibm-zcouncil.com, insights-on-business.com, jazz.net, mobilebusinessinsights.com, promontory.com, proveit.com, ptech.org, s81c.com, securityintelligence.com, skillsbuild.org, softlayer.com, storagecommunity.org, think-exchange.com, thoughtsoncloud.com, alphaevents.webcasts.com, ibm-cloud.github.io, ibmbigdatahub.com, bluemix.net, mybluemix.net, ibm.net, ibmcloud.com, galasa.dev, blueworkslive.com, swiss-quantum.ch, blueworkslive.com, cloudant.com, ibm.ie, ibm.fr, ibm.com.br, ibm.co, ibm.ca, community.watsonanalytics.com, datapower.com, skills.yourlearning.ibm.com, bluewolf.com, carbondesignsystem.com