What is bad data?

Bad data, defined

Bad data refers to information that compromises decision-making because it is inaccurate, incomplete, inconsistent, outdated, duplicate, invalid or biased.

The causes of bad data vary. Sometimes it stems from poor data architecture; other times it’s the result of human error. Regardless of origin, when organizations unintentionally use bad data, the consequences can range from minor inconveniences, such as sending tax documents to the wrong address, to severe risks such as regulatory noncompliance, reputational damage and financial losses.

A unique danger of bad data lies in its stealth. Unlike a system outage, the effects of bad data can go undetected until significant damage is done. Organizations can unknowingly operate on bad data for years. For example, a sales team would notice immediately if their Salesforce dashboard didn’t load, but it would take them much longer to realize that the data displayed was wrong.

As big data volumes skyrocket and business leaders increasingly rely on data to power artificial intelligence (AI) and decision-making, maximizing data quality is more important than ever. Through strong data governance, data quality management practices and data observability tools, organizations can help ensure their data assets fuel growth—rather than become invisible liabilities.

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

Types of bad data

Bad data can be broadly categorized using the key dimensions of data quality:

  • Inaccurate data
  • Incomplete data
  • Inconsistent data
  • Outdated data
  • Duplicate data
  • Invalid data
  • Biased data

Inaccurate data

Data accuracy measures how closely data reflects true, real-world events and values. When data is inaccurate, it contains errors and is unreliable for decision-making. For instance, inaccurate customer data (such as data points about pricing) can distort a company’s understanding of its audience and lead to misguided actions that erode customer satisfaction rates.

Incomplete data

Incomplete data is missing necessary records and values—gaps that impact data processing and data analysis. A large gap can even introduce bias, as analysis results may not be representative of the true dataset. For example, if most of the entries in a customer database are missing contact information, sales teams will miss opportunities to engage their customers.

Inconsistent data

Inconsistent data lacks standardization and is largely incompatible across different datasets and systems. Discrepancies in date formats, naming conventions and units of measurement can lead to confusion among users, create data silos within specific platforms and introduce errors in reporting or analysis.

Outdated data

Outdated data is information that is no longer current, which can cause decision-makers to use irrelevant information that does not represent real-world conditions. Data freshness is a metric that indicates how often database information is updated. Significantly long gaps between updates can result in data staleness.

Duplicate data

Duplicate data (or redundant data) refers to repeated entries in a dataset—unique data only appears once. It can skew analysis by overrepresenting certain data values or trends. (It’s important to note that there are use cases for intentional data redundancy in database design to help ensure high availabilitydata integrity and consistency.

Invalid data

Invalid data is information that does not conform to system or business rules (such as permitted value ranges, required formats and defined data types). Examples include data that contains an unsupported special character or phone numbers formatted without required hyphens.

Biased data

Although bias is not itself a data quality dimension, it is an important factor for stakeholders to consider as it influences several of the dimensions. Biased data is skewed or unrepresentative of actual events, populations and conditions. It can lead to unfair, inaccurate and unreliable outcomes, and when used in machine learning (ML) and AI systems, can result in serious consequences for individuals, organizations and society.

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

What is the impact of bad data?

Bad data is the antithesis to good data. While high-quality data promotes growth and innovation, poor-quality data slows progress.

Organizations rely on data for informed decisions, actionable insights and forecasting for internal operations as well as customer experiences. Decisions based on bad data can lead to missed opportunities, operational inefficiencies and damaged reputations. In industries such as finance or healthcare, where data helps inform high-stakes decisions, bad data can have severe or even catastrophic impacts.

Consider a clinical study containing inconsistent patient data. Researchers would struggle to compare results, which could delay the development of potential treatments. In finance, inaccurate or missing data can elicit steep compliance costs. Inaccurate financial reports may lead to violations of regulations like the Sarbanes-Oxley (SOX) Act—which can carry fines of up to USD 1 million and up to 10 years in prison.

The risks of bad data escalate in the context of artificial intelligence. When AI or ML models are trained on inaccurate, inconsistent or biased data, their outcomes reflect those errors. To help maximize investments in AI and ML, organizations must ensure their data is AI-ready.

Unity Technologies is a prime example of the consequences of bad data in AI and ML. In 2022, the video game company’s advertising placement algorithm ingested bad data from a large customer. The performance of the algorithm suffered to the degree that they had to rebuild it. The incident contributed to a 37% drop in Unity’s stock and an estimated USD 110 million impact on the business.

On the other hand, good, accurate data can be a boon for AI initiatives. Research by the IBM Institute for Business Value found that organizations with trusted data realized nearly double the return on investment from their AI capabilities. The bottom line: Good data is a non-negotiable priority for any AI or data-driven strategy.

What causes bad data?

There is no one root cause of bad data. It can arise from technology, processes or people—and typically, it’s a combination of several. Some common causes of poor data quality include:

  • System failures
  • Data decay
  • Unreliable data collection
  • Weak data governance
  • Human error
  • Data integration or migration breakdowns
System failures

Poorly designed data architectures can lead to data silos, slow performance and software bugs that degrade data consistency and reliability. When systems fail, files can be corrupted or left incomplete, resulting in missing values and inaccuracies in downstream processes.

Data decay

Many types of business data (such as consumer behavior metrics) are subject to decay if not updated regularly. When databases are outdated, any insights or decisions based on the data are stale—and likely inaccurate.

Unreliable data collection

Bad data can originate at collection, and not just from poor-quality data sources or providers. Biases, inconsistent methods, faulty tools or inaccurate measurements during data entry and processing can all compromise data quality.

Weak data governance

As a discipline, data governance defines and implements policies, standards and procedures for the entire data lifecycle. When these practices are applied inconsistently or without accountability, data quality quickly erodes.

Human error

Human error is a frequent cause of bad data. Typos during manual data entry, inconsistent data coding, biases or misinterpretations can all lead to data inaccuracies. Human error is exacerbated by time pressures, inadequate training and poorly designed systems.

Integration or migration breakdowns

Data migration or data integration without the proper processes, planning and technology can result in data loss, inconsistencies and inaccuracies. These issues often arise from mismatched data formats and structures or unobserved dependencies.

How to prevent bad data

In a perfect world, bad data would be caught at the source and never reach downstream systems or data analytics workflows. In reality, however, data quality can degrade at any point of its lifecycle and for many different reasons.

Preventing bad data at all stages requires a comprehensive strategy that addresses risks at every phase. This strategy can incorporate the following practices:

  • Governance and strategy
  • Monitoring and visibility
  • Cleansing and remediation
  • Data skills and literacy

Governance and strategy

Establishing strong data governance is a critical first step in preventing bad data. It defines and enforces the policies, standards and procedures needed to maintain accurate, high-quality data through its lifecycle. Robust governance frameworks can help organizations identify and address inaccuracies before they influence decision-making and operational efficiency.

Effective data governance should complement and enhance an organization’s broader data strategy. It typically works alongside other disciplines—such as data management, data security and data architecture—to keep data consistent and reliable.

Monitoring and visibility

You can’t fix bad data if you don’t know it exists. Organizations can use several processes to gain visibility into and continuously monitor the health of their data:

  • Data lineage: These tools provide a clear view of how data (and its metadata) moves and changes throughout its lifecycle, including its origin and ultimate destination. Visibility into data lineage supports root cause analysis and regulatory compliance.

  • Data audits: Regular review and analysis of enterprise data helps map a visual of the data environment. Audits help organizations discover, classify and monitor their data to uncover risks, inaccuracies and inconsistencies.

  • Data profiling: The data profiling process analyzes data to gain insight into its structure and quality so teams can plan remediation. It is typically performed by data engineers who use a range of business rules and analytical algorithms.

  • Data observability: Going beyond traditional monitoring, data observability tools use automation and intelligence to help identify, troubleshoot and resolve data issues in near-real time, before they have the chance to spread to business operations.

Cleansing and remediation

With data errors and their root causes identified, bad data must then be corrected. Data cleansing processes work to address common data quality issues such as duplicate records, missing values, inconsistencies, syntax errors, irrelevant data and structural errors. Common techniques include standardization, addressing outliers and missing values, deduplication and data validation.

Data teams increasingly use AI to automate and optimize several of these steps, especially tasks such as standardization and deduplication.

Data skills and literacy

Data-literate organizations have the skills to read, understand, use and communicate with data for better decision-making. The ability to critically evaluate data also improves overall data quality: Employees with even rudimentary data skills are better equipped to recognize bias, inconsistencies, inaccuracies or missing values.

Alexandra Jonker

Staff Editor

IBM Think

Tom Krantz

Staff Writer

IBM Think

Related solutions
IBM StreamSets

Create and manage smart streaming data pipelines through an intuitive graphical interface, facilitating seamless data integration across hybrid and multicloud environments.

Explore StreamSets
IBM® watsonx.data™

Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.

Discover watsonx.data
Data and analytics consulting services

Unlock the value of enterprise data with IBM Consulting®, building an insight-driven organization that delivers business advantage.

Discover analytics services
Take the next step

Unify all your data for AI and analytics with IBM® watsonx.data™. Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics.

Discover watsonx.data Explore data management solutions