Data quality issues and challenges

A man and a woman sit at a table in a restaurant while looking at papers and a laptop computer.

Data quality issues, defined

Data quality issues are flaws in datasets that can compromise decision-making and other data-driven workflows at an organization. Common examples include duplicate data, inconsistent data, incomplete data and data silos.

 

While data quality issues have long dogged data analysis, addressing them has arguably taken on unprecedented significance in the modern, big data era. Enterprises increasingly rely on large, complex datasets to unlock key insights and gain a competitive advantage.

But any data quality issues within these datasets degrade the accuracy of analysis and business intelligence. This “bad data” results in missed opportunities, inefficiencies in business processes, financial losses and regulatory penalties.

The impact of poor data quality also extends to artificial intelligence (AI) initiatives: Machine learning algorithms must be paired with high-quality datasets to produce performant machine learning models.

Without good training data, resulting models are more likely to make inaccurate, irrelevant predictions, imperiling AI-powered initiatives. Gartner predicts that “through 2026, organizations will abandon 60% of AI projects unsupported by AI-ready data.

Data quality issues stem from a number of factors, including human error and data integration problems. Practices such as data audits, data observability and data governance can elevate the quality and usability of an organization’s data, improving business decisions and business outcomes.

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

What are the most common data quality issues?

While different organizations face different data quality problems, here are among the most common data quality issues:

Inaccurate data

Inaccurate data refers to data points that fail to represent real-world values, diminishing the overall quality of the dataset that houses them. Inaccurate data not only hinders decision-making but can also prevent organizations from successfully implementing innovative tools, including AI-powered solutions.

For instance, inaccurate data on retail worker shifts led to poor performance for an AI scheduling tool used at over 6,000 stores. Researchers found that managers chose to manually override scheduling of 84% of the shifts in the AI-generated timetables.1

Data accuracy has also emerged as a key concern for organizations considering adopting agentic AI. Nearly half (49%) of executives cited data inaccuracies and bias as a barrier to embracing the technology, according to the IBM Institute for Business Value.2

Duplicate data

Data duplication occurs when data is replicated either purposely or inadvertently. In the case of the former, duplication can provide a safeguard in case of data loss: When data in one location is damaged or goes missing, systems can continue to function using copies of the data stored elsewhere.

However, unintentional duplicate data entries can cause a variety of problems. They can over-represent specific data points or trends, resulting in unreliable outputs and skewed forecasts. Duplicate data can also cause increased storage costs and slower performance. Unintentional duplication can occur because of flaws in data migration and integration processes—if, for instance, two different data sources contain the same record, both copies might remain when data from those sources is combined in a single dataset.

Inconsistent data

Inconsistent data creates discrepancies in the representation of real-world situations. For example, the number of employees in a company department should not exceed the total number of employees at that business.

Data can also be inconsistent when it represents the same types of values in different formats. For example, a dataset of street names can store full street names (Jones Street) or abbreviated names (Jones St.) but should not include both.

Incomplete data

Data assets fall short on the data completeness dimension when tables are missing values or entire rows. Incomplete or empty data values can interrupt data integration processes and take up memory on source systems.

Incomplete data can sometimes lead researchers to delete records, even when those records include information (missing values notwithstanding) that could provide valuable contributions to their data analysis.3

And as governments increasingly require greater transparency in regulated industries, incomplete data can result in costly penalties. Banking giant JPMorgan Chase was fined roughly USD 350 million by US banking regulators in 2024 for providing incomplete trading and order data to surveillance platforms.4

Invalid data

Invalid data does not fall within a range of permitted values or or violates data type, format or business rules. For example, if a customer data table includes a column for ages 0 to 125, an entry of 200 would be invalid.

There is an overlap between invalid data issues and incomplete data issues: When a mandatory data field is left blank, that is also a case of invalid data. In addition, a table may be considered invalid if its data is not organized according to a specified schema.

Outdated data

In an era where information delays and lags increasingly yield negative consequences, data freshness and timeliness are critical for smart decision-making. When data is not regularly updated, the decline of data quality over time is known as data decay.

Outdated information can produce outcomes that don’t serve present-day circumstances. One study found that 85% of companies blamed stale data for bad decision-making and lost revenue.5

Mislabeled data

Data labeling is the process of identifying raw data—such as images, text files or videos—and assigning one or more labels to specify its context for machine learning models. These labels help the models interpret the data for better predictions.

However, large-scale datasets compiled from web scraping or crowdsourcing platforms are often mislabeled. This phenomenon, known as label noise, can reduce the accuracy of model predictions.6

Biased data

Biased data is data skewed by one or more human biases, such as cognitive bias, confirmation bias, historical bias and sampling bias. Data bias has emerged as a major quality issue for AI model training, contributing to inaccurate outputs that can result in a host of consequences, including legal liability, discrimination and poor customer service.

For example, biased data may have undermined patient care during the Covid-19 pandemic. AI-powered analysis on data from pulse oximeters (devices that measure blood oxygen saturation) helped inform treatment decisions.

The devices’ sensors, however, worked less effectively on people with darker skin, thicker skin or smaller fingers—raising concerns about the reliability of the analysis.7

Data silos

Data silos are isolated collections of data that prevent data sharing among systems and business units. This isolation can precipitate other data issues, such as inconsistency.

Silos can also prevent organizations from leveraging relevant data for specific use cases. For instance, a marketer planning customer outreach would benefit from a database that contains customer names, contact information and records of previous outreach attempts, rather than a database of names and contact information alone.

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

Common causes of data quality issues

Data quality issues are an age-old problem. In the mid 1800s, British nursing pioneer and healthcare reformer Florence Nightingale described military medical statistics as rife with so many inaccuracies that they existed in “a state of great confusion, so that it is hardly possible to obtain correct results.” 8

In Nightingale’s time, human error and biases would have caused the bulk of data quality issues. But today the landscape of causes is more complex, with risks emerging at multiple points of a data pipeline.

Poor data collection

Flawed data collection processes resulting from human error, biases or technological glitches can lead to missing data and the procurement of unreliable data. Problems with automated data collection processes, in particular, have been blamed for data mislabeling.9

Data entry errors

Even amid ubiquitous digitization, manual data entry is still commonplace—and with it, data entry errors. In research studies, for instance, manual data entry error rates have ranged from as low as 0.55% to as high as 26.9%.10 Data entry errors in government operations, in particular, have made headlines in recent years, causing problems ranging from bungled tax assessments to buildings mistakenly flagged for demotion.11, 12

Corrupt data transmission

The transmission of data between systems presents another opportunity for quality issues to arise. Research suggests that large, modern datasets are susceptible to corruption as data transfer infrastructure was initially designed for smaller datasets. A four-year study of a scientific research network found corruption in 1 in every 121 file transfers.13

Inadequate integration

Different data sources often have different data types, data formats, schemas and other characteristics. When integrating these various sources, insufficient data transformation—the conversion of raw data into a unified format—can lead to data duplication and inconsistency. The unification of data can be even more challenging with respect to unstructured data, which lacks a predefined format and requires additional preprocessing steps in order to achieve consistency and standardization for integration.

Lack of synchronization

When high-quality data is successfully integrated and stored in a target system, the next challenge is maintaining consistency and avoiding data decay. If data synchronization—the continuous propagation of record changes and updates throughout a system—isn’t taking place, datasets can quickly become outdated, limiting their usability.

Data poisoning

While many causes of data quality issues are inadvertent, some are intentional—namely, those involving malicious actors who set their sights on undermining data quality. This has become an especially salient concern in AI model training: A type of cyberattack known as data poisoning manipulates or corrupts training data, which in turn can subtly or drastically alter model behavior.

Learn more about data poisoning

How to fix data quality issues

Organizations can prevent and address data quality problems through data quality management, data quality monitoring and data quality tools that support these practices. Key measures include:

  • Implementing data governance
  • Detecting issues
  • Correcting errors
  • Validating data
  • Monitoring data 

Implementing data governance

Through the discipline of data governance, organizations set policies and standards necessary for collecting, storing and maintaining high-quality data.

Data governance software solutions support the enforcement of these policies by making it easy for users to discover, understand and access data. Leading solutions include searchable catalogs of data assets, as well as metadata enrichment and quality check capabilities—features that ensure that metadata includes information on data rules, data definitions and data lineage.

Detecting issues

Data profiling and auditing tools evaluate the structure and context of data, and establish a baseline against which to measure remediation. The right tool can identify issues such as inconsistencies, duplicate records and general anomalies.

Correcting errors

Data cleansing (also known as data cleaning) is the correction of errors and inconsistencies in raw datasets through methods such as standardization, data deduplication and addressing missing values.

While data engineers and other data management professionals can perform data cleaning techniques through manual methods, AI can be used to automate and optimize data cleaning processes such as standardizing data and consolidating duplicates.

Validating data

Data validation is rule-based verification that data is clean, accurate and meets specific data quality requirements (such as range constraints) that make it ready for use. Data validation capabilities are commonly included in data integration platforms.

Monitoring data

While it’s critical for organizations to ensure the quality of data entering their systems, it’s also important to continue monitoring it throughout its lifecycle.

Leading data observability tools allow the assessment of data across an organization’s data ecosystem and the management of incidents through a single dashboard. Such tools enable automated monitoring, root cause analysis, logging, data lineage, service level agreement (SLA) tracking and the routing of real-time data anomaly alerts to data stakeholders.

Alice Gomstyn

Staff Writer

IBM Think

Alexandra Jonker

Staff Editor

IBM Think

Related solutions
IBM StreamSets

Create and manage smart streaming data pipelines through an intuitive graphical interface, facilitating seamless data integration across hybrid and multicloud environments.

Explore StreamSets
IBM® watsonx.data™

Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.

Discover watsonx.data
Data and analytics consulting services

Unlock the value of enterprise data with IBM Consulting®, building an insight-driven organization that delivers business advantage.

Discover analytics services
Take the next step

Unify all your data for AI and analytics with IBM® watsonx.data™. Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics.

Discover watsonx.data Explore data management solutions
Footnotes