The devil is in the data: How data quality metrics help enterprises get ahead

Product manager leading a meeting, explaning datas at a screen with graphics.

Authors

Alice Gomstyn

Staff Writer

IBM Think

Alexandra Jonker

Staff Editor

IBM Think

Cultivating a vibrant data environment can help enterprises accelerate growth, according to new research by the IBM Institute for Business Value. But how can organizations know if their data is, in fact, vibrant and primed to fuel growth?

Using data quality metrics can help.

Data quality metrics are quantitative measures for evaluating the quality of data. Organizations can leverage data quality metrics to track and monitor data quality over time, helping identify high-quality data fit for data-driven decision-making and artificial intelligence (AI) use cases.

Metrics vary by organization and may reflect traditional data quality dimensions such as accuracy, timeliness and uniqueness, as well as characteristics specific to modern data pipelines, such as pipeline duration. Through data quality metrics, dimensions of data quality can be mapped to numerical values.

Data quality tools powered by automation and machine learning can help data engineers evaluate data quality metrics and identify data quality issues in real time. This enables organizations and their data teams to take the necessary steps to optimize the trustworthiness and reliability of their datasets and data pipelines.

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

Why are data quality metrics important?

Maintaining high-quality, reliable data is an objective for many modern organizations—and for good reason.

Good data contributes to valuable business intelligence, operational efficiency, optimized workflows, regulatory compliance, customer satisfaction, enterprise growth and progress on key performance indicators (KPIs). High data quality is also critical for effective AI initiatives, as AI models require training on reliable, accurate data to deliver useful outputs.

But to reap such rewards, organizations must ensure their data is truly high quality. That’s where data quality metrics play a key role. Data quality metrics can help you ascertain the quality of your data by mapping data quality dimensions to numerical values, such as scores.1

Through data quality assessments, organizations can determine the usability of their data for business decisions and AI model training. Low-quality data identified through data quality measures can often be improved through data remediation efforts.

Mixture of Experts | 5 December, episode 84

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Key dimensions of data quality

Six traditional dimensions tracked through data quality metrics are:

  • Data accuracy: Data correctly represents real-world events and values.
  • Data completeness: Data contains all necessary records with no missing values.
  • Data consistency: Data is coherent and standardized across the organization, ensuring that data records in different datasets are compatible.
  • Data timeliness: Data values are up to date, allowing organizations to avoid making decisions based on stale information.
  • Data uniqueness: Data is free from redundancies or duplicate records, which can distort analysis.
  • Data validity: Data conforms to business rules, such as falling within permitted ranges for certain data values and meeting specified data format standards.

Common dimensions of data quality can often be measured through simple ratios, such as the ratio of the number of preferred outcomes (the number of accurate data points, valid data entries, etc.) to the total number of outcomes.2

For example, a basic way to calculate data completeness is:

Completeness = (number of complete data elements) / (total number of data elements)

Alternatively, using an inverse metric focused on bad data is also an option:

Completeness = 1 – [(missing data elements) / (total number of data elements)]

Other methods for measuring dimensions require more complex calculations.

For example, formulas for calculating data timeliness might rely on variables such as the data’s age, delivery time (when data is delivered), input time (when data is received) and volatility (the amount of time that data is valid).

Additional data quality metrics

In addition to data metrics representing traditional data quality dimensions, other key metrics can help organizations keep their data pipelines running smoothly. Examples include:

  • Data freshness: Sometimes used interchangeably with data timeliness, data freshness refers specifically to the frequency with which data is updated in a system. Data staleness occurs when there are significant gaps between data updates.
  • Data lineage: Data lineage, the process of observing and tracing touchpoints along the data journey, can help organizations confirm accuracy and consistency in data.
  • Null counts: Data engineers and analysts may track the number of nulls or percentages of nulls in a column. Rising null counts could indicate issues such as missing values and data drift.
  • Schema changes: Frequent schema changes, such as column data type changes or new columns, might indicate an unreliable data source.
  • Pipeline failures: Pipeline failures can cause data health issues such as schema changes, missing data operations and stale data.
  • Pipeline duration: Complex data pipelines typically take similar amounts of time to complete different runs. Major changes in duration could result in the processing of stale data.

Learn more about the top data quality metrics for your environment.

Data quality metrics in key data processes

Data quality metrics support key data process such as data governance, data observability and data quality management.

Data governance

Data governance is a data management discipline that helps ensure data integrity and data security by defining and implementing policies, quality standards and procedures for data collection, ownership, storage, processing and use. Data quality metrics such as data consistency and completeness help organizations assess progress toward meeting standards set through governance practices.

Data observability

Data observability is the practice of monitoring and managing data to help ensure its quality, availability and reliability across various processes, systems and pipelines within an organization. Data quality metrics tracked through data observability practices include data freshness, null counts and schema changes.

Data quality management

Data quality management or DQM is a collection of practices for enhancing and maintaining the quality of an organization’s data. A core DQM practice is data profiling, which entails reviewing the structure and content of existing data to evaluate its quality and establish a baseline against which to measure remediation. Data quality is evaluated according to data quality dimensions and metrics.

Poor data quality revealed through profiling can be addressed through another DQM practice: data cleansing. Data cleansing, also known as data cleaning, is the correction of data errors and inconsistencies in raw datasets. Cleansing data is an essential first step to data transformation, which converts raw data into a usable format for analysis.

Tools for tracking data quality metrics

Software solutions can provide real-time data quality monitoring, including tracking performance on data quality metrics. Leading solutions might include such features as:

Comprehensive dashboards

An aggregated display of an organization's pipelines and data assets enables data incident management across the data stack.

Real-time monitoring

Monitoring for data quality checks and service level agreement (SLA) rule violations related to missed data deliveries, schema changes and anomalies.

Customized alerts

Customized, automated notifications delivered to data stakeholders through tools and platforms such as Slack, PagerDuty and email.

Trend-level graphs

Graphs on rows and operations written and read each day can help enterprises identify important trends and problematic patterns.

End-to-end lineage

End-to-end data lineage shows dependent datasets and pipelines that are affected by data quality issues.

Related solutions
IBM watsonx.data intelligence

Discover, govern and share your data—wherever it resides—to fuel AI that delivers accurate, timely and relevant insights.

Discover watsonx.data intelligence
IBM data intelligence solutions

Transform raw data into actionable insights swiftly, unify data governance, quality, lineage and sharing, and empower data consumers with reliable and contextualized data.

Discover data intelligence solutions
Data and analytics consulting services

Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.

Discover analytics services
Take the next step

IBM® watsonx.data® optimizes workloads for price and performance while enforcing consistent governance across sources, formats and teams. 

Explore IBM watsonx.data Explore data quality solutions