What is a data contract?

By Alice Gomstyn , Alexandra Jonker

Data contracts, defined

Data contracts are formal agreements between data producers and data consumers that define the quality, structure, semantics and availability of data. The creation and enforcement of these agreements can help enterprises strengthen their data-driven decision-making.

Like traditional business contracts, data contracts include terms and conditions governing what’s being delivered from one party to another. In a data contract, this can include components such as data quality rules, schema definitions, service level agreements, data producer information and server information.

However, where data contracts truly differ is that they are written in code; therefore, the agreements are enforceable through automation rather than manual processes.

The impact of data contracts on data engineering has been compared to the impact of application programming interfaces (APIs) on software development. APIs define rules that enable software applications to communicate with each other, while data contracts define rules that enable data consumers to successfully integrate and use data from various sources.

And, just as APIs are credited with improving productivity and accelerating innovation in software development, the successful implementation of data contracts can yield an array of benefits to enterprises and data users.

The most critical of these is the prevention of data pipeline failures: Without data contracts, upstream changes in data production can result in disastrous consequences for downstream users. Data contracts can ensure such breaking changes are identified and addressed before they impact data consumers.

Other benefits of data contracts include improvements to data quality, data governance and scalability. Data contracts also provide foundational support to data products and data mesh architectures that enable business users to find and unlock value from data across the organization.

There are a variety of tools and platforms that help businesses define and enforce data contracts, including data quality tools and data governance platforms.

Why are data contracts important?

Brittle and broken data pipelines are the bane of many data engineers. One study found that more than half of engineers encounter pipeline failures in their data systems at least once a month, if not more.¹

Too often, as one data architect noted, pipelines are “held together with duct tape and desperation.” When they fail, they can undermine decision-making and artificial intelligence (AI) initiatives in a disastrous fashion.

Data contracts can help prevent such consequences by targeting a frequent source of pipeline failures: misalignment between data producers and data consumers. Misalignment occurs when new data provided by data producers doesn’t meet the expectations of consumers, who might rely on specific data types, schemas and other constraints to fit their use cases.

Understandably, downstream consumers can “be doubtful about the stability of the data that they find,” according to Jean-Georges Perrin, a lifetime IBM Champion. “To create trust, the data producer or the data owner needs to show and guarantee a promise,” Perrin wrote.

Such a promise—whether it’s about data quality, validation, access or structure—can be guaranteed through the implementation of data contracts. When data producers and data consumers align on and codify data requirements, this can prevent data quality issues before they affect downstream workflows.

Data contracts are emerging as especially salient for AI workflows, because ensuring the right data for model training and data analytics is crucial for accurate predictions.

“You get better data in your systems, so you don’t have garbage in, garbage out,” Perrin said.

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.

Data producers and data consumers: Key differences and roles

Since misalignment between data producers and consumers largely drives the need for data contracts, it helps to take a closer look at both of these groups and their roles within data ecosystems.

Data producers are often software engineers who collect and store data as they build applications. Often this data includes transactional events, such as customer orders, which can vary tremendously in terms of schema, size, content and so on.²
Data consumers include technical data consumers (data engineers, data scientists and other data team members who use programming languages to transform and analyze data and to construct data pipelines) and non-technical data consumers, namely business users who use transformed data to inform decisions.

Data consumers rely on producers to make their data available for downstream use. But this reliance does not by itself establish a solid relationship between consumers and producers. That’s because producers tend to store data in the formats best suited for their applications—the data management needs of downstream consumers aren’t inherently part of their focus.

Consequently, when software engineers update applications and code, they might not consider how those updates impact the data that is ultimately delivered to data consumers.

When such changes are unexpected—even changes as relatively small as dropping a single column from a 1,000-column table—chaos can ensue for data consumers. These so-called breaking changes can disrupt data flows and imperil dataset compatibility.

As a result, scripts fail, dashboards become stale or inaccurate, and both humans and AI agents are deprived of the reliable data necessary for key decisions and operations.

By establishing definitive agreements between data producers and data consumers, data contracts can break the silos separating the two groups, preventing misalignment and supporting more functional data pipelines.

What do data contracts include?

Data contracts, as advocates note, turn implicit assumptions into explicit guarantees: They detail exactly what data producers are supposed to deliver to data consumers and how they’re expected to achieve that delivery. Key elements of data contracts include:

Fundamentals
Schema
Data quality
Support channels
Pricing
Team
Roles
Service level agreements (SLAs)
Infrastructure and servers

Fundamentals

Fundamentals, also known as demographics, encompass general information about the contract. This can include unique identifiers, the contract version (1.0 or 2.0, for instance), the contract status (such as “active” or “retired”), the intended purpose for the data, and legal limitations for data use.

Schema

A schema definition details how data is organized. It specifies objects (data structures such as tables and documents), object properties (such as the columns in a table) and metadata for included data types, such as timestamps and string length limits. Schema registries, which are centralized repositories for managing schemas, can help support data contracts.

Data quality

Data contracts define rules and parameters to ensure high-quality data. They can address multiple data quality metrics, such as accuracy, completeness, validity and null counts. Additionally, custom rules can allow for quality checks by data quality tools.

Support channels

Data contracts can list sources for support—such as Slack channels, Teams channels, Discord chats, email distribution lists and websites—for data consumers who need help with their data contracts.

Pricing

The pricing section of a data contract lists what a data consumer is charged for a data product. It can include the currency being used and the unit of measure (such as megabytes or gigabytes) that can be used for calculating cost.

Team

Known in some older data contract templates as “stakeholders,” the team section of a data contract includes information on the members of the team that owns the data and their relationship to the data contract.

Roles

The roles section of a data contract lists the roles that provide data consumers different types of access (such as read or write access) to a dataset.

Service level agreements (SLAs)

Data contracts include descriptions of service-level agreements, which define the level of performance the provider is expected to attain. For example, SLA sections may list guarantees on when the data is available and how long it is retained.

Infrastructure and servers

Specifying the data’s server—such as Kafka, Microsoft SQL, Google BigQuery or IBM Db2—allows it to be discoverable to data consumers while providing platform engineers the information they need to automate access. This section can also include information on different environments, such as development and production.

In addition to core data contract values, stakeholders can add custom properties to data contracts to meet their specific needs. Recent versions of popular data contract templates also include the ability to document relationships between properties, refer to external contracts and access a library of predefined data quality rules for consistent quality checks.

What is the format of data contracts?

Data contracts are noticeably distinct from other contracts in that they’re not written in plain language. Rather, they’re usually written in YAML or JSON, human- and machine-readable data serialization languages. (For users unfamiliar with coding serialization languages, data contracts can be authored in Excel and converted to YAML files through the open source tool, Data Contract CLI.)

The advantage of writing data contracts in a serialization language is that it enables automation in place of laborious manual processes. Machine-readable data contracts can be integrated into data platforms and enforcement tools. Organizations can deploy these solutions to test whether datasets adhere to data contract rules, allowing them to address issues before they result in pipeline failures.

How are data contracts designed?

Many data contracts are based on the Open Data Contract Standard (ODCS). As its name suggests, ODCS is an open source framework for standardizing data contracts. The standard is defined by Bitol, a Linux Foundation AI & Data sandbox project under the Apache 2.0 license, and is available on GitHub.

Proponents of the initiative say it helps facilitate innovation by allowing organizations to implement data contracts that support their data architecture without the risk of vendor lock-in.

What are the benefits of data contracts?

Data contracts can address multiple pain points in data pipelines, clearing the way for organizations to unlock value from their data assets. The benefits of data contracts include:

Improved data quality

As the old saying goes, what gets measured gets managed. Setting standards for accuracy, validity, timeliness and other data quality metrics in a data contract can enhance the quality of the data delivered while reducing latency.

Increased transparency for dependencies

Data contracts can delineate relationships between interconnected tasks, or dependencies. With clarification on such dependencies, data producers can avoid making breaking changes.

Better change management

Data contracts are version controlled, meaning that new versions of contracts are created to reflect important changes. This can help producers introduce modifications, such as schema changes, in a “safe” way that provides consumers time to accommodate those changes, reducing the risk of sudden pipeline breaks.

Enhanced collaboration

The process of creating a data contract facilitates communication and agreement between data producers and consumers. Once it’s created, the contract serves as a foundation for continued collaboration as versioning accommodates changing data and needs.

Greater accountability

Data contracts delineate the relationships between stakeholders and data, including who owns it, what roles can access it, and how users can get support. Contracts help clarify who is responsible for what, ensuring greater accountability.

Data governance

Data contracts can be considered data governance tools because they reflect and enforce critical governance goals, such as ensuring data quality, security and availability throughout the data lifecycle.

Easier scalability

By supporting data governance and collaboration, data contracts can help organizations and business units successfully share data even as data volumes increase at unprecedented rates.

What is Apache Kafka?

In this video, you will learn what Apache Kafka is, how it works and the core concepts behind building real-time event streaming applications.

Explore Confluent

Data contracts vs. data products and data mesh

Data contracts are often mentioned in discussions around data products and data mesh. This is for good reason: data contracts play key roles in supporting both of these technologies.

Let’s start with data products. A data product is a reusable, self-contained package that combines data, metadata, semantics and templates to support diverse business use cases. Data contracts can serve as “quality control” for data products, ensuring that the data within them is consistent, reliable and formatted correctly.

Perrin describes the relationship between data products and data contracts in familial terms: “Data contracts and data products are like inseparable cousins—always working together, always aligned, and always making sure things run smoothly.”

The significance of data contracts for data products also makes them important for the functionality of data mesh. A data mesh is a decentralized data architecture that organizes data by business domain—such as marketing, sales or customer service.

In a data mesh architecture, domain data producers use data products that allow business users to find and use data from different parts of an organization. As such, when data contracts ensure the performance of data products, they support the success of a data mesh as a whole.

How are data contracts implemented and enforced?

In the book Data Contracts, data experts Chad Sanderson, Mark Freeman and B.E. Schmidt delineate the following process for how data contracts work.³

Data consumers identify their data needs to meet business objectives.
Technical data consumers translate business requirements into technical requirements for data.
Data consumers request data contracts from data producers based on these requirements.
Data producers determine whether the requests are viable.
The data contract is written in code, such as YAML.
Data producers create a pull request (a way to propose changes to a repository) when they need to change a data asset.
Data contract-based checks are performed on the requested change as part of a CI/CD pipeline to ensure it doesn’t violate contract terms.
Data producers are alerted if the change violates the contract, triggering measures to address the violation.

Data contract solutions

Organizations can choose from a variety of tools and platforms to create and manage data contracts. They include:

Open source projects that serialize and deserialize data, such as Apache Avro and Google Protocol Buffers (protobuf)
Data quality and testing tools, such as Great Expectations and dBT
Schema registries to check for schema compatibility, such as Confluent’s registry
Data governance platforms with features such as data lineage tracking and data catalogs

An organization’s unique needs and existing data stacks can help determine which data contract solutions or combination of solutions are best to support their data contracts.

Authors

Alice Gomstyn

Staff Writer

IBM Think

Alexandra Jonker

Staff Editor

IBM Think

3D render of two lines of several icons such as a camera, volume knob and a clipboard

Discover how an AI-powered data integration approach unlocks the full potential of your data from our ebook.

Resources

3D render of several icons aligned between glass lenses

AI agents run on data—is yours ready?

Your data is your competitive edge. Learn how to unlock it securely and drive measurable ROI from AI in this short webinar.

Illustration of various icons in an orbit-like flow

Is your data ready for gen AI?

Explore our IBM Data Matters hub to learn how you can tackle data and AI challenges like integration.

Close-up of a person's hands interacting with a smartphone

Real-time advising needs real-time data

How Wealth API is powering AI-ready, real-time financial intelligence with trusted streaming data

3D render of several social media pieces in different colors forming a DNA

Unleash the power of AI for seamless data integration

Understand why organizations need to adopt a unified approach that lets them manage the full spectrum of integration capabilities from a single pane of glass, eliminating the need to rely on numerous tools.