My IBM

What is a data mesh?

A data mesh is a decentralized data architecture that organizes data by a specific business domain—for example, marketing, sales, customer service and more—to provide more ownership to the producers of a given data set.

The producers’ understanding of the domain data positions them to set data governance policies focused on documentation, quality, and access. This, in turn, enables self-service use across an organization. While this federated approach eliminates many operational bottlenecks associated with centralized, monolithic systems, it doesn't necessarily mean that you can't use traditional storage systems, like data lakes or data warehouses. It just means that their use has shifted from a single, centralized data platform to multiple decentralized data repositories.

It's worth noting that data mesh promotes the adoption of cloud native and cloud platform technologies to scale and achieve the goals of data management. This concept is commonly compared to microservices to help audiences understand its use within this landscape. As this distributed architecture is particularly helpful in scaling data needs across an organization, it can be inferred that a data mesh may not be for all types of businesses; that is, smaller businesses may not reap the benefits of a data mesh as their enterprise data may not be as complex as a larger organization.

Zhamak Dehghani, a director of technology for IT consultancy firm ThoughtWorks, is credited for promoting the concept of data mesh as a solution to the inherent challenges of centralized, monolithic data structures, such as data accessibility and organization. Its adoption was further spurred by the COVID-19 pandemic in an effort to drive cultural change and reduce organizational complexity around data.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

How does a data mesh work?

A data mesh involves a cultural shift in the way that companies think about their data. Instead of data acting as a by-product of a process, it becomes the product, where data producers act as data product owners. Historically, a centralized infrastructure team would maintain data ownership across domains, but the product thinking focus under a data mesh model shifts this ownership to the producers as they are the subject matter experts. Their understanding of the primary data consumers and how they leverage the domain’s operational and analytical data allows them to design APIs with their best interests in mind.

While this domain-driven design also makes data producers responsible for documenting semantic definitions, cataloguing metadata and setting policies for permissions and usage, there is still a centralized data governance team to enforce these standards and procedures around the data. Additionally, while domain teams become responsible for their ETL data pipelines under a data mesh architecture, it doesn't eliminate the need for a centralized data engineering team. However, their responsibility becomes more focused on determining the best data infrastructure solutions for the data products being stored.

Similar to how a microservices architecture couples lightweight services together to provide functionality to a business- or consumer-facing application, a data mesh uses functional domains as a way to set parameters around the data, enabling it to be treated as a product which can be accessed to users across the organization. In this way, a data mesh allows for more flexible data integration and interoperable functionality, where data from multiple domains can be immediately consumed by users for business analytics, data science experimentation and more.

Mixture of Experts | 25 April, episode 52

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch the latest podcast episodes

Data mesh vs. data lake

As previously stated, a data mesh is a distributed data architecture, where data is organized by its domain to make it more accessible to users across an organization. A data lake is a low-cost storage environment, which typically houses petabytes of structured, semi-structured and unstructured data for business analytics, machine learning and other broad applications. A data mesh is an architectural approach to data, which a data lake can be a part of. However, a central data lake is more typically used as dumping ground for data as it frequently is used to ingest data that does not yet have a defined purpose. As a result, it can fall victim to becoming a data swamp—i.e. a data lake that lacks the appropriate data quality and data governance practices to provide insightful learnings.

Data mesh vs. data fabric

A data fabric is an architecture concept, and it focuses on the automation of data integration, data engineering, and governance in a data value chain between data providers and data consumers. A data fabric is based on the notion of “active metadata” which uses knowledge graph, semantics, and artificial intelligence/machine learning technology to discover patterns in various types of metadata (for example system logs, social, etc.) and apply this insight to automate and orchestrate the data value chain (for example enable a data consumer to find a data product and then have that data product provisioned to them automatically). A data fabric is complimentary to a data mesh as opposed to mutually exclusive. In fact the data fabric makes the data mesh better because it can automate key parts of the data mesh such as creating data products faster, enforcing global governance, and making it easier to orchestrate the combination of multiple data products.

Benefits of a data mesh

Data democratization: Data mesh architectures facilitates self-service applications from multiple data sources, broadening the access of data beyond more technical resources, such as data scientists, data engineers, and developers. By making data more discoverable and accessible via this domain-driven design, it reduces data silos and operational bottlenecks, enabling faster decision-making and freeing up technical users to prioritize tasks that better utilize their skillsets.

Cost efficiencies: This distributed architecture moves away from batch data processing and instead, it promotes the adoption of cloud data platforms and streaming pipelines to collect data in real-time. Cloud storage provides an additional cost advantage by allowing data teams to spin up large clusters as needed, paying only for the storage specified. This means that if you need additional compute power to run a job in a few hours vs. a few days, you can easily do this on a cloud data platform by purchasing additional compute nodes. This also means that it improves visibility into storage costs, enabling better budget and resource allocation for engineering teams.

Less technical debt: A centralized data infrastructure causes more technical debt due to the complexity and required collaboration to maintain the system. As data accumulates within a repository, it also begins to slow down the overall system. By distributing the data pipeline by domain ownership, data teams can better meet the demands of their data consumers and reduce technical strains on the storage system. They can also provide more accessibility to data by providing APIs for them to interface with, reducing the overall volume of individual requests.

Interoperability: Under a data mesh model, data owners agree on how to standardize domain-agnostic data fields upfront, which facilitates interoperability. This way, when a domain team is structuring their respective datasets, they are applying the relevant rules to enable data linkage across domains quickly and easily. Some fields commonly standardized are field type, metadata, schema flags, and more. Consistency across domains enables data consumers to interface with APIs more easily and develop applications to serve their business needs more appropriately.

Security and compliance: Data mesh architectures promote stronger governance practices as they help enforce data standards for domain-agnostic data and access controls for sensitive data. This ensures that organizations follow government regulations, like HIPPA restrictions, and the structure of this data ecosystem supports this compliance through the enablement of data audits. Log and trace data in a data mesh architecture embeds observability into the system, allowing auditors to understand which users are accessing specific data and the frequency of that access.

Use cases of a data mesh

While distributed data mesh architectures are still gaining adoption, they're helping teams attain their goals of scalability for common big data use cases. These include:

Business intelligence dashboards: As new initiatives arise, teams commonly require customized data views to understand the performance of these projects. Data mesh architectures can support this need for flexibility and customization by making data more available to data consumers.
Automated virtual assistants: Businesses commonly use chatbots to support call centers and customer service teams. As frequently asked questions can touch on various datasets, a distributed data architecture can make more data assets available to these virtual agent systems.
Customer experience: Customer data allows businesses to better understand their users, allowing them to provide more personalized experiences. This has been observed in a variety of industries from marketing to healthcare.
Machine learning projects: By standardizing domain agnostic data, data scientists can more easily stitch together data from various data sources, reducing the time spent on data processing. This time can help to accelerate the number of models which move into a production environment, enabling the achievement of automation goals.

The data leader’s guide to AI-ready data

Learn about actionable steps data leaders can take to overcome challenges, establish the groundwork for a trusted data foundation, and help get your organization’s data ready for AI.

Resources

The data differentiator: Data intelligence

Discover, curate, trust, and access data through cataloging, quality assurance, governance, lineage tracing, and sharing platforms.

The future of data intelligence

Read this Q&A with IDC’s Stewart Bond to learn about using the right data for the right reason by providing transparency, context, and control in delivering AI-ready data for the AI-fueled business.

IDC MarketScape: Worldwide Data Intelligence Platform Software 2024 Vendor Assessment

Learn why IBM was named a leader in this IDC evaluation report that assesses and positions vendors in the data intelligence software market.

IBM named a Leader in the 2024 Gartner® Magic Quadrant™ for Data and Analytics Governance Platforms

Learn why IBM is recognized as a Leader in the 2024 Gartner® Magic Quadrant™ for Data and Analytics Governance Platforms.

2024 ISG Buyers Guide: Data Intelligence

Gain a holistic view of a software provider’s ability to deliver the combination of functionality to provide a complete view of data production and data consumption with either a single data intelligence product or suite of products.

From data to insight: charting your journey through data intelligence

Explore the transformative process and best practices for deriving actionable insights from data intelligence.

The data differentiator: Design your data strategy in six steps

Design the right data strategy to achieve your organization’s business goals and create success with AI.

What is a data mesh?