A data mesh is a decentralized data architecture that organizes data by a specific business domain—for example, marketing, sales, customer service, and more—providing more ownership to the producers of a given dataset. The producers’ understanding of the domain data positions them to set data governance policies focused on documentation, quality, and access. This, in turn, enables self-service use across an organization. While this federated approach eliminates many operational bottlenecks associated with centralized, monolithic systems, it doesn't necessarily mean that you can't use traditional storage systems, like data lakes or data warehouses. It just means that their use has shifted from a single, centralized data platform to multiple decentralized data repositories.
It's worth noting that data mesh promotes the adoption of cloud native and cloud platform technologies to scale and achieve the goals of data management. This concept is commonly compared to microservices to help audiences understand its use within this landscape. As this distributed architecture is particularly helpful in scaling data needs across an organization, it can be inferred that a data mesh may not be for all types of businesses; that is, smaller businesses may not reap the benefits of a data mesh as their enterprise data may not be as complex as a larger organization.
Zhamak Dehghani, a director of technology for IT consultancy firm ThoughtWorks, is credited for promoting the concept of data mesh as a solution to the inherent challenges of centralized, monolithic data structures, such as data accessibility and organization. Its adoption was further spurred by the COVID-19 pandemic in an effort to drive cultural change and reduce organizational complexity around data.
A data mesh involves a cultural shift in the way that companies think about their data. Instead of data acting as a by-product of a process, it becomes the product, where data producers act as data product owners. Historically, a centralized infrastructure team would maintain data ownership across domains, but the product thinking focus under a data mesh model shifts this ownership to the producers as they are the subject matter experts. Their understanding of the primary data consumers and how they leverage the domain’s operational and analytical data allows them to design APIs with their best interests in mind. While this domain-driven design also makes data producers responsible for documenting semantic definitions, cataloguing metadata and setting policies for permissions and usage, there is still a centralized data governance team to enforce these standards and procedures around the data. Additionally, while domain teams become responsible for their ETL data pipelines under a data mesh architecture, it doesn't eliminate the need for a centralized data engineering team. However, their responsibility becomes more focused on determining the best data infrastructure solutions for the data products being stored.
Similar to how a microservices architecture couples lightweight services together to provide functionality to a business- or consumer-facing application, a data mesh uses functional domains as a way to set parameters around the data, enabling it to be treated as a product which can be accessed to users across the organization. In this way, a data mesh allows for more flexible data integration and interoperable functionality, where data from multiple domains can be immediately consumed by users for business analytics, data science experimentation and more.
As previously stated, a data mesh is a distributed data architecture, where data is organized by its domain to make it more accessible to users across an organization. A data lake is a low-cost storage environment, which typically houses petabytes of structured, semi-structured and unstructured data for business analytics, machine learning and other broad applications. A data mesh is an architectural approach to data, which a data lake can be a part of. However, a central data lake is more typically used as dumping ground for data as it frequently is used to ingest data that does not yet have a defined purpose. As a result, it can fall victim to becoming a data swamp—i.e. a data lake that lacks the appropriate data quality and data governance practices to provide insightful learnings.
A data fabric is an architecture concept, and it focuses on the automation of data integration, data engineering, and governance in a data value chain between data providers and data consumers. A data fabric is based on the notion of “active metadata” which uses knowledge graph, semantics, and AI / ML technology to discover patterns in various types of metadata (for example system logs, social, etc.) and apply this insight to automate and orchestrate the data value chain (for example enable a data consumer to find a data product and then have that data product provisioned to them automatically). A data fabric is complimentary to a data mesh as opposed to mutually exclusive. In fact the data fabric makes the data mesh better because it can automate key parts of the data mesh such as creating data products faster, enforcing global governance, and making it easier to orchestrate the combination of multiple data products.
Data democratization: Data mesh architectures facilitates self-service applications from multiple data sources, broadening the access of data beyond more technical resources, such as data scientists, data engineers, and developers. By making data more discoverable and accessible via this domain-driven design, it reduces data silos and operational bottlenecks, enabling faster decision-making and freeing up technical users to prioritize tasks that better utilize their skillsets.
Cost efficiencies: This distributed architecture moves away from batch data processing and instead, it promotes the adoption of cloud data platforms and streaming pipelines to collect data in real-time. Cloud storage provides an additional cost advantage by allowing data teams to spin up large clusters as needed, paying only for the storage specified. This means that if you need additional compute power to run a job in a few hours vs. a few days, you can easily do this on a cloud data platform by purchasing additional compute nodes. This also means that it improves visibility into storage costs, enabling better budget and resource allocation for engineering teams.
Less technical debt: A centralized data infrastructure causes more technical debt due to the complexity and required collaboration to maintain the system. As data accumulates within a repository, it also begins to slow down the overall system. By distributing the data pipeline by domain ownership, data teams can better meet the demands of their data consumers and reduce technical strains on the storage system. They can also provide more accessibility to data by providing APIs for them to interface with, reducing the overall volume of individual requests.
Interoperability: Under a data mesh model, data owners agree on how to standardize domain-agnostic data fields upfront, which facilitates interoperability. This way, when a domain team is structuring their respective datasets, they are applying the relevant rules to enable data linkage across domains quickly and easily. Some fields commonly standardized are field type, metadata, schema flags, and more. Consistency across domains enables data consumers to interface with APIs more easily and develop applications to serve their business needs more appropriately.
Security and compliance: Data mesh architectures promote stronger governance practices as they help enforce data standards for domain-agnostic data and access controls for sensitive data. This ensures that organizations follow government regulations, like HIPPA restrictions, and the structure of this data ecosystem supports this compliance through the enablement of data audits. Log and trace data in a data mesh architecture embeds observability into the system, allowing auditors to understand which users are accessing specific data and the frequency of that access.
While distributed data mesh architectures are still gaining adoption, they're helping teams attain their goals of scalability for common big data use cases. These include:
IBM supports the implementation of a data mesh with the IBM Data Fabric on Cloud Pak for Data. The IBM Data Fabric is a unified solution that contains all the capabilities needed to create data products and enable the governed and orchestrated access and use of these data products. The IBM Data Fabric enables the implementation of a data mesh on any platform (e.g., on premises data lakes, cloud data warehouses, etc.), allowing true enterprise-level self-service and re-use of data products regardless of where the data is.