What is a data architecture?
Explore IBM's data architecture solution Subscribe for AI updates
Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following
What is a data architecture?

A data architecture describes how data is managed--from collection through to transformation, distribution, and consumption. It sets the blueprint for data and the way it flows through data storage systems. It is foundational to data processing operations and artificial intelligence (AI) applications.

The design of a data architecture should be driven by business requirements, which data architects and data engineers use to define the respective data model and underlying data structures, which support it. These designs typically facilitate a business need, such as a reporting or data science initiative.

As new data sources emerge through emerging technologies, such as the Internet of Things (IoT), a good data architecture ensures that data is manageable and useful, supporting data lifecycle management. More specifically, it can avoid redundant data storage, improve data quality through cleansing and deduplication, and enable new applications. Modern data architectures also provide mechanisms to integrate data across domains, such as between departments or geographies, breaking down data silos without the huge complexity that comes with storing everything in one place.

Modern data architectures often leverage cloud platforms to manage and process data. While it can be more costly, its compute scalability enables important data processing tasks to be completed rapidly. The storage scalability also helps to cope with rising data volumes, and to ensure all relevant data is available to improve the quality of training AI applications.

Build responsible AI workflows with AI governance

Learn the building blocks and best practices to help your teams accelerate responsible AI.

Related content

Register for the ebook on generative AI

Conceptual vs logical vs physical data models

The data architecture documentation includes three types of data model:

  • Conceptual data models: They are also referred to as domain models and offer a big-picture view of what the system will contain, how it will be organized, and which business rules are involved. Conceptual models are usually created as part of the process of gathering initial project requirements. Typically, they include entity classes (defining the types of things that are important for the business to represent in the data model), their characteristics and constraints, the relationships between them and relevant security and data integrity requirements.
  •  Logical data models: They are less abstract and provide greater detail about the concepts and relationships in the domain under consideration. One of several formal data modeling notation systems is followed. These indicate data attributes, such as data types and their corresponding lengths, and show the relationships among entities. Logical data models don’t specify any technical system requirements.
  • Physical data models: The physical data model is the most detailed and specific of the three. It defines the actual implementation of the database, including table structures, indexes, storage and performance considerations. It focuses on the technical aspects of how the data will be stored and accessed, and is used for database schema creation and optimization.
Popular data architecture frameworks             

A data architecture can draw from popular enterprise architecture frameworks, including TOGAF, DAMA-DMBOK 2, and the Zachman Framework for Enterprise Architecture.

The Open Group Architecture Framework (TOGAF)

This enterprise architecture methodology was developed in 1995 by The Open Group, of which IBM is a Platinum Member.

There are four pillars to the architecture:

  • Business architecture, which defines the enterprise’s organizational structure, business strategy, and processes.
  • Data architecture, which describes the conceptual, logical, and physical data assets and how they are stored and managed throughout their lifecycle.
  • Applications architecture, which represents the application systems, and how they relate to key business processes and each other.
  • Technical architecture, which describes the technology infrastructure (hardware, software, and networking) needed to support mission-critical applications.

As such, TOGAF provides a complete framework for designing and implementing an enterprise’s IT architecture, including its data architecture.

DAMA-DMBOK 2

DAMA International, originally founded as the Data Management Association International, is a not-for-profit organization dedicated to advancing data and information management. Its Data Management Body of Knowledge, DAMA-DMBOK 2, covers data architecture, as well as governance and ethics, data modelling and design, storage, security, and integration.

Zachman Framework for Enterprise Architecture

Originally developed by John Zachman at IBM in 1987, this framework uses a matrix of six layers from contextual to detailed, mapped against six questions such as why, how, and what. It provides a formal way to organize and analyze data but does not include methods for doing so.

 

Types of data architectures and underlying components

A data architecture demonstrates a high level perspective of how different data management systems work together. These are inclusive of a number of different data storage repositories, such as data lakes, data warehouses, data marts, databases, et cetera. Together, these can create data architectures, such as data fabrics and data meshes, which are increasingly growing in popularity. These architectures place more focus on data as products, creating more standardization around metadata and more democratization of data across organizations via APIs.

The following section delves deeper into each of these storage components and data architecture types:

Types of data management systems

  • Data warehouses: A data warehouse aggregates data from different relational data sources across an enterprise into a single, central, consistent repository. After extraction, the data flows through an ETL data pipeline, undergoing various data transformations to meet the predefined data model. Once it loads into the data warehouse, the data lives to support different business intelligence (BI) and data science applications.
  • Data marts: A data mart is a focused version of a data warehouse that contains a smaller subset of data important to and needed by a single team or a select group of users within an organization, such as the HR department. Since they contain a smaller subset of data, data marts enable a department or business line to discover more-focused insights more quickly than possible when working with the broader data warehouse data set. Data marts originally emerged in response to the difficulties organizations had setting up data warehouses in the 1990s. Integrating data from across the organization at that time required a lot of manual coding and was impractically time consuming. The more limited scope of data marts made them easier and faster to implement than centralized data warehouses.
  • Data Lakes: While data warehouses store processed data, a data lake houses raw data, typically petabytes of it. A data lake can store both structured and unstructured data, which makes it unique from other data repositories. This flexibility in storage requirements is particularly useful for data scientists, data engineers, and developers, allowing them to access data for data discovery exercises and machine learning projects. Data lakes were originally created as a response to the data warehouse’s failure to handle the growing volume, velocity, and variety of big data. While data lakes are slower than data warehouses, they are also cheaper as there is little to no data preparation before ingestion. Today, they continue to evolve as part of data migration efforts to the cloud. Data lakes support a wide range of use cases, because the business goals for the data do not need to be defined at the time of data collection. However, two primary ones include data science exploration and data backup and recovery efforts. Data scientists can use data lakes for proof-of-concepts. Machine learning applications benefit from the ability to store structured and unstructured data in the same place, which is not possible using a relational database system. Data lakes can also be used to test and develop big data analytics projects. When the application has been developed, and the useful data has been identified, the data can be exported into a data warehouse for operational use, and automation can be used to make the application scale. Data lakes can also be used for data backup and recovery, due to their ability to scale at a low cost. For the same reasons, data lakes are good for storing “just in case” data, for which business needs have not yet been defined. Storing the data now means it will be available later as new initiatives emerge.

Types of data architectures

Data fabrics: A data fabric is an architecture, which focuses on the automation of data integration, data engineering, and governance in a data value chain between data providers and data consumers. A data fabric is based on the notion of “active metadata” which uses knowledge graph, semantics, data mining, and machine learning (ML) technology to discover patterns in various types of metadata (for example system logs, social, etc.). Then, it applies this insight to automate and orchestrate the data value chain. For example, it can enable a data consumer to find a data product and then have that data product provisioned to them automatically. The increased data access between data products and data consumers leads to a reduction in data siloes and provides a more complete picture of the organization’s data. Data fabrics are an emerging technology with enormous potential and they can be used to enhance customer profiling, fraud detection, and preventative maintenance.  According to Gartner, data fabrics reduce integration design time by 30%, deployment time by 30%, and maintenance by 70%.

Data meshes: A data mesh is a decentralized data architecture that organizes data by business domain. Using a data mesh, the organization needs to stop thinking of data as a by-product of a process and start thinking of it as a product in its own right. Data producers act as data product owners. As subject matter experts, data producers can use their understanding of the data’s primary consumers to design APIs for them. These APIs can also be accessed from other parts of the organization, providing broader access to managed data.

More traditional storage systems such as data lakes and data warehouses can be used as multiple decentralized data repositories to realize a data mesh. A data mesh can also work with a data fabric, with the data fabric’s automation enabling new data products to be created more quickly or enforcing global governance.

 

Benefits of data architectures

Well constructed data architecture can offer businesses a number of key benefits, which include:

  • Reducing redundancy: There may be overlapping data fields across different sources, resulting in the risk of inconsistency, data inaccuracies, and missed opportunities for data integration. A good data architecture can standardize how data is stored, and potentially reduce duplication, enabling better quality and holistic analyses.
  • Improving data quality: Well-designed data architectures can solve some of the challenges of poorly managed data lakes, also known as “data swamps”. A data swamp lacks in appropriate data quality and data governance practices to provide insightful learnings. Data architectures can help enforce data governance and data security standards, enabling the appropriate oversight into data pipeline to operate as intended. By improving data quality and governance, data architectures can ensure that data is stored in a way that makes it useful now and in the future.
  • Enabling integration: Data has often been siloed, as a result of technical limitations on data storage and organizational barriers within the enterprise. Today’s data architectures should aim to facilitate data integration across domains, so that different geographies and business functions have access to each other’s data. That leads to a better and more consistent understanding of common metrics (such as expenses, revenue, and their associated drivers). It also enables a more holistic view of customers, products, and geographies, to better inform decision-making.
  • Data lifecycle management: A modern data architecture can address how data is managed over time. Data typically becomes less useful as it ages and is accessed less frequently. Over time, data can be migrated to cheaper, slower storage types so it remains available for reports and audits, but without the expense of high-performance storage.
Modern data architecture

As organizations build their roadmap for tomorrow’s applications – including AI, blockchain, and Internet of Things (IoT) workloads – they need a modern data architecture that can support the data requirements.

The top seven characteristics of a modern data architecture are:

  • Cloud-native and cloud-enabled, so that the data architecture can benefit from the elastic scaling and high availability of the cloud.
  • Robust, scalable, and portable data pipelines, which combine intelligent workflows, cognitive analytics, and real-time integration in a single framework.
  • Seamless data integration, using standard API interfaces to connect to legacy applications.
  • Real-time data enablement, including validation, classification, management, and governance.
  • Decoupled and extensible, so there are no dependencies between services and open standards enable interoperability.
  • Based on common data domains, events, and microservices.
  • Optimized to balance cost and simplicity.
IBM Solutions
IBM Cloud Pak for Data

IBM Cloud Pak for Data is an open, extensible data platform that provides a data fabric to make all data available for AI and analytics, on any cloud.

Explore IBM Cloud Pak for Data
IBM Watson® Studio

Build, run and manage AI models. Prepare data and build models on any cloud using open source code or visual modeling. Predict and optimize your outcomes.

Explore IBM Watson Studio
IBM® Db2® on Cloud

Learn about Db2 on Cloud, a fully managed SQL cloud database configured and optimized for robust performance.

Explore IBM Db2 on Cloud
Resources Create a strong data foundation for AI

Read the smartpaper on how to create a robust data foundation for AI by focusing on three key data management areas: access, governance, and privacy and compliance.

Read the IBV report

Data fabric can help businesses investing in AI, machine learning, Internet of Things, and edge computing get more value from their data.

Take the next step

Scale AI workloads for all your data, anywhere, with IBM watsonx.data, a fit-for-purpose data store built on an open data lakehouse architecture.

Explore watsonx.data Book a live demo