A data architecture demonstrates a high level perspective of how different data management systems work together. These are inclusive of a number of different data storage repositories, such as data lakes, data warehouses, data marts, databases, et cetera. Together, these can create data architectures, such as data fabrics and data meshes, which are increasingly growing in popularity. These architectures place more focus on data as products, creating more standardization around metadata and more democratization of data across organizations via APIs.

The following section delves deeper into each of these storage components and data architecture types:

Types of data management systems

Data warehouses: A data warehouse aggregates data from different relational data sources across an enterprise into a single, central, consistent repository. After extraction, the data flows through an ETL data pipeline, undergoing various data transformations to meet the predefined data model. Once it loads into the data warehouse, the data lives to support different business intelligence (BI) and data science applications.

Data marts: A data mart is a focused version of a data warehouse that contains a smaller subset of data important to and needed by a single team or a select group of users within an organization, such as the HR department. Since they contain a smaller subset of data, data marts enable a department or business line to discover more-focused insights more quickly than possible when working with the broader data warehouse data set. Data marts originally emerged in response to the difficulties organizations had setting up data warehouses in the 1990s. Integrating data from across the organization at that time required a lot of manual coding and was impractically time consuming. The more limited scope of data marts made them easier and faster to implement than centralized data warehouses.

Data Lakes: While data warehouses store processed data, a data lake houses raw data, typically petabytes of it. A data lake can store both structured and unstructured data, which makes it unique from other data repositories. This flexibility in storage requirements is particularly useful for data scientists, data engineers, and developers, allowing them to access data for data discovery exercises and machine learning projects. Data lakes were originally created as a response to the data warehouse's failure to handle the growing volume, velocity, and variety of big data. While data lakes are slower than data warehouses, they are also cheaper as there is little to no data preparation before ingestion. Today, they continue to evolve as part of data migration efforts to the cloud. Data lakes support a wide range of use cases, because the business goals for the data do not need to be defined at the time of data collection. However, two primary ones include data science exploration and data backup and recovery efforts. Data scientists can use data lakes for proof-of-concepts. Machine learning applications benefit from the ability to store structured and unstructured data in the same place, which is not possible using a relational database system. Data lakes can also be used to test and develop big data analytics projects. When the application has been developed, and the useful data has been identified, the data can be exported into a data warehouse for operational use, and automation can be used to make the application scale. Data lakes can also be used for data backup and recovery, due to their ability to scale at a low cost. For the same reasons, data lakes are good for storing "just in case" data, for which business needs have not yet been defined. Storing the data now means it will be available later as new initiatives emerge.

Types of data architectures

Data fabrics: A data fabric is an architecture, which focuses on the automation of data integration, data engineering, and governance in a data value chain between data providers and data consumers. A data fabric is based on the notion of “active metadata” which uses knowledge graph, semantics, data mining, and machine learning (ML) technology to discover patterns in various types of metadata (for example system logs, social, and more). Then, it applies this insight to automate and orchestrate the data value chain. For example, it can enable a data consumer to find a data product and then have that data product provisioned to them automatically. The increased data access between data products and data consumers leads to a reduction in data siloes and provides a more complete picture of the organization’s data. Data fabrics are an emerging technology with enormous potential and they can be used to enhance customer profiling, fraud detection, and preventative maintenance. According to Gartner, data fabrics reduce integration design time by 30%, deployment time by 30%, and maintenance by 70%.

Data meshes: A data mesh is a decentralized data architecture that organizes data by business domain. Using a data mesh, the organization needs to stop thinking of data as a by-product of a process and start thinking of it as a product in its own right. Data producers act as data product owners. As subject matter experts, data producers can use their understanding of the data’s primary consumers to design APIs for them. These APIs can also be accessed from other parts of the organization, providing broader access to managed data.

More traditional storage systems such as data lakes and data warehouses can be used as multiple decentralized data repositories to realize a data mesh. A data mesh can also work with a data fabric, with the data fabric’s automation enabling new data products to be created more quickly or enforcing global governance.