Power the agentic enterprise Watch the Think Keynote

What is a data warehouse?

What is a data warehouse?

A data warehouse aggregates data from various sources into a central data store optimized for querying and analysis. It typically uses extract, transform and load (ETL) or extract, load and transform (ELT) processes to clean, prepare and organize data for business intelligence (BI) and other data analytics use cases.
 

Data warehousing systems can integrate large amounts of data from a wide range of source systems, including operational databases, transactional systems and customer relationship management (CRM) platforms. Self-service analytics tools empower business users to explore and analyze this data for valuable insights.

The concept of a data warehouse emerged in the 1980s to integrate disparate data into a consistent format for analysis. As the number of new data sources surged, such as the World Wide Web, social media and the Internet of Things (IoT), the demand for larger storage capacity and faster analysis grew.

Data warehouses are configured and optimized for near real-time analytics, which means they are typically not ideal for storing massive amounts of raw, unstructured big data. As the amount of data in a warehouse grows, the cost and complexity of storage grows with it. Latency and performance issues can also arise.

In response, more flexible alternatives have evolved, such as cloud-native data warehouses and data lakehouses. For more information, see “data warehouse versus data lakehouse” for more information.

How does data warehousing work?

Data warehouses often use a three-tier architecture designed to transform data for analytics:

  • Bottom tier
  • Middle tier
  • Top tier

Bottom tier

Data flows from multiple source systems into a data warehouse server, where it is stored. Traditionally, the data moves through an extract, transform, load (ETL) data integration process, which uses automation to clean and organize data before loading it into the warehouse.

Because data warehouses primarily store structured data, data transformation occurs before data loading. Some modern warehouses use extract, load, transform (ELT) processes instead—which load data into the warehouse before transforming it. This method is commonly used in data lakes, which can store unstructured and semi-structured data without requiring standardized formats.

Middle tier

This tier contains the analytics engine, often powered by an online analytical processing (OLAP) system. While traditional relational databases (including many data warehouses) can store multidimensional data (sales figures might have several dimensions such as location, time and product), they are not optimized for multidimensional queries.

OLAP systems are designed for high-speed, complex queries and multidimensional analysis on large volumes of data. They use “cubes” (array-based multidimensional data structures) to enable faster, more flexible analysis across multiple dimensions. Common use cases include data mining, financial analysis, budgeting and forecast planning.

Diagram showing the structure of an OLAP cube An OLAP cube with dimensions for products, sales regions and quarters

OLAP versus OLTPOnline transaction processing (OLTP) systems capture and update large volumes of real-time transactions from many users. In contrast, OLAP systems analyze data that has already been captured.

There are three types of OLAP one might use in a data warehouse:

  • Multidimensional online analytical processing (MOLAP): Works directly with a multidimensional OLAP cube and is typically the fastest and most practical type of multidimensional data analysis.

  • Relational online analytical processing (ROLAP): Performs multidimensional data analysis that operates directly on data in relational tables, without first reorganizing it into a cube.

  • Hybrid online analytical processing (HOLAP): Functions to create the optimal division of labor between relational and multidimensional databases within a single OLAP architecture.

Top tier

The final layer of a data warehouse provides a front-end user interface for reporting, dashboards and ad hoc data analysis of business data. These self-service business intelligence tools allow users to generate reports based on historical data, visualize trends and identify workflow bottlenecks without technical data engineering expertise.

Data warehouse deployment models: On-premises, cloud and hybrid

Data warehouses have evolved significantly, moving from exclusively on‑premises systems to flexible cloud and hybrid models.

Traditional data warehouses

Historically, data warehouses were hosted on-premises using commodity hardware. These systems were organized in either massively parallel processing (MPP) or symmetric multiprocessing (SMP) architectures. They were also delivered as stand-alone appliances. These deployments require significant investment. However, they can be a strong choice for organizations in sectors with strict compliance, data security or data privacy standards.

Cloud data warehouses

Today, many data warehouses are built to run in the cloud. They offer the benefits of cloud computing such as data storage at petabyte scale, highly scalable compute and storage, and pay-as-you-go pricing. Cloud-based data warehouses are typically delivered as a fully managed software as a service (SaaS) offering, eliminating the need for upfront investment in hardware or software.

This service delivery also reduces the resources needed for infrastructure management so organizations can focus on analytics and insights. Cloud-based data warehouses have grown in popularity as organizations seek the agility to scale and reduce their on-premises data center footprints and legacy infrastructure costs.

Hybrid approaches

Some organizations might adopt a hybrid model, which combines the best of both on-premises and cloud data warehouses. This approach allows them to take advantage of the scalability and flexibility of cloud while retaining control of sensitive workloads that must remain on-premises.

What are the three schemas in a data warehouse?

In a data warehouse, schemas define how data is organized. There are three common schema structures: the star schema, the snowflake schema and the galaxy schema (also called a fact constellation schema).

These schemas are all dimensional data models designed to optimize data retrieval speeds in OLAP systems. Dimensional models can increase redundancy, making it easier to locate information for reporting and retrieval and improving query performance.

These schemas contain fact tables and dimension tables, defined as follows:

  • Fact tables: Store quantitative data such as products sold or revenue amounts

  • Dimension tables: Store contextual, descriptive information for facts, such as dates of sale and product category

Star schema

Star schemas consist of a single, central fact table surrounded by dimension tables. In a diagram, the fact table appears in the middle of a star pattern. The star schema is considered the simplest and most common type of schema, offering users faster querying speeds.

Graphic depicting a star schema Example of a star schema

Snowflake schema

A snowflake schema places a central fact table at the core, with numerous normalized dimension tables radiating outward and those dimensions extend further to other dimension tables through many‑to‑one relationships. This complex, branching pattern can resemble a snowflake. Snowflake schemas have low levels of data redundancy, but this benefit comes at the cost of slower query performance.

Example of a snowflake schema Example of a snowflake schema

Galaxy schema

Just as a galaxy contains many stars, a galaxy schema contains multiple star schemas. These schemas share dimension tables that are normalized to reduce redundancy. The galaxy schema is best suited for highly complex data warehouses, but users can experience lower performance.

Example of a galaxy schema Example of a galaxy schema

Components of data warehouse architecture

A typical data warehouse architecture has several components that work together to store, manage, process and deliver data for analytics.

  • ETL/ELT tools
  • API layer
  • Data layer (central database)
  • Metadata
  • Sandbox
  • Access tools

ETL/ELT tools

ETL tools extract data from source systems, transform it in a staging area and load it into the data warehouse. In ELT, the data is transformed after being loaded in the warehouse. A data processing framework tool, such as Apache Spark, can help manage the transformation of data.

API layer

A connectivity layer for application programming interfaces (APIs) can help the warehouse pull data from and integrate with operational systems. APIs can also provide access to visualization and advanced analytics tools.

Data layer (or central database)

The data layer (or central database) is the heart of the data warehouse. Here, the system integrates and stores data from various sources, such as business applications, email lists, websites or other databases. ETL or ELT data pipelines support this layer and a relational database management system (RDBMS) or cloud data warehouse platform powers it. Built-in data governance and security capabilities can partition data so users access what they need.

Metadata is data about data, describing the data that is stored in a system to make it searchable and usable for analytics. It includes technical metadata (such as table structure and data type) and descriptive metadata (such as author, creation date or file size). Metadata is key to effective data governance and data management.

Sandbox

Some data warehouses provide a sandbox, which is a walled-off testing environment containing a copy of the production data and relevant analysis tools. Data analysts and data scientists can test new analytical techniques in the sandbox without affecting live warehouse operations.

Access tools

Access tools connect to data warehouses and provide an accessible front end. Business users and data analysts can use dashboards, apps and data visualization tools to interact with data and extract insights. Examples of these tools include Tableau, Looker and Qlik.

Think Keynotes

Power the agentic enterprise

Understand how AI-ready data platforms enable real-time insights and execution, while supporting secure, sovereign deployment across environments.

Types of data warehouses

There are three main types of data warehouses:

  • Enterprise data warehouse (EDW)
  • Operational data store (ODS)
  • Data mart

Enterprise data warehouse (EDW)

An enterprise data warehouse (EDW) is a data warehouse that services the entire enterprise. It acts as a centralized information repository of historical data for all teams and subject areas. An enterprise data warehousing environment can also include an operational data store (ODS) and department-specific data marts.

Operational data store (ODS)

An operational data store (ODS) contains the most recent snapshot of operational data. An ODS is updated frequently, enabling quick access to near-real-time data. Organizations often use an ODS for day-to-day operational decision-making and real-time analysis. It can also be a data source for an EDW or other data systems.

Data mart

A data mart is a subset of an existing data warehouse (or other data sources) and contains data tailored to a specific business line or department rather than the entire enterprise. For example, a company can have a data mart that aligns with the marketing department. Those users can access more-focused insights for customer segmentation and campaign performance without navigating the broader enterprise data set.

Data warehouses versus other types of data storage

The terms data warehouse, database, data lake and data lakehouse are sometimes used interchangeably, but important differences exist.

Data warehouse versus database

A database is like a filing cabinet built primarily for automated data capture and fast transaction processing. It typically serves as the focused data store for a specific application. Data warehouses store data from any number of applications in an organization and are optimized for predictive analytics and other advanced analysis.

Data warehouse versus data lake

A data lake is a low-cost storage solution for massive amounts of raw data and uses a schema-on-read approach rather than a predefined schema. Data lakes can store structured data, unstructured data and semi-structured data, such as documents, videos, IoT logs and social media posts.

They can be built on big data platforms such as Apache Hadoop or cloud object storage services such as Amazon Simple Storage Service (Amazon S3). They don’t typically clean, validate or normalize data for analytics, as a warehouse does.

Data lakehouse versus data warehouse

data lakehouse combines the strengths of data warehouses and data lakes, delivering the low‑cost flexibility of a lake alongside the high performance of a warehouse. By combining the key features of lakes and warehouses into a single data platform, lakehouses can accelerate processing for large volumes of structured, unstructured and real-time data.

They also support machine learningdata science and artificial intelligence (AI) workloads more efficiently. Data lakehouses might also add features such as shared metadata and distributed structured query language (SQL) engines.

Data warehouse benefits

Data warehouses make insights and information available to users across an organization, offering many benefits, such as:

  • Improved data quality
  • Support for AI and machine learning
  • Enhanced decision support

Improved data quality

Through ELT or ETL processes, data warehouses prepare incoming data before it is stored in the warehouse. This preparation includes data quality methods such as data cleansing, standardization and deduplication. Robust data governance policies and practices can also help ensure data accuracy and integrity for all users.

By integrating high-quality data into a single store, organizations create a comprehensive and reliable single source of truth—which helps to eliminate data silos. This central repository enables business users to confidently access the organization’s pertinent data and use it for business decision‑making. An enterprise-grade data warehouse might also include support for open source formats, such as Apache Iceberg, Parquet and CSV, enabling further data access and sharing across the enterprise.

Support for AI and machine learning

Modern data warehouses can support various AI and machine learning workflows by providing clean, reliable data. Data scientists can use cleansed and validated warehouse data to build proprietary generative AI models or fine-tune existing models to better serve their unique business needs.

An AI-ready data warehouse should be able to collect, cleanse, organize and structure data and facilitate the flow of data to AI and machine learning platforms. However, not all modern data warehouses are optimized for AI workloads. Data lakehouses are increasingly becoming the data platforms of choice for AI infrastructure.

Enhanced decision support

A data warehouse centralizes and cleanses data from different sources to create a single source of truth, giving organizations a comprehensive, reliable view of enterprise data. Self-service BI tools enable users across the enterprise to access and run analytical queries on this aggregated data.

In this way, data warehouses make it possible for business users of any technical skill level to discover and report on themes, trends and aggregations. Business leaders can use these insights to make better-informed decisions and forecasts based on hard evidence in virtually every area of the organization, from business processes to financial management and inventory management.

Industry-specific data warehouse use cases

Data warehouses can serve industry-specific uses, such as:

Government

The analytical capabilities of a data warehouse can help governments better understand complex phenomena such as crime, demographic trends and traffic patterns.

Healthcare

The ability to centralize and analyze disparate data—such as billing and diagnostic codes, patient demographics, medications and test results—can give healthcare providers deeper insight. These insights help them understand patient outcomes, improve operational efficiency and more.

Travel and hospitality

Organizations can use historical data pertaining to travel and accommodation choices to more precisely target advertising and promotions to their customers.

Manufacturing

Large manufacturing companies that generate huge volumes of data can use data warehouse solutions to build out data marts tailored to the needs of each department.

Frequently asked questions about data warehouses

Do I need a data warehouse?

A data warehouse can be a smart choice if your organization aggregates large amounts of data from multiple operational systems, such as business applications (BI), websites and other databases. It is especially useful when you plan to perform complex historical analysis with BI tools or dashboards.

How do I optimize data warehouse costs?

For cost optimization, look to architectures that separate data and compute resources, allowing you to scale them separately. You can also use cost‑effective cloud object storage and AI‑powered workload management for automated resource distribution. Open data formats make it easier to share data across warehouses and lakehouses, reducing storage costs and complexity.

How do I handle data quality issues in a data warehouse?

Strong ETL/ELT processes for data cleansing and standardization, robust data governance policies and data observability to detect issues as they emerge can help with data quality problems. A “shift left” approach can also help to detect and resolve data quality issues closer to the root cause rather than downstream.

How is a data warehouse different from a database?

A database is built primarily for fast transaction processing and typically serves as the data storage for a specific application. A data warehouse aggregates large volumes of data from various sources, then cleans and prepares it for business intelligence, analytical queries and other advanced data analysis.

Who should own the data warehouse?

Data engineers build and maintain the infrastructure, while a chief data officer sets data strategy and oversees data management functions. Business intelligence teams manage semantic layers and dashboards and cross‑functional data governance teams help ensure data quality and security.

Authors

Alexandra Jonker

Staff Editor

IBM Think

Jim Holdsworth

Staff Writer

IBM Think

Matthew Kosinski

Staff Editor

IBM Think

Related solutions
IBM® watsonx.data®

Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.

Discover watsonx.data
Data lake solutions

Power your applications, analytics and AI with any data in an open data lakehouse

Discover data lake solutions
Data and AI consulting services

Successfully scale AI with the right strategy, data, security and governance in place.

Explore data and AI consulting services
Take the next step

Unify all your data for AI and analytics with IBM watsonx.data®. Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics.

  1. Discover watsonx.data
  2. Explore data for AI solutions