What is a data warehouse?

Aerial view of 3D rendered automated guided vehicles in smart warehouse

Authors

Alexandra Jonker

Staff Editor

IBM Think

Jim Holdsworth

Staff Writer

IBM Think

Matthew Kosinski

Staff Editor

IBM Think

What is a data warehouse?

A data warehouse aggregates data from various sources into a central data store optimized for querying and analysis. It typically uses extract, transform and load (ETL) or extract, load, and transform (ELT) processes to clean, prepare and organize data for business intelligence (BI) and other data analytics use cases.
 

Data warehousing systems can ingest large amounts of data from a wide range of source systems, including operational databases, transactional systems and customer relationship management (CRM) platforms. Self-service analytics tools empower business users to explore and analyze this data for valuable insights.

The concept of a data warehouse emerged in the 1980s to integrate disparate data into a consistent format for analysis. As the number of new data sources surged—such as the World Wide Web, social media and the Internet of Things (IoT)—the demand for larger storage capacity and faster analysis grew.

Data warehouses are configured and optimized for near real-time analytics, which means they are typically not ideal for storing massive amounts of raw, unstructured big data. As the amount of data in a warehouse grows, the cost and complexity of storage grows with it. Latency and performance issues can also arise.

In response, more flexible alternatives have evolved, such as cloud-native data warehouses and data lakehouses. (See “data warehouse vs. data lakehouse” for more information.)

How does data warehousing work?

Data warehouses often use a three-tier architecture designed to transform data for analytics:

  • Bottom tier
  • Middle tier
  • Top tier

Bottom tier

Data flows from multiple source systems into a data warehouse server, where it is stored. Traditionally, the data moves through an extract, transform, load (ETL) data integration process, which uses automation to clean and organize data before loading it into the warehouse.

Because data warehouses primarily store structured data, data transformation occurs before data loading. Some modern warehouses use extract, load, transform (ELT) processes instead—which load data into the warehouse before transforming it. This method is commonly used in data lakes, which can store unstructured and semi-structured data without requiring standardized formats.

Middle tier

This tier contains the analytics engine, often powered by an online analytical processing (OLAP) system. While traditional relational databases—including many data warehouses—can store multidimensional data (for example, sales figures might have several dimensions such as location, time and product), they are not optimized for multidimensional queries.

OLAP systems are designed for high-speed, complex queries and multidimensional analysis on large volumes of data. They use “cubes” (array-based multidimensional data structures) to enable faster, more flexible analysis across multiple dimensions. Common use cases include data mining, financial analysis, budgeting and forecast planning.

Diagram showing the structure of an OLAP cube An OLAP cube with dimensions for Products, Sales Regions and Quarters

OLAP vs. OLTPOnline transaction processing (OLTP) systems capture and update large volumes of real-time transactions from many users. In contrast, OLAP systems analyze data that has already been captured.

There are three types of OLAP one might use in a data warehouse:

  • Multidimensional online analytical processing (MOLAP): Works directly with a multidimensional OLAP cube and is typically the fastest and most practical type of multidimensional data analysis.

  • Relational online analytical processing (ROLAP): Performs multidimensional data analysis that operates directly on data in relational tables, without first reorganizing it into a cube.

  • Hybrid online analytical processing (HOLAP): Functions to create the optimal division of labor between relational and multidimensional databases within a single OLAP architecture.

Top tier

The final layer of a data warehouse provides a front-end user interface for reporting, dashboards and ad hoc data analysis of business data. These self-service business intelligence tools allow users to generate reports based on historical data, visualize trends and identify workflow bottlenecks without technical data engineering expertise.

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

Data warehouse deployment models: On-premises, cloud and hybrid

Data warehouses have evolved significantly, moving from exclusively on‑premises systems to flexible cloud and hybrid models.

Traditional data warehouses

Historically, data warehouses were hosted on-premises using commodity hardware. These systems were organized in either Massively Parallel Processing (MPP) or Symmetric Multiprocessing (SMP) architectures. They were also delivered as standalone appliances. These deployments require significant investment. However, they may be a strong choice for organizations in sectors with strict compliance, data security or data privacy standards.

Cloud data warehouses

Today, many data warehouses are built to run in the cloud. They offer the benefits of cloud computing such as data storage at petabyte scale, highly scalable compute and storage, and pay-as-you-go pricing. Cloud-based data warehouses are typically delivered as a fully managed software as a service (SaaS) offering, eliminating the need for upfront investment in hardware or software. This service delivery also reduces the resources needed for infrastructure management so organizations can focus on analytics and insights.

Cloud-based data warehouses have grown in popularity as organizations seek the agility to scale and reduce their on-premises data center footprints and legacy infrastructure spend.

Hybrid approaches

Some organizations might adopt a hybrid model, which combines the best of both on-premises and cloud data warehouses. This approach allows them to take advantage of the scalability and flexibility of cloud while retaining control of sensitive workloads that must remain on-premises.

What are the three schemas in a data warehouse?

In a data warehouse, schemas define how data is organized. There are three common schema structures: the star schema, the snowflake schema and the galaxy schema (also called a fact constellation schema).

These schemas are all dimensional data models designed to optimize data retrieval speeds in OLAP systems. Dimensional models can increase redundancy, making it easier to locate information for reporting and retrieval and improving query performance.

These schemas contain fact tables and dimension tables, defined below:

  • Fact tables: Store quantitative data such as products sold or revenue amounts

  • Dimension tables: Store contextual, descriptive information for facts, such as dates of sale and product category

Star schema

Star schemas consist of a single, central fact table surrounded by dimension tables. In a diagram, the fact table appears in the middle of a star pattern. The star schema is considered the simplest and most common type of schema, offering users faster querying speeds.

Graphic depicting a star schema Example of a star schema

Snowflake schema

A snowflake schema has a central fact table connected to many normalized dimension tables, which may then connect to other dimension tables through many-to-one relationships. This complex, branching pattern can resemble a snowflake. Snowflake schemas have low levels of data redundancy, but this comes at the cost of slower query performance.

Example of a snowflake schema Example of a snowflake schema

Galaxy schema

Just as our galaxy contains many stars, a galaxy schema contains many star schemas that share dimension tables that are normalized to avoid redundancy. The galaxy schema is best suited for highly complex data warehouses, but users may experience lower performance.

Example of a galaxy schema Example of a galaxy schema

Components of data warehouse architecture

A typical data warehouse architecture has several components that work together to store, manage, process and deliver data for analytics.

  • ETL/ELT tools
  • API layer
  • Data layer (central database)
  • Metadata
  • Sandbox
  • Access tools

ETL/ELT tools

ETL tools extract data from source systems, transform it in a staging area and load it into the data warehouse. In ELT, the data is transformed after being loaded in the warehouse. A data processing framework tool, such as Apache Spark, can help manage the transformation of data.

API layer

A connectivity layer for application programming interfaces (APIs) can help the warehouse pull data from and integrate with operational systems. APIs can also provide access to visualization and advanced analytics tools.

Data layer (or central database)

The data layer (or central database) is the heart of the data warehouse. Here, the system integrates and stores data from various sources, such as business applications, email lists, websites or other databases. This layer is supported by ETL or ELT data pipelines and uses a relational database management system (RDBMS) or cloud data warehouse platform. Built-in data governance and security capabilities can partition data so users only access what they need.

Metadata is data about data, describing the data that is stored in a system to make it searchable and usable for analytics. It includes technical metadata (such as table structure and data type) and descriptive metadata (such as author, creation date or file size). Metadata is key to effective data governance and data management.

Sandbox

Some data warehouses provide a sandbox, which is a walled-off testing environment containing a copy of the production data and relevant analysis tools. Data analysts and data scientists can experiment with new analytical techniques in the sandbox without impacting the live operations of the data warehouse for other users.

Access tools

Access tools connect to data warehouses and provide a business user-friendly front end. Business users and data analysts can use dashboards, apps and data visualization tools to interact with data and extract insights. Examples of these tools include Tableau, Looker and Qlik.

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

Types of data warehouses

There are three main types of data warehouses:

  • Enterprise data warehouse (EDW)
  • Operational data store (ODS)
  • Data mart

Enterprise data warehouse (EDW)

An enterprise data warehouse (EDW) is a data warehouse that services the entire enterprise. It acts as a centralized information repository of historical data for all teams and subject areas. An enterprise data warehousing environment may also include an operational data store (ODS) and department-specific data marts.

Operational data store (ODS)

An operational data store (ODS) contains the most recent snapshot of operational data. An ODS is updated frequently, enabling quick access to near-real-time data. Organizations often use an ODS for day-to-day operational decision making and real-time analysis. It can also be a data source for an EDW or other data systems.

Data mart

A data mart is a subset of an existing data warehouse (or other data sources) and contains data tailored to a specific business line or department rather than the entire enterprise. For example, a company may have a data mart that aligns with the marketing department. Those users can access more-focused insights for customer segmentation and campaign performance without navigating the broader enterprise data set.

Data warehouses vs. other types of data storage

The terms data warehouse, database, data lake and data lakehouse are sometimes used interchangeably, but important differences exist.

Data warehouse vs. database

A database is like a filing cabinet built primarily for automated data capture and fast transaction processing. It typically serves as the focused data store for a specific application. Data warehouses store data from any number of applications in an organization and are optimized for predictive analytics and other advanced analysis.

Data warehouse vs. data lake

A data lake is a low-cost storage solution for massive amounts of raw data and use a schema-on-read approach rather than a predefined schema. Data lakes can store structured data, unstructured data and semi-structured data, such as documents, videos, Internet of Things (IoT) logs and social media posts.

They can be built on big data platforms such as Apache Hadoop or cloud object storage services such as Amazon Simple Storage Service (Amazon S3). They don’t typically clean, validate or normalize data for analytics, as a warehouse does.

Data lakehouse vs. data warehouse

A data lakehouse merges aspects of data warehouses and data lakes—namely, the low-cost flexibility of a lake and the high performance of a warehouse. By combining the key features of lakes and warehouses into one data solution, lakehouses can help accelerate data processing for large amounts of structured, unstructured and real-time data streams to support machine learning, data science and artificial intelligence (AI) workloads.

Data lakehouses might also add features such as shared metadata and distributed structured query language (SQL) engines.

Data warehouse benefits

Data warehouses make insights and information available to users across an organization, offering many benefits, such as:

  • Improved data quality
  • Support for AI and machine learning
  • Enhanced decision support

Improved data quality

Through ELT or ETL processes, data warehouses prepare incoming data before it is stored in the warehouse. This preparation includes data quality methods such as data cleansing, standardization and deduplication. Robust data governance policies and practices can also help ensure data accuracy and integrity for all users.

By integrating high-quality data into a single store, organizations create a comprehensive and reliable single source of truth—which helps to eliminate data silos. This central repository enables business users to confidently access all of the organization’s pertinent data and use it for business decision-making. An enterprise-grade data warehouse might also include support for open source formats, such as Apache Iceberg, Parquet and CSV, enabling further data access and sharing across the enterprise.

Support for AI and machine learning

Modern data warehouses can support various AI and machine learning workflows by providing clean, reliable data. Data scientists can use cleansed and validated warehouse data to build proprietary generative AI models or fine-tune existing models to better serve their unique business needs.

An AI-ready data warehouse should be able to collect, cleanse, organize and structure data, as well as facilitate the flow of data to AI and machine learning platforms. However, not all modern data warehouses are optimized for AI workloads. Data lakehouses are increasingly becoming the data platforms of choice for AI infrastructure.

Enhanced decision support

A data warehouse centralizes and cleanses data from different sources to create a single source of truth, giving organizations a comprehensive, reliable view of enterprise data. Self-service BI tools enable users across the enterprise to access and run analytical queries on this aggregated data.

In this way, data warehouses make it possible for business users of any technical skill level to discover and report on themes, trends and aggregations. Business leaders can use these insights to make better-informed decisions and forecasts based on hard evidence in virtually every area of the organization, from business processes to financial management and inventory management.

Industry-specific data warehouse use cases

Data warehouses can serve industry-specific uses, such as:

Government

The analytical capabilities of a data warehouse can help governments better understand complex phenomena such as crime, demographic trends and traffic patterns.

Healthcare

The ability to centralize and analyze disparate data—such as billing and diagnostic codes, patient demographics, medications and test results—can help healthcare providers gain insight into patient outcomes, operational efficiencies and more.

Travel and hospitality

Organizations can use historical data pertaining to travel and accommodation choices to more precisely target advertising and promotions to their customers.

Manufacturing

Large manufacturing companies that generate huge volumes of data can use data warehouse solutions to build out data marts tailored to the needs of each department.

Related solutions
IBM StreamSets

Create and manage smart streaming data pipelines through an intuitive graphical interface, facilitating seamless data integration across hybrid and multicloud environments.

Explore StreamSets
IBM® watsonx.data™

Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.

Discover watsonx.data
Data and analytics consulting services

Unlock the value of enterprise data with IBM Consulting®, building an insight-driven organization that delivers business advantage.

Discover analytics services
Take the next step

Unify all your data for AI and analytics with IBM® watsonx.data™. Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics.

Discover watsonx.data Explore data management solutions