Data warehousing systems can ingest large amounts of data from a wide range of source systems, including operational databases, transactional systems and customer relationship management (CRM) platforms. Self-service analytics tools empower business users to explore and analyze this data for valuable insights.
The concept of a data warehouse emerged in the 1980s to integrate disparate data into a consistent format for analysis. As the number of new data sources surged—such as the World Wide Web, social media and the Internet of Things (IoT)—the demand for larger storage capacity and faster analysis grew.
Data warehouses are configured and optimized for near real-time analytics, which means they are typically not ideal for storing massive amounts of raw, unstructured big data. As the amount of data in a warehouse grows, the cost and complexity of storage grows with it. Latency and performance issues can also arise.
In response, more flexible alternatives have evolved, such as cloud-native data warehouses and data lakehouses. (See “data warehouse vs. data lakehouse” for more information.)
Data warehouses often use a three-tier architecture designed to transform data for analytics:
Data flows from multiple source systems into a data warehouse server, where it is stored. Traditionally, the data moves through an extract, transform, load (ETL) data integration process, which uses automation to clean and organize data before loading it into the warehouse.
Because data warehouses primarily store structured data, data transformation occurs before data loading. Some modern warehouses use extract, load, transform (ELT) processes instead—which load data into the warehouse before transforming it. This method is commonly used in data lakes, which can store unstructured and semi-structured data without requiring standardized formats.
This tier contains the analytics engine, often powered by an online analytical processing (OLAP) system. While traditional relational databases—including many data warehouses—can store multidimensional data (for example, sales figures might have several dimensions such as location, time and product), they are not optimized for multidimensional queries.
OLAP systems are designed for high-speed, complex queries and multidimensional analysis on large volumes of data. They use “cubes” (array-based multidimensional data structures) to enable faster, more flexible analysis across multiple dimensions. Common use cases include data mining, financial analysis, budgeting and forecast planning.
OLAP vs. OLTP: Online transaction processing (OLTP) systems capture and update large volumes of real-time transactions from many users. In contrast, OLAP systems analyze data that has already been captured.
There are three types of OLAP one might use in a data warehouse:
The final layer of a data warehouse provides a front-end user interface for reporting, dashboards and ad hoc data analysis of business data. These self-service business intelligence tools allow users to generate reports based on historical data, visualize trends and identify workflow bottlenecks without technical data engineering expertise.
Industry newsletter
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
Data warehouses have evolved significantly, moving from exclusively on‑premises systems to flexible cloud and hybrid models.
Historically, data warehouses were hosted on-premises using commodity hardware. These systems were organized in either Massively Parallel Processing (MPP) or Symmetric Multiprocessing (SMP) architectures. They were also delivered as standalone appliances. These deployments require significant investment. However, they may be a strong choice for organizations in sectors with strict compliance, data security or data privacy standards.
Today, many data warehouses are built to run in the cloud. They offer the benefits of cloud computing such as data storage at petabyte scale, highly scalable compute and storage, and pay-as-you-go pricing. Cloud-based data warehouses are typically delivered as a fully managed software as a service (SaaS) offering, eliminating the need for upfront investment in hardware or software. This service delivery also reduces the resources needed for infrastructure management so organizations can focus on analytics and insights.
Cloud-based data warehouses have grown in popularity as organizations seek the agility to scale and reduce their on-premises data center footprints and legacy infrastructure spend.
Some organizations might adopt a hybrid model, which combines the best of both on-premises and cloud data warehouses. This approach allows them to take advantage of the scalability and flexibility of cloud while retaining control of sensitive workloads that must remain on-premises.
In a data warehouse, schemas define how data is organized. There are three common schema structures: the star schema, the snowflake schema and the galaxy schema (also called a fact constellation schema).
These schemas are all dimensional data models designed to optimize data retrieval speeds in OLAP systems. Dimensional models can increase redundancy, making it easier to locate information for reporting and retrieval and improving query performance.
These schemas contain fact tables and dimension tables, defined below:
Star schemas consist of a single, central fact table surrounded by dimension tables. In a diagram, the fact table appears in the middle of a star pattern. The star schema is considered the simplest and most common type of schema, offering users faster querying speeds.
A snowflake schema has a central fact table connected to many normalized dimension tables, which may then connect to other dimension tables through many-to-one relationships. This complex, branching pattern can resemble a snowflake. Snowflake schemas have low levels of data redundancy, but this comes at the cost of slower query performance.
Just as our galaxy contains many stars, a galaxy schema contains many star schemas that share dimension tables that are normalized to avoid redundancy. The galaxy schema is best suited for highly complex data warehouses, but users may experience lower performance.
A typical data warehouse architecture has several components that work together to store, manage, process and deliver data for analytics.
ETL tools extract data from source systems, transform it in a staging area and load it into the data warehouse. In ELT, the data is transformed after being loaded in the warehouse. A data processing framework tool, such as Apache Spark, can help manage the transformation of data.
A connectivity layer for application programming interfaces (APIs) can help the warehouse pull data from and integrate with operational systems. APIs can also provide access to visualization and advanced analytics tools.
The data layer (or central database) is the heart of the data warehouse. Here, the system integrates and stores data from various sources, such as business applications, email lists, websites or other databases. This layer is supported by ETL or ELT data pipelines and uses a relational database management system (RDBMS) or cloud data warehouse platform. Built-in data governance and security capabilities can partition data so users only access what they need.
Metadata is data about data, describing the data that is stored in a system to make it searchable and usable for analytics. It includes technical metadata (such as table structure and data type) and descriptive metadata (such as author, creation date or file size). Metadata is key to effective data governance and data management.
Some data warehouses provide a sandbox, which is a walled-off testing environment containing a copy of the production data and relevant analysis tools. Data analysts and data scientists can experiment with new analytical techniques in the sandbox without impacting the live operations of the data warehouse for other users.
Access tools connect to data warehouses and provide a business user-friendly front end. Business users and data analysts can use dashboards, apps and data visualization tools to interact with data and extract insights. Examples of these tools include Tableau, Looker and Qlik.
There are three main types of data warehouses:
An enterprise data warehouse (EDW) is a data warehouse that services the entire enterprise. It acts as a centralized information repository of historical data for all teams and subject areas. An enterprise data warehousing environment may also include an operational data store (ODS) and department-specific data marts.
An operational data store (ODS) contains the most recent snapshot of operational data. An ODS is updated frequently, enabling quick access to near-real-time data. Organizations often use an ODS for day-to-day operational decision making and real-time analysis. It can also be a data source for an EDW or other data systems.
A data mart is a subset of an existing data warehouse (or other data sources) and contains data tailored to a specific business line or department rather than the entire enterprise. For example, a company may have a data mart that aligns with the marketing department. Those users can access more-focused insights for customer segmentation and campaign performance without navigating the broader enterprise data set.
The terms data warehouse, database, data lake and data lakehouse are sometimes used interchangeably, but important differences exist.
A database is like a filing cabinet built primarily for automated data capture and fast transaction processing. It typically serves as the focused data store for a specific application. Data warehouses store data from any number of applications in an organization and are optimized for predictive analytics and other advanced analysis.
A data lake is a low-cost storage solution for massive amounts of raw data and use a schema-on-read approach rather than a predefined schema. Data lakes can store structured data, unstructured data and semi-structured data, such as documents, videos, Internet of Things (IoT) logs and social media posts.
They can be built on big data platforms such as Apache Hadoop or cloud object storage services such as Amazon Simple Storage Service (Amazon S3). They don’t typically clean, validate or normalize data for analytics, as a warehouse does.
A data lakehouse merges aspects of data warehouses and data lakes—namely, the low-cost flexibility of a lake and the high performance of a warehouse. By combining the key features of lakes and warehouses into one data solution, lakehouses can help accelerate data processing for large amounts of structured, unstructured and real-time data streams to support machine learning, data science and artificial intelligence (AI) workloads.
Data lakehouses might also add features such as shared metadata and distributed structured query language (SQL) engines.
Data warehouses make insights and information available to users across an organization, offering many benefits, such as:
Through ELT or ETL processes, data warehouses prepare incoming data before it is stored in the warehouse. This preparation includes data quality methods such as data cleansing, standardization and deduplication. Robust data governance policies and practices can also help ensure data accuracy and integrity for all users.
By integrating high-quality data into a single store, organizations create a comprehensive and reliable single source of truth—which helps to eliminate data silos. This central repository enables business users to confidently access all of the organization’s pertinent data and use it for business decision-making. An enterprise-grade data warehouse might also include support for open source formats, such as Apache Iceberg, Parquet and CSV, enabling further data access and sharing across the enterprise.
Modern data warehouses can support various AI and machine learning workflows by providing clean, reliable data. Data scientists can use cleansed and validated warehouse data to build proprietary generative AI models or fine-tune existing models to better serve their unique business needs.
An AI-ready data warehouse should be able to collect, cleanse, organize and structure data, as well as facilitate the flow of data to AI and machine learning platforms. However, not all modern data warehouses are optimized for AI workloads. Data lakehouses are increasingly becoming the data platforms of choice for AI infrastructure.
A data warehouse centralizes and cleanses data from different sources to create a single source of truth, giving organizations a comprehensive, reliable view of enterprise data. Self-service BI tools enable users across the enterprise to access and run analytical queries on this aggregated data.
In this way, data warehouses make it possible for business users of any technical skill level to discover and report on themes, trends and aggregations. Business leaders can use these insights to make better-informed decisions and forecasts based on hard evidence in virtually every area of the organization, from business processes to financial management and inventory management.
Data warehouses can serve industry-specific uses, such as:
The analytical capabilities of a data warehouse can help governments better understand complex phenomena such as crime, demographic trends and traffic patterns.
The ability to centralize and analyze disparate data—such as billing and diagnostic codes, patient demographics, medications and test results—can help healthcare providers gain insight into patient outcomes, operational efficiencies and more.
Organizations can use historical data pertaining to travel and accommodation choices to more precisely target advertising and promotions to their customers.
Large manufacturing companies that generate huge volumes of data can use data warehouse solutions to build out data marts tailored to the needs of each department.
Create and manage smart streaming data pipelines through an intuitive graphical interface, facilitating seamless data integration across hybrid and multicloud environments.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Unlock the value of enterprise data with IBM Consulting®, building an insight-driven organization that delivers business advantage.