The latest tech news, backed by expert insights
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.
Data warehousing systems can integrate large amounts of data from a wide range of source systems, including operational databases, transactional systems and customer relationship management (CRM) platforms. Self-service analytics tools empower business users to explore and analyze this data for valuable insights.
The concept of a data warehouse emerged in the 1980s to integrate disparate data into a consistent format for analysis. As the number of new data sources surged, such as the World Wide Web, social media and the Internet of Things (IoT), the demand for larger storage capacity and faster analysis grew.
Data warehouses are configured and optimized for near real-time analytics, which means they are typically not ideal for storing massive amounts of raw, unstructured big data. As the amount of data in a warehouse grows, the cost and complexity of storage grows with it. Latency and performance issues can also arise.
In response, more flexible alternatives have evolved, such as cloud-native data warehouses and data lakehouses. For more information, see “data warehouse versus data lakehouse” for more information.
Data warehouses often use a three-tier architecture designed to transform data for analytics:
Data flows from multiple source systems into a data warehouse server, where it is stored. Traditionally, the data moves through an extract, transform, load (ETL) data integration process, which uses automation to clean and organize data before loading it into the warehouse.
Because data warehouses primarily store structured data, data transformation occurs before data loading. Some modern warehouses use extract, load, transform (ELT) processes instead—which load data into the warehouse before transforming it. This method is commonly used in data lakes, which can store unstructured and semi-structured data without requiring standardized formats.
This tier contains the analytics engine, often powered by an online analytical processing (OLAP) system. While traditional relational databases (including many data warehouses) can store multidimensional data (sales figures might have several dimensions such as location, time and product), they are not optimized for multidimensional queries.
OLAP systems are designed for high-speed, complex queries and multidimensional analysis on large volumes of data. They use “cubes” (array-based multidimensional data structures) to enable faster, more flexible analysis across multiple dimensions. Common use cases include data mining, financial analysis, budgeting and forecast planning.
OLAP versus OLTP: Online transaction processing (OLTP) systems capture and update large volumes of real-time transactions from many users. In contrast, OLAP systems analyze data that has already been captured.
There are three types of OLAP one might use in a data warehouse:
The final layer of a data warehouse provides a front-end user interface for reporting, dashboards and ad hoc data analysis of business data. These self-service business intelligence tools allow users to generate reports based on historical data, visualize trends and identify workflow bottlenecks without technical data engineering expertise.
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.
Data warehouses have evolved significantly, moving from exclusively on‑premises systems to flexible cloud and hybrid models.
Historically, data warehouses were hosted on-premises using commodity hardware. These systems were organized in either massively parallel processing (MPP) or symmetric multiprocessing (SMP) architectures. They were also delivered as stand-alone appliances. These deployments require significant investment. However, they can be a strong choice for organizations in sectors with strict compliance, data security or data privacy standards.
Today, many data warehouses are built to run in the cloud. They offer the benefits of cloud computing such as data storage at petabyte scale, highly scalable compute and storage, and pay-as-you-go pricing. Cloud-based data warehouses are typically delivered as a fully managed software as a service (SaaS) offering, eliminating the need for upfront investment in hardware or software.
This service delivery also reduces the resources needed for infrastructure management so organizations can focus on analytics and insights. Cloud-based data warehouses have grown in popularity as organizations seek the agility to scale and reduce their on-premises data center footprints and legacy infrastructure costs.
Some organizations might adopt a hybrid model, which combines the best of both on-premises and cloud data warehouses. This approach allows them to take advantage of the scalability and flexibility of cloud while retaining control of sensitive workloads that must remain on-premises.
In a data warehouse, schemas define how data is organized. There are three common schema structures: the star schema, the snowflake schema and the galaxy schema (also called a fact constellation schema).
These schemas are all dimensional data models designed to optimize data retrieval speeds in OLAP systems. Dimensional models can increase redundancy, making it easier to locate information for reporting and retrieval and improving query performance.
These schemas contain fact tables and dimension tables, defined as follows:
Star schemas consist of a single, central fact table surrounded by dimension tables. In a diagram, the fact table appears in the middle of a star pattern. The star schema is considered the simplest and most common type of schema, offering users faster querying speeds.
A snowflake schema places a central fact table at the core, with numerous normalized dimension tables radiating outward and those dimensions extend further to other dimension tables through many‑to‑one relationships. This complex, branching pattern can resemble a snowflake. Snowflake schemas have low levels of data redundancy, but this benefit comes at the cost of slower query performance.
Just as a galaxy contains many stars, a galaxy schema contains multiple star schemas. These schemas share dimension tables that are normalized to reduce redundancy. The galaxy schema is best suited for highly complex data warehouses, but users can experience lower performance.
A typical data warehouse architecture has several components that work together to store, manage, process and deliver data for analytics.
ETL tools extract data from source systems, transform it in a staging area and load it into the data warehouse. In ELT, the data is transformed after being loaded in the warehouse. A data processing framework tool, such as Apache Spark, can help manage the transformation of data.
A connectivity layer for application programming interfaces (APIs) can help the warehouse pull data from and integrate with operational systems. APIs can also provide access to visualization and advanced analytics tools.
The data layer (or central database) is the heart of the data warehouse. Here, the system integrates and stores data from various sources, such as business applications, email lists, websites or other databases. ETL or ELT data pipelines support this layer and a relational database management system (RDBMS) or cloud data warehouse platform powers it. Built-in data governance and security capabilities can partition data so users access what they need.
Metadata is data about data, describing the data that is stored in a system to make it searchable and usable for analytics. It includes technical metadata (such as table structure and data type) and descriptive metadata (such as author, creation date or file size). Metadata is key to effective data governance and data management.
Some data warehouses provide a sandbox, which is a walled-off testing environment containing a copy of the production data and relevant analysis tools. Data analysts and data scientists can test new analytical techniques in the sandbox without affecting live warehouse operations.
Access tools connect to data warehouses and provide an accessible front end. Business users and data analysts can use dashboards, apps and data visualization tools to interact with data and extract insights. Examples of these tools include Tableau, Looker and Qlik.
There are three main types of data warehouses:
An enterprise data warehouse (EDW) is a data warehouse that services the entire enterprise. It acts as a centralized information repository of historical data for all teams and subject areas. An enterprise data warehousing environment can also include an operational data store (ODS) and department-specific data marts.
An operational data store (ODS) contains the most recent snapshot of operational data. An ODS is updated frequently, enabling quick access to near-real-time data. Organizations often use an ODS for day-to-day operational decision-making and real-time analysis. It can also be a data source for an EDW or other data systems.
A data mart is a subset of an existing data warehouse (or other data sources) and contains data tailored to a specific business line or department rather than the entire enterprise. For example, a company can have a data mart that aligns with the marketing department. Those users can access more-focused insights for customer segmentation and campaign performance without navigating the broader enterprise data set.
The terms data warehouse, database, data lake and data lakehouse are sometimes used interchangeably, but important differences exist.
A database is like a filing cabinet built primarily for automated data capture and fast transaction processing. It typically serves as the focused data store for a specific application. Data warehouses store data from any number of applications in an organization and are optimized for predictive analytics and other advanced analysis.
A data lake is a low-cost storage solution for massive amounts of raw data and uses a schema-on-read approach rather than a predefined schema. Data lakes can store structured data, unstructured data and semi-structured data, such as documents, videos, IoT logs and social media posts.
They can be built on big data platforms such as Apache Hadoop or cloud object storage services such as Amazon Simple Storage Service (Amazon S3). They don’t typically clean, validate or normalize data for analytics, as a warehouse does.
A data lakehouse combines the strengths of data warehouses and data lakes, delivering the low‑cost flexibility of a lake alongside the high performance of a warehouse. By combining the key features of lakes and warehouses into a single data platform, lakehouses can accelerate processing for large volumes of structured, unstructured and real-time data.
They also support machine learning, data science and artificial intelligence (AI) workloads more efficiently. Data lakehouses might also add features such as shared metadata and distributed structured query language (SQL) engines.
Data warehouses make insights and information available to users across an organization, offering many benefits, such as:
Through ELT or ETL processes, data warehouses prepare incoming data before it is stored in the warehouse. This preparation includes data quality methods such as data cleansing, standardization and deduplication. Robust data governance policies and practices can also help ensure data accuracy and integrity for all users.
By integrating high-quality data into a single store, organizations create a comprehensive and reliable single source of truth—which helps to eliminate data silos. This central repository enables business users to confidently access the organization’s pertinent data and use it for business decision‑making. An enterprise-grade data warehouse might also include support for open source formats, such as Apache Iceberg, Parquet and CSV, enabling further data access and sharing across the enterprise.
Modern data warehouses can support various AI and machine learning workflows by providing clean, reliable data. Data scientists can use cleansed and validated warehouse data to build proprietary generative AI models or fine-tune existing models to better serve their unique business needs.
An AI-ready data warehouse should be able to collect, cleanse, organize and structure data and facilitate the flow of data to AI and machine learning platforms. However, not all modern data warehouses are optimized for AI workloads. Data lakehouses are increasingly becoming the data platforms of choice for AI infrastructure.
A data warehouse centralizes and cleanses data from different sources to create a single source of truth, giving organizations a comprehensive, reliable view of enterprise data. Self-service BI tools enable users across the enterprise to access and run analytical queries on this aggregated data.
In this way, data warehouses make it possible for business users of any technical skill level to discover and report on themes, trends and aggregations. Business leaders can use these insights to make better-informed decisions and forecasts based on hard evidence in virtually every area of the organization, from business processes to financial management and inventory management.
Data warehouses can serve industry-specific uses, such as:
The analytical capabilities of a data warehouse can help governments better understand complex phenomena such as crime, demographic trends and traffic patterns.
The ability to centralize and analyze disparate data—such as billing and diagnostic codes, patient demographics, medications and test results—can give healthcare providers deeper insight. These insights help them understand patient outcomes, improve operational efficiency and more.
Organizations can use historical data pertaining to travel and accommodation choices to more precisely target advertising and promotions to their customers.
Large manufacturing companies that generate huge volumes of data can use data warehouse solutions to build out data marts tailored to the needs of each department.
A data warehouse can be a smart choice if your organization aggregates large amounts of data from multiple operational systems, such as business applications (BI), websites and other databases. It is especially useful when you plan to perform complex historical analysis with BI tools or dashboards.
For cost optimization, look to architectures that separate data and compute resources, allowing you to scale them separately. You can also use cost‑effective cloud object storage and AI‑powered workload management for automated resource distribution. Open data formats make it easier to share data across warehouses and lakehouses, reducing storage costs and complexity.
Strong ETL/ELT processes for data cleansing and standardization, robust data governance policies and data observability to detect issues as they emerge can help with data quality problems. A “shift left” approach can also help to detect and resolve data quality issues closer to the root cause rather than downstream.
A database is built primarily for fast transaction processing and typically serves as the data storage for a specific application. A data warehouse aggregates large volumes of data from various sources, then cleans and prepares it for business intelligence, analytical queries and other advanced data analysis.
Data engineers build and maintain the infrastructure, while a chief data officer sets data strategy and oversees data management functions. Business intelligence teams manage semantic layers and dashboards and cross‑functional data governance teams help ensure data quality and security.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Power your applications, analytics and AI with any data in an open data lakehouse
Successfully scale AI with the right strategy, data, security and governance in place.