What is a data warehouse?
Explore IBM's data warehouse solution Subscribe for AI updates
Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following
What is a data warehouse?

A data warehouse, or enterprise data warehouse (EDW), is a system that aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, artificial intelligence (AI) and machine learning.

 

A data warehouse system enables an organization to run powerful analytics on large amounts of data (petabytes and petabytes) in ways that a standard database cannot.

Data warehousing systems have been a part of business intelligence (BI) solutions for over three decades, but they have evolved recently with the emergence of new data types and data hosting methods. Traditionally, a data warehouse was hosted on-premises—often on a mainframe computer—and its functionality was focused on extracting data from other sources, cleansing and preparing the data, and loading and maintaining the data in a relational database. More recently, a data warehouse might be hosted on a dedicated appliance or in the cloud, and most data warehouses have added analytics capabilities and data visualization and presentation tools.

Build responsible AI workflows with AI governance

Learn the building blocks and best practices to help your teams accelerate responsible AI.

Related content

Register for the ebook on Presto

Data warehouse architecture

Generally speaking, data warehouses have a three-tier architecture, which consists of a:
 

  • Bottom tier:  The bottom tier consists of a data warehouse server, usually a relational database system, which collects, cleanses, and transforms data from multiple data sources through a process known as Extract, Transform, and Load (ETL) or a process known as Extract, Load, and Transform (ELT). For most organizations that use ETL, the process relies on automation, and is efficient, well-defined, continuous and batch-driven.
     

  • Middle tier:  The middle tier consists of an OLAP (online analytical processing) server which enables fast query speeds. Three types of OLAP models can be used in this tier, which are known as ROLAP, MOLAP and HOLAP. The type of OLAP model used is dependent on the type of database system that exists.
     

  • Top tier:  The top tier is represented by some kind of front-end user interface or reporting tool, which enables end users to conduct ad-hoc data analysis on their business data.

A short history of data warehouse architecture

Most data warehouses will be built around a relational database system, either on-premise or in the cloud, where data is both stored and processed. Other components would include a metadata management system and an API connectivity layer enabling the warehouse to pull data from organizational sources and provide access to analytics and visualization tools.

A typical data warehouse has four main components: a central database, ETL tools, metadata and access tools. All of these components are engineered for speed so that you can get results quickly and analyze data on the fly.

The data warehouse has been around for decades. Born in the 1980s, it addressed the need to optimize analytics on data. As companies’ business applications began to grow and generate/store more data, they needed data warehouse systems that could both manage the data and analyze it. At a high level, database admins could pull data from their operational systems and add a schema to it via transformation before loading it into their data warehouse.

As data warehouse architecture evolved and grew in popularity, more people within a company started using it to access data–and the data warehouse made it easy to do so with structured data. This is where metadata became important. Reporting and dashboarding became a key use case, and SQL (structured query language) became the de facto way of interacting with that data.

Components of data warehouse architecture

Let's take a closer look at each component.

ETL

When database analysts want to move data from a data source into their data warehouse, this is the process they use. In short, ETL converts data into a usable format so that once it’s in the data warehouse, it can be analyzed/queried/etc. 

Metadata

Metadata is data about data. Basically, it describes all of the data that’s stored in a system to make it searchable. Some examples of metadata include authors, dates or locations of an article, create date of a file, the size of a file, etc. Think of it like the titles of a column in a spreadsheet. Metadata allows you to organize your data to make it usable, so you can analyze it to create dashboards and reports.

SQL query processing

SQL is the de facto standard language for querying your data. This is the language that analysts use to pull out insights from their data stored in the data warehouse. Typically data warehouses have proprietary SQL query processing technologies tightly coupled with the compute. This allows for very high performance when it comes to your analytics. One thing to note, however, is that the cost of a data warehouse can start getting expensive the more data and SQL compute resources you have.

Data layer

The data layer is the access layer that allows users to actually get to the data. This is typically where you’d find a data mart. This layer partitions segments of your data out depending on who you want to give access to, so you can get very granular across your organization. For instance, you may not want to give your sales team access to your HR team’s data, and vice versa.

Governance and security

This is related to the data layer in that you need to be able to provide fine-grained access and security policies across all your organization’s data. Typically data warehouses have very good data governance and security capabilities built in, so you don’t need to do a lot of custom data engineering work to include this. It’s important to plan for governance and security as you add more data to your warehouse and as your company grows.

+ Data warehouse access tools

While access tools are external to your data warehouse, they can be seen as its business-user friendly front end. This is where you’d find your reporting and visualization tools, used by data analysts and business users to interact with the data, extract insights and create visualizations that the rest of the business can consume. Examples of these tools include Tableau, Looker and Qlik.

Understanding OLAP and OLTP in data warehouses

OLAP (online analytical processing) is software for performing multidimensional analysis at high speeds on large volumes of data from unified, centralized data store, such as a data warehouse. OLTP (online transactional processing), enables the real-time execution of large numbers of database transactions by large numbers of people, typically over the internet. The main difference between OLAP and OLTP is in the name: OLAP is analytical in nature, and OLTP is transactional. 

OLAP tools are designed for multidimensional analysis of data in a data warehouse, which contains both historical and transactional data. Common uses of OLAP include data mining and other business intelligence apps, complex analytical calculations, and predictive scenarios, as well as business reporting functions like financial analysis, budgeting, and forecast planning.

OLTP is designed to support transaction-oriented applications by processing recent transactions as quickly and accurately as possible. Common uses of OLTP include ATMs, e-commerce software, credit card payment data processing, online bookings, reservation systems, and record-keeping tools.

For a deep dive into the differences between these approaches, check out "OLAP vs. OLTP: What's the Difference?

Schemas in data warehouses

Schemas are ways in which data is organized within a database or data warehouse. There are two main types of schema structures, the star schema and the snowflake schema, which will impact the design of your data model.

Star schema: This schema consists of one fact table which can be joined to a number of denormalized dimension tables. It is considered the simplest and most common type of schema, and its users benefit from its faster speeds while querying.

Snowflake schema: While not as widely adopted, the snowflake schema is another organization structure in data warehouses. In this case, the fact table is connected to a number of normalized dimension tables, and these dimension tables have child tables. Users of a snowflake schema benefit from its low levels of data redundancy, but it comes at a cost to query performance. 

Data warehouse vs. database, data lake, and data mart

Data warehouse, database, data lake, and data mart are all terms that tend to be used interchangeably. While the terms are similar, important differences exist:

Data warehouse vs. data lake
 

Using a data pipeline, a data warehouse gathers raw data from multiple sources into a central repository, structured using predefined schemas designed for data analytics. A data lake is a data warehouse without the predefined schemas. As a result, it enables more types of analytics than a data warehouse. Data lakes are commonly built on big data platforms such as Apache Hadoop.

Data warehouse vs. data mart
 

A data mart is a subset of a data warehouse that contains data specific to a particular business line or department. Because they contain a smaller subset of data, data marts enable a department or business line to discover more-focused insights more quickly than possible when working with the broader data warehouse data set.

Data warehouse vs. database
 

A database is built primarily for fast queries and transaction processing, not analytics. A database typically serves as the focused data store for a specific application, whereas a data warehouse stores data from any number (or even all) of the applications in your organization.

A database focuses on updating real-time data while a data warehouse has a broader scope, capturing current and historical data for predictive analytics, machine learning, and other advanced types of analysis.

Types of data warehouses

Cloud data warehouse
 

A cloud data warehouse is a data warehouse specifically built to run in the cloud, and it is offered to customers as a managed service. Cloud-based data warehouses have grown more popular over the last five to seven years as more companies use cloud computing services and seek to reduce their on-premises data center footprint.

With a cloud data warehouse, the physical data warehouse infrastructure is managed by the cloud company, meaning that the customer doesn’t have to make an upfront investment in hardware or software and doesn’t have to manage or maintain the data warehouse solution.

Data warehouse software (on-premises/license)
 

A business can purchase a data warehouse license and then deploy a data warehouse on their own on-premises infrastructure. Although this is typically more expensive than a cloud data warehouse service, it might be a better choice for government entities, financial institutions, or other organizations that want more control over their data or need to comply with strict security or data privacy standards or regulations.

Data warehouse appliance
 

A data warehouse appliance is a pre-integrated bundle of hardware and software—CPUs, storage, operating system, and data warehouse software—that a business can connect to its network and start using as-is. A data warehouse appliance sits somewhere between cloud and on-premises implementations in terms of upfront cost, speed of deployment, ease of scalability, and data management control.

Benefits of a data warehouse

A data warehouse provides a foundation for the following:

  • Better data quality: A data warehouse centralizes data from a variety of data sources, such as transactional systems, operational databases, and flat files. It then cleanses the operational data, eliminates duplicates, and standardizes it to create a single source of the truth.

  • Faster, business insights: Data from disparate sources limit the ability of decision makers to set business strategies with confidence. Data warehouses enable data integration, allowing business users to leverage all of a company’s data into each business decision. Data warehouse data makes it possible to report on themes, trends, aggregations, and other relationships among data collected from an engineering lifecycle management (ELM) app.

  • Smarter decision-making:  A data warehouse supports large-scale BI functions such as data mining (finding unseen patterns and relationships in data), artificial intelligence, and machine learning—tools data professionals and business leaders can use to get hard evidence for making smarter decisions in virtually every area of the organization, from business processes to financial management and inventory management.

  • Gaining and growing competitive advantage: All of the above combine to help an organization finding more opportunities in data, more quickly than is possible from disparate data stores.
Challenges with data warehouse architecture

As companies start housing more data and needing more advanced analytics and a wide range of data, the data warehouse starts to become expensive and not so flexible. If you want to analyze unstructured or semi-structured data, the data warehouse won’t work. We’re seeing more companies moving to the data lakehouse architecture, which helps to address the above. The open data lakehouse allows you to run warehouse workloads on all kinds of data in an open and flexible architecture. This data can also be used by data scientists and engineers who study data to gain business insights. Instead of a tightly coupled system, the data lakehouse is much more flexible and also can manage unstructured and semi-structured data like photos, videos, IoT data and more.

The data lakehouse can also support your data science, ML and AI workloads in addition to your reporting and dashboarding workloads. If you are looking to upgrade from data warehouse architecture, then developing an open data lakehouse is the way to go.

Related solutions
Data warehouse solutions

IBM data warehouse solutions offer performance and flexibility to support structured and unstructured data for analytics workloads including machine learning.

Explore data warehouse solutions
Db2 Warehouse on Cloud

Explore the capabilities of a fully managed, elastic cloud data warehouse built for high-performance analytics and AI.

Explore Db2 Warehouse on Cloud
IBM Cloud Pak® for Data

IBM Cloud Pak® for Data is a modular set of integrated software components for data analysis, organization and management across business silos, on premises and in clouds.

Explore IBM Cloud Pak® for Data
Resources Finding the right enterprise data warehouse to meet the data and AI challenge

AI can present a number of challenges that enterprise data warehouses and data marts can help overcome. Discover how to assess the total value such a solution can provide.

How to choose the right data warehouse for AI

To choose an enterprise data warehouse, businesses should consider the impact of AI, key warehouse differentiators, and the variety of deployment models. This ebook helps do just that.

The Data Differentiator

A guide to building a data-driven organization and driving business advantage.

Take the next step

Scale AI workloads for all your data, anywhere, with IBM watsonx.data, a fit-for-purpose data store built on an open data lakehouse architecture.

Explore watsonx.data Book a live demo