What is a Data Warehouse?

What is a data warehouse?

A data warehouse, or enterprise data warehouse (EDW), is a system that aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, artificial intelligence (AI) and machine learning.

A data warehouse system enables an organization to run powerful analytics on large amounts of data (petabytes and petabytes) in ways that a standard database cannot.

Data warehousing systems have been a part of business intelligence (BI) solutions for over three decades, but they have evolved recently with the emergence of new data types and data hosting methods. Traditionally, a data warehouse was hosted on-premises—often on a mainframe computer—and its functionality was focused on extracting data from other sources, cleansing and preparing the data, and loading and maintaining the data in a relational database. More recently, a data warehouse might be hosted on a dedicated appliance or in the cloud, and most data warehouses have added analytics capabilities and data visualization and presentation tools.

Build responsible AI workflows with AI governance

Learn the building blocks and best practices to help your teams accelerate responsible AI.

Data warehouse vs. data lake

Using a data pipeline, a data warehouse gathers raw data from multiple sources into a central repository, structured using predefined schemas designed for data analytics. A data lake is a data warehouse without the predefined schemas. As a result, it enables more types of analytics than a data warehouse. Data lakes are commonly built on big data platforms such as Apache Hadoop.

Data warehouse vs. data mart

A data mart is a subset of a data warehouse that contains data specific to a particular business line or department. Because they contain a smaller subset of data, data marts enable a department or business line to discover more-focused insights more quickly than possible when working with the broader data warehouse data set.

Data warehouse vs. database

A database is built primarily for fast queries and transaction processing, not analytics. A database typically serves as the focused data store for a specific application, whereas a data warehouse stores data from any number (or even all) of the applications in your organization.

A database focuses on updating real-time data while a data warehouse has a broader scope, capturing current and historical data for predictive analytics, machine learning, and other advanced types of analysis.

Types of data warehouses

Cloud data warehouse

A cloud data warehouse is a data warehouse specifically built to run in the cloud, and it is offered to customers as a managed service. Cloud-based data warehouses have grown more popular over the last five to seven years as more companies use cloud computing services and seek to reduce their on-premises data center footprint.

With a cloud data warehouse, the physical data warehouse infrastructure is managed by the cloud company, meaning that the customer doesn’t have to make an upfront investment in hardware or software and doesn’t have to manage or maintain the data warehouse solution.

Data warehouse software (on-premises/license)

A business can purchase a data warehouse license and then deploy a data warehouse on their own on-premises infrastructure. Although this is typically more expensive than a cloud data warehouse service, it might be a better choice for government entities, financial institutions, or other organizations that want more control over their data or need to comply with strict security or data privacy standards or regulations.

Data warehouse appliance

A data warehouse appliance is a pre-integrated bundle of hardware and software—CPUs, storage, operating system, and data warehouse software—that a business can connect to its network and start using as-is. A data warehouse appliance sits somewhere between cloud and on-premises implementations in terms of upfront cost, speed of deployment, ease of scalability, and data management control.

Benefits of a data warehouse

A data warehouse provides a foundation for the following:

Better data quality: A data warehouse centralizes data from a variety of data sources, such as transactional systems, operational databases, and flat files. It then cleanses the operational data, eliminates duplicates, and standardizes it to create a single source of the truth.
Faster, business insights: Data from disparate sources limit the ability of decision makers to set business strategies with confidence. Data warehouses enable data integration, allowing business users to leverage all of a company’s data into each business decision. Data warehouse data makes it possible to report on themes, trends, aggregations, and other relationships among data collected from an engineering lifecycle management (ELM) app.
Smarter decision-making: A data warehouse supports large-scale BI functions such as data mining (finding unseen patterns and relationships in data), artificial intelligence, and machine learning—tools data professionals and business leaders can use to get hard evidence for making smarter decisions in virtually every area of the organization, from business processes to financial management and inventory management.
Gaining and growing competitive advantage: All of the above combine to help an organization finding more opportunities in data, more quickly than is possible from disparate data stores.

Challenges with data warehouse architecture

As companies start housing more data and needing more advanced analytics and a wide range of data, the data warehouse starts to become expensive and not so flexible. If you want to analyze unstructured or semi-structured data, the data warehouse won’t work. We’re seeing more companies moving to the data lakehouse architecture, which helps to address the above. The open data lakehouse allows you to run warehouse workloads on all kinds of data in an open and flexible architecture. This data can also be used by data scientists and engineers who study data to gain business insights. Instead of a tightly coupled system, the data lakehouse is much more flexible and also can manage unstructured and semi-structured data like photos, videos, IoT data and more.

The data lakehouse can also support your data science, ML and AI workloads in addition to your reporting and dashboarding workloads. If you are looking to upgrade from data warehouse architecture, then developing an open data lakehouse is the way to go.