A core component of business intelligence, a data warehouse pulls together data from many different sources into a single data repository for sophisticated analytics and decision support.
What is a data warehouse?
A data warehouse is a system that aggregates data from different sources into a single, central, consistent data store to support business analytics, data mining, artificial intelligence (AI), and machine learning. A data warehouse enables an organization to run powerful analytics on huge volumes (petabytes and petabytes) of historical data in ways that a standard database cannot.
Data warehouses have been a part of business intelligence (BI) solutions for over three decades, but they have evolved significantly in recent years. Traditionally, a data warehouse was hosted on-premises—often on a mainframe computer—and its functionality was focused on extracting data from other sources, cleansing and preparing the data, and loading and maintaining the data in a relational store. More recently, a data warehouse might be hosted on a dedicated appliance or in the cloud, and most data warehouses have added analytics capabilities and data visualization and presentation tools.
Benefits of a data warehouse
A data warehouse provides a foundation for the following:
- More consistent, higher-quality data: A data warehouse brings together data from multiple different sources, then cleanses it, eliminates duplicates, and standardizes it to create a single source of the truth.
- Faster, unlimited insight: Disparate data sources limit the data that can be used to support any given decision. A data warehouse makes it easier to put the weight of all a company’s data behind every decision.
- Smarter decision-making supported by cutting-edge tools: A data warehouse supports large-scale BI functions such as data mining (finding unseen patterns and relationships in data), artificial intelligence, and machine learning—tools data professionals and business leaders can use to get hard evidence for making smarter decisions in virtually every area of the organization, from business processes to financial management and inventory management
- Gaining and growing competitive advantage: All of the above combine to help an organization finding more opportunities in data, more quickly than is possible from disparate data stores.
Data warehouse architecture
Generally speaking, data warehouses have a three-tier architecture:
- The extraction tier collects, cleanses, and normalizes/transforms the data from multiple sources by using a process known as Extract, Transform, and Load (ETL) or a process known as Extract, Load, and Transform (ELT).
- The data store tier is typically a relational data store, but with schemas that support analytical processing.
- The analytics tier (or client layer) can include everything from standard querying tools to analytics, data mining, AI or machine learning capabilities, and presentation visualization tools.
ETL vs. ELT
ETL (Extract, Transform, and Load) and ELT (Extract, Load, and Transform) are methods for extracting data from its original source and integrating it into the data warehouse. The difference between the two—and the appropriateness of one or the other for your data warehouse—lies in where the data is transformed:
- ETL extracts data from various data source systems, transforms it using an intermediate transformation engine, and then loads it into the data warehouse system. Because ETL transforms data before writing it to the warehouse, it’s a better choice for loading smaller data volumes and for on-premises data warehouse solutions.
- ELT extracts data from one or multiple remote sources and then loads it into the target data warehouse without any other formatting. The transformation of data in an ELT process happens within the target database. For this reason, ETL moves data to the warehouse faster, making it a better choice for larger data volumes or cloud-based data warehouse solutions. (Also, because it doesn’t transform data in transit, ELT is the only method suitable for loading a data lake.)
Data warehouse vs. database, data lake, and data mart
Confusion often arises between the terms data warehouse, database, data lake, and data mart. While the terms are similar, important differences exist:
Data warehouse vs. data lake
A data warehouse gathers data from multiple sources into a central repository, structured using predefined schemas designed for data analytics. A data lake is a basically a data warehouse without the predefined schemas. As a result, it enables more types of analytics than a data warehouse. Data lakes are commonly built on big data platforms such as Apache Hadoop.
See the following video for more information on data lakes:
Data warehouse vs. data mart
A data mart is a subset of a data warehouse that contains data specific to a particular business line or department. Because they contain a smaller subset of data, data marts enable a department or business line to discover more-focused insights more quickly than possible when working with the broader data warehouse data set.
Data warehouse vs. database
A database is built primarily for fast queries and transactional processing, not analytics. A database typically serves as the focused data store for a specific application, whereas a data warehouse stores data from any number (or even all) of the applications in your organization.
Also important: While a database captures and stores data from a single (usually current) point in time, a data warehouse encompasses current and historical data required for predictive analytics, machine learning, and other advanced analysis.
Cloud data warehouse
A cloud data warehouse is a data warehouse specifically built to run in the cloud, and it is offered to customers as a managed service. Cloud-based data warehouses have grown more popular over the last five to seven years as more companies use cloud services and seek to reduce their on-premises data center footprint.
With a cloud data warehouse, the physical data warehouse infrastructure is managed by the cloud company, meaning that the customer doesn’t have to make an upfront investment in hardware or software and doesn’t have to manage or maintain the data warehouse solution.
Data warehouse software (on-premises/license)
A business can purchase a data warehouse license and then deploy a data warehouse on their own on-premises infrastructure. Although this is typically more expensive than a cloud data warehouse service, it might be a better choice for government entities, financial institutions, or other organizations that want more control over their data or need to comply with strict security or data privacy standards or regulations.
Data warehouse appliance
A data warehouse appliance is a pre-integrated bundle of hardware and software—CPUs, storage, operating system, and data warehouse software—that a business can connect to its network and start using as-is. A data warehouse appliance sits somewhere between cloud and on-premises implementations in terms of upfront cost, speed of deployment, ease of scalability, and management control.
Data warehouse and IBM cloud
IBM offers on-premises, cloud, and integrated appliance data warehouse solutions—all built on a data analytics and artificial intelligence foundation optimized for predictive insight and data-driven decision making. All three are part of the IBM DB2 family of products, offering a common SQL engine to streamline queries and machine learning capabilities that enhance data management performance.
IBM Db2 Warehouse on Cloud is a fully managed, elastic cloud data warehouse that delivers independent scaling of storage and compute, featuring a highly optimized columnar data store, actionable compression, and in-memory processing to supercharge your analytics and machine learning workloads. Smaller data marts and spin ups can add Flex One, an elastic data warehouse built for high-performance analytics, deployable on multiple cloud providers, starting at 40 GB of storage.
Netezza Performance Server, the next evolution of the IBM Netezza appliance, builds on the hyper-converged architecture of the IBM Cloud Pak for Data System to provide a cloud native decision support system for your enterprise’s most complex analytics.
And IBM Watson Studio, a data science and machine-learning offering, empowers organizations to tap into data assets and inject predictions into business processes and modern applications.
For more information on data warehouses, sign up for an IBMid and create your IBM Cloud account.