 | Data warehousing – An overview
In order to quickly respond to the market and keep a competitive
advantage, organizations need an integrated and historic view of their
businesses. To support this requirement, an IT platform can be created
to consolidate the data from the different heterogeneous systems
across the company's business functions and operations, and even
external sources, into one or more common databases called a data
warehouse.
Data warehouses are designed with the business-context in mind (its
terms, data, and performance measures) and typically take the form of
multi-dimensional databases to better support decision-support
applications.
In the broadest sense of the term, a data warehouse refers to a
database that contains very large stores of historical data. The data
is stored as a series of snapshots and aggregated data at a specific
time, which allows a user to reconstruct history, make trend and
projection analysis, discover gaps and problems, and make accurate
comparisons between different time periods using Business Intelligence
(BI) tools available in the market, like IBM Cognos. In addition, by
using Data Mining tools, the business analysts can analyze the
patterns in the data values and even predict business behavior based
on different variables. A data warehouse includes processes in place
to clean, integrate, and transform the data that it retrieves before
it is loaded into the warehouse database so the repository contains a
truthful integrated business view of a specific function or view
across multiple functions within the organization.
The term data warehouse can mean different things to different people.
A data-warehousing environment can encompass any of the following
forms:
- Data warehouse
- Data mart
- Operational data store
- Repository
Please, refer to the "Overview of Data Warehousing" page in the IDS
V11.50 Information Center for further explanation on these types of
repositories (see Resources).
Figure 3 summarizes the software technologies
and tools that are typically present or needed in the infrastructure
of an end-to-end data warehousing solution. It also illustrates how
the data flows from the sources to the targets in the process of
creating and maintaining the data warehouse. As we move from the
operational to the analytical world, the database systems' purpose and
design change paradigms from OLTP to OLAP. The data warehouse itself
can comprise one or more databases of the repositories types (ODS,
data staging repository, data mart, data warehouse) mentioned earlier.
Some of these could be the final data warehouse version that the
front-end BI tools will use, and some could be used as staging
databases in more complex data consolidation projects. Depending on
the business need, you may want to put different warehouse
repositories available to different business intelligence tools
(reporting, dashboard, mashups, OLAP analysis and data mining). Last
but not least, the extract-transform-load (ETL) or
extract-load-transform (ELT) tools are needed to automate the initial
and periodic tasks of consolidating and summarizing data from the
different data sources (typically, OLTP database systems, external
repositories and files) into the warehousing databases. Depending on
the DW project, the ETL/ELT tools and processes can be deployed in
several places and can be working in several stages to deal with the
complexity of data extraction, transformation, cleansing, aggregation,
and loading.
Figure 3. Overview
of a typical Data Warehousing Infrastructure
The Informix Warehouse role in
an Informix-based DW solution
To be effective, a data warehouse has to contain trusted information.
It also has to be integrated, subject-oriented, non-volatile, and
time-variant. Therefore, the design of the data warehouse and the
periodic process to populate it is very important. Designers must make
sure the design is consistent with the business queries that will be
conducted within the warehouse. Another important aspect of data
warehouses is data acquisition, which can be defined as transferring
data from a source to the target warehouse. Data acquisition is one of
the most expensive parts of building a data warehouse. This process
will often be conducted with an extract-transform-load (ETL) or
an extract-load-transform (ELT) tool.
It is here, in the data warehouse design and ELT processes, where
Informix Warehouse Feature helps. This tutorial series explains by
example how to use the tools available within the Informix Warehouse
Feature to accomplish these tasks for Informix warehouse repositories.
Once the initial data has been transferred to the data warehouse, the
process must be repeated consistently to incrementally update the
warehouse according to the new data and changes in the operational
sources. Data acquisition is a continuous process, and the goal is to
make sure the warehouse is updated on a regular basis. When this
update occurs in real time, the process is then called change data
capture. This has become a separate field, and there are a number of
products currently available to deal with it, including IBM InfoSphere
Change Data Capture® (CDC), which can use Informix as either
source or target system.
If very sophisticated and specialized ETL and data cleansing tools are
needed to satisfy all types of data integration needs, the IBM
InfoSphere DataStage® and QualityStage® products, which
also work with Informix, are ideal for this purpose. InfoSphere CDC,
DataStage, and QualityStage are products in the IBM InfoSphere®
Information Server offering.
Differences between OLTP and OLAP: A change in approaches
IT systems are divided into transactional (OLTP) and
analytical (OLAP). In general, you can assume that OLTP
systems provide source data to data warehouses — the OLAP
systems — that will help analyze it.
OLTP (On-line Transaction Processing) is characterized by a
large number of short on-line transactions (INSERT, UPDATE, DELETE)
predefined and pre-optimized within the application. The main emphasis
for OLTP systems is put on very fast query and transaction processing
to support the mission-critical operation of the organization. ERP,
SCM, CRM systems to support Manufacturing, Sales, Finance/Accounting,
Distribution, Payroll, and Marketing functions are typical examples of
OLTP systems.
OLAP (On-line Analytical Processing) is characterized by
relatively low volume of transactions. Queries are often very complex,
performed ad-hoc, and involve aggregations. For OLAP systems, a
response time is an effectiveness measure because the users of these
systems are typically business analysts and executives who need a
strategic view of several functions of the business.
The design of a data warehouse / OLAP database is fundamentally
different than a OLTP database. The data warehouse is specifically
designed in the form of dimensional databases, where the tables will
be structured in cubes, star, or snowflakes schemas to facilitate fast
query times and multi-dimensional analysis. In these schemas, the
measures of the business (for example, cost, revenue) are stored in
what are called the fact tables, whereas the data based on
which we analyze those measures (for example, time, geography,
product, customer) is stored in what are called the dimension
tables. For more information, you can consult the "IBM
Informix Database Design and Implementation Guide – Dimensional
Databases" in the IDS V11.50 Information Center (see
Resources).
Table 1 summarizes the major differences between
OLTP and OLAP systems design. (Click
here to see a detailed
version of this comparison.) The detailed table is partially based on
information from the book Building the Data Warehouse (William
H. Inmon; Wiley, John & Sons, Incorporated, March 2002; ISBN:
9780471081302) and on www.rainmakerworks.com.
Table 1. Differences between OLTP
and OLAP systems
| Characteristic | Online Transactional Systems (OLTP)
Operational Systems | Online Analytical Processing (OLAP) Decision
Support Systems (DSS) |
|---|
| Main Purpose | To support day-to-day operations, control, and run fundamental
business tasks | To support managerial, strategic planning and problem solving,
and decision support needs | | Queries | Short transactions; relatively simple SQL | Longer transactions; complex SQL with analytics | | Updates | Random updates; few rows accessed | Sequential/bulk updates; many rows accessed | | Processing speed and response times | Sub-second response time | Seconds to minutes response time | | Database model | ER Modeling; minimizes redundancy | Dimensional Modeling; okay to have redundancy | | Data normalization | Normalized data (5NF); minimizes duplicates | De-normalized data (3NF); duplicates are okay | | Indexes | Few indexes; avoids index maintenance cost in writes | Okay to have more indexes; mostly read-only operations | | Workload predictability and tuning | Pre-compiled queries; repeated execution of queries | Ad-hoc queries; unpredictable load |
After this brief review of data warehousing concepts and OLAP design,
and putting IDS and the Informix Warehouse Feature in context, let's
take a look at the fictional business scenario you'll use in this
tutorial and the rest of series to illustrate the capabilities of the
Informix Warehouse tools.
|  |