Lesson 2. DW as an Enterprise Memory
Slide 8. Kimball and Inmon Schools
Slide 9. Bill Inmon vs. Ralph Kimball
Slide 10. DW: Similarities and Differences of Inmon and Kimball
Slide 5. Evolution of OLTP and OLAP understandings
Slide 6. Inmon’s Definition of Data Warehouse
Slide 7. Kimball’s DW definition
Slide 1. DW as an Enterprise Memory
Slide 2. Predecessors of Data Warehouse
Slide 3. Generations of Data Warehouses
Recommended EDW Architecture
of an enterprise data warehouse (EDW) should satisfy many functional
and nonfunctional requirements that depend on the specific tasks solved
by the EDW. As there is no generic bank, airline, or oil company, so
there is no single solution for the EDW to fit all occasions. But the
basic principles that EDW must follow can still be formulated.
and foremost it is the data quality that can be understood as complete,
accurate and reproducible data, delivered in time where they are
needed. Data quality is difficult to measure directly, but it can be
judged by the decisions made. That is, data quality requires investment,
and it can generate profits in turn.
Secondly, it is the
security and reliability of data storage. The value of information
stored in EDW can be compared to the market value of the company.
Unauthorized access to EDW is a threat with serious consequences, and
therefore adequate protection measures must be taken.
Thirdly, the data must be available to the employees to the extent necessary and sufficient to carry out their duties.
Fourthly, employees should have a unified understanding of the data, so a single semantic space is required.
Fifthly, it is necessary, if possible, to resolve conflicts in data encoding in the source systems. Pic. 4. Recommended EDW Architecture
The proposed architecture follows the examined principles of modular design - "
unsinkable compartments”. The strategy of "
divide and rule"
is applicable not only in politics. By separating the architecture into
modules, we also concentrate in them certain functionality to give
power over the unruly IT elements.
ETL tools provide complete,
reliable and accurate information gathering from data sources by means
of algorithms concentrated in ETL for the collection, processing, data
conversion and interaction with metadata and master data management
Metadata management system is the principal "
keeper of wisdom"
which you can ask for advice. Metadata management system supports the
relevance of business metadata, technical, operational and project
The master data system is an arbitrator for conflict resolution of data encoding.
Data Warehouse (CDW) has only the workload of reliable and secure data
storage. Depending on the tasks, the reliability of CDW can be up to
99,999%, to ensure smooth functioning with no more than 5 minutes of
downtime per year. CDW’s software and hardware tools can protect data
from unauthorized access, sabotage and natural disasters. Data structure
in the CDW is optimized solely for the purpose of ensuring effective
Data sample, restructuring, and delivery tools
(SRD) in this architecture are the only users of the CDW, taking on the
whole job of data marts filling and, thereby, reducing the user queries
workload on the CDW.
Data marts contain data in formats and
structures that are optimized for tasks of specific data mart users. At
present, when even a laptop can be equipped with a terabyte disk drive,
the problems associated with multiple data duplication in the data mart
do not matter. The main advantages of this architecture are:
- comfortable user’s operation with the necessary amount of data,
- the possibility to restore quickly the contents from the CDW in case of data marts failover,
- off-line data access when connection with the CDW is lost.
architecture allows a separate design, development, operation and
refinement of individual EDW components without a radical overhaul of
the whole system. This means that the beginning of work on the
establishment of EDW does not require hyper effort or hyper investments.
To start it is enough to implement a data warehouse with limited
capabilities, and following the proposed principles, to develop a
prototype that is working and truly useful for users. Then you need to
identify the bottlenecks and to evolve the required components.
of this architecture along with the triple strategy for data
integration, metadata, and master data , allows to reduce time and
budgeting needed for EDW implementation and to develop it in accordance
with changing business requirements.
directions of integration
10:15 – 11:15
DW as an
justification for DW
11:15 – 11:30
11:30 – 12:15
strategy for DW projects
data management with practical examples
management on practice
data gathering and analysis system
strategy on practice
09:30 – 10:15
Стратегические направления интеграции
10:15 – 11:15
ХД как корпоративная память
Экономическое обоснование ХД
11:15 – 11:30
11:30 – 12:15
Архитектуры хранилищ данных
12:15 – 13:00
Тройная стратегия создания ХД
13:00 – 14:00
14:00 – 14:45
Ведение НСИ на практических примерах
14:45 – 15:30
Управление метаданными на практике
15:30 – 16:15
Управление качеством данных
Система сбора и анализа первичных данных
16:15 – 17:00
Тройная стратегия на практике
Data Warehouse with Integration Bus
acceptance of service - oriented architecture (SOA)  has led to an
idea to use SOA in solutions for enterprise data warehousing instead of
ETL tools to extract, transform, load to a central data warehouse, and
instead of SRD tools to sample, restructure and deliver data to the data
Integration bus, which underpins the SOA, is designed
for web-services and applications integration, and provides intellectual
message routing, protocol mediation and message transformation between
service consumer and service provider applications.
glance, the functionality of service bus allows us to replace the ETL
and SRD with integration bus. Indeed, ETL performs mediation between the
central data warehouse (CDW) and data sources, and SRD is the mediator
between the CDW and data marts. It would seem that the replacement of
the ETL and SRD with integration bus can benefit from the flexibility
provided by the bus for application integration.
the CDW, the operational data store (ODS), the temporary storage area,
metadata and master data management systems call the bus as independent
applications with queries to update the data from data sources.
of all, the load on data sources will increase by many times, since the
same information will be repeatedly transmitted by requests of the CDW,
ODS, the temporary storage area and metadata and master data management
systems. An obvious solution is to create a data store for the
integration bus to cache queries.
Pic. 3. Data Warehouse with Integration Bus
the data gathering procedures, previously centralized in the ETL, will
now be scattered over the application requesting the data. The
discrepancy of various data gathering procedures for the CDW, ODS,
metadata and master data management systems will arise sooner or later.
Data collected by different methods at different time intervals,
processed by different algorithms contradict each other. Thereby the
main goal of creating the CDW as a single source of consistent
non-contradictory data will be destroyed.
The consequences of
SRD replacement with the integration bus are not so dramatic. CDW must
be transformed into a service in order to respond to data marts’
requests for data, directed through the integration bus. This means that
the data warehouse must conform to the most common style of web -
services and support HTTP / HTTPS protocols and SOAP / XML message
format. This approach works well for short messages, but usually data
marts require a large amount of data to pass through an integration bus.
The task can be solved by using the binary objects
transmission. The necessary data restructuring can not be performed by
the integration bus, and must be carried out either in the CDW, or in
the data marts. Data restructuring inside CDW is an unusual
functionality for CDW, as it must be aware of all data marts, and has to
carry an additional workload, irrelevant to its main goal: reliable
data storage. Data restructuring inside data marts requires a direct
access from DM to CDW. In many cases it’s unacceptable for security
reasons. This function can be realized by a proxy service that receives
data and transmits it to the data marts after the restructuring. So, we
return to the idea of SRD tool just supplied with bus interface.
integration bus can be used in the EDW architecture as a transport
medium between the data sources and the ETL and between SRD and data
marts in those cases where the components of EDW are separated
geographically and are behind firewalls in accordance with strict
requirements for data protection. In this case, it is sufficient for
interoperability that the exchange was enabled over HTTP / HTTPS
protocols. All data collection, transformation and dissemination logic
should still be concentrated in ETL and SRD.
N., Bose S., Fiammante M, Jones K., Shah R. “Service-Oriented
Architecture Compass: Business Value, Planning, and Enterprise Roadmap”,
IBM Press, 2005.
DW with intermediate application data marts
The following assumptions were the rationales for this architecture’s invention.
companies still deploy and operate independent disparate application
data marts. Data quality in these data marts can meet the requirements
of analysts who are working with DM.
stakeholders are confident that enterprise data warehouse implementation
is a deadly technical trick with unpredictable consequences. As a
matter of fact, the difficulties of EDW development and implementation
are not technical, but are associated with poor project organization and
with the lack of involvement of experts - future EDW users. However
project team tries to avoid nonsignificant technology issues and to
simplify up-to-the-minute tasks, instead of improving project
- The requirement for quick results. The
necessity to report on a quarterly basis causes a need for quick
tangible results. That’s why project team is not immune to the
temptation to develop and implement a restricted solution with no
relation to other tasks.
Following these principles
either accidentally or deliberately, companies start data integration
with introducing the separate independent data marts, in the hope that
the data they contain will be easily, simply and quickly integrated when
required. The reality is much more complicated. Although the quality of
data in data marts can satisfy their users, this information is not
consistent with data from other DMs. So reports, prepared for the top
management and decision makers, can not be reduced to an uncontroversial
The same indicators can be calculated by different
algorithms based on different data sets for various periods of time.
Figures with the same name may conceal different entities, and vice
versa, the same entity may have different names in various DMs and
reports. Pic. 2. DW with intermediate application data marts
is a lack of common data sense. Users of independent data marts speak
different business languages, and each DM contains its own metadata.
problem lies in the difference of master data, used in the independent
data marts. The differences in the data encoding, used codifier,
dictionary, classifiers, identifiers, indices, glossaries make it
impossible to combine these data without serious analysis, design and
development of master data management tools.
organization already has approved plans, budget and timeline for EDW
which is based on independent data marts. Management expects to get
results quickly and inexpensively. Developers provided with a scarce
budget, are forced to implement cheapest solutions. This is a proven
recipe for creation a repository of inconsistent reports. Such
repository contradicts the idea of data warehousing as a single and sole
source of purified, coherent and consistent historical data.
neither the company management nor the repository users are inclined to
trust the information contained therein. Therefore, the total
rebuilding of DW is required that usually implies that new EDW should be
created, which stores report figures indexes, rather than full reports.
This allows to aggregate figures indexes into consistent reports.
EDW rebuilding is impossible without metadata and master data
management systems. Both systems will impact only the central data
warehouse (CDW), as independent data marts contain their own metadata
and master data.
As a result, management and experts can get
coherent and consistent records, but they can not trace the data origin,
due to discontinuity in the metadata data management between
independent data marts and CDW.
Thus, the desire to achieve
immediate results and to demonstrate rapid progress leads to denial of
unified, end-to-end management of metadata and master data. The result
of this approach is the semantic islands, where users speak a variety of
Nevertheless, this architecture can be
implemented, where a single data model is not necessary, or is
impossible, and where a relatively small amount of data must be
transferred to CDW without knowledge of their origin and initial
components. For example, an international company, operating in
different countries, has already implemented several national data
warehouses that follow local legal requirements, business constrain and
financial accounting rules. CDW can require only piece of information
from the national DWs for corporate reporting. There is no need to
develop a unified data model, because it would not be demanded at the
Certainly, similar scheme requires a high degree
of confidence in national data, and can be used, if intentional or
unintentional distortion of the data will not lead to serious financial
consequences for the entire organization.
of three articles is devoted to architectures of data warehouse (DW) and
their predecessors. The abundance of various approaches, methods and
recommendations makes a mess of concepts, advantages and drawbacks,
limitations and applicability of specific architecture solutions. The
first article  is concerned with the evolution of OLAP role
understanding, of DW architecture components, of virtual DW and
independent data marts. The second article  considers the Centralized
DW (CDW) with ETL (Extract, Transform, Load), CDW with ELT (Extract,
Load, Transform), CDW with operational data store, and extended model
with data marts. This article discusses DW with intermediate Application
Data Marts, centralized ETL with parallel DW and data marts; DW with
integration bus, and the recommended DW architecture.
Centralized ETL with parallel DW and data martsPic.1. Centralized ETL with parallel DW and data marts
this case the whole architecture of EDW is built around the ETL
(Extract, Transform and Load) system. Information from disparate sources
goes to ETL which purifies and harmonizes data and loads it to a
central data warehouse (CDW), to operational data store (ODS), if any,
and, if necessary, to temporary storage area. This is a common practice
in EDW development. But downloading data from the ETL to the data marts
directly is unusual.
In practice, this architecture is a result
of users’ requirement to access analytical data as soon as possible,
without time delay. Operational data store does not solve the problem,
as users may be located in distant regions, and they require territorial
data marts. The security limitations on the deployment of heterogeneous
information in ODS may be another rationale for this architecture.
architecture has a trouble spot: one of the problems of its operation
is data recovery difficulty after a crash of data marts, directly
supplied from ETL. The point is that ETL tools are not designed for long
term storage of extracted and cleaned data. Transactional systems tend
to focus on ongoing operations. Therefore, in case of data losses in
data marts directly associated with ETL, one has to either extract
information from the transactional systems’ backup, or organize
historical archives of data sources systems. These archives require
funds for their development and operational support, and they are
redundant, with a corporate standpoint, since they duplicate functions
of EDW, but they are designed only to support a limited number of data
As another approach, sometimes these data marts are
connected both to ETL directly and to data warehouse, which leads to
confusion and misalignment of the results of analytical work. The reason
is that data coming in EDW, as a rule, pass additional checks for
consistency with the already loaded data. For example, financial
document can be loaded with the requisites, almost coinciding with the
document received by the EDW before. The ETL system, not having the
information about all downloaded data, can not reveal whether the new
document is a mistake or a result of legitimate correction.
inside the data warehouse, data verification procedures could reveal
such uncertainty. The new data will be discarded in case of errors.
Contrary, if it is a required correction, the changes will affect both
these numbers, and the corresponding aggregate figures.
the information, loaded into data mart from ETL directly, may contradict
the data received from the EDW. Sometimes, to solve this contradiction,
the identical data verification algorithms are implemented in the data
marts and in EDW. The disadvantage is the need to support and to
synchronize the same algorithms in the EDW and in data marts, fed from
the ETL directly.
To sum up, we can say that parallel data marts
lead to additional data processing, to organization and maintenance of
excess operating archives, to support of duplicate applications and to
decentralized data processing, which causes information mismatch.
the parallel data marts can be implemented in cases where rapid access
to analytical information is more important than disadvantages of this