Data-Driven

Technological building blocks of a Data-Driven Enterprise

Veröffentliche eine Notiz:

The goal of this part of our series is to introduce core concepts and building blocks that are relevant in becoming data-driven. We do not want to claim to author the new Data and AI almanac however some terms we use may require introduction. The main purpose here is to give a one-stop shop for all concepts or definitions that are used throughout our series and will be expanded while we are adding new articles or getting feedback. The structure is oriented towards a taxonomy for intuitive navigation.

The History of Managing Data

As soon as data became available in computer systems, the need has risen to analyze the data. For separating the analytics workload from the operational systems, decision support systems were born which tried to optimize the data for analytics. Since different decision support systems started to provide different answers to the same query, the next step was the creation of the enterprise data warehouse by Bill Inmon and Ralph Kimball which should contain the single point of truth for all data in the enterprise. For achieving this, elaborate ETL (Extract – Transform – Load) processes were created to consolidate the different data sources into a consistent view. But serveral new developments made the achievement of the enterprise data warehouses more difficult: The amount of data was growing very fast, actually too fast, new types of data – semi-structured and unstructured data – became important for the analysis of data. Data had to be analyzed as soon as it was created in the sources. The enterprise data warehouse was no longer agile enough for this. The data lake – usually based on Hadoop technology – tried to address this: Data of any kind was copied and stored in the original source format on cheap storage to make it availabe for downstream analysis and machine learning. A zoo of tools was created to handle new requirements and new kinds of data. But new problems appeared: The large number of tools in the data lake was difficult to integrate, insufficient or missing governance created data sw

Therefore, recently new approaches showed up: Data lakehouses try to combine the best of data warehouses and of data lakes while avoiding the problems to these traditional approaches. Virtual approaches try to gain agility by providing a logical view of the data in the source systems for analytics, and distributed approaches like Data Mesh and Data Fabric try to provide high quality data without the need to bring them in a central repository first.

Too many technical terms? Ok, then let’s try to paint a picture of some of the most important technology concepts for Data-Driven Enterprises and explain each of them step by step.

Technologies to Consider as a Data-Driven Enterprise

Graph-Overview of the components of a Data-Driven Enterprise

Data Repositories

Each data driven enterprise needs at the very basic level an option to store the data for operational as well as for analytical purposes. The systems used for storing these data are called data repositories or data stores. In the next sections we discuss the main kinds of technologies for data repositories.

RDBMS

The data repository which had been dominating the past 30 to 40 years, is the Relational Database Management System (RDBMS). RDBMS are not the first kind of database management system invented. Hierarchical and Network database management system had existed before RDBMS. The main benefit of the relational model, which had been invented by Edgar F. Codd in the 1970s, is that it allows the storage of data in a way so that it can be used as a general purpose system by many different applications. A database management system not only supports the storage of data (which a file system also does) but supports — as the name implies — the full management of these data which includes defining a schema for the data, inserting, changing and querying the data. The language SQL — standardized by ISO and IEC — is nowadays the main language used for these purposes.

NoSQL

Over the past years, the advent of micro services architectures for applications has led to the introduction of many new special purpose data repositories. They are usually called NoSQL systems to distinguish them from RDBMS which use SQL as their main query language. Since many NoSQL systems have added SQL interfaces over time, NoSQL is nowadays mostly interpreted as “not only SQL”. There are different types of NoSQL databases, for example document-oriented NoSQL databases, key-value stores, column-family stores, time-series databases, graph databases, and many more. Many NoSQL systems chose to support a different subset of the CAP theorem capabilities of consistency, availability and partitioning tolerance than RDBMS. They often sacrifice the possibility to join data for better scalability. Therefore, they are ometimes called “no join” systems.

Multi-Model Databases

est trend are Multi-Model database systems (sometimes also called Multi-Modal database systems). They try to get the advantages of RDBMS and NoSQL systems by combining the diverse capabilities of several NoSQL systems and RDBMSs in a single data repository while avoiding the disadvantages of the zoo of NoSQL systems with its many different backup/restore, availability, and disaster recovery mechanisms.

Object Storage

ata repositories have often the requirement to scale the resources used for compute and storage independently. When compute and storage are separated, several compute / processing engines can share the same data. A storage very well suited for that is object storage since it can be accessed from many compute nodes at the same time. Object storage usesglobal identifiers for accessing data objects, and are capable of storing metadata alongside the actual data object. Object storage services often provide a high degree of system or service availability, which other storage types may not be capable of. At the same time, object storage does not support all operations that file systems are usually supporting. For example, it is not possible to update existing object within an object storage, instead a new object is created, with a new global identifier or version. 

Files

External data may not only be stored on object storage but also locally in files on the servers of a data repository. These files are used for exchanging data between systems and often are stored in the same formats as data on object storage (e.g., CSV, Parquet, ORC, Avro). The filesystem on which these files are located may be a POSIX filesystems but may also have a different interface like e.g., HDFS.

Data repository workload

Data repositories are designed for various kinds of workloads. At a high level usually the following three kinds of workloads are identified:

OLTP

OLTP (OnLine Transaction Processing) systems are designed for processing the operative workload of an enterprise. The typical operation on these systems is the lookup or modification of a small set of data records. The typical duration of such an operation is measured in milliseconds, but there are many of them which need to be processed concurrently. These operations are executed within so-called transactions which ensure the ACID properties (Atomicity, Consistency, Isolation, Durability).

OLAP (Decision Support Systems)

Decision Support Systems are used for analyzing data. Sometimes the abbreviation OLAP (OnLine Analytical Processing)is used for decision support systems to highlight the difference to OLTP, but sometimes OLAP is also used in a narrower sense for a subset of decision support systems.

Decision support systems include data warehouses, data marts, data lakes, and data lakehouses.

General architecture of a decision support system

HTAP

In the past OLTP for data processing and OLAP for data analytics have been implemented in separate systems. This was done so that analytical workloads with require read access to many data records do not negatively interfere with the performance of operational systems supporting OLTP transactions on single or small sets of records. A system which tries to handle OLTP and OLAP workloads at the same time is often called an HTAP system. HTAP stand for Hybrid Transactional and Analytical Processing.

Usages of Decision Support Systems

Business Intelligence (BI)

nalytical tools to analyze data and deliver actionable information to help executives and managers make informed business decisions. Organizations collect raw data from enterprise IT systems (e.g., operational data) and external sources, store it in data warehouses, prepare it for analysis, run queries against the data, create reports, and data visualisations in the form of BI dashboards. Reports and dashboards help business users to make operational decisions and strategic planning by identifying trends and patterns (e.g., inefficiencies) in the data (comparing against metrics / KPIs). Commonly used tools include Cognos, MicroStrategy, Tableau, ClickView, MS PowerBI.

Advanced Analytics

Advanced Analytics (AA) and Business Intelligence (BI) are related approaches to data analysis. However, AA goes a step further than BI. BI is a reactive approach and is about reporting the status of what has happened, whereas AA is a more proactive approach and has a stronger focus on predicting the future through the use of statistical algorithms.

Predictive Analytics

Predictive analytics uses statistical algorithms to forecast future trends. It helps identify patterns, potential risks and opportunities. It can be seen as a branch of Advanced analytics. 

Prescriptive Analytics

Prescriptive Analytics provides recommendations/actions to decision-makers about what should be done to improve the business.

Artifical Intelligence (AI)/Machine Learning (ML)

Artificial Intelligence (AI) refers to the ability to mimic human intelligence. The terms „artificial intelligence“ and „machine learning“ are still used interchangeably by many people, without detailing out the differences between these terms. Generally, we can say that Machine Learning (ML) is a subfield of AI. More broadly, ML leverages statistical methods and deep learning algorithms to solve certain problems. There are a variety of ML techniques, including deep learning, natural language processing (NLP), computer vision, etc. These techniques allow computers to learn from any kind of data.

Implementations of Decision Support Systems

There are several kinds of decision support systems which are described in the following:

Data Warehouse (DWH)

“A Data Warehouse is a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management’s decision-making process.” – Bill Inmon

Early analytics and even some BI Applications today directly access OLTP / operational systems. That leads to issues like impacting OLTP performance. Also OLTP systems usually lack features like temporal data models (that enable time travel query) or need harmonization with other sources. The motivation behind the DWH was to have a “single point of truth” for analytics that is decoupled from operational systems. DWHs had a big gain in adoption as requirements for auditing arose with regulations like Basel 2. 

Early analytics and even some BI Applications today directly access OLTP / operational systems. That leads to issues like impacting OLTP performance. Also OLTP systems usually lack features like temporal data models (that enable time travel query) or need harmonization with other sources. The motivation behind the DWH was to have a “single point of truth” for analytics that is decoupled from operational systems. DWHs had a big gain in adoption as requirements for auditing arose with regulations like Basel 2. 

Data in a DWH is following strict data schemas and is usually of high quality in order to be credible, trustworthy and appropriate for BI usage.

he required data harmonization is usually done in the “transform” part of Extract-Transform-Load (ETL) – ETL pipelines are being used to collect, transform and transform and then load data into a DWH.

Most DWHs are updated in daily batches, sometimes even weekly or monthly. DWH implementations that are updated in near real-time (so called “active DWHs”) are rare, given the complexity that result from required subsequent updates for changing a single data set. Near realtime requirements are often addressed with an ODS (see below). 

DWH Architecture. Most DWH deployments implement a layered information architecture.

High Level DWH architecture

ny transformation logic. Only operational metadata, such as processing timestamps, etc., are added. Therefore the data model is aligned to the original sources. The staging area is usually the place where data quality checks are performed (e.g., matching data type, look for missing values, duplicates, …) before the data is processed downstream. 

The Core-DWH acts as single source of the truth. It is modelled in an enterprise wide canonical model with full history.When data is transferred from Staging to Core the data is transformed from source system centricity to the enterprise model. The data model is usually in (bi-)temporal third normal form (3NF) or more recently “Data Vault”. As it represents the ground truth. It is modelled against the enterprises business entities and it is not optimized for analytical performance.  As such, the Core-DWH is enterprise aligned, but use case agnostic.

Data-Marts are optimized for the data consumer: They are denormalized for performance and usually represent a view for a specific usage e.g., regulatory reports. ETL from core to mart is usually not complex as the data is already aligned in core and is mainly optimized for performance (denormalization, aggregation) and enriched with derived values (e.g., KPI calculation). Most popular modelling schemes for marts are Star, Snowflake or Galaxy schema

DWH Infrastructure. DWHs are usually operated on optimized databases (Db2 Warehouse, ORACLE) on optimized hardware or even DWH appliances like Teradata, Netezza, Exadata. DWH infrastructure often represents very high performance and expensive compute and storage. Recent architectures separate compute from storage to allow for cheaper storage solutions and for independent scaling of compute and storage. 

Operational Data Store (ODS)

An operational data store (ODS) is a type of database that is designed to integrate data from multiple sources in real-time or near-real-time, for the purpose of supporting operational or transactional business processes. It stores frequently updated data that needs to be quickly accessed and organized to support high-speed, transactional access. Unlike a traditional data warehouse, an ODS focuses on real-time data, serving as a staging area for data before it is moved to a data warehouse or analytical system. An ODS is optimized for transactional data such as customer transactions, inventory updates, and order processing.

Data Lake

The Data Lake tried to address several challenges of data warehouses when “Big Data” with its characteristics variety, volume, and velocity came up. Data were no longer just structured relational data, but also semi-structured, and un-structured data. Data lakes — often based on Apache Hadoop technology — store all these data in their original format, and do not require a defined schema to store data, a characteristic known as “schema-on-read.” Therefore, data is available sooner for analysis. This flexibility in storage requirements is particularly useful for data scientists, data engineers, and developers, allowing them to access data for data discovery exercises and machine learning projects.
Data Lakes usually leverage cheap storage solutions to allow for cost-effective usage of a large volume of data. 

The main advantages of data lakes over data warehouses are 

  • their greater flexibility regarding the types of data supported, 
  • the earlier access to data by providing data in their source format,
  • and the cost efficient scalability.

But data lakes have also significant disadvantages compared to data warehouses:

  • The complexity of a data lake is usually much higher than the complexity of a data warehouse, because of the many different tools and technologies used in a data lake.
  • This means many additional skills have to be acquired by the people operating and using a data lake in comparison to people doing the same for a data warehouse.
  • Governance becomes a more important topic in data lake because of the heterogeneous data in a data lake. If there is not sufficient governance in a data lake it is in danger of mutating into a data swap.
  • The performance of queries is often lower compared to data warehouses when the same number of compute resources (cores) is used.

Apache Hadoop. Apache Hadoop is an open source framework that is used to store very large datasets. Hadoop’s clustering technology allows it to process and analyze extremely large data sets in parallel. A main aspect of Hadoop is the “Hadoop Distributed File System (HDFS)” which is a distributed file system that runs on standard hardware. MapReduce is a core component of Hadoop and allows to convert input data to an aggregated output. Resources are managed with (YARN) – Another Resource Negotiator. YARN allows to schedule tasks and to monitor cluster usage. In general Hadoop, is an highly available, distributed file system for storing very large amounts of data (millions of files) in a cluster of nodes. Name Node handles requests and metadata and stores location of data on Data Nodes. Data is broken into blocks and blocks are redundantly stored on Data Nodes. Block size and number of replicas can be configured. 

Data Lakehouse

The data lakehouse is an emerging data management architecture that tries to combine the advantages of data lakes like open, flexible and low-cost storage with the advantages of data warehouses like performance and transactional support while at the same time trying to avoid the disadvantages of data lakes and data warehouses. This enables all data (structured, semi-structured and unstructured) to reside in commodity storage to enable machine learning, business intelligence and artificial intelligence in one solution without vendor lock-in. Data lakehouses get the data warehousing functionality like transaction support, time travel queries, etc. by using a translation layer which is mapping files (local or on object storage) to tables There are three main such translation layers:

  • Delta Lake was incubated and is mostly used by Databricks
  • Iceberg was originally incubated by Netflix and is now used by several systems
  • Hudi was incubated by Uber and is mostly for Hadoop systems

Data Access

Data access can be provided in several ways. Traditionally, each data repository had provided its own interface for data access. SQL or some variety of it is the standard query language for relational and many non relational systems. NoSQL systems have also APIs and languages (e.g., Gremlin for graph databases) optimized for the specific kind of repository. Some data repositories have their proprietory drivers which have to be installed at the application side. Many systems especially ones which support SQL have JDBC and ODBC drivers. Recently, most data repositories started to support REST interfaces. The main advantage of a REST interface is that most current programming environments already support REST and no data repository specific driver have to be installed on clients. Therefore, REST is also a popular choice for implementing data services.

Data Virtualization

A special kind of data access is data virtualization. It provides a uniform access to many diverse data repositories without the requirement to copy the data first to a central repository. Data virtualization allows to combine parts from different data respositories in several ways. It can join the data or concatenate the data into a single table. Often data virtualization contains mechnisms like caching for improving performance.

Data Classes

There are several types of data stored in data repositories.  

Reference Data

Reference data provide lists of code values (reference data values) for a specific domain. Examples are country codes or currency symbols. These codes are typically sets of allowed values that are associated with data fields. Besides the codes,reference data may hold a description or structure information (parent / child references) example: Country-Code = “DE”, Country-Name = ”Germany”, Parent = ”EU”. Typical usage scenarios for reference data are data quality checks or servingas the source for a dimension table.

The business description of the referenced domain is usually stored in a glossary (see business metadata).

Master Data 

Master Data describes business entities and their relationships. Master data is by nature of rather small volume and is not updated often. Master Data gives the context to transactional data and is therefore referenced from transactional data. E.g.,a purchase order references a customer master data record which holds additional information like full name, address or consent information. Good master data is free from redundancy so matching records is a core capability of Master Data Management (MDM).

Metadata

Metadata is data describing raw data – a popular example metadata like aperture, shutter-speed, and location are providing additional context and information for an image. Metadata in the Data and AI space is mostly not only “data about data” but also data about data processing. An ETL job could be seen as metadata describing how data is processed. Including metadata about data processing allows data platforms to support capabilities like data lineage or data observability.

In the context of data management metadata it is often divided into:

Technical Metadata

Data describing the physical format (file, database) or operational information of data. This is especially useful or even required to process the data. Additional examples are timestamps (date created, last changed, …), size

Business Metadata

Describes the business context of data. A glossary structuring business terms is a prominent example. The context to data objects is implemented by linking physical data objects (Tables, Files, Messages, …) to the business glossary.

Governance information like data owner or steward falls into this category, too. Also data mapping (e.g., source to target mappings) that not only link objects but also define mapping rules may fit best here.

Operational Metadata

Operational metadata describes how data was processed. 

Examples are: Duration of ETL jobs, number of records processed, number of errors or warnings.

Applications on operational metadata cover data provenance: Where does it come from, how was it generated, or data observability.

Accessing metadata: Data Catalogs

A data catalog is a central interface to the metadata assets. The catalog often does not hold data assets directly but proxies: a metadata representation linked to the original assets. Besides directly browsing metadata (e.g., the business glossary) it also holds tools to enrich assets with metadata (e.g. data quality information from data profiling) or automation for cataloging and linking (meta-)data assets.

Data Integration

ETL stands for extract, transform and load and is a data integration process that combines data from multiple data sources into a analytical data store like a data warehouse or data lake or to provision business applications with analytical information. The latter is also known as “reverse ETL”.

Extract: Retrieve data from various sources. Extracting may do full or delta extracts. Full extracts copy entire tables while delta only transfers changed data from the source table. Delta handling can be based on unique IDs or time stamping or use specific tools like CDC that operate on database logs.

Transform: Transform the landed source data into the target data model, i.e., the enterprise canonical model in case of a DWH Core. When applicable, data quality problems can be fixed (e.g., fill in missing values, correcting errors, …). Calculations are applied e.g., to derive complex figures or for harmonization of different measurements.

Load: Load the transformed data into the target tables. This usually supports incremental loading logic. Which needs to cover that existing data will be updated, new data appended and obsolete data deleted. This is usually optimized for the underlying data store implementation and may e.g., contain sophisticated partition handling.

ETL Challenges

ETL Complexity grows with number of sources and data structures to be transformed. The efforts for ETL usually dominate the overall costs when building or migrating a DWH. Complex ETL is sensitive against changes in underlying sources or targets and therefore introduces lengthy release cycles in DWH landscape.

ETL vs ELT

Instead of transforming all data before storing in a DWH, in ETL data is loaded into data lake in raw or source format until needed for a specific application in case of ELT. In ELT the “T”-Part can be performed within the data store engine (e.g., database, Hadoop, Spark) – so that the data does not need to cross system boundaries and can leverage the power of parallel processing engines.

Popular Tools are: DataStage, Talend, Informatica, AWS Glue, Azure Synapse, DBT

Data Observability

Data observability aims to ensure transparency of the health and state of data in an enterprise. It addresses to cope with the complexity of modern distributed data processing architectures that span different data stores processing engines and platforms. At its core, data observability collects events from data processing and connects, aggregates or correlates them to detect anomalies and to discover trends.

The following is a set of disciplines in data observability:

  • Monitoring —a dashboard that provides an operational view of your pipeline or system
  • Alerting—both for expected events and anomalies
  • Tracking—ability to set and track specific events
  • Comparisons—monitoring over time, with alerts for anomalies
  • Analysis—automated issue detection that adapts to your pipeline and data health
  • Logging—a record of an event in a standardized format for faster resolution
  • SLA Tracking—the ability to measure data quality and pipeline metadata against pre-defined standards

Data observability needs to cover data at rest and – in motion. Tracking and Monitoring ensures transparency over the current state. Alerting is required for events that need timely response.

Architectures / Methodologies

Data Mesh

Centralized approaches like data warehouses and data lakes rely on a central IT team to own and manage data. This creates a bottleneck as the central team fails to keep up with an ever-increasing amount of data analytics use cases. Data Mesh addresses this challenge by democratizing data access and data ownership aiming to scale and speed up data analytics use cases. First introduced by Zhamak Dehghani in 2019, Data Mesh consists of four core principles. The first principle domain-oriented decentralized data ownership and architecture emphasizes the need to leverage domain knowledge to improve data quality. Furthermore, it addresses the challenge of data ownership across the full lifecycle as domains have end-to-end responsibility for the data they produce – leading to the second principle: data as a product. Following the Data Mesh approach, domains provide data in data products consisting of the data itself, rich metadata, and accessibility options similar to APIs in microservices. Data products can then be accessed by other domains, for example, using a data catalog to list data products. In accordance with the microservice analogy, domains rely on a self-service data infrastructure platform that provides domain-agnostic functionality, tools, and systems to create, execute, and maintain interoperable data products across all domains; empowering domains and providing more agility. Finally, Data Mesh aims to ensure standardization, interoperability, security and compliance using federated computational governanceHowever, experience and recent publications have shown that using a federated governance approach to ensure compliance and security remains challenging. As scale is a key goal, a Data Mesh may not be the ideal approach for organizations operating in a very confined context. 

Data Fabric

Data fabric is an architectural concept that facilitates the end-to-end integration of different data pipelines and cloud environments through the use of intelligent and automated systems. In general, it helps organizations meet their business needs and gain competitive advantage. Gartner defines data fabric as follows: Data fabric (13) is a design concept that serves as an integrated layer (fabric) of data and connecting processes. A data fabric utilizes continuous analytics over existing, discoverable and inferenced metadata assets to support the design, deployment and utilization of integrated and reusable data across all environments, including hybrid and multi-cloud platforms.

The diversity of components of data platforms alongside with the data growth has made the unification and governance of data environments an increasing priority as this growth has created significant challenges, such as data silos, security risks, and general bottlenecks to decision making. Data management teams are addressing these challenges head on with data fabric solutions. They are leveraging them to unify their disparate data systems, embed governance, strengthen security and privacy measures, and provide more data accessibility to workers, particularly their business users.

The key elements of a data fabric are:

  • A knowledge graph that eases discovery of all assets and gives context by automatically cataloging and augmenting data assets.
  • Integration layer that supports various integration stiles from streaming to batch and should allow access to data without moving it – mostly implemented by a data virtualization engine.
  • Self-service data usage that utilizes the knowledge graph for discovery.
  • The data fabric is distributed by nature a unified data-ops tooling has to ensure a unified lifecycle management for the various assets.
  • Multimodal governance to centrally manage policies, stewardship and data quality.

AI driven automation is a cross cutting pattern for all those elements e.g. to keep the knowledge graph up to date, augment it with business and technical context or to enforce policy driven access.

Principal Information Architecture Technical Specialist

Dominik Kreuzberger

Customer Success Manager Architect - Data & AI

Sascha Slomka

Senior Client Engineering Solution Architect, IBM Technology, DACH

More stories
By Sascha Slomka and others on Oktober 24, 2023

AI Governance

AI governance has received a lot more attention as AI regulations are being formulated and passed. But AI Governance is not only about regulation, it is the key discipline to master the complexity induced by the variety of AI frameworks, models and tools. AI Governance relies on proper Data Governance which has been discussed in […]

Weiterlesen

By Andreas Weininger and others on September 12, 2023

IBM’s Data Platform for Data-Driven Enterprises

What technology does IBM have to offer to help you become or strengthen your position as a data driven enterprise? IBM recognizes that most enterprises don’t start on a greenfield, but instead already have a landscape of data stores and analytical systems grown over many years. Therefore, IBM’s approach to a modern data platform focuses […]

Weiterlesen

By Sascha Slomka and others on Juli 18, 2023

Experiential and Incremental Implementation

Motivation We have started this blog-series with the question why it is so difficult to become data driven and explored the approaches to accomplish this in Part 3. In this article we go in more detail and focus on experiential and incremental delivery. The main goal of experiential and incremental approaches is to gain a […]

Weiterlesen