An evolutionary history of enterprise data architectures
There’s an evolutionary paradox that biologists and geneticists have observed multiple times in nature when two nearly-identical species inhabit separate, geographically distant regions. A good example is the species pair of Hodgson’s Frogmouth and Sri Lanka Frogmouth, who inhabit two different corners of India respectively. Often, the situation came about because climatic change fragmented what was once a continuous habitat into multiple ecosystems, and the species embarked upon evolutionarily disparate paths.
A similar phenomenon has taken place within the history of enterprise data management. From their very beginnings, enterprise data platforms were intended to serve as central repositories that could make data available throughout the enterprise. However, changing business climates and evolutionary pressures isolated analytical and transactional data to the point that they almost seem like different species.
To unlock the full value of business data in the modern enterprise, however, we must cultivate more inclusive and accommodating data architectures — ones where all data can again be managed, seen and valued – as a common and shared entity.
Forming Islands: The Evolution of Data Platforms
Enterprise data management was born out of the need to centralize and integrate data from multiple, separate transactional systems. First-generation data platforms, known as data warehouses, were built to make integrated transactional data available for specific analytic purposes.
Within a decade, however, enterprise data had greatly increased in volume and variety, and data warehouses could no longer meet storage and management needs. As a result, many turned to a second generation of data platforms, called data lakes.
Neither data warehouses nor data lakes were enough to solve the problem, however. The gap between the transactional world (most data’s point of origin) and the analytical world (where data is transformed into insights) has continued to widen. Though enterprise data is fundamentally the same everywhere, how it is managed for analytic and transactional purposes grew more and more distinct.
This dichotomy created multiple issues:
- When data travels between transactional and analytical systems, it must be ingested, integrated, and transformed, often multiple times. This means that there’s always a time lag between data’s creation and when it becomes usable for analytics. As a result, analytics leverage information that’s inherently outdated, providing an unreliable basis for business decision-making.
- Enterprise analytics solutions mainly create inert insights, and human experts must interpret them. This reliance on humans introduces a second time lag.
- Working with data requires specialized skills. Data architectures and algorithms remain complex, and gaining insights from data demands expertise in fields like application integration, data engineering, machine learning (ML), DataOps and MLOps. Typical business users don’t possess these skills, creating a gap — and often leading to mistrust — between business and technology organizations within the enterprise.
Lakehouses: Solving Immediate Problems with Data Lakes
While traditional data warehouses made use of an Extract-Transform-Load (ETL) process to ingest data, data lakes instead rely on an Extract-Load-Transform (ELT) process. Extracted data from multiple sources is loaded into cheap BLOB storage, then transformed and persisted into a data warehouse, which uses expensive block storage.
This storage architecture is inflexible and inefficient. Transformation must be performed continuously to keep the BLOB and data warehouse storage in sync, adding costs. And continuous transformation is still time-consuming. By the time the data is ready for analysis, the insights it can yield will be stale relative to the current state of transactional systems.
Furthermore, data warehouse storage cannot support workloads like Artificial Intelligence (AI) or ML, which require huge amounts of data for model training. For these workloads, data lake vendors usually recommend extracting data into flat files to be used solely for model training and testing purposes. This adds an additional ETL step, making the data even more stale.
A third generation of data platforms, Lakehouses, was created to solve these problems. The data warehouse storage layer is removed from Lakehouse architectures; instead continuous data transformation is performed within the BLOB storage. Multiple APIs are added so that different types of workloads can all use the same storage buckets. This is an architecture that’s well suited for the cloud, since AWS S3 or Azure DLS2 can provide the requisite storage.
But these solutions are only half measures. While Lakehouse architectures dissolve the silos within a data lake, they’re still centralized and monolithic. If anything, the Lakehouse merely adds another node — albeit an efficient one — to the enterprise data platform landscape. Still more revolutionary changes are needed to truly unify data management across the modern enterprise.
The Enterprise Data Fabric: Transforming Data Management
The Data Fabric represents a fourth generation of data platform architecture. The purpose of the Data Fabric is to make data available wherever and whenever it is needed, abstracting away the technological complexities involved in data movement, transformation and integration so that anyone can use the data.
By nature, Data Fabrics are:
- A Data Fabric is comprised of a network of data platforms, all interacting with one another to provide greater value. The data platforms are spread across the enterprise’s hybrid and multicloud computing ecosystem.
- Each node in a Data Fabric can be different from the others. A Data Fabric can consist of multiple data warehouses, data lakes, IoT/Edge devices and transactional databases. It can include technologies that range from Oracle, Teradata and Apache Hadoop to Snowflake on Azure, RedShift on AWS or MS SQL in the on-premises data center, to name just a few.
- Able to support the holistic data lifecycle. The Data Fabric embraces all phases of the data-information-insight lifecycle. One node of the fabric may provide raw data to another that, in turn, performs analytics. These analytics can be exposed as REST APIs within the fabric, so that they can be consumed by transactional systems of record for decision-making.
- Designed to bring together the analytical and transactional worlds. In the Data Fabric, everything is a node and the nodes interact with one another through a variety of mechanisms. Some of these require data movement, while others enable data access without movement. The underlying idea is that data silos (and differentiation) will eventually disappear in this architecture.
- Inherently secure. Security and governance policies are enforced whenever data travels or is accessed throughout the Data Fabric. Just as Istio applies security governance to containers in Kubernetes, the Data Fabric will apply policies to data according to similar principles, in real time.
- Supportive of data discoverability. In a Data Fabric, data assets can be published into categories, creating an enterprise-wide data marketplace. This marketplace provides a search mechanism, utilizing metadata and a knowledge graph to enable asset discovery. This enables access to data at all stages of its value lifecycle.
The advent of the Data Fabric is opening new opportunities to transform enterprise cultures and operating models. Because Data Fabrics are distributed but inclusive, their use promotes federated but unified governance. This will make the data more trustworthy and reliable. The marketplace will make it easier for stakeholders across the business to discover and use data to innovate. Diverse teams will find it easier to collaborate, and to manage shared data assets with a sense of common purpose.
As data architectures, both Data Fabrics and Lakehouses are still maturing. Their future promise is bright, though. These emerging technologies may someday enable the transactional and analytical worlds to merge into a single sphere where access to data will be democratized, and data-driven insights will flow freely and fast.