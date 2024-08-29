There’s an evolutionary paradox that biologists and geneticists have observed multiple times in nature when two nearly-identical species inhabit separate, geographically distant regions. A good example is the species pair of Hodgson’s Frogmouth and Sri Lanka Frogmouth, who inhabit two different corners of India respectively. Often, the situation came about because climatic change fragmented what was once a continuous habitat into multiple ecosystems, and the species embarked upon evolutionarily disparate paths.
A similar phenomenon has taken place within the history of enterprise data management. From their very beginnings, enterprise data platforms were intended to serve as central repositories that could make data available throughout the enterprise. However, changing business climates and evolutionary pressures isolated analytical and transactional data to the point that they almost seem like different species.
To unlock the full value of business data in the modern enterprise, however, we must cultivate more inclusive and accommodating data architectures — ones where all data can again be managed, seen and valued – as a common and shared entity.
Enterprise data management was born out of the need to centralize and integrate data from multiple, separate transactional systems. First-generation data platforms, known as data warehouses, were built to make integrated transactional data available for specific analytic purposes.
Within a decade, however, enterprise data had greatly increased in volume and variety, and data warehouses could no longer meet storage and management needs. As a result, many turned to a second generation of data platforms, called data lakes.
Neither data warehouses nor data lakes were enough to solve the problem, however. The gap between the transactional world (most data’s point of origin) and the analytical world (where data is transformed into insights) has continued to widen. Though enterprise data is fundamentally the same everywhere, how it is managed for analytic and transactional purposes grew more and more distinct.
This dichotomy created multiple issues:
While traditional data warehouses made use of an Extract-Transform-Load (ETL) process to ingest data, data lakes instead rely on an Extract-Load-Transform (ELT) process. Extracted data from multiple sources is loaded into cheap BLOB storage, then transformed and persisted into a data warehouse, which uses expensive block storage.
This storage architecture is inflexible and inefficient. Transformation must be performed continuously to keep the BLOB and data warehouse storage in sync, adding costs. And continuous transformation is still time-consuming. By the time the data is ready for analysis, the insights it can yield will be stale relative to the current state of transactional systems.
Furthermore, data warehouse storage cannot support workloads like Artificial Intelligence (AI) or ML, which require huge amounts of data for model training. For these workloads, data lake vendors usually recommend extracting data into flat files to be used solely for model training and testing purposes. This adds an additional ETL step, making the data even more stale.
A third generation of data platforms, Lakehouses (link resides outside ibm.com), was created to solve these problems. The data warehouse storage layer is removed from Lakehouse architectures; instead continuous data transformation is performed within the BLOB storage. Multiple APIs are added so that different types of workloads can all use the same storage buckets. This is an architecture that’s well suited for the cloud, since AWS S3 or Azure DLS2 can provide the requisite storage.
But these solutions are only half measures. While Lakehouse architectures dissolve the silos within a data lake, they’re still centralized and monolithic. If anything, the Lakehouse merely adds another node — albeit an efficient one — to the enterprise data platform landscape. Still more revolutionary changes are needed to truly unify data management across the modern enterprise.
The Data Fabric represents a fourth generation of data platform architecture. The purpose of the Data Fabric is to make data available wherever and whenever it is needed, abstracting away the technological complexities involved in data movement, transformation and integration so that anyone can use the data (link resides outside ibm.com).
By nature, Data Fabrics are:
The advent of the Data Fabric is opening new opportunities to transform enterprise cultures and operating models. Because Data Fabrics are distributed but inclusive, their use promotes federated but unified governance. This will make the data more trustworthy and reliable. The marketplace will make it easier for stakeholders across the business to discover and use data to innovate. Diverse teams will find it easier to collaborate, and to manage shared data assets with a sense of common purpose.
As data architectures, both Data Fabrics and Lakehouses are still maturing. Their future promise is bright, though. These emerging technologies may someday enable the transactional and analytical worlds to merge into a single sphere where access to data will be democratized, and data-driven insights will flow freely and fast.
