October 14, 2016 | Written by: Tony Giordano
New Technologies, Enhanced Processes
There is an old saying that “the more things change, the more they stay the same.” Big Data is far more than just Hadoop. Spark, Python, Scala, Cassandra, and Atlas are open source Big Data technologies that have expanded the ability for analytics to scale in velocity and volumes that can provide real-time, digital insight capabilities. While many of these technologies are relatively new, many of the data management best practices developed during the “Relational Age” are equally if not more applicable and equally necessary in the new age of Big Data Technologies.
The “Two Waves” of Big Data
All technologies have a natural life cycle from emergence, to adoption, to maturity, and finally, obsolescence. Typically as a technology matures processes are created (or re-purposed) to better manage that technology. As we see the relational age for analytics coming to a close, being replaced by Big Data we are seeing the same mistakes being made.
The First wave of Big Data
While the creation of Big Data technologies were based on the needs of emerging Internet-based organisations that needed a data platform that could scale linearly at volume with web-based unstructured data, its adoption in the wider market was very much in response to the unmet needs of the business with traditional relational technologies. This was in response to two basic challenges:
The need for raw data for analytics – IT simply did not allow the business to access raw data with the assumption that the unconformed raw data did not provide an enterprise view and is burdened with data quality issues. Mean-while the Data Scientists were secretly getting dumps of source system data so that they could perform the work they needed.
The time to get things done – Business communities grew weary of waiting 3-6 months for a new report and saw Big Data as a solution, simply land the data in Hadoop and start using it!
As these Big Data environments were being stood up there was a massive movement to stop applying all the data modelling data integration design, and information governance rules and processes that seem to take so long. There was one problem with this approach this has been tried twice before.
During the advent of the Relational Database and later with the advent of Data Warehousing both the business and IT built environments where they simply loaded data from source systems into tables that looked exactly like the source structures and went to work analysing the environment. On more than one occasion when someone was asked why the data warehouse data model looks exactly like the source systems data model, the answer would be “it’s on Oracle.”
Over time as the discipline matured, atomic data warehouse data models, dimensional data mart data models emerged to accommodate the different types of uses of data for reporting and operational analytics. Process disciplines such as data quality and metadata management emerged to further enhance and qualify an analytics data environment.
The Second wave of Big Data
Organisations that built these first wave Big Data environments quickly found out that their Data Lakes were turning into “Data Swamps.” All the data management techniques that have been ignored in Big Data environments were creating the same mess that had been seen twice before.
For example, all the master data management work that was performed in the relational world is even more important in Big Data. Source file dumps do not integrate customer keys from transactional data, much less integrating all the digital sources, such as 1st and 3rd party cookies, Facebook, and Google id’s. You do not get a 360 view of a customer in a first generation Bog Data environment.
This is leading to the second wave of Big Data. All the data management processes we threw out as unneeded are now being reviewed on how they should be applied in the Big Data environment. Much of the discussion on the “modern” data lake revolves around re-introducing these data management practices.
The Goldilocks Hypothesis
Is the new world the same as the old?
No, digital, and real-time analytics are very different from the batch, transactional paradigm of the relational world.
For Big Data we should apply the “Goldilocks Hypothesis,” Let’s put process-es in place that provide the ability to provide the flexibility of the first wave, and implement data management processes that provides data formatted as needed such as 360 view of customer. Not too much, not too little.
OK, how is this done, by providing a very pragmatic three-tiered architecture that accommodates the needs of a flexible data environment for data science, and sufficient data management processes for conformed, qualified, defined data.
Our GBS Data Lake Architecture represented this architecture in three distinctive layers.
Raw Layer – The Raw layer stores and manages raw data from the multiple sources (both transactional and other). This layer is used for two main purposes.
1. to stage data for the subsequent big data layers
2. as a source and platform for data science usage.
Conformed Layer – This layer contains a lightly modelled data typically in an industry format to increase operational, analytic, and cognitive throughput and focuses on the processes and environments that deal with the capture, qualification, processing and movement of data in order to prepare it for end user consumption or storage in the User Layer.
User Layer – This contains the file structures, databases, data appliances and other data persistence and related components that provide most of the storage for the data which supports a robust data environment. The Analytic Stores are not a replacement or replica of transactional databases which reside on the Data Sources component, but are a complementary set of data repositories that reshape data into formats necessary for making tactical and strategic decisions and managing a business. These persistent structures could be represented by conceptual, logical, and physical data models and data model types (e.g. 3NF, star/snowflake schemas, unstructured, etc.) and multiple technologies such as SQL, NoSQL, In Memory and specialised data platforms.
Does this mean we do everything the same, no. there are definite differences do to the immediacy and volumes of Big Data technologies, for example:
Data modelling. Physical data models are Spark Frameworks rather than tables. BTW, it is designed for virtualisation vs. moving data into a structure.
Ingestion Curation: Data Quality is not a physical job in Data Stage, Informatica, or Ab Initio; it is a Python-based container.
The moral of the story is while technology is constantly evolving, the best practices on how to manage data remain.