The biblical Moses learned that after forty years of effort, he could not reach his final destination across the Jordan River. Hopefully your IT projects haven’t taken forty years, but how often have you gotten to the figurative Jordan River and not been able to cross? With operational or transactional systems, even those developed in phases, there is a tangible release into production of some functionality that works, but with informational or analytical systems, it is not really complete until the loop is closed between information and action, and that involves some degree of change in the way people work, not a simple proposition. That leaves the feeling of being left behind on the river bank.
It is abundantly clear that all businesses are ramping up their ability to analyze data. But the practice of “analytics” is hindered by legacy approaches to gathering and modeling data, approaches derived from existing tools and architectures that are now largely obsolete. Some database vendors see embedding analytical processing inside the database as one way to get the data warehouse, for example, to budge from a reporting role to an interactive analytics, operational, or predictive one. But is that really enough?
Effective, iterative, analytics in organizations are an ongoing process, not an ad hoc exercise. In other words, good work is done by a single analyst using advanced descriptive, predictive and optimizing quantitative methods, but this is just the beginning. For these efforts to pay off in an organization, analytical models have to be woven into a series of steps where they become production artifacts, and that requires a whole set of capabilities far beyond those found in standalone statistical modeling packages. What is needed is support for the whole lifecycle of analytical applications from discovery, to functional design, to development, testing, version control, collaboration, production/optimization and profiling and monitoring, not to mention presentation and visualization of results. With the right platform and tools many of these steps can be automated, easing the process and effort required to build and deploy a highly advanced analytic application on big data. Without these capabilities integrated into the optimization and workload management tools of the database platform, analytical work is left stranded at the Jordan River.
A relational database provides the platform for data integrity, security, availability, reliability, scalability, and task management. When data is extracted from the relational database to an intermediate platform for reformatting and statistical processing, all of these qualities are lost. But until now, relational databases have lacked the breadth of functionality to support large-scale analytical processing beyond what could be done, often laboriously, in SQL. Lost in the fog of history is that relational databases were put into service on a widespread basis as transaction processing engines, not analytical engines 1. Their ability to support even ordinary data warehousing tasks such as high-speed bulk loading with adequate indexing and aggregation or the ability to process queries that generated large intermediate results, multiple joins and sort merges were severely constrained. As a result, database designers using traditional relational databases were forced to resort to a wide variety of design techniques, including: modifications at the physical schema level, disabling of key database features such as referential integrity; extremely complicated designs using multiple schema for historical and detailed data, current operational data and data marts; and aggregation and even restrictions on use of system resources and data.
The dilemma was how to have a platform as feature-rich as a relational database for capturing, storing and provisioning large amounts of data, but allow for unpredictable usage patterns, be able to scale to data warehouse-size volumes, and provide the engine for knowledge workers to perform their full range of analytical processes, from scheduled reporting, to ad hoc analysis to advanced analytics? While extant database vendors scratched their collective heads and tried to find a common ground, the market demand allowed for a new crop of relational databases to emerge to serve the analytical rather than transactional requirements. It may not be a mathematical proof, but it is widely believed that a single database platform cannot serve both transactional and analytical needs. It is buried too deeply in the system’s DNA. Though some argue that a relational database could have two separate query optimizers, one for transactional and one for analytical queries, that is not an adequate solution because of the way the database operates, all the way down to the silicon.
Clearly, the solution is a new type of database appliance, an analytical platform that combines massive power and flexibility for both traditional (analytical) relational database functions, as well as a means to embed non-relational data, languages, and development tools in the same platform. One example of this is MapReduce, a programming framework for handling very large sets of data by using the parallel processing capabilities of advanced database platforms. That only works if the relational database is capable of ingesting functions like MapReduce and not merely “calling” them from a user-defined function (UDF).
Where We Are Today
Relational databases designed to support the analysis of data, rather than the capturing of it as transactions, deployed on highly scalable platforms, are now capable of hosting other complementary applications within their environment. An emergent technology, advanced in-database analytics, greatly simplifies the task of applying quantitative algorithms to the tremendous volume of data in these databases, vastly improves the productivity of the professionals who conceive and execute data mining, predictive models and process optimization, and will increase the practice and importance of advanced analytics in all kinds of organizations. Given today’s volatile, complicated environment, it is imperative to employ real-time, rapid-response technology solutions to meet the growing demands of today’s market.This is impossible without the application of advanced quantitative methods. There are many instances where public, private, charitable, and even religious organizations have successfully utilized quantitative methods to manage and improve their operations. Despite the recent chatter about “Competing on Analytics,” this has been going on for decades. However with the exception of those companies whose whole business is data, or newer entities that emerged as a result of the Internet and developed internal process on more current architecture, analytics is not widely used within organizations and is practiced in focused areas for only a small number of problems. The reasons are both technical and cultural, but the technical barriers are falling quickly.
Information technology today provides not only the means to capture data in excruciating detail, it can manage that data and process it in volumes, speed and reliability that are staggering. As a result, the practice of advanced analytics is changing rapidly. A recent innovation, the analytical database or data analytics platform, offers new opportunities to expand the practice of analytics by embedded quantitative methods, and provides the services around them to manage the process within the highly functional and performant machines. In current practice, as depicted in the Figure 1 below, analysts spend a considerable amount of time finding, cleaning and moving data from sources to analytical servers that lack the power and management features of a fully-formed relational database.
The emergence of in-database analytics is an opportunity for organizations to apply the rigor of mathematics to the terabytes of data they collect. An analytic database appliance is an extremely powerful and scalable solution. With the data in the database tightly bound and available to the analytical engine (application environment that hosts the analytics application), not only can analysts create more models and run more versions of them, they can do so with the entire process controlled by the relational database environment for performance, load balancing, security and availability.
In-database analytics is fostering the creation of new analytic applications, by both in-house developers and third parties, that are portable and trusted, lowering the time spent explaining and defending models. As a result, analytics will begin to pervade organizations, without the heavy cost of greatly expanding the number of quantitative analysts.
There are two challenges. The first is to design a relational database that is capable of supporting analytical work. This trail was blazed before, using proprietary and expensive platforms, but more recently, companies like Aster Data have introduced high performance relational database engines that run in-database analytics inside an application environment, largely solving this problem at more attractive price points (Figure 2). The second challenge, though, is constructing these database engines to address not only the broadest range of analytical work (described below), most of which is decidedly non-relational, but incorporating all of the features needed to avoid the stuck-at-Jordan-River problem – not just supporting the number crunching, but providing the tools to support the entire lifecycle of the analytical work in its great diversity.
About The Author
Neil Raden is an active consultant and widely published author and speaker and also the founder of Hired Brains, Inc. Hired Brains provides consulting, systems integration and implementation services in Business Intelligence, Decision Automation and Advanced Analytics for clients worldwide. Hired Brains Research provides consulting, market research, product marketing and advisory services to the software industry.
Neil was a contributing author to one of the first (1995) books on designing data warehouses and he is more recently the co-author with James Taylor of Smart (Enough) Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions, Prentice-Hall, 2007. He welcomes your comments at firstname.lastname@example.org or at his blog at Intelligent Enterprise magazine at http://www.intelligententerprise.com/blog/nraden.html.
Advanced In-database Analytics Done Right