Modificado el por DavidBirmingham
I recently did a Virtual Enzee presentation and listed the Top Ten requirements for scalable bulk data processing inside a Netezza machine.
I'll come back periodically and elaborate on them
1.Platforms easily scale for increasing stress
We have a Netezza machine, so what could go wrong? I was asked a desperate question by an Enzee as to how to get more power out of their machine. After nearly two days of struggling with them I finally asked how big their machine was. It was a TwinFin-3. The answer I gave them, they clearly did not like and even sought solace on the shoulder of another. Who told them the same thing. Get a bigger box. TwinFin-3 is a dev box, not a production box.
Stress comes in many forms. Constantly changing requirements. The need for functional and physical agility. As these things increase, we need a platform that will work with us, not against us.
2.Human intervention eliminated wherever possible (no eyeball-based actions)
This means ALL aspects, not just operational ones. Everything from table maintenance to application development. AUTOMATE!
It is humorous to hear testers offer up their methods, with naive blurbs like "open the application and examine the contents". No, with billions of rows there is no such thing. We must use statistical checking that operates on sets, such as summaries, counts-of etc. No longer can be "eyeball" the data.
Likewise with runtime processes. Define a table with 200 columns and try to put an ELT query against it. 200 columns in the insert phrase, 200 entries in the select phrase, and to maintain it we have to keep them in sync with "eyeballs". No, this doesn't scale.
3.Architecture-centric platforms express applications with patterns
Oddly application developers, like those who develop using stored procedures, whip out a bunch of application-centric 'code" and when the smoke clears, they see repeatable patterns all over it. Unfortunately, they can't take the patterns anywhere because they are hard-wired.
The more architectural approach is to harness the patterns as capabilities and allow our applications to express from them. The application is then an expression of the capabilitis not the center of gravity.
4.Deliberately simple to leverage and operate
Large-scale systems can have mind-numbing characteristics for the un-initiated. It is incumbent upon us to deliberately simplify their interface points to it. Simple utilities, fewer keystrokes to achieve mundane goals, automation for rote tasks..
5.Built for administrative recovery, not reactionary recovery
This can be as simple as, when data arrives and has errors, we don't come to a full stop. We cordon off the error records into an adminstrative /logical status and report them for later remediation. In systems of scale, we cannot halt the processing of tens of millons/billions of records just because a few stragglers are misbehaving. The time it will take to process the data is the problem. If we are 20 minutes away from the process being complete, then we are always 20 minutes away if we have fully stopped the flow for the sake of a few records. If we allow the process to proceed with error-capture, we will close the 20 minutes and then the admins have more breathng room to fix the problem without the scrutiny or pressure of the clock.
6.Data and metadata-driven
The environment can no longer be driven by application code. It has to be driven by an architectural harness that responds and adapts to data and metadata. This is a non-trivial endeavor, of course, but entirely possible to achieve.
What does this look like? The data model is arguable the most volatile component of the solution. Changes in it can destablize a solution. We need ways and utilities to buffer ourselves from the impact of change all-the-while enabling the change. It won't do to tell the users that the data model is frozen for 6 months because we fear impact to our tightly-woven application code (e.g. stored procs)
7.Blended/hybrid approaches quickly adapt and scale
One doesn't have to make an exclusive choice between ETL and ELT. People really want to leverage the power inside the machine but feel constrained that doing so may obviate the ETL tool. Not so - both of these technologies have a major role to play and we should balance them for the best-of-breed solution
8.Template-driven applications: SQL is an artifact, not the center-of-gravity
In the VIrtual Enzee I offered several examples of templates for SQL transforms (insert-into-select-from), views (to avoid nesting) and stored procedures (to build from a template rather than editing them in a SQL tool)
Why do this? The developer puts application logic into the template. At run time, or installation time in case of the SP or View, we formulate the product from the template. This allows us to automatically include non-optional aspects like operational controls, inline status reporting and other elements that we don't want the developer to worry about, much less hand-craft on their own.
Need another bit of operational control? Add it to the template factory and don't worry about the application logic
More importantly, we can generate a template from the catalog and by definition it is tied to the catalog. It is therefore easy to compare the already-deployed templates to changes in the data model. Since 90 percent of all new columns are invariably pass-through columns, running an impact analysis like this captures over 90 percent of the issues in one shot.
9.Inefficiencies are our number one enemy
One of our clients had a TwinFin 48 they were planning to use for their development phase and then cutover internally to production. I asked them to dial back the developers so that it had the effective power of a TwinFin 12. They were a bit stunned at my request until I noted: The TwinFin 12 has a lot of power for development, but a TwinFin 48 will hide bad data models and sloppy code. Lots of power can make any lousy code/data model look spectacular.
Many cases of Netezza machine under stress, upon review we find that many of their inefficient practices have been going on for years, some since the box arrived. But the machine was so powerful it masks the inefficiency, like allowing the box to eat itself from the inside out
Preserve capacity at the processing level, not by guarding the data storage level. Do not be afraid to spin off replica data structures (even large ones) just for a different distribution, if it means that the machine can close its queries faster.
10.Operational integrity drives functional integrity
We understand this as a matter of quality control. Hamburgers from a national chain should taste the same no matter where we buy them. This is not accidental. The end user data is only as good as the processes that are delivering it.
If we make it so the operators have a difficult time handling it, or the admins don't understand it, or the troubleshooters can't get things done, they will start to grouse about the quality of their existence.
On the flip side, I know folks that we radically simplfied things for, and when we showed them the various utilities they would need to keep things in order, they balked. "Do we have to know all this stuff? Why is there so much stuff to know?" And yet, we have reduced a thousand things down to one, but they cannot grasp how much more complex it would be without our having simplified it.
We know that Netezza embraces simplicity. We just have to be mindful to maintain this spirit when we build things around it.
At the functional/capability level,we need to drive operational integrity into the data itself, outfitting the tables and rows with additional columns for the sole purposes of operational control. Otherwise the functional model is pretty much out-in-the-open and we won't have a way to manage the tables in a consistent, harnessed, repeatable form.
Modificado el por DavidBirmingham
I recently bumped up against a Proof-Of-Concept where "Two MPP Powerhouses went Toe-To-Toe" - and I was fairly excited to see that there might be a contender in the ring, stalking the PureData Analytics/Netezza machine. These POC's are always fun to watch. I am sure not quite as gratuitous as gladiators in the Colosseum, but engaging nonetheless.
In this corner...
The contender, dressed in white, was an MPP, er - clustered servers posing as an MPP. Now let's level-set on what an MPP is, and what it is not.
As we can see with the above high intensity graphics - 6 - 2 Cylinder Fiats versus a 12-Cylinger Jag. Now here's the trick question so don't be shy: Which one really accelerates to close the distance faster? Take your time, I'll be right here.
It's no mystery that orchestrated, optimized and purpose-built hardware beats general-purpose commodity hardware every time it's tried. The contender was a cluster of commodity servers posing as an MPP. When we tried to launch scanning analytic queries on it, we could practically hear the whirrrrr-click of the machines as they quietly, well, went silent. For a very long time. I wondered if they would ever offer up the answer. Unlike the Hitchhiker's Guide where they had to wait a million years to get the answer to the ultimate question, we decided to kick off the same query on the PureData/Netezza machine.
I hit the "enter' button while I was standing at the keyboard, then recalled that I needed to check on something else and lowered myself into the chair, but before I could sit - Netezza had the answer. No, it wasn't "42" but something a bit more actionable.
We left the bulding that day satisfied that the Jaguar had in fact smoked the competition. I probably should mention that even as we left the building, the "other guys" still had not come back with an answer. Sad indeed.
Scalability for scanning/bulk operations is a result of strong architecture. It cannot be cobbled together with general-purpose parts. The cluster of servers posing as an MPP had failed. Cluster Failure. Send it back.
Modificado el por DavidBirmingham
Sometimes the average Netezza user gets a bit tripped-up on how an MPP works and how co-located joining operates. They see the "distribute on" phrase and immediately translate "partition" or "index" when Netezza has neither. In fact, those concepts and practices don't even have an equivalent in Netezza. This confusion is simply borne on the notion that Netezza-is-like-other-databases-so-fill-in-the-blank. And this mistake won't lead to functional problems. They will still get the right answer, and get it pretty fast. But it could be soooo much faster.
As an example, we might have a traditional star-schema for our reporting users. We might have a fact table that records customer transactions, along with dimensions of a customer table, a vendor table, a product table etc. If we look at the size of the tables, we find that the product and vendor tables are relatively small compared to the customer, and the fact table dwarfs them all. A typical default would be to distribute each of these tables on their own integer ID, such as customer_id, vendor_id etc. and then putting in a transaction fact record id (transaction_id) that is separate from the others, even though the transaction record contains the ID fields from the other tables.
Then the users will attempt to join the customer and the transaction fact using the customer_id. Functionally this will deliver the correct answer but let's take a look under-the-covers what the performance characteristics will be. As a note, the machine is filled with SBlades, each containing 8 CPUs. For example, if we have a TwinFin-12, this is 12 SBlades with 8 CPUs, or 96 CPUs. They are interconnected with a proprietary, high-speed Ethernet configured to optimize inter-CPU cross-talk.
Also whenever we put a table into the machine, it logically exists in one place, the catalog, but physically exists on disks assigned to the CPUs. A simplistic explanation would be that if we have 100 CPU/disk combinations and load 100,000 rows to a table that is distributed on "random", each of the disks would receive exactly 1000 records. When we query the table, the same query is sent to all 100 CPUs and they only operate on their local portion of the data. So in essence, every table is co-located with every other table on the machine. This does not mean however, that they will act in co-location on the CPU. The way we get them to act in co-location (that is, joining them local to the CPU) is to distribute them on the same key.
But because our noted tables are not distributed on the same key, they cannot co-locate the join. This means that the requested data from the customer table will be shipped to the fact table. What does this look like? Because the customer table has no connection to the transaction_id, the machine must ship all customer records to all blades (redistribution) so that the CPUs there can attempt to join on the body of the customer table. We can see how inefficient this is. This is not a drawback of the Netezza machine. It is a misapplication of the machine's capabilities.
Symptoms: One query might run "fine". But two of them run slow. Several of them even slower. Results are inconsistent when other activities are running on the machine. We can see why this is the case, because the processing is competing for the fabric. Why is this important to understand? The inter-CPU fabric is a fairly finite resource and if we allow data to fly over it in an inefficient manner, it will quickly saturate the fabric. All the queries start fighting over it.
Taking a step back, let's try something else. We distribute the transaction_fact on the customer_id, not the transaction_id. Keep in mind that the transaction_id only exists on the transaction table so using it for distribution will never engage co-location. Once we have both tables distributed on the customer_id, let's look at the results now:
When the query initiates, the host will recognize that the data is co-located and the data will start to join without ever leaving the CPU where the two table portions are co-located. The join result is all that rises from the CPU, and no data is shipped around the machine to affect the answer. This is the most efficient and scalable way to deal with big-data in the box.
Now another question arises: If the vendor and product dimensions are not co-located with the transaction_fact, how then will we avoid this redistribution of data? The answer is simple: they are small tables so their impact is negligible. Keep in mind that we want to co-locate the big-ticket-or-most-active tables. I say that because we have sites that are similar in nature where the customer is as large as two of the other dimensions, but is not the most active dimension. We want to center our performance model on the most-active datasets.
This effect can rear its head in counter-intuitive ways. Take for example the two tables - fact_order_header and fact_order_detail. These two tables are both quite monstrous even though the detail table is somewhat larger. Fact_order_header is distributed on the order_header_id and the fact_order_detail is distributed on the order_detail_id. The fact_order_detail also contains the order_header_id, however.
In the above examples, the order header was being joined to the detail, along with a number of other keys. This achieved the correct functional answer, but because they were not using the same distribution key, the join was not co-located. So we suggested putting the order_detail table on the same distribution as the order-header (order_header_id). Since the tables were already being joined on this column, this was a perfect fit. The join received an instant boost and was scalable, no longer saturating the inter-CPU fabric.
The problem was in how the data architects thought about the distribution keys. They were using key-based thinking (like primary and foreign keys) and not MPP-based thinking. In key-based thinking, functionality flows from parent-to-child, but in MPP-based thinking, there is no overriding functional flow of keys - it's all about physics. This is not to say that "function doesn't matter" but we cannot put together the tables on a highly physical machine and expect it to behave at highest performance unless we regard the physics and protect the physics as an asset. Addressing the functionality alone might provide the right functional answer, but not the most scalable performance.
Last week (4/3/13) IBM did a product launch of the new Hadoop Appliance and the DB2 BLU Acceleration. The BLU model is columnar and they ran with the Netezza model of "simplify-load-and-go" so the total instructions to get data into the machine and act on it is now dirt-simple.
The Hadoop appliance also ran with part of the Netezza model. The Hadoop appliance takes the MPP approach in-a-box so that it's a self-contained appliance without having to stand up a gaggle-of-servers for the same purpose. Keep in mind that these appliances consume less power and generate less heat than the aggregate of their distributed counterparts on the raised floor.
I contrast this to the average hapless soul who wants to do Hadoop and calls upon his management to roll out a gaggle of servers to make it work, and cobbles together the necessary parts and software to make it all happen, painstakingly tuning the environment because that's-what-engineers-do. Then someone says, hey, we could have saved all that money (labor is not free, and neither is hardware) and bought a PureData appliance for Hadoop that has scalable power and a simplified interface - AND integrates to the other environments like PureData Netezza and PureData DB2 for a self-contained operational and administrative experience. We don't need to pay or hire our engineers to home-grow the core substrate. Now they can concentrate on what we hired them for: solving business problems rather than engineer technologies.
The bane of the above model is simply this: we will roll all of it out once, for one application. Repeating it for another application starts us from scratch again because rarely do our engineers roll out such environments with reusable patterns and modules. It is a custom-tuned and rarefied atmosphere for one business purpose. This is true of most application/solution development. The engineers do not focus on the parts they intend to leverage or reuse for the next application. It is all very application-centric all-the-way-to-the-Hadoop servers. One may argue that the Hadoop servers are reusable, but we know in application development that an app-server is rolled out per-application. So while the app-server might itself be similarly configured to other app servers, it is still a separate machine. At some point in this game, the "mission critical" card will be played and all other Hadoop projects will need their own hardware - er - their own gaggle of multiple servers. This is when the instances start to reproduce like rabbits. Would we rather just trade-in all those servers, or forego their purchase altogether and install an appliance? Even if it's one appliance per application instance, it's better than a farm of servers that stretch across a raised floor so wide that we can see the earth curve? Tempting no?
Orrrr - we could continue to do it the hard way. Many years ago I was impressed with the notion of "Eccentric Innovation" in that managers who were running out of capacity would act in desperation to stand up home-grown skunkworks (innovations) that were cobbled together by their most "creative" engineers who they did not hire for engineering or their ability to innovate - and ended up with an eccentric innovation - one that they would not have purchased off-the-shelf if given the choice, but that they instead paid several-times-more for and now they own it and only a handful of people on the planet can actually operate it. It's a very tense existence.
In the appliance genre, it sort of looks like this: If I give you a four-slice toaster, you will likely not use all four-slots except on busy mornings or if you have a big family. However, if I give you a 400-slice toaster, your problem is no longer toasting bread, but "bread-management" - keeping the toaster busy by pushing and pulling bread to and from it, and boosting your bread-movement infrastructure. No different for the Hadoop platform. No sooner will it roll out and people will start to use it, but will they use it enough to justify its expense? The total-cost-of-ownership is a glaring, almost blinding problem with a "common" Hadoop rollout but the costs of labor and upkeep are intangible. Appliances may have a tangible up-front expense but their low-maintenance and scalability mitigate total-cost-of-ownership issues.
And - of course - do we want a swarm of engineers running the Hadoop farm or do we want appliances in a lights-out ops center, quietly solving the world's problems before bedtime?
Many months ago I sat for some interviews essentially distilling the content of the Best Practice Sessions we executed in "deep dive" form at the Enzee Universe in Boston for several years.
Of course, time was always short and we were never able to touch on all the subjects in all the topics. As an example, the topic of Migration was jam-packed into one hour with a follow-on Q&A of 30 minutes. As one can see by the content of the monologues, there was always three or more hours of material we could never get to.
Now available on Amazon in a four-part "mini-series".
As with all analysis of implementations, please accept the following as a composite commentary (much like the Case Studies in Netezza Transformation). The names have been changed largely to protect the guilty. The innocent have already been punished.
So for those of you who may recognize shadows of your own environments in the discussion below, you now have plenty of time to get them cleaned up before anyone finds out about it! But honestly, don't admit to anyone that you are "doing it this way". Just fix it. What happens in the underground, stays in the underground!
I cannot (today) count how many on-site assessments I have executed or the variety of their outcomes. I have to say that on balance, most technology folks are pretty sharp and have things on track. I can usually advise them on how to make things better. This is of course exactly what they are hoping for. What manager wants to hear that they've done most it of it wrong? Or that their investment in the technology and the people, are a bust? No managers I know take their responsibilities so lightly. Some, however, inherit a mess from their predecessor and are flummoxed as to how to unravel it. They don't want me to "put lipstick on the pig", so to speak, but to provide a roadmap on how to dig out of the ditch (or hole, or rat's nest) and move things forward in a healthy direction.
Working with a 10400 Mustang, pre-TwinFin Era, one of our recently arrived data warehouse aficionados took our leadership aside and said, "What they are describing is a reporting system, like a data mart. But we aren't using any technologies to help them with this. We need to have a talk with them about standing up Microsoft SQLServer so we can put a data mart on it and..."
Stop right there. Yikes. He was so full of passion! It was really, really hard to talk him back from the ledge. So I finally said, "If you mention this plan to the client, even once, we will have to remove you from the project." And his eyes went wide like he'd been hit with a two-by-four across the forehead. "Why?" was his impassioned plea. Time to educate him on what Netezza does, right?
Netezza is a data warehouse appliance. It circumscribes and simplifies the data warehouse disciplines. It also makes some strong assumptions about the potential users of the appliance, not the least of which is what-problems-it-solves-well and what-problems-it-does-not-solve-at-all. (World Peace, Global Warming, Time Travel, Cloning of IT Staff Members, and getting the Dallas Cowboys to the Superbowl, to name a few).
Example: What if you were going about doing some-regular-task manually-and-tediously, and someone then showed you a device that would automate it? You might count your blessings and move forward with a skip in your step. But when you share the device's features with someone unfamiliar with the manual, tedious nature of an existence without it, they scratch their heads and say "I don't get it."
I am reminded of a joke where a lumberjack is in the market for a new saw. A powered-chain-saw salesman asks him how many trees he cuts down in a day with his manual saws, and the man says "30". Ahh, says the salesman, with one of these you could cut 100 or more in a single day. The lumberjack doesn't believe him, so the salesman tells him, Look, take this one for a test drive. Use it all day tomorrow and if it doesn't at least double your output, bring it back, no harm done, no questions asked. The lumberjack agrees but returns two days later, clearly disgruntled about the chain-saw's performance. "I was only able to cut down 10 trees with this lousy thing." To which the salesman balked, and wondered if it might not be defective. So he beckoned the lumberjack to follow him outside to their testing area, where he threw a log across two sawhorses and pulled the starter cord on the chain saw. When it roared to life, the lumberjack took a step back and shouted over the sound of the motor - "WHAT'S THAT NOISE?"
Clearly a little product-orientation was in order, no?
A CTO once lamented to me, "Well, we did the best we could with what we had." - Well sure. Don't we all? I don't know of anyone who borrows or rents help to do it poorly. Nor do they take their best people to deliberately make something sub-standard. The problem is, without a baseline knowledge of what the machine can do, how it is typically deployed, and what to embrace or avoid about it, then it's really no different than the lumberjack's problem. He did the best with what he had, didn't he?
What were the outcomes? Poor perception of the product by the user. An objective lack of productivity. General grousing about something that is not well-understood. Where have we seen this before? Give a call to practically any help desk of any product, especially a technology product, and they will bend your ear with "howling" examples of users who mis-applied the product - and some would say - were just plain stupid about it.
Underwriters Laboratory (UL) has a standard policy of quickly adjudicating claims against them no matter how frivolous. Seems that just having the "UL" on the product makes them a lightning rod for litigation. One man took his name-brand lawn mower, which also sported the UL sticker, picked it up while it was still running, and attempted to use it as a hedge-trimmer. He slipped, the lawnmower fell on him, and he sued the maker of the lawnmower and UL. Three boys found a giant bullfrog and decided to kill it by setting it on fire. They grabbed a gas can from the shed, doused the bullfrog with it and tossed a lighted match on the hapless creature. Someone should have told the lads about the volatility of gasoline fumes, because the flames climbed the can's fume-trail directly into its mouth and detonated the can's contents, seriously wounding and severely burning all three boys. They sued the makers of the gas can and UL, which also had a label on the can. Perhaps someone should send this one in to A Thousand Ways to Die. For bullfrogs.
But this is not a dissertation on A Thousand Ways to Fail with Netezza because frankly, it's really hard to fail with a machine this powerful. This is why I say that when we encounter howlers like the guy with the lawnmower or frog-immolation, we're clearly off the beaten path. Why is it then, that the "beaten path" pops up more often than it should? Or for that matter, pops up at all? Aren't data warehousing folks a little smarter than that?
Of course they are. In fact, I don't recall encountering any experienced data warehousing folks who have had a bad experience - quite the opposite. However, as for the folks who have never built a data warehouse but have a lot of experience in "applications" - well in this zone it can get a little choppy.
The point is, across the fruited plain we have exceptions to every rule. My sincere hope is that your project is not inadvertently caught in the "crosshairs of fate".
Rat's Nest Number One.
Upon arrival on site I knew something was wrong. People were squeezed into their cubes, boxes were stacked against walls in every room. The whole office just felt so crowded. And then they introduce me to "the machine". In this case, their production machine was the lowest-powered machine that Netezza had to offer, just short of a Skimmer. The admin at the desk barks at me for parking in the street and not in the garage underground. It's my first day, and nobody said anything about parking. The difference in cost was exactly $1, and if you're like I am and travel a lot, this kind of difference is not worth discussing. Except for here. It's tongue-lashing time. All of these things added up to some significant red flags, moving in a direction of a road lined with red flags.
Case in point, this is not a company doing things expediently or frugally. They are cheap. They will do things according to the lowest denominator of cost and skill, not because they have balanced priorities. They would rather save a few dollars on training or even a rent-an-architect, and allow the least-of-their-staff to painfully slog through the nuances of data warehousing on an immutable deadline. It's the immutable deadline I can't fathom. Here's why:
In a solution implementation, we have cost, duration and quality. Pick two. Whichever two you pick will shortchange the third. Every time it's tried. Well, these folks were shortchanging all three in the blind naivete that it was valid and workable. Without time or resources, quality is always the first, most expedient of the three to fall on its face. Doing it on-the-cheap? Well, what does this say of their readiness for prime-time? Data warehouses have an ongoing cost-of-ownership. It's not trivial. Those who want to play cheap should find another profession, one that does not cherish quality.
I was told by the client that their current back data processing environment used Netezza stored procedures. Another big red flag. Stored procs invite black-boxed code and we cannot capture data lineage through them. Netezza stored procs are ideal for the front-end. Never for the back-end. They are hand-crafted and rather ugly to maintain (this would be true on any platform).
On this particular platform, they had decided not to use monolithic stored procs (a proc with a lot of serialized operations in it) but use modular ones. So modular in fact, that each inbound data stream had its own dedicated "receivor" stored proc, followed by three more role-based stored procs plus another two - one to validate and one to push the data into the final target table. All 150 incoming record descriptions/filestreams had these 6 stored procedures assigned to them. With one catch - they were the same "role" of stored proc, but all of them were different. That's right, to intake 150 tables we saw 6 stored procs each, for a grand total of 900 stored procedures, and this was for just one of several data sources!
Many of you OO aficionados see something screaming out at you, that this should have been one general-purpose loader with six phases of operation, serving all 150 streams. Adding another source and another 100 streams, no problem, they go through the same loader and phases. Need more phases of operation? No problem, just add them to the loader and everyone benefits. Forever. It's a beautiful thing.
Of course, this means a deliberate instantiation of some reusable infrastructure. Many app-developer folks are not familiar with how to do this. After all, with 150 tables incoming,we could expect those definitions to remain pretty stable. But if the same stored procedures are facing the internal data model(s) (and they must), then we have worse than a cut-and-paste rat's nest, we have a hand-crafted rat's nest. If the data model must change, we may effectively invalidate most if not all of the stored procedures. Can you even imagine having to review - and re-review 900 stored procedures so -- oh never mind.
It therefore did not surprise me to learn that they had "frozen" the data model so that the stored procedures could have higher durability. We know this isn't realistic either, because the business will start to drive more requirements into the solution and the model must change to accommodate it, even if it's just attribution of existing tables. How do we keep these things from impacting the existing code? We have no choice but to freeze the data model. But this isn't really a choice, it's more like a un-necessary evil. Their stored procedure implementation only guarantees one thing: their functional code base will be in flux and unstable for the duration of the solution's lifetime.
I made a valiant attempt to explain the rather problematic issues concerning their implementation. (Problematic here, is a polite and professional term for rat's nest without having to say so). I have to admit however, that "rat's nest" may have done disservice to the rats. They also wanted me to "jot down" a list of "enhancements" that would make their solution better, stronger, faster - all that stuff. I could not think of a profesisional term for "burn it to the ground and start over".
Perhaps I could have told them the bullfrog story.
Rat's Nest Number Two
In keeping on our theme with stored procedures - recall - stored procedures in Netezza were originally concieved for supporting the front end BI tools. Not back-end data processing. In fact, pushing the back-end data processing under higher programmatic control is something new - even to ETL tools. That Netezza supports it very well is a bonus. Actually, Netezza does it better than any of the other databases, because the ease of manufacturing an intermediate table, using it and tossing it, is amazingly simple and easy to manage. Other machines, not so much.
But when I say "ELT" or back-end processing, it's still a SQL statement. We have options, like hand-crafting the SQL in script. Been there, done that. Or hand-crafting a stored procedure. Not really interested in doing that again. And then we have generated-SQL from a template or metadata-driven framework.
ETL tools have pushdown, but it's still pretty weak. At least, too weak for power-users like moi. I have no doubt that they will step up, eventually.
In this example, we have the opposite problem from the first. Just as much of a rat's nest, it is a monolithic stored procedure rather than a gaggle of modular ones. The monolithic stored procedure often runs for an hour or more, executes hundreds of SQL statements along the way, and has a lot of detailed steering logic embedded amongst them. It is a veritable nightmare to code and debug, and even worse to troubleshoot. I hear that some developers have been invited to padded cells afterwards, but I think those are just exaggerations. It can't be that bad, can it?
Given choice between only the two, I would choose the gaggle-of-modular over the monolithic. I mean, if you were implementing it and not me. I always have the choice to say no. I don't work for your company, after all. You may not have a good choice. Your uber-architects and their hired guns have told you it's stored-procedures-or-nothing. So it's time to pick a poison I suppose. I'll take hemlock for $400, Alex.
As these stored-proc programmers stared back at me with hollow eyes, I thought I had entered some macabre Tim Burton flick and all we needed was some spooky music, fog-machines and strange howling in the distance to make it complete. They spoke in muted, muffled tones and their questions seemed to drift. Had they slept in like, the last 48 hours? They all looked sooo tired. This is what a monolithic stored procedured does to your staff. Now watch it drain the lifeblood from your operations staff. It is the virtual/technical equivalent of leeches, and you thought we'd left those behind in the Middle Ages (for technology, that would be the 1990's)
Stored procedures don't have single-step capability. When we add another function to it, we have to test all of the functions at once, because it has to run end-to-end. We can creatively work around this in the beginning, but eventually we have to integrate it. When one test takes over an hour, or two, and the answer is buried in the mountain of carefully crafted NZPL-SQL code, at some point we have to wonder what we signed up for. (that would be, we signed up to do it wrong). Ouch.
Stored procedures cannot be parallelized (unlike their more modular counterparts) and as such is a glaringly missed opportunity. They are doomed to be serialized forever.
Now, our framework (that we consult and use as a problem-solving platform for Netezza - nzDIF) handled 100 percent of all data lineage no matter how many intermediate tables, databases or machines are involved in the overall flows and handoff of work. You won't get this with any other product, nor with anythng a stored procedure has to offer. This would be true of any stored procedure on any platform.
This is because on a transactional platform, procs are meant to handle multiple operations on singleton entities so data lineage simply is not an issue. On a Netezza platform, procs are meant to serve the BI platform, not the back end, so likewise data lineage is not much of an issue. Stored procs for the front end are largely summary/filters for pre-existing datasets. We want the lineage on those datasets, not the on-demand operations that consume them. "Could" we expect data lineage from stored procedures in Netezza? Why? The only reason would be to support back-end processing, and stored procedures are not for back-end processing. It's sort of a Catch-22.
Rule #10 in play here
And let's not forget Rule #10, shall we? Recently emblazoned in glowing letters on the catacomb walls of the Underground, Rule #10 is very simple: Never do bulk data processing in a general-purpose RDBMS engine running on a general-purpose platform.
Now, I just had to get Rule #10 into the forefront because this underscores the primary reason why stored procedures are bad for back end data processing. If we have a rule in place against using SQL for bulk processing on a general-purpose platform/engine, then any experience we may have with bulk processing through stored procedures on such a platform is itself a violation and not a marketable skill in the Enzee Universe. More importantly, it institutionalizes the violation and makes it so much worse. We could "maybe" dig ourselves from a ditch if we're using hand-crafted out-in-the-open SQL (also not recommended) but when ensconced behind the fortress of stored procedures, we have to first storm the fortress before we can loot it. Easier said that done.
All that said, folks who have instantiated stored-procedure-based data processing on general purpose platforms have already been doing it the wrong way, so why bring those practices into the Netezza machine? Just sayin'
Rat's Nest Number Three
Ahh, you thought we were done and coming into the home stretch, eh? Well, we're almost there.
This particular rat's nest only appears in places where people have churned a lot of contractors, consultants and other aficionados and hired guns through the company's various revolving doors. And as the person who inherits it rightly recognizes if as a rat's nest, or a hairball, something comes to mind that rushes through their brains like a river of water "Wow, and I deliberately signed up for this. What could I possibly have been thinking?"
Ahh, not to worry, this syndrome is rare, and shall pass. Breathing deeply will override the spooky breathing cadence of the Dark Lord of Expectations, and shall give you extraordinary confidence on how to resolve this problem.
This condition is entirely severable from the technology itself. Any shop that allows the contractors to establish their own standards without oversight is just signin' up for a world-of-hurt somewhere down the line. Fortunately for us, the Netezza machine is like a monster truck with gumbo-mudder tires. No matter how mired-in-clay it may be, we need only fire the engines and punch the accelerator to regain control and be underway in no time.
The first step, like any 12 step program, is to recognize that a problem exists. Bandaging a hemmorraging wound will not heal it. This will only forestall the inevitability of bleeding out. If we are to be proper stewards, bandaging has its benefits while we treat the larger wounds.
First and foremost, commit to some form of data management logistics. And this is not by purchasing a data backup tool. This is a committment to flow-based, insert-only architecture as the rule, with updates and deletes as the exceptions. After all, if we were using an ETL tool, we wouldn't be able to update or delete a flow of work. We can only integrate and filter the data while it's on its way elsewhere, but that elsewhere will always be an insert-only target - because it's a file set and and not a database. Only when we reach the book-end of the database can we perform updates and deletes, and these are largely to support things like constraints and slowly-changing dimensions. We just need to avoid invoking an insert/delete/update protocol for all tables at all times. Center on a theme and accommodate the exceptions. We must have rules, and this is one of them.
Commit to some form of rules-driven architecture. That is, when we encounter a new condition or potential fork in the logic, consider shaping it with a rule (one that we can switch on/off or modify from afar) rather than hard-wired SQL or hard-coded solutions. Is this easy to do? Of course not. Nothing is easy about data warehousing or large-scale flow mechanics, silly rabbit.
Netezza has simplified the harder, tedious and repeatable parts so that we can actually address the issues we never had time for before. The "next level" was never in view, or even on the radar because we were always immersed in the operational weeds of the implementation. With a Netezza machine, lots of that is behind us, but before us stands the new challenge. It goes something like this:
If I were to give you a 400-slice toaster, your problem is no longer toasting bread, but bread management. Keeping the toaster busy has now become a daunting problem of bread logistics, not machine capacity. The problem domain has shifted into a zone that lots of folks don't have any experience with. Time to step up.
Tossing around the term "Big Data" these days seems to elicit a wide variety of feedback, concern, conjecture, etc. Those in the "old school" of Big Data wrestled with billions of rows of structured data. The buzz now is for unstructured Big Data, and for folks in this zone, it's as though the "other" form of Big Data never existed. After all, they were never exposed to it. Their introduction to Big Data, as though it was brand new, was big "unstructured" data. I recently posted out on Linked-In a tongue-in cheek rendition of the branding problem Big Data is having. I have received so many emails on it that I thought it deserved a little more exposure, if only for pure entertainment purposes. Here we go:
Big (unstructured) Data was so-dubbed for lack of a better word. Unfortunately it has suffered the same ambiguity as Kleenex (don't we ask for a Kleenex when we mean any-old-tissue?) or Coke (don't kids ask for a Coke when they really mean any-old-soda, and didn't Coca-Cola have to go on a market scourge to avoid losing the meaning of the word? If you ask for a "Coke" the person at the counter is required to correct your choice if they don't serve it) and don't we use the word "cellophane" to mean - oh wait - cellophane really is a distinct product that never protected the meaning of its name, so it really is lost.
I would honestly rather protect the meaning the term Big Data to mean Big Structured Data without having to say so. Alas, I fear it has already been hijacked forever.
Give me a chicken sandwich, french fries and a Coke, with a side of Data.
We only serve Pepsi, will that be fine?
Yeah sure. And supersize the data to Big Data.
Will that be Big Structured Data or Big Unstructured Data?
What's the difference?
One is like a burger and the other is like a salad.
One we make from a stack with a secret sauce. The other we just toss together at the last minute.
Does it come with dressing?
As much as you can handle.
Make it so.
Unstructured it is. That will be $20.13 at the first window.
I tossed the clothing into the washer, grabbed the non-chlorine bleach, popped off the cap and poured some into the tub with the clothing. My wife, horrified, asked "What are you doing, didn't you measure it?" To which I say "Of course, it was three bloops."
"Three bloops," she recoils in further horror, "are you kidding me?"
"Well, no, look it requires exactly one cup of bleach for this size of load," I explained, then grabbed the measuring cup and turned the bleach bottle on top of it and let the contents "bloop" three times. It made exactly one cup of liquid. It would be one cup of liquid no matter how many times I repeated it. I simply took the shortcut and measured the bloops. An inexact measurement but just as effective. Of course, she wants me to use the measuring cup every time, and cannot imagine going through life with a bloop-here or a bloop-there. I mean, all those recipes in the kitchen with exact measurements. Do we reduce those to inexact quantities too?
A great mathematician once told me that he never figures the exact to-the-penny tip for a wait-staff of a restaurant. If he wants to give fifteen percent, he mentally calculates ten percent by chopping a zero off the whole-dollar amount, then divides this value approximately in half, adds it to the first then upwardly rounds to the nearest dollar. This method is horrifying to accountants, whom I have seen use calculators to figure exact tip amounts. He also told me that when figuring Celsius temperature, he simply subtracted 32 and divided by two. Is this the right "formula"? No, it's supposed to be 5/9ths right? But if all he wants to know is whether to grab a sweater, coat or neither, this is close enough.
I made some chili later that evening. This is a simple reciple. We brown and drain two pounds of ground meat. We then toss this into a two quart container and follow it with three large cans of diced tomatoes. Plus three packets of chili mix. Do I need to read the labels, really? Sure, it calls for a cup of this or several ounces of that. But lets face it, we're after a certain taste and I know that these ingredients in these proportions deliver it. This particular evening my daughter and I were making the chili together and she carefully followed the instructions, but we were left over with half a packet of chili mix plus half a can of diced tomatoes. We can see the pattern here, right? So to her horror I tossed the additional tomatoes into the container along with the mix and started stirring. I don't know if this is a "guy" thing or not. My youngest son was also horrified at my abject disregard for the instructions on the packet. I have tampered with the forces of the universe, you see. Do not deviate from the recipe, lest the earth open up and swallow us all. Or something like that.
Anyhow, when we served up the chili, they ate as voraciously as always and all was well. As a throwback from the days of yore, I add mustard to my helping of chili and use Frito's Scoops to spoon it out. Anyone familiar with Frito-Pie fully understands the connection.
Now one might well ask, what on earth has this to do with enterprise architecture or anything akin to it? In science, aren't we supposed to cross every T and make sure nothing is amiss? Yes and no. Implementations drill on detail. Architecture, not so much.
I listened to a crew of IT admins debate the required size of their new environment. One of them quipped that if we need more space or CPUs, we would need 6 weeks of lead time to order it. After sizing the environment, all agreed that we were at least within 10 percent of the necessary sizing, and this should be good enough. All except for one, who was concerned that if it was too small, we would have to order more. But you have six weeks to decide, right? Well, no, we need to order the hardware now so that it will be here in six weeks when the rollout needs it. Yeah, said another, but if we run into an issue between now and then we just order more. We don't need all of it right way. We're only using about twenty percent of the capacity to begin with. What's the big deal?
And yet, they continued in their paralysis, unwilling to make a committment until one of our rank simply forced them to. In the blink of an eye, all was clarified when everyone present was willing to admit that such decisions have inexact quantities all over them. It's an educated guess. Like a hypothesis. So we're still using science, but we have to make a call and get moving. People depend on it, and time has expired to delay any further.
We see a lot of this kind of in-exactness in data warehousing. Capacity planning especially has percentages and utilization wriitten all over it. A case in point is that we need to be seriously considering capacity upgrades when a system reaches 60 percent of its current capacity. Why is this? Because a red line exists at the 80 percent mark, and above that is reserved for system recovery and workspace. It is not okay to presume that the "last 20 percent" of capacity is available for regular use. It is the red zone. But if we go to the boss and say we are reviewing capacity when the current utilization is only 10 percent above the halfway mark (60 percent) - they may well ask - isn't this a bit premature?
Well, it honestly depends on how long it takes to get the upgrade/transition for capacity underway. If the assessment takes a few weeks, and the designation, procurement, delivery and installation take many more weeks, we have to consider how fast the data is growing. Will the data grow into another 10 percent of the machine by the time we're able to install the upgrade? Okay, then, we're still outside the red zone. But every percentage point above this is one tick closer to the red zone. And we don't want to cross it. Many times I have personally witnessed "perfectly operational" systems simply hang one day. Out of the blue. The processing capacity required for an intermittent spike did not have room to finish. Or an error recovery needed more spill space than it had left to give. Simple things often lead to catastrophic failure when chaos has no place to go. Or for that matter, when the machine cannot dispatch the chaos because it is already too overwhelmed.
A colleague relates that in his data processing shop, the disk space had long since breached capacity and regularly spilled over to tape drives during the evening's processing window. As he described it to me, their environment was using tape drives for runtime workspace! And the CIO could not stop complaining about how long the jobs were taking, but simply refused to buy any additional disk space for the environment. In his mind, the final storage needed exactly 80 percent of the capacity and they were not in the red zone. But weren't they?
In a data processing scenario, the "understood" quantity of disk space runs anywhere from 6x to 8x of the final product's size. So if we are targeting a 1 TB warehouse, we would need between 6 TB and 8 TB of workspace to support it. Whether this is actually hosted on the physical database machine is immaterial if the disk space is shared between database and the external, flat-file world, which is often the case. I recall one instance where I specified 300 gb for a 20 gb warehouse and the manager, himself a warehousing aficionado, raised a strong objection to such a need. When we did the math, I was actually being pretty conservative in the estimate, seeing that we needed to support Development, Testing and Production workspaces, you see. With 60 gb between them, and 6x needed to support each - voila! We have easily breached 300 gb. The punch-line of course, was that they needed to order an additional 150 gb to finalize the project. Oh well, it's just an educated guess.
But if he was willing to give me so much grief over 300gb, imagine what I would have heard for 450gb? The point being, it won't always be true that our bosses will give us, for all configuration lifecycle environment, upwards of 40x what we need in the end - but it certainly gives us insight as to why "all that disk space" seems to evaporate within a couple of months of the project's inception!
Set-based operations, big structured data handling, and now big-data on-the-grid, we will find even more "inexactness" to wade through. I had an interesting conversation just this week with someone who could-not-believe the data being returned by his big-data cluster. Something had to be amiss, he asserted, because "he just knew" that things had-to-be-different. Basically, he'd spent millions on marketing and brand recognition and had expected measureable lift for his product. When it did not arrive, it basically meant that all those millions were spent for nothing. Either that, or he was looking in the wrong place. I asked him if sales ever changed after one of these marketing pushes, and he said no, the marketing pushes were traditionally geared to keep product loyalists from defecting.
So I asked a very impertinent question - how do you know if the marketing pushes are doing anything at all? Wouldn't it be odd to just forego the next marketing push and see-what-happens? This was interesting to him, but simply out of his hands. The engine to create the marketing collateral and the waves of market "push" were ensconced as science in the highest echelons of the company. Asking them to forego even one cycle, and the risk involved in such a thing, could be suicide.
So let's measure it, I suggested. If the quantities are a science and we can know for certain where the lift is, or is not going, we can measure it as a trend. So he set up a number of market "trolls" as it were, to cast the net for information on their products and various trending for competitor products. He pulled these stats daily for the month prior to the marketing push, through the push and for one week thereafter. I warned him that if it measures "nothing" we have nothing to report on. We really need to report on "something" so we can show what directly affects loyalty to the product. He knew of several interesting 'anti'-quantities that could show us, almost in negative terms, whether the marketing push had any value. These are proprietary so I will not share them here. Nonetheless, by measuring these anti-quantities we could see a loyalty trend in a different way. Not when people re-aligned with their products, but when they dis-aligned with them.
This was an interesting graph. It showed that the loyalty to their brands had less to do with their marketing pushes and more to do with the sale-event discounts associated with their competitors. In a comedy of errors, their marketing pushes just-so-happened to be timed when their competitor sale-events were ebbing off, offering the illusion that loyalty was being restored when it fact it was simply re-aligning to its normal center. What if, he mused, they simply delayed the marketing push for some point after the customer loyalty naturally aligned-to-center? This could in fact pull in even more loyal customers, or let them know whether the loyalty push had any value at all.
Last year I ran into my colleague again, some two years after we had first characterized the situation. He told me that after showing all of the metrics, the marketing folks pooh-pooh'd his findings. All except for one, an ambitious soul who had recently been promoted to the second-in-command of the marketing department. According to legend, this person worked with my colleague to distil the right answers, inexact though they might be. Some 8 months later, they finally agreed to offset the marketing push by four weeks to see what the results would be. Three weeks into this cycle, with the upward trends behaving normally even though no marketing push was underway, gave them what they needed to know. The marketing pushes were at best, mis-timed and at worst, completely worthless. They decided to forego the marketing push entirely. Six months later, the trends remained in place without any effort on their part. With a primary benefit: They had saved many millions of dollars in marketing expenses. And by this time they had already begun the process of running an entirely different kind of marketing push, this time on the edge of the peak rather than in the trough of the lull.
So we can see that watching "trends" or "patterns" of gross movement gives us insight into how to attack (or retreat from) the marketplace in ways that make us more competitive. These gross movements are inexact. While we cannot conjure up successfull marketing potions with three-bloops of elixir, the approach to success is not so different. Patterns, swaths, wakes, edges, trends, peaks etc are all inexact measurements we derive from the existing information. But we need to do it on a scale that is impossible with commodity, general-purpose technologies. More importantly, while the detail data drives the final results, incrementally more information may not "move the needle" at all. In fact, just like the measurement of three-bloops - there's very little in how that measurement system will deviate from the center in any signficant manner. It is this "significance" we care about, and why the inexact results and processes to derive them may not be lockstep-perfect, but they tell us what we need to know.
So the CEO is glaring at his white-board, or rather, white-wall ever since he had an entire wall converted to white-board surface - and its border-to-border scribblings, ad-hoc graphics and spider-web of interconnected projects. He wonders when these things will finally fall "under control".
He chats at board meetings, his various coffee-klatches and among his business acquaintances, peers and lieutenants. He has dreams. Big ones. Who doesn't? But we know that a corporate dreamer is no different than the several kids lying in the grass staring at the clouds or stars, their craniums barely touching one another as their bodies are configured in a pinwheel. They dream of what those clouds look like, or what it might feel to fly among the galaxies in a starship. If only they had the power to fly. The power to reach lightspeed.
But in a lot of ways, they already can. Research has shown, in various ways, that the "speed of thought" in the human brain requires countless millions of signals in any given fraction of a second. The permutations of thought-pathways are themselves mind-boggling, and yet all of this takes place rather seamlessly and instantaneously. People think just as quickly at the supermarket as they do in their homes or when driving. Martial artists don't particularly think faster, but have trained their reflexes to second-guess their brain's immediate intentions. What does all this mean?
One could assert that the human brain is already processing data far faster than mere electronics could possibly accomplish, meaning that the human brain actually thinks, on the aggregate, faster and with more directed power than the speed of light itself. Okay, that's pretty heavy, but think about those kids wondering if they could travel at the speed of light or beyond, when the mere act of dreaming has already taken them there?
How to capture this speed-of-thought capacity? When analysts stand up their uber-powered environment, they want to query their data, receive a response, formulate a ripost and fire it back again, as fast as they are able to think. Well, what machine could adequately keep up with the human brain? Not that it's directly bound to the brain or anything, but we know that if an analyst has to wait for ten minutes, or an hour, for a return-on-query, that their thought processes are not willing to keep themselves on hold for that long. The human brain moves on, and has formulated many more questions even before the machine returns the first one. Underpowered technology like this cannot serve analytics in any significant capacity.
Now let's go back in time to NASD (securities regulatory agency for NASDAQ), which in 2006 had a processing day that exceeded 28 hours. That's right, they would start a new processing day before the prior one finished, and they would catch up on the weekends. No matter how the CEO rearranged his thoughts or white-board priorities, it would not change the primary problem: a lack of capacity. They installed a Netezza machine and recharacterized their workload on it, effectively reducing their processing day to less than 2 hours. Simpler processes. More accurate answers - and some answers that had been unavailable before. Capacity in hand, the business forces saw an opportunity arise that would have been impossible without this capacity, and so NASD merged with NYSE's counterpart to form FINRA. This is not to say that Netezza was solely and completely responsible for the merger, but definitely played into the ability (or lack thereof) to make decisions toward this goal. Face it, without capacity, FINRA would never have seen the light of day.
The CEO doesn't really care about the length of a batch window or SLA compliance. He/She cares about capacity. This would true of any CEO on the planet. Whether it's a hospital, retail chain or head of a consulting firm, they want to capture more business and maintain good service with their current base. This requires capacity and "bandwidth" to do so. That CEO might have the best-laid plans on the planet, but without capacity, he's just a dreamer. Just another dreamer.
Is this what they thought of the CEO of NASD, or any other corporate exec who knows what the answer is, the path to sucess, favor and recognition in the marketplace? Knowing what the answer is, and having no power to fulfill it, well, looks a lot like just-another dreamer. What enabled this dream? Why dream at all if there's no capacity to make it happen?
So an architect gets the PureData/Netezza machine in-house, and handily squares-away that first project. Then another, and another. Unbeknownst to folks at her level, the CEO is happy to erase that first project from the white-board, and the next, and so forth. In reasonably short order, the project work has been assimilated by the time the CEO erases the final piece of the puzzle and steps away from the white board, now an empty canvas.
Then something interesting starts to happen. The "drag" on the CEO's mental cycles has abated, the engines are still firing on all cylinders but there's no overhead - all those outstanding projects - to distract them. Thoughts accelerate into the light-speed zone. A hand reaches for the EXPO dry-erase marker, and like the light-saber jumps into the Jedi's hand on-demand, the marker meets his hand in the air and the ideas start to flow.
He starts to dream.
And now we have a very powerful situation - a dreamer who is actively dreaming, with the power to make the dreams come true. Which is more valuable? The visionary with nothing more than gossamer-thin hopes for a brighter tomorrow? Or the dreamer with the capacity to make dreams come true?
Is a Puredata/Netezza machine a dream machine? Is it what dreams are made of? This might sound a bit surreal, but what other machine has the capacity to respond at the speed-of-thought, the ability to assimiliate - with a little assistance - the dreamer's thought processes? And so like Jackie Chan-in-a-box, it is able to reflexively anticipate the competitors next move. Whether that's the corporate adversary, the competitive company, the market forces or the various risks in every new endeavor - we need a machine that can defend us like a blackbelt, can anticipate like a reflex, and can convert mere analytics into a jump to lightspeed, leaving our competitors in the proverbial dust, stardust tho it may be. How's that for thinking big?
I have noted in prior posts several of the "banes" of in-the-box data processing, not the least of which is harnessing the mechanics and nuances of the SQL statement itself. After all, the engine of in-the-box is a series of insert/select SQL statements. I've also noted that we need to squeeze the latency out of the inter-query handoff and management. These are important factors for efficiency, scalability and adaptability.
But this article deals primarily with "adaptive" SQL, that is, the ability to surgically and dynamically control the SQL, the paths of flow between SQL statements, their timings and the ability to conditionally execute them.
I am drawing a contrast between this approach and the common "wired" ETL application. In the wired application of an ETL tool, all components are known and flow-paths predefined. If we want to shut off a particular component or flow, we'd better make that decision at startup because we won't get to do this later. A benefit here is that if we add or change a flow-path, the ETL tool's dependency analysis will (usually) detect it and give it a thumbs-up or thumbs-down. We can (and do) perform this kind of design-time analysis, but what of dynamic run-time analysis?
Case in point: One group performs trickle-feed of data from a change-data-capture, so on any given loadng cycle, we don't know which files will show up. Not to worry in an ETL tool, since we would just build a separate mini-app to deal with the issues. The mini-app would key on the arrival of a specific file, process the file and present results to the database. This is a very typical implementation. But with hundreds of potential files, it's also logistically very daunting and hard to get the various streams to inter-operate. In fact, an ETL tool quickly reduces to "sphagetti-graphics" and the graphical user interface is just in-our-way at that point.
Case in point: One group has multiple query paths/flows where sql statements build one-to-the-next for the final outcome. These can follow a wide range of paths not unlike a labyrinth depending on a variety of different factors. The problem is, these factors aren't known until run-time and only appear in fleeting form as the data is processed. How do we capture these elements and use them as steering logic? In an ETL tool, our options are limited to none. In this particular case, three primary paths of logic were available each time the flows ran. Sometimes all three paths ran end-to-end. Sometimes only one, or two would run, or perhaps none-at-all. The starting conditions and unfolding data conditions determined the execution path.
But we have another name for this don't we? Isn't this just plain vanilla "computer programming"? Where the data shows up and we use the encountered-data and encountered-conditions to guide the IF-THEN-ELSE logic to conclusion? The problem you see, is that we are so accustomed to using IF-THEN-ELSE at the ROW/COLUMN level, we cannot imagine what this would look like at the SET level. Ahh, the conditional logic driving SETS is unique and distinct from that which drives basic elements. But then again, we can only scale in sets, not the basic elements. THis is where the dynamic nature of conditional-sets is invaluable.
But this isn't really about conditional sets, either. Only that conditional sets are a necessary capability and we have to account for them along with many other subtle nuances. Let's follow:
We have an external file and we load this into an intermediate/staging table (TABLE-A) in preparation for processing.
Now we build another target intermediate table (TABLE-B) and an insert/select statement to move / shape the data logically and physically from TABLE-A to TABLE-B.
From here we have several more similar operations, so we build intermediate tables for their results as well, such as TABLE-C, TABLE-D and TABLE-E
TABLE-A >>> TABLE-B >>> TABLE-C >>> TABLE-D >>> TABLE-E
Now let's say we have another chain of work starting from TABLE-V:
TABLE-V >>> TABLE-W >>> TABLE-X >>> TABLE-Y >>> TABLE-Z
Now something interesting happens, in that the developers sense a pattern that allows them to reuse certain logic if they only put these quantities into a couple of working tables, which we will call TABLE-G and TABLE-H, and now the flows look like this:
TABLE-A >>> TABLE-B >>>>>>>>TABLE-C >>> TABLE-D >>> TABLE-E
TABLE-V >>> TABLE-W >>>>>>>>TABLE-X >>> TABLE-Y >>> TABLE-Z
Notice how TABLE-G is feeding TABLE-C and TABLE-H is feeding TABLE-X, so that each of them have a 2-table dependency.
Now we get to the end of the chain of work and learn that TABLE-Z has to leverage some data in TABLE-C! We don't want to rebuild TABLE-C just for TABLE-Z, but in an ETL Tool this data would be bound/locked inside a flow. We could redirect the flow to TABLE-Z, unless the flow to TABLE-Z is entirely conditional and we don't know it until we encounter TABLE-C. What if, for example, the results of TABLE-C are conditional and if the condition is realized, none of the components following TABLE-C are executed. However, we could have TABLE-Z see this absence as acceptable and continue on.
Okay, that's a lot of stuff that might have your head spinning about now, but the simplicity in resolving the above is already in our hands. In any flow model, upstream components essentially have a "parent" relationship to a downstream "child" component. This parent-child relationship pervades flows (and especially trees) and as we can readily see, the above chain-of-events looks a lot like a tree (more so than a flow).
More importantly, each node of the tree is a checkpointed stop. We must build the intermediate table, process data into it and move on, but once we persist the data, we have a checkpointed operation. This is why it behaves so beautifully as a flow and a tree.
Now let's say over the course of SDLC (regular maintenance), that a developer needs to add some more operations and connect other existing operations to their results. This is essentially just introducing new source tables in the where/join clause, but the table as to exist. In short, if we add a new table to the logic of TABLE-X, it will now be dependent upon its original tables and the new ones. (Its query will break if they are not present at run time).
It is easy enough (honestly) to perform a quick dependency-check over all of our queries to make sure that their various source tables are accounted for. In other words, an operation actually exists that will produce the table. What if we picked the wrong table or even misspelled it? At run-time we would know, but we would rather know before execution because it's a design-time issue. This may verify that logically we have a plan to create the dependent table, but it does not deal with the simple fact that conditional circumstances may forego the physical instantiation of the table. Transforms ultimately do not operate on intent, but on the presence of physical assets.
As another nuance, this creates a disparity between the design-time flow of data, and the run-time flow of data. If the run-time is governed (e.g. ETL tools) so that the dependencies and conditions are all evaluated at the start of the application, the design-time and run-time are more easily mated for review by an auditor or analyst. But if any part of it is dynamically conditional, we can see how this could practically nullify the design-time form of the flow. They would simply say, "I know what the flow would do by design, but I want to see what it actually did at run time, because the data isn't matching up". Aha - so "intent" counts for design review, but "intent" is not what puts physical data into the tables. Operational processes do that.
As noted above with the necessity for conditionality and reduction of inter-transform latency, we now have a need to weave together at run-time what the flows will actually do. The "source tables" for a given transform are found in the where/join phrase and these had better be present when the SQL launches or it will be a short ride indeed.
And now, what you did not expect - one of the most powerful ways to use a Netezza machine is to forego the "serialization" of these flows and allow them to launch asynchronously. We can certainly throttle how many are "live" at at time, but if any or all of them can launch independently, how on earth are we supposed to manage the case where one or two of them really are dependent on another one or more? Do we put these in a separate flow? Do we really want our developers to have to remember that if they put an additional dependency in a transform that they have to regard whether that preceding transform has actually executed successfully?
So that's the real trick, isn't it? If I have forty transforms and all of them could run asynchronously except for about ten of them, that can only run after their predecessor completes, I have several options to see to it that these secondary operations do not fail (because their predecessor has not executed yet).
I can serialize them in by putting them into separate flows (or branches). One of them kicks off and runs to completion while the next one waits. This is logically consistent but also inefficient. If those secondary transforms are co-located with the original set, the optimizer can run them when there is bandwidth rather than waiting until the end. It is also logistically unwieldy because a developer has to remember to that if a transform should gain a dependency, it has to be moved to the second flow.
I can fully serialize them into a list, but this is the most inefficient since it "boxcars" the transforms and does not leverage the extra machine cycles we could have used to shrink the duration.
I can link them via their target table and source table, such that this relationship is dynamically identified and the flow path dynamically realized. If a given transform does not run (conditional failure) or simply fails to execute, the dependency breakage is dynamically known. What does this do? What if a given transform is supposed to use an incoming (intake) table if it is present (data was loaded) otherwise use a target-table's contents (e.g. trickle-feed, change-data-capture problem). This allows the transform to do its work with consistency but also have the ability to dynamically change its sources based on availability.
Now, we know ETL tools don't do this. Other tools may attempt to rise to this level of dynamic pathing, but the bottom line is that if those tools don't provide this kind of latency-reduction, high-throughput, dynamically adaptable model, they will not be able to leverage the full bandwidth of the machine. Trust me on this - the difference is between using 90 percent of the machine or only 10 percent at a time. That machine packs the virtual joules to make it happen, so let's make it happen.
When we originally developed our framework to wrap around some of these necessary functions, we had not considered these nuances of dynamic interdependence and frankly, ELT was so new that it didn't really matter. The overhead to execute "raw" SQL was zero, but we could not effectively parallelize/async the queries without losing control. Running async chains of transforms necessitated detailed control, but nobody had a decent algorithm for it, so once again Brightlight had to pioneer this capability. Our architecture allowed us to easily integrate these things into the substrate of the framework as a transparent function. This is the primary benefit of a framework, that the developers can continue to build their applications without disruption, but we can upgrade and enhance the framework to provide stronger and deeper functionality. Whether our framework is right for all applications is not the issue, but whether the complete implementation is right for Netezza. It's a powerful machine and we should not arbitrarily leave any cycles on the table.
Imagine slowly running out of steam because of latent implementation inefficiencies, then ultimately asking for a Netezza upgrade that, if the inefficiencies weren't present, the upgrade wouldn't be necessary. This has happened with more than one of our sites and rather than upgrading to all-new-hardware, we installed, converted and bought back an enormous amount of capacity. They eventually upgraded the hardware much later on, but for the right reasons.
After completing another migration from a traditional, general-purpose RDBMS to the Netezza technology, I visited a friend who had several artifacts in his home that had to be the strangest things I'd ever seen.
Now I'd heard of genetically altered cat fur, you know, buying a cat online while picking the fur color of your choice (blue, lavender, teal - etc). Seems like an odd thing to do to a cat. I like cats and dogs both, so don't imagine that this blog essay attempts to take sides. Some folks are downright serious on their choice of pet, so I'll smooth that fur wherever I can.
Back to the hairless cats. I asked him "Where did you find these? And what happened to your other cats?" And he laughed, "That's a funny story. These are my cats, but I had to shave them." To this, I rolled my eyes, wondering where this was about to lead. He told the tale:
"We took a vacation down south and put the cats in kennels in the back of the truck. The round-trip took a toll on their fur and matted up everything from head to toe. Weeks later, the cat fur had not smoothed out. The kids had been brushing it out but it wasn't working. And if you think they look ridiculous shaved, you have no idea how silly they looked with their hair matted."
"So you shaved them?" I asked.
"Sure" he said, "Seems practical right? Just get rid of the matted hair altogether. Teasing it apart would have taken, well, years of time. Their fur will grow back out soon enough"
"Aren't there, you know, shampoos and stuff for that? I mean, shaving seems a little extreme."
"Tried all that. Bad thing about it, mats are bad for cats - they cause infections and all kinds of nasty side effects. Best to just shave it all and be done with it."
Laughing on the inside, I thought a bit about how we have to decompose and de-engineer an organically-grown data warehouse. Some would suggest porting (forklifting) the whole thing over "as is". Like taking a matted-hair cat and moving them from one house to another. It changes the venue for the cat, but doesn't help the cat at all. It's still sick and getting sicker from the mats. Such folks "tell a tale" of the success of their migration derring-do. But they are like nomads. Hunting the game until there's no more, then pulling up stakes to find another place to burn out. Forklift-migrations have value only to the ones who are doing the migrating, not the recipients. No sooner will they tie a bow on it than someone will request a change, and we will discover what we already knew: The original data model (now the new data model) isn't very resilient to change no matter where it is hosted.
We realize we have ported both the good and the bad from the old system, when we had the opportunity to port the good and leave the bad behind. We essentially are agreeing that we are about to standardize on the past and then accommodate the future, rather than a better approach: standardize on the future and accommodate the past.
Many years ago, at one site we had to carefully tease-apart the data and the stored procedured to find out what they were actually doing. Unfortunately we had carved up the work for several teams rather than reviewing it together. Had we done this, we would have discovered that the stored procs executed in chains of work, and that many of the chains were copy-pasted from one original chain that was too "matted" to risk breaking. So they copied-and-modified this chain to perform the new functionality. Enough of these and we see how the stored procedure doesn't benefit us (at all) for back-end data processing. In fact, we strongly suggest people use stored procs in Netezza for BI-adaptation and optimization, for the presentation layer. But not for the back end. Stored procs are not operationally viable for a wide range of reasons. It's even funny how folks move from one technology to another and try to replicate the stored procedure logic as a knee-jerk exercise, without realizing how flawed it really is. Perhaps the Netezza stored proc will run a lot faster. Trust me, performance is the least of your worries.
So once we converged the teams together, these themes started popping out like rabbits. By the end of the first day we are all laughing at the sheer level of redundancy in the back end. But not particularly surprised at the outcome. We'd seen it in lots of places before, but not so bad.
Of course, it never once occurred to us that we would port these hundreds of stored procs over to the new system. Rather we would functionally specify what they are doing now, and leverage tools to accommodate the vast majority of the functionality, only building what was left over. I mean, this is a standard functional port, why complicate things? Forklifting into a Netezza machine will certainly yield 10x performance, so why the beef? Without optimizing the data structures and processes to leverage Netezza's power, we might get 10x but leavel 100x on the table. Is this a good tradeoff?
Well, true to form, someone had the capacity to complicate things. He whipped out a spreadsheet and calculated the cost of the hundreds-of-stored-procs in the original system, not realizing we were planning to reduced these to maybe fifteen operations at most. Spreadsheet calculator in-hand, he estimated that it would take 24 people, 8 months, to handle on these stored procs. I sat back in my seat, stunned, because he was costing a project we weren't about to undertake. Rather, building out 15 or so operations would require a handful of people and 90 days at the outside. But also true to form, the project principals saw visions of sugar plums (another word for sales-comp) that got in the way of their better judgment. They actually went to the client with these inflated numbers, he rejected their proposal outright and gave the business to someone else. It's easy to lose a deal when the client sees the inflation on-the-page.
But what our "spreadsheet guy" missed, was that we weren't about to embark on a journey of finding a home for each stored proc (we already knew this had no value, and the client knew it too). He believed that we intended to bring the matted-cats into the house and put them on pillows, when we intended to pick the cats we wanted to keep, and shave them.
Okay, that's a strange analogy, but we had no intention whatsover of accepting all that convoluted spaghetti as the foundation for the go-forward system.
Netezza gives us the capacity - to simplify. We keep the parts we consider valuable (the cat) and get rid of all the mess that keeps the cat sick and unhappy. Taking only the functions we want, we then reconstruct (let the hair grow back out) only what we want to keep, and take the opportunity to apply some solid architectural principles and likewise capitalize on the strengths of the Netezza platform.
In the end, if we really have a platform that is standardized on the future, but accommodates the past, we also have something else that is even more powerful: A simpler, stronger engine that is ready to grow in functionality, adapting to our changing needs. The old system was never built with this kind of vision or priority, because the power wasn't there to affect it anyhow.
After review of a "high performance" ELT platform (that's ELT, database-transform-in-the-box) - I started asking hard questions about things they had not considered. It's a high-performance platform, isn't it?
Well, yes and no. It supports a "continuous" model, but the performance is all in the query and the data, right? Well, we'd like to think that for purposes of long-cycle queries anyhow. Here's the expectation:
In a general-purpose RDBMS, transformation-in-the-box is expensive. Each query can take minutes or even hours to complete. At this point, nobody really cares about the overhead to launch and shepherd the query, or to report its status when completed. All of these infrastructure issues are eclipsed by the duration of the query itself. If only one percent of the operation's duration is in the overhead, who cares if we spend time optimizing or minimizing it?
So in one scenario, the product would launch its queries end-to-end using a scheduler. Each of the queries would be packaged into its own little run-time, then the scheduler would kick off each one and wait for its closure, only then kicking off the next, etc. Some would call this reasonable, others sophisticated. After all, if the duration of each query is protracted, why do we care about inter-transform latency?
Contrast this to a Netezza-centric series of transforms. A general-purpose database, recall, requires us to dogpile lots of logic into each transform, protracting the duration as a matter of necessity. In a Netezza-centric scenario, we will see those ten-or-so general-purpose queries chopped apart into more efficient, tactical form, with each query building upon the last towards a final outcome (in a fraction of the time of the general-purpose equivalent).
Apart from the mechanics of how Netezza makes this happen, look at how the mechanics of the operation changes dramatically. I'll use a known working example of 42 Netezza transforms. When first we had ten-or-so-general-purpose queries running for an hour or more, we now have over forty MPP-queries, each of which runs in less than a second of duration (with some exceptions). So all 42 queries, if we could kick them off one-after-another, will execute in around 45 seconds. If in this case, the users (or process) kicking off the sequence wants less than one-minute turnaround, now we have to deal with squeezing out inter-transform latency.
In plain-vanilla terms, the mechanism using the scheduler noted above, put six-seconds of latency between each transform. What does this mean? With each transform running in one second, and six-seconds of overhead, what could run in less than a minute now runs in six minutes - we have dramaticallly breached our one-minute SLA!
Now David, get serious - who on earth wants to shave seconds off the inter-transform latency? Six seconds of delay between each transform seems perfectly reasonable! Sure, if the transform itself will take twenty or thirty minutes. But if it will take less than a second, our overhead for it is now the glaring culprit with a smoking gun, red-hands and all that. And for those Netezza users who want to eliminate this latency, don't pooh-pooh their needs. The fat is our problem to solve.
And what does this say for ETL tools that perform push-down to the machine? They will also have transitional latency as they hand-off control across components. Geared for the big-fat, long-duration queries of the general-purpose world, nobody has ever cared about the smidgeon of latency between them. Only now, the smidgeon is not so much, and looks a lot like a boulder next to a basketball.
Consider the following breakdown of a standard transform's parts. They look a lot like a CPU's fetch-decode-execute cycle (yeah, that's a little geeky, I know)
Startup Overhead Execution Shutdown
So imagine that we have a tool with controls (like a scheduler) with several seconds of latency in startup, formulating the query (recall, we want query-generation, not hand-coded queries) leading up to Execution, and then some minor overhead to accept the status and transition to the next operation. We'll call it at four (4) seconds of latency.
If we scale the above timeline with several inline / serialized transforms- we would see an effect like this:
|---x--x--| |----x--x--| |---x--x--| ----
See the "DDD" (for dead-time). We lose that time in the additional latency for transform management. This is akin to the startup/shutdown cost of a launcher, scheduler, or ETL tool shepherding a "component" to activate the underpinning SQL statement
Either way, take a look at the total time "between the x's " that is the actual execution time. If this time is several hours, the dead-time is a nit. If the execution time is proportional to the line-drawing above, we have far too much overhead.
So for 42 of these operations, we would have 42x4 seconds of latency plus the 1 second of execution, for a grand total of 210 seconds, or 3.5 minutes. When we seriously consider squeezing the fat from this operation, our primary problem is in the non-optional overhead.
Now take a look at the scenario below. I have kicked off five transforms asynchronously, with their execution times between-the-x's -
See how the first one hands off to the second one, like a baton in a relay race? Notice how each of the transforms has already incurred its overhead in parallel to the first transform, and are now merely waiting (see the "W") for their predecessor to trigger their execution.
What is the start-to-finish time of these five transforms if executed like "boxcars" in serialized mode, accepting the penalty for intertransform latency? From start to finish is around 25 seconds. But with the above example, the inter-transform latency has been practically removed except for the simple handshake as they hand-off. The first one launches and we incur the initial four-seconds of latency, a necessary penalty. Then each subsequent transform requires one second, for a grand total of 10 seconds of runtime. We have effectively moved from a model with 25 seconds of runtime to 10 seconds of runtime.
While this is around 36% of the original run time, it does not seem as dramatic as when we compare it to the original model of 42 transforms. That is, four seconds of overhead plus 42 seconds is 46 seconds of runtime. Add to this 1/10th second for the handoff delay (another 4 seconds), for 50 seconds of grand total run time.
Its original time was 42*5 seconds, or 210 seconds. This new time of 50 seconds is 24% of our original run time. That's a 300% improvement in run-time and of course, is well within the boundaries of the one-minute SLA.
Offsetting the transform run-times like this, so that the overhead is essentially invisible, is a common hardware-stablization approach for micro-electronics. Here's an example:
Many years ago I worked for a company that was repackaging a design into much smaller form. Essentually we were reducing a rack of hardware into a 1-foot cube. All of the large wire-wrapped boards had been designed-down to much smaller form. This is when instabilities were first detected.
In one particular case, the engineer described it to me as an engineering-101 mistake on the part of the original designer. Every hardware circuit is driven by signals that pass through conditioners and gates (transistors) so that the final outcome is a signal of some kind on a particular part of the board's interface(s). The mistake was in that the timing signal, a pulse sent to the hardware 60 times a second, was being applied at the first input of logic. So a general signal would "sit" on a given input location, the timing signal would "fire", opening the gate, and the signal would then traverse through the many other components and paths to reach the output path of the hardware. But here was the problem: the signal took too long to make it from point-A to point-B before all of the other signals had already left it behind. The solution to this: Put the timing-trigger signal on the output side of the board. This way, when the original signal first arrived, it would make its way across all the necessary components and paths and then present its signal to a final gate, which was triggered by the timer. So when the trigger hit, the signal was already present and passed through instantly without a problem.
This sort of "triggered-steering-logic" is the theme of the noted inter-transform handoff scenario above. The transforms navigate their overhead to a stopping point and wait for the signal. In this case, they are waiting for a "done" signal from the transform preceding them. When this signal hits, the next transform is queued and ready to immediately execute its query without further delay. The subsequent transforms fall like dominos.
But that's really only part of the story. These 42 transforms don't just run in a serialized stream. Some of them, after all, have no dependencies whatsoever. Others have dependencies in a discrete chain. Why serialize them when their dependencies are relatively few? Here's an example:
DEF JKL MNO VWX
Transforms A,B and C are dependent upon each other, so will serialize. In the meantime, the D,E,F and G,H,I transforms are independent of A,B,C, so can run side-by-side. J,K,L transforms are dependent on the outcomes of C,F and I, so will wait on them to complete, then finish the K,L transforms as serialized. But then another set-of-three transforms takes off. In ETL tools, we recognize this as branching or "component-parallelism". However, in an ETL tool the branches must be wired together and follow each other by design of the graphic on the canvas. ELT however, is much more dynamic than that.
Look at the effect: We're saving 6 seconds of time by not executing the A-I transforms serialized. Likewise the P-U transforms will now take 3 seconds, not 9 seconds, saving another 6 seconds. If we can find the otherwise independent transforms and move them to the front of the chain(s), the total time for the run is the longest chain of dependent transforms. In this case, we shrank 42 transforms into the same time frame as 20 serialized transforms. This shrank the total time from 50 seconds down to less than 28 seconds.
But David, you mean we have to carefully weave these transforms together so that we can squeeze the fat from the timeline? Actually no. We have a sniffer that examines the "filter clause" for the dependent tables, you know, what the given transform will actually join against. This allows the transforms to self-discover their own optimum path without having to deal with painful weaving. If we happen to add another transform, or change the filter phrase in a transform to include/exclude another table in the join, it will automatically realign the priorities based on the intrinsic dependencies of the transforms. And your ETL tool won't do this dynamically.
Then we have the intake protocol, that of transferring data from one machine to another, or loading from files. We have a couple of options, that of loading the data completely before taking actions in the transforms, or we can load the data asynchronously to the transforms. Just as the transforms will "wait" on a prior table in the chain-of-transforms, we can also make them "wait" for the arrival of a particular intake table. Rather than waiting for all of the intake tables to arrive, we can initiate a transform chain when its particular data set has arrived (or for that matter, not at all, in case its data never arrives).
An in our second example, the total time allocated for loading the data, executing it and ensconcing it to the proper target tables was less than three minutes. Since serialized, all-or-nothing loading easily absorbed over half of this time, we found it important to squeeze the fat from this process. By causing the entire chain to run asynchronously, when the given intake table is ready, its transform-chain would automatically launch. As fortune would have it, the longest processing chain also had small tables to load. So we could kick off those transforms very early. Some of the shortest chains also had the largest load files, so by the time saved by starting the loads as-early-as-possible.
In the end, what started out as a five-minute-plus operation shrank to 160 total seconds of time, well-inside the 3-minute SLA with some room for recovery. Here's how it looked:
Load Time = L
Transform Time = T
Original - all transforms launch when the all loads are complete
LLLL TTT TTTTT
Async form, transforms begin when their source tables are ready:
One may ask, why wouldn't we do it in the second form anyhow? It seems that this is the optimum way to process data. Well, one answer is simply : checkpointing. If we launch the transforms prior to all of the loads finishing, it is much harder to recover in case of load-failure. We would have to cancel the in-flight transforms and otherwise bring the processing to a halt. If we load it all first, then proceed, any errors can stop the processing immediately. No wasted cycles. Efficiency is important with these kinds of transform-chains in a database like Netezza. Still, if we need to shrink the total job time, the recovery-shutdown protocol (in case of load failure) is a necessary capability. Besides, our protocol does not itself finalize anything to the target database until the last transform is complete, lest the data start to arrive intermittently and out-of-context. So as long as the load time is balanced with overall transform time, shutting down before finalization is usually doable.
This of course invites the obvious question - are any of the ETL tools (or ELT offerings) attempting to squeeze out the fat in a similar manner? Methinks not, and for one reason: They, like the general-purpose databases, likewise consider themselves to be general-purpose tools. They are not philosophically committed to squeezing out latency because Netezza is the only platform that can benefit from it. The uptake in setting up and coordinating this sort of baton-like handoff, while not particularly difficult, is also non-trivial and represents a significant effort for a product tool to embrace. When compared to interfacing with the general-purpose-platforms - Netezza is not a large-enough user base to justify ramping-up this high-performance, zero-latency capability. In short, the product's features are market-driven.
The Netezza user base is growing, however, and with IBM pushing the hardware, it will grow more and faster.
Sitting at the top-end of this food chain are the big-ticket TwinFin 24/48/96 machines that do massive amounts of processing-in-the-box and want to move toward continuous models, processing data as-the-world turns. The age of the nightly batch cycle is waning and Netezza is stepping up to the vast opportunities within the "continuous" world. Latency in this world is like poison. Just because it appears to be acceptable to the general-purpose world, this is an illusion. If the general-purpose queries ever dropped into subsecond-duration, the tools facing them would need to re-tool. Actually they need to re-tool now - they just don't feel the pressure yet.
A while back we bought a new house and decided that we would keep our first house for a few additional months to muse about whether we would keep it for a rental home. Well, shortly afterward was Sept 11th, 2001 and we were boxed into keeping it for quite a while longer. In all this, opportunists came out of the woodwork to take our house for a "song".
We tired of the songbirds quickly, but some of them were just bizarre. One couple looked at the house and said they would buy it if we would put $20k of upgrades into it. New this, new that. Our answer was no, we're selling "as is". But they persistent and felt that their requests were reasonable. No, I said, you can spend your money fixing it up like you want it, because I'm not spending my money to fix it up like you want it. They were confused. We weren't "playing the game" you see. Oddly, they had a realtor working with them who kept chanting, "Ask for what you want. It never hurts to ask."
Well, when you're asking the wrong questions, and it's obvious, it definitely hurts to ask. When on site with a large financial services firm, we were being examined for the replacement of their current consulting firm. Their lament: "They ask all the wrong questions." Apparently we were asking the right questions, because they gave the business to us.
Recently I walked through a demo of some things we've done for other clients. Two minutes into the demo, they started asking questions. This or that? Those or these? But nothing whatsoever to do with the core problems the solution was geared for, or the core questions that everyone asks.
It's like this: I have a list of questions on left side of the white board and another list on the right side. The questions on the right side deal with things like very-large-scale backup and high-speed recovery, disaster recovery, hot-failover between two Netezza machines, real-time replication, continuous processing, and of course, heavy-lifting processing inside-the-box.
These are things that accelerate business logic, the rapid turnaround of business rules and features, catalog-aware design-time and run-time modules, developer productivity, testing accuracy and turnaround, rapid-testing and rollout, release management and patched releases - operational integrity, a firm harness around the functional data, and radical simplification of very complex problems - in short, a high-wall of protection around a high-speed delivery model. Something that shrinks the time-to-delivery as well as shrinks operatiional processing time and protects capacity, while protecting the data itself.
I knew better than to hold my breath for a question on migration -
Whew! Whenever I give this short list to a CIO, they are usually asking in terms of having rolled out a large-scale environment with Netezza and are experiencing pain in one or more of those areas. Of course, we focus on those areas. Painkiller is not enough. You have to remove the source of pain..
How do I know that these are the questions Enzees are asking? Because they are - to me personally, to the Netezza community forums, and are the hot-topics at every Enzee Universe. Good grief, that's the short list of the lightning-round question-and-answer sessions at the Best Practice forums.
But they did not hear any of this. They were asking all of the wrong questions. The reason, methinks, is that they weren't in any pain. They felt the freedom to ask about the inconsequential because they had never experienced the consequences, or have even foreseen the consequences. In other words, I am solving problems they don't have.
Or do they? Anyone who buys a large-capacity storage devicesis by definition biting off a data management challenge. One of our clients has over 200 TB (uncompressed) on their TwinFin. They currently do their backups to another TwinFin of the same size. They would like to perform their backups offline, but no commodity backup/recovery system on the planet can handle this kind of load with any aplomb. Some claim to, of course, but have not considered the true needs of the Netezza user. It's all Texas-Sized Data Warehousin'
For example, one of the off-the-shelf products claims that once the data is offloaded, they have "hooks" allowing the user to query the offline archives. That's right, the user query will be dispatched to their proprietary engine and it will fetch the answer back for you. But wait a second. That protocol is for a transactional system. Going out onto the file system to fetch a transaction might have some value, since the only alternative is to bulk-load the entire dataset back into the device to query it, when all we wanted was one transaction.
However, for a Netezza "scanning analytic" query, potentially spanning tens of terabytes in scale, such a mechanism is no more than a child's toy. What we need is the ability to rapidly reload the data back into the machine so we can query it there, inside Netezza's MPP. More importantly, once it's back online we can join it to other tables, that's right, down on the MPP where the CPUs burn brightest.
But this is simply an example of how one vendor has solved a problem that no Netezza user is asking. Netezza users need to scan and analyze the data in-the-box. Those tools provide data access outside-the-box, without considering that having the data outside-the-box is the problem we were trying to solve in the first place!
Another issue is how the ETL tools interact with Netezza. All of them do, some better than others. Some are certainly getting better than the others. The game is on, they know that data processing is migrating back into the machine. Netezza is leading the way, and they want a piece of the action. Can you blame them?
But hold on. Is bulk-data processing in an ETL tool the same as bulk-data processing inside a Netezza box? No, not really. "Going parallel" in an ETL tool means spreading work out across CPUs for a select set of operations. Not only this, we will compete with other processes for those same CPUs. The net is, we cannot put all the CPUs on the machine to work for us. Clearly some of the ETL tools are otherwise proud of their scalability. And they should be. Add a few more CPUs to that rack and watch the ETL tool scale with it. Nothing bad about that. Capture those business rules in a point-click and off-we-go. They have spent millions of dollars tuning their performance models, pouring the brain power of their brightest people into making their product go-parallel and push that data hard. Imagine them handing off this hard-won, multi million-dollar performance capability to an upstart appliance? Hmm, no, they'll be the last who are draggged kicking and screaming into the inexorable future.
Anyone who is kicking the tires of a large-scale storage and processing device like Netezza is also in for a subtle surprise: Once you are migrated, you will unleash the creativity of you staff to add functionality that was never possible before. This will generate business, which will generate more data. The extra capacity will naturally offer confidence that all-new-business-great-and-small are not only do-able, but with strength that isn't available with other platforms. No worries, sez aye, the Netezza machine will scale with you. And you have chosen well, grasshopper.
Then one day they look around and say - have we backed-up this data recently? How ahout disaster recovery? What about archiving old data to make room for new? What about the ability to make the offline data available for the ad-hoc folks? That is, available in time for them to actually use it? These simple questions raise fear in the hearts of the mightiest of IT champions when they know it should have been asked, and applied a long time before ---- Before the locomotive was moving at 90 miles an hour dragging 500 boxcars. Before the locomotive even left the station. The sheer logistics of performing a backup of data this size, and this transient, is mind-numbing. I've noted that one client, hosting over 150 TB (uncompressed) naively plugged their commodity backup tool into the Netezza machine. After over-a-week of whirr-click backup activity with no end in site, someone finally said "Kill it. If it takes a week to back it up, it will take even longer to restore it". This is a wise observation, but also tells us something:
The commodity tools expect us to accept that the restore-process will be slower than the backup-process. In the Netezza world, the backup and restore can both happen faster, and take up less storage than the commodity backup tool. If the commodity backup has to offload in uncompressed form, we need to provide generous workspace for it. If it's a Netezza-compressed backup, we only need to provide for the amount relevant to our compression ratio. Some sites get a 16x, others get almost twice that. The mileage varies because of the nature and compressibility of the data. Either way, offload is fast because it comes right off the disk without passing through the host. Likewise the reload, directly to the disk without the host. In a traditional RDBMS, offload/reload has to pass through the host to get on and off the device. For bulk analytic data, the missing middle-man is just the ticket for rapid-reload.
But they weren't asking this question either. The problems we had already solved and were bringing to the table as Netezza-centric solutions, the bread-and-butter, core-mission capabilities that people ask for all the time, wasn't even on their radar. The disconnect is simple: They have read white-papers on what people are doing after they get the back-end squared-away. The nice-to-haves. The critical parts, you know, the failure points that could mean the demise of the business, or perhaps their own paychecks? Not even on the table.
This is akin to someone about to engage in the construction of a large cargo ship. To be sure, some folks are concerned about the utility of the ship's interior. But when putting pen to paper, which is more important, that the break rooms have contemporary wallpaper, or that the ship can master tempestuous seas and high swells? Methinks the crew of said ship couldn't care less about the amenities if they got wind that the architects gave short shrift to things like hull stress and multiple watertight compartments. You know, things to keep the ship from being claimed by the sea when stress is high. Boy, those light fixtures sure look cool, don't they?
Back to basics. Netezza is an appliance. It can perform as a load-and-query device just like all the other load-and-query devices. A primary differentiator, one that more customers are experiencing all the time, is that Netezza is a powerful data processing platform. When leveraged this way, it also becomes a problem-solving platform. We simply wrapped some additional logic around these core capabilities in order to harness the logistics of multiple sequential (or asynchronous) queries being fired off to the device, managing its workload, intermediate tables and whatnot. The appliance makes such an endeavor simple, and further simplifies the user's interaction with the machine for purposes of pattern-based utilization. Change-data-capture, referential integrity checking etc are all far more effective inside the machine than outside the machine.
For the shops that would rather master these aspects in the ETL tool, hey, no harm done. Lots of people do it that way. But they all eventually get to the same point in the game: the ETL tool runs out of gas. Or to give it more gas will require a significantly larger power-plant. They then realize that all those CPUs in the Netezza machine are just sitting there, doing nothing most of the time. Enough CPUs respond to peak, which is about five percent of the time. The remaining 95 percent of time, the box is running less than half capacity with a big chunk of that completely idle. Wouldn't it be great to get a handle on all that power? Recover it as part of an ongoing capacity model?
The most important part of this approach is to decide you want to do it. Lots of options naturally come your way when you try to get creative in the direction of power- because power drives the creatviity further. Ideas are given wings that can fly higher and farther than any other traditional RDBMS could even dream about. Managers start having ideas too. By golly if that machine can do this, then why not that? And why not, indeed?
In fact, most of the time when dealing with a standard RDBMS, the managers will ask why-not? in the sincerest of spirits. Then forth from the mouths of IT come the exact, precise reasons why-not. They are good reasons, logical reasons, and effectively put a wide moat around the management's idea-factories. Soon their ideas fade. They forget how good the ideas were. They will never see the light of day. The underpowered nature of the machines constrain them.
Contrasted with the Netezza machine, the question of why-not is more rhetorical in nature. The person asking the question is not expecting an answer either, but not for the same reasons as the manager above. He's not expecting an answer because the IT folks are already on it. His answer to the question will not be verbal, but will be positively expressed in real functional terms. Likely in a very short period of time.
Why-not? becomes more of a water-cooler phrase. Almost like, "I'm trying do decide whether we should eat at the club or eat at the diner. How about today we eat at the club?" To which the respondent says - "why not?" This is a very different and rarefied existence than the one borne on the constraints of an underpowered machine.
Once you gather a head of steam on all this, you realize that not only is Netezza an enabler for large-scale, complex problem solving, it provides the impetus for us to construct a problem-solving platform upon it. The ability to capture larger patterns of set-based processing, express them in simpler terms, and have them available to all - as capabilities that leverage the capabilites of the Netezza machine.
In your own domain, if you have a Netezza machine in-house and you're using it for data processing, even if it's the most inefficient model on the planet, it still beats the socks off the plain-vanilla ETL tool's work for the same operations. If you have coupled an ETL tool with this approach and are getting the gains out of it, even though this process may have initially been painful, you have answered the question with the right answers. The tools may not be quite up to speed, but that's okay. Your compass has not betrayed you. Stay the course and the benefits will be magnanimous indeed.
And for those who continue to use the machine as a load-and query device, and have not forayed into the radically rewarding realm of ELT and data processing inside the machine, there is only one question to ask, and it's the right question:
So I published this book last Spring ('11) on how the Netezza machine is a change-agent. It initiates transformation upon people or products that happen to intersect with it. Most of the time this transformation makes the subject better. Sort of like how heavy-lifting of weights will make the body stronger. Or the pressure can crush the subject. Stress works that way. We could imagine the Netezza machine as the change-agent entering the environment. Everything brushing against it or interacting with it will have to step-up, beef-up or adapt. I sometimes hear the new players say things like "But if the Netezza machine could only.." That's like a Buck Private saying of his drill sargeant, "If he could only ..." No, the subject must consider that the Netezza machine is never the object of transformation but rather is the initiator of it. But it's not a harsh existence by any means. Products that can adapt are far-and-away better than before. Those that cannot adapt now, will eventually, or remain in their current tier.
Having been directly or indirectly alongside these sorts of product integrations and proof-of-concepts (POCs) numerous times, it's always an interesting ride. The vendor shows up ready-to-go with visions-of-sugarplums in their head. And the suits who show up with them, are salivating for the ink on the license agreement. In less than an hour into the POC, all of them have a very different opinion of their product than when they arrived. Their bravado is reduced to a shy, sort of sheepish spin. Throw them a bone, not everyone walks out of this ring intact. Some of them shake their fist at the Netezza machine. It is unimpressed. Others shake their fist at their own product. Alas, it is but virtual, inanimate matter. What is transforming now? The person in the seat.
So I have watched them scramble to make the product hit-the-mark. Patches? We don't need no stinkin' patches. Except for today, when they will be on the phone in high-intensity conversations with their "engine-room" begging for special releases while on-site. Alas such malaise could have been avoided if only they had connected their product - at least once - to a Netezza machine. In so many cases, they will claim that they have Netezza machines in-the-shop so they are prepared-and-all-that. It is revealed, sometimes within the first hour, that the product has never been connected to a Netezza machine. It doesn't even do the basics, or address the machine correctly. It is especially humorous to hear them speak in terms of scalability as though a terabyte is a high-water mark for them. One may well ask, why are we wasting our time with underpowered technology? Well, in point of fact, when placed next to the Netezza machine it's all underpowered, so really it's just a matter of degree.
Case in point, Enzees know that in order to copy data from one database to another, we have to connect to the target database (we can only write to the database we are connected to). And then use a fully-qualifed database/tablename to grab data from elsewhere - in the "select" phrase. Forsooth, their product wants to do it like "all the others do" and connect to the source, pushing data to the target. Staring numb at the white board in realization of this fundamental flaw, they mutter "If only Netezza could....". But that's not the point. They arrived on site, product CD in hand, without ever having performed even one test on real Netezza machine, or this issue (and others) would have hit them on the first operation. They would have pulled up a chair in their labs, started the process of integration and perhaps call the potential customer "Can we push the POC off until next week? We have some issues (insert fabricated storyline here) and need to do this later."
Cue swarming engineers. Transformation ensues.
Another case in point, many enterprise products are built to standards that are optimized for the target runtime server. That is, they fully intend to bring the data out of the machine, process it and send it back. One of my colleagues made a joke about Jim Carrey's "The Grinch" and the mayor's lament for a "Grinch-less" Christmas. Well, didn't the Grinch tell Cindy-Lou Who that in order to fix a light on the tree, he would take the tree, fix it and bring it back? Seems like a lot of hassle for one light? Why can't you fix it here and not take it anywhere? Enzees see the analogy unfolding. No, we don't want to take the data out, process it and put it back. We want "Grinch-less" processing, too. Fix the data where it already is.
Why do this? Well, in 6.0 version of the NPS Host, the compression engine could easily give us up to 32x compression on the disk. Or even a nominal 16x compression, meaning that our 80 terabytes is now 5TB of storage. And while we may have to de-compress it on the inside of the machine to process it, the machine is well-suited to moving these quantities around internally. Woe unto the light-of- heart who would pull the data out into-the-open, blooming it to its full un-compressed glory, on the network, CPU, the network again - just to process it and put it back.
Unprepared for the largesse of such data stores, our vendor contender's product centers on common scalar data types. Integer, character, varchar, date. No big deal. Connect to the Netezza machine and find out that the "common" database size is in the many billions and tens of billions of rows. A chocolate-and-vanilla software product without regard to a BigInt (8 byte) data type, cannot exceed the ceiling of 2 billion (that's the biggest a simple integer can hold). This does not bode well for integrating to a database with a minmum of ten billion records and that's just the smallest table. Having integers peppered throughout the software architecture by default - would require a sweeping overhaul to remediate. As the day wears on, we see them struggle with singleton inserts (a big No-No in Netezza circles) and lack of process control over the Netezza return states and status. These are not exotic or odd, but no two databases behave the same way. The moment that Netezza returned the row-count that it had successfully copied four billion rows, we watched the product crash because it could not store the row-count anywhere - the product had standardized on integers, not big integers, so the internal variable overflowed and tossed everything overboard. Quite unfortunately, this was a data-transfer product and performed destructive operations on the data (copy over there, delete the original source over here). So any hiccup meant that we could lose data, and lots of it.
Cue announceer: "And the not-ready-for-prime-time-players..."
Oh, and that "lose data and lots of it" needs to be underscored. In a database holding tens of billions of rows (hundreds of terabytes) of structured data, that is, each record in inventory, with fiducial, legal, contractual, perhaps even regulatory wrappers around it, and we're way, way past the coffin zone. Some of you recall the "coffin zone" is the point-of-no-return for an extreme rock-face climber. Cross that boundary and you can't climb down. But we're not climbing a rock face are we? The principle is the same. Lose that data and we'll get a visit from the grim reaper. He doesn't hold a sickle, just a pink slip in one hand and a cardboard box in the other (just big enough for empty a desk-full of personal belongings).
One test after another either fails or reveals another product flaw. When the smoke clears, the "rock solid offering" complete with sales-slicks and slick-salesmen, is beaten and battered and ready for the showers. The product engineers must now overhaul their technology (transform it) and fortify it for Netezza, or remain in their tier. The Netezza machine has spoken, reset itself into a resting-stance, presses a fist into a palm, graciously bows, and with a terse, gutteral challenge of a sensei master, says: "Your Kung Fu is not strong!"
Now it's transformation-fu.
Superficially, this can look like a common product-integration firefight. But this kind of scramble tells a larger tale: They weren't really ready for the POC. This would be similar to an "expert" big-city fireman, supremely trained and battle-hardened in the art of firefighting and all its risks, joining Red Adair's oil-well -fire-fighting team ( a niche to be sure) and finding that none of the equipment or procedures he is familiar with apply any longer. He will have to unlearn what he knows in order to be effective on a radically larger scale. He might have been a superhero back home, faster than a speeding bullet, able to leap tall (burning) buildings in a single bound, but when he shows up at Red Adair's place, they will tell him to exchange his clothing for a fireproof form and get rid of the cape. Nobody's a hero around an oil-field fire. Heroes leave the site horizontally, feet-first. No exceptions.
Enzees have experienced a similar transformation (with a different kind of fire). The most-oft-asked questions at conferences are just that flavor: How do we bring newbees into the fold? How do we get them from thinking in row-based/transactional solutions into set-based solutions? How do we help them understand how to use sweeping-query-scans to process billions of rows? Or use one-rule-multiple-row approaches versus cursor-based multiple-rule-per-row? How do we get testers into a model of testing with key-based summaries instead of eyeballs-on-rows (when rows are in billions)?
We were dealing with a backup problem at one site because of a lack of external disk space. Commodity tools often use external disk space for this purpose, until they are connected to a Netezza machine and their admin tool complains that they need to add "another hundred terabytes" of workspace. We gulp, realizing that the workspace is only today a grand total of ten terabytes in size. And you need another hundred! Yeesh, you big-data-people!
Most of the universe outside the Enzee universe will never have to address problems on this scale. It is not the machine itself that is the niche. It is the problem/solution domain. Most of the commodity products that are stepping up are doing so only because it's clear that Netezza is here to stay and they need to step into Netezza's domain. I suppose at some point they expected Netezza to give them a call to start the integration process, but the Netezza Enzee Universe already had all that under control. It's amazing how lots-of-power can simplify hard tasks to the end of ignoring commodity products entirely.
Another case in point, a product vendor "popped over" with a couple of his newbee product guys and spent two weeks trying to get their product to play in-scale with Netezza. Before throwing in the towel, they offered up the common litany of observations. "No indexes? What the?" and "Netezza needs to change X", or the favorite "Nobody stores this much data in one place." The short version is, you brought a knife to a gun fight, as Sean Connery would assert, or perhaps, you brought a pick-axe and a rope to scale Mt. Everest. What were you thinking? You see, most people who have never heard of Netezza (I know, there really are folks out there who don't know about it, strange as is seems) do not understand the scale of data inside its enclosure. Billions of records? Tens of billions of records? A half-trillion records? Is that all you got?
We will watch a switch flip over in their brains as they assess what they are trying to bite off. A small group will embrace the problem and work toward harnessing the Netezza machine in every way possible. Another group will provide a bolt-on adapter for Netezza to interface to their core product engine. While another, larger group will assess the expense of such things, the marketplace they currently address, and conclude that they will for now remain in their current tier. This is like a 180-lb fighter climbing into the ring with a heavyweight, and walking away realiizing that they need to add some muscle, some speed, and some toughness or just stay in their own weight class and be successful there.
Another case-in-point is the need for high-intensity data processing in-the-box in a continuous form, coupled with the need for the reporting environment to share the data once-processed, likewise coupled with the need for backup/restore/archive and perhaps a hot-swap failover protocol. We can do these things with smaller machines and their supporting vendor software products. But what about Netezza, with such daunting data sizes, adding the complexity of data processing?
At one site we had a TwinFin 48 (384 processors) and two TwinFin 24's (192 processors) with the '48 doing the heavy-lifting for both production roles. When it came time to get more hardware, the architects decided to get another '48 and split the roles, so that one of the machines would do hard-data-processing and simply replicate-final-results to the second '48, limiting its processing impact for any given movement. This was not the only part of their plan. They then set up replicators to make "hot" versions of each of these databases on the other server. This allowed them to store all of the data on both, providing a hot DR live/live configuration, but it would only cost them storage, not CPU power. Configured correctly, neither of the live databases would know the difference. Our replicators (nzDIF modules) seamlessly operated this using the Netezza compressed format to achieve an effective 6TB/hour inter-machine transfer rate, plenty of power for incremental/trickle feeds across the machines.
Some say "I want an enterprise product that I can use for all of my databases". Well, this is the problem isn't it? Netezza is not like "all of our other databases". Products that have a smashing time with the lower-volume environments start to think that a "big" version of one of those environments somehow qualifies their product to step-up. I am fond of noting that Ab Initio, at one site loading a commodity SMP RDBMS, was achieving fifteen million rows in two hours. Ab Initio can load data a lot faster than that (and is on record as the only technology that can feed Netezza's load capacity). So what was the problem? The choice of database hardware? Software? Disk space? Actually it was the mistaken belief that any of those can scale to the same heights as Netezza. I could not imagine, for example, that if fifteen million rows would take two hours, what about a billion rows (1300 hours? ). Netezza's cruising-speed is over a million rows a second from one stream, and can load multiple streams-at-a-time.
Many very popular enteprise products have not bothered to integrate with a Netezza machine, and many of those who have, provide some form of bolt-on adapter for it. It usually works, but because the problem domain is a niche, it's not on their "product radar". It's not "integrated as-one". What does this mean? Netezza's ecosystem, and now assimilated by IBM, through IBM's product genius and sheer integration muscle, will ultimately have a powerful stack for enterprise computing such that none of the other players will be able to catch up. If those vendors have not integrated by now, the goal-line to achieve it is even now racing ahead of them toward the horizon. Perhaps they won't catch up. Perhaps they won't keep up. Some products (e.g. nzDIF) are at the front-edge, but nzDIF is not a shrink-wrapped or download-and-go kind of toolkit. We use it to accelerate our clients and differentiate our approaches. It's a development platform, an operational environment and expert system (our best and brightest capture Netezza best practices directly into the core). This has certainly been a year where we've gotten the most requests for it. But there's only one way to get a copy.
Cue Red Adair.
"No capes!" - Edna Mode, clothing-designer-for-the-gods, Disney/Pixar's The Incredibles
When introducing the Netezza platform to a new environment, or even trying to leverage existing technologies to support it, very often the infrastructure admins will have a lot of questions, especially concerning backups and disaster recovery. Not the least of which are "how much" "how often" and such like. More often than not, every one our responses will be met with a common pattern, a sentence starting with the same two words:
Case in point, when we had a casual conversation with some overseers of backup technology as a precursor to "the big meeting" we almost - quite accidentally - shut down the conversation entirely. Just the mention of "billions" of rows, or speaking of the database in "Carl Sagan" scaled terms, caused them to want to scramble for budget and market surveys of technologies that were more scalable than their paltry nightly tape-backup routines. In this particular conversation, we were talking about nightly backups that were larger than the monthly backups of all of their other systems combined. Clearly we were about to pop the seams of their systems and they wanted a little runway to head off the problem in the right way.
But what is the "right way" to perform a backup where one of the tables is over 20 terabytes in size, the entire database is over 40 terabytes in size and the backup systems require two or even three weeks to extract and store this information for just one backup cycle? Quipped one admin "It takes so long to back up the system that we need to start the cycle again before the first cycle is even partially done." and another "Forget the backup. What about the restore? Will it take us three weeks to restore the data too? This seems unreasonable."
Yes it does seem unreasonable precisely because it is - quite unreasonable. As many of you may have already discovered, the Netezza platform is a change-agent, and will either transform or crush the environment where it is installed, so voracious are its processing needs and so mighty is its power to store mind-numbing quantities of information.
The aforementioned admins simply plugged their backup system into the Netezza machine, closed their eyes and flipped the switch, then helplessly watched the malaise unfold. It doesn't have to be so painful. These are very-large-scale systems that we are attempting to interface with smaller-scale systems. We might think that the backup system is the largest scaled system in our entire enclosure, but put a Netezza machine next to it and watch it scream like a little girl.
So here's the deal: No environment of this size shoiuld be handled in a manner that is logistically unworkable for the infrastructure hosting it. We can say all day that these lower-scaled technologies should work better or that Netezza should pony-up some stuff to bridge the difference, but we all know that it's not that easy. Netezza has simplified a lot of things, but simplification of things outside the Netezza machine - aren't we asking a bit much of one vendor?
To avoid pain and injury, think about the things that we need to accomplish that are daunting us, and solve the problem. The problem is not in the technology but in the largesse of the information. We would have the same problem on our home computers if we had a terabyte of data to backup onto a common 50-gig tape drive. We would need twenty tapes to store the data. The backup/restore technology works perfectly fine and reasonably well for a variety of large-scale purposes. We simply need to be creative about adapting it to the Netezza machine. Don't plug it in and hope for the best. Don't do monolithic backups. The data did not arrive in the machine in a monolithic manner so why are we trying to preserve it that way? Leave large-scale storage and retrieval to the Netezza machine and don't crush the supporting technologies with a mission they were never designed for.
Several equally viable schools of thought are in play here. What we are looking for is the most reliable one. Which one will instill the highest confidence with least complexity? The more complex a backup/restore solution becomes, the less operational confidence we have in it. If it cannot backup and restore in a reasonable time frame, we exist in a rather anxious frontier, wondering when the time will come that the restoration may be required and we put our faith in the notion that it either won't, or when it does all of the other collateral operational ssues will eclipse the importance of the restoration. In other words. future circumstance will get us off the hook. There is a better way, like a deterministic and testable means to truly backup and restore the system with high reliability and confidence.
On deck is the simplest form of the solution - another Netezza machine. Many of you already have a Disaster-Recovery machine in play. Trust me when I tell you that this should be fleshed out as a fully functional capability (discussed next) and then the need for a commodity backup/restore technology evaporates. Using another Netezza machine, especially when leveraging the Netezza-compressed information form, allows us to replicate terabytes of information in a matter of minutes. I don't have to point out that none of our secondary technologies can compete with this.
A second strategy requires a bit more thought, but it actually does leverage our current backup/restore technology in a manner that won't choke it. It won't change the fact that the restoration, while reliable, may be slow simply because moving many terabytes in and out of one of these secondary environments is inherently slow already.
A third strategy is a hybrid of the second, in that enormous SAN/NAS resources are deployed as the active storage mechanism for the data that is not on (or no longer on) the Netezza machine. This can be a very expensive proposition on its own. We know of sites that keep the data on SAN in load-ready form, and then load data on-demand to the Netezza machine, query the just-loaded data, return the result to the user and then drop the table. You may not have on-demand needs of this scale, but this shows that Netezza is ready to scale into it.
A fourth strategy is a hybrid of the first, that is we still use a Netezza machine to back up our other Netezza machine, but we use the more cost-effective Netezza High-Capacity server, which is less expensive than the common TwinFin (fewer CPUS, more disk drives) but otherwise behaves in every way identically to its more high-powered brethren. And honestly if we were to put apples-to-apples in a comparison between the cost of a big SAN plant to store these archives versus the High Capacity Server, the server wins hands-down. It's cheaper, simpler to operate, doesn't require any special adaptation and we can replicate data in terms guided by catalog metadata rather than adapting one technology to another.
So let's take these from the least viable to the most viable and compare them in context and contrast, and let the computer chips fall where they may.
Commodity backup/restore technology
If we want to leverage this, we need to understand that it cannot be used to perform monolothic operations. These are unmanageable for a lot of reasons:
To mitigate the above, we need to adapt the large-scale database to the backup technology by decoupling and downsizing the operation into manageable chunks. This is a direct application of the themes surrounding protracted data movement in any environment. The larger the data set, the more the need for checkpointed operations so that the overall event is an aggregation of smaller, manageable events. If any single event fails, it is independently restartable without having to start over. Case in point, if I have 100 steps to complete a task and they are all dependent upon one another, and the series should die on step 71, I still have 29 steps remaining that may have completed without incident, but I cannot run them without first completing step 71. This is what a monolithic backup buys us - an all-or-nothing dependency that is not manageable and I would argue is entirely artificial.
To continue this analogy, lets say that any one of these 100 steps only takes one minute. In the above, I am still 30 minutes away from completion. I arrive at 6am to find that step 71 died, and now I have to restart from step 1 and it will cost me another 100 minutes. Even if I could restart at step 71, I am still 30 minutes away from completion.
Contrast the above to a checkpointed, independent model. If we have 100 independent steps and the step 71 should die, the remaining 29 steps will still continue. We arrive at 6am to find that only one of the 100 steps died and we are only 1 minute away from full completion. The difference in the two models is very dramatic:
Monolithic means we are operationally reactive when failure occurs. The clock is ticking and we have to get things back on track and keep moving. Checkpointed means we are administratively responsive when failure occurs. We don't have to scramble to keep things going. In fact, in the above example, if step 71 should fail and the operator is notified, doesn't the operator have at least half an hour to initiate and close step 71 independently of the remaining 29 steps? Operators need breathing room, not an anxious existence.
Monolithic methods are supported de-facto by the backup/restore technology. If we want to perform a checkpointed operation, we have to adapt the backup/restore process to the physical or logical content of the information. We don't want to directly mate the backup technology itself, so we need to adapt it.
Logical Content Boundaries
This means we have to define logical content boundaries in the data. What's that? You don't have any logical content boundaries, not even for operational purposes? Well, per my constant harping on enhancing our solution data structures with operational content, such as job/audit ID and other quantities, perhaps we need to take a step back and underscore the value of these things because they exist for a variety of reasons. One of them is now upon us - the operatonal support of content-bounded backups. It is required for scalability and adaptability and is not particularly hard to apply or maintain.
A more important aspect of content boundary is the ability to identify old versus new data. If the data is carved out in manageable chunks, some will always be least-recent and some more-recent. Invariably the least-recent chunks will be identical in content no matter how many times we extract them for backup. This means we can extract/backup these only once and then focus our attention on the most active data. In a monolithic model, there is no distinction between least or most active, least or most recent. In large-scale databases, the least-recent data is the majority, so the monolithic backup is painfully redundant when it need not be.
Do we absolutely need content-bounded backups for all of our tables to function correctly? Of course not. But by applying this as a universal theme it allows us to treat all tables as part of a backup framework where all of them behave the same way. So part of this is in the capture of the data but a larger part is the operational consistency of the solution.
Many reference tables such as lookups will never grow larger and we know this. In fact, their data may remain static for many years. For the ones that are tied to the application and grow or change every day, these we will call solution tables. They are typically fed by an upstream source and are modified on a regular basis. Any of these tables can grow out of control. The reference data then represents a very low operational risk. Why then would we not simply fold the reference data into the larger body and treat all the tables the same? There is no operational penalty for it, but enormous benefit from being able to treat all tables the same way inside a common solution.
At this point, the backup/restore framework will address all of the tables the same way, but now we have the ability to leverage rules and conditions within the framework so that special handling is available if necessary. This is a common theme in large-scale processing: Handle everything as though it will grow, but accommodate exceptions with configuration rules. I'll forego this aspect for now and let' take a look at what we need in basic terms:
- An intermedaite landing zone: Some off-machine storage location that can hold the data while in transit - e.g a NAS or SAN volume. This will not be the same size as the database but some fraction of it. It is a workspace for intermediate files, not permanant storage
- Content boundaries: A means to define and manage content bthese are tags or quantities in the data itself. They need to be consistent for all tables so that the backups and restore operate the same way. Of course, if we don't have these, we need to apply them.
- A restoration database: a separate database that will be used for the restoration process, because we won't be restoring the data directly into the table from whence it came. Why not? There is an increasing likelihood that the data structures have changed since the last backup. If the structure has changed, we cannot reload the data into it. We need a way to shepherd the data back into the machine.
- Processes: that capture / backup / restore the data using content boundaries and common utilities. These are typically control scripts that are easily configured, deployed and maintained
- Archival Database (next): Physical content replicas of the masters that hold all of the original information in a form useful for backup and restore. Whether they keep the data active/online is immaterial. Their role is to interface with the backup/restore mechanisms so that any data once made offline can easily be made online through this database.
Setup formal archival databases. Whether these reside on the same server, or on a DR / High Capacity server (below) is immaterial. The point is that the data in the master tables will be actively rolled into these tables, which will form the backbone for all backup and restoration operations. We therefore replicate the masters (with streaming replication) into the archival stores and then at some interval perform the backup of the archives, not the master.
Foreshorten the Master Tables
Now that we have a means to define content boundaries - you did apply those boundaries right? We can now look at the database holistically for optimizations based on active data.
At one site we have a table with ninety billion records in three different fact tables spanning over ten years of information, However, the end-users and principals claimed (and we verified) that the most active data on any given day is the most recent six months. Anything prior to that, they would query perhaps once or twice a year for investigative purposes, but had not tied any reports to it.
So now we have an opportunity to get agreement from the users to shorten the master tables. This is especially necessary for those fact tables. In the end, these fact tables (above) were shortened to less than four billion rows each and these are kept trimmed on a regular basis. The original long-term data is held in another archival database that is the foundation for the backups and restores.
On a per-table basis, use the content boundaries to manufacture/compile a set of checkpointed instructions for each table to support extraction and restoration
Execute the extractions to create flat files
Each extraction will pull data to a Netezza-compressed flat file and also supply another file with the metadata instructions necessary to restore the file's contents. On a per-file basis, these instructions will include:
Original table definition, used to construct a template of the table in the restoration database
File load instructions used to get the content back into the machine
Transformation command, used to move data from the restoration database into its original home
- The files will be extracted (using content boundaries) onto the storage landing zone
- The smaller extractions can be performed in parallel, are independent and form their own checkpointed process
- The backup techology will begin its backup process once the second (metadata companion) file arrives for the set. The metadata companion file behaves as a trigger.
- Once the backup for the file-set is complete, the backup/restore deletes the file and its companion
- This decouples the backup/restore processing from the database (completely) so that it can focus solely on file-based backup and restore
- For restoration, both the content file(s) and companion metadata file(s) will be retrieved and placed on the storage landing zone
- The metadata will be used to reconstruct an empty intermediate version of the original table in the Restoration Database
- Load the table using nzload
- Then the metadata will be used (even a preformulated SQL statement) to copy the data into the original target
- Throw away the intermediate load table, delete the data file and its companion metadata file
- The restoration process can be run as many ways-parallel as supported by nzload
- The restoration process can also be surgical, with the more agile ability to restore data in smaller segments
The above protocol, while likely easy to pull off an administer, still has a number of moving parts that the Netezza based-equivalent (later) will not have.
The only difference between the above protocol and one using a SAN-based storage mechanism, is the absence of a formal backup/restore technology. Rather the SAN is the long-term storage location and we perform the incremental extractions onto it. Rather than delete the files, we keep them.
This has significant implications for the cost of the SAN. After all, if we intend to interface this to the Netezza machine, we would not want common NAS storage because it is too slow and the vendors actively disclaim their technology from being viable for data warehousing. The primary reason is that the network and CPUs are set up for load-balancing, like a transactional database, but not bulk onload/offload of the data.
Not only will we need enough SAN to backup the environment, but to also carry fathers and grandfathers if need be (this is a policy decision). With checkpointed extracts, the father/grandfather issue is largely moot. This is because once a checkpointed extract of older data is pulled and stored, it won't be changing and capturing another one just like it has no value.
In this approach we leverage another Netezza machine like a DR server, as our backup, archival and restore foundation. It can easily hold the information quantity. The difference here is in price, of course, since a fully-functional TwinFin is more expensive than most common SAN installations. However, the High Capacity Server (below) mitigates this pricing problem while delivering a consistent data experience.
One primary benefit of performing backups into the DR server is that it can automatically serve the role of a hot-swap server in case of failure in the primary server.
For this scenario to work however, we would want streaming replication between the active databases and the DR server so that the data is being reconciled while being processed. This allows us to have a fully functional hot-swap if the primary crashes, and we can continue uninterrupted while the primary is serviced. Word to the wise on this kind of scenario however, bringing back the primary means that it is out of sync, since the secondary took over for a span of time. So we would need to be able to reverse the streaming replication to make it whole.
Scenarios like this often embrace the practice of operationally swapping the two machines on defined boundaries, like once a month or once a quarter, where they actually switch roles each time. This allows the operations staff to gain confidence in the two machines as redundant to each other in every way. I have seen cases like this where the primary machine went down, the secondary machine kicked in seamlessly and all was well. I have also seen cases where the principals kept the DR server up to date but when it came time to operationally switch, some important piece (usually in the infrastructure between the devices) was missing causing the failover itself to fail. It is best to have a plan in place, but it's better to have tested the plan and that it actually works.
On a per-table basis, use the content boundaries to manufacture/compile a set of checkpointed instructions for each table to support extraction and restoration
Using a variation of the nzmigrate utility, perform the table-level extractions
Extract data in Netezza-compressed form to a flat file
Load the flat file into a restoration database in the second machine
Perform the transform-execution to copy data from the restoration database into the target table
The extractions, restoratons and transformations can all leverage simple scripts and catalog metadata, not become dependent on deeply hand-crafted code
Intermediate (SAN) landing zone requires less space because the data is being transferred in Netezza-compressed format and it is cleaned-up on-the-fly
Transfer is quicker because Netezza-compressed data is written to the data-slices directly instead of coming through the host.
A word on Netezza-compressed transfer. I wrote about this in Netezza Transformation but it is important to highlight here. We performed an experiment moving half a terabyte scattered across a hundred or more tables. This data was moved from its original home to a database in another machine. The first method used simple SQL-extract into an nzLoad component. This process took over an hour. The second method used transient external tables with compression, coupled with an nzload in compressed mode. The entire transfer took less than six minutes. This was because the compressed form of the data was already 14x compressed.
In other experiment using over 20x compression for the data, we were able to transfer ten terabytes in less than an hour. This kind of data transfer speaks well for the streaming replication necessary for DR server operation (above) but underscores the fact that even when transferring between Netezza machines, it's as though we haven't left the machine at all.
Netezza-based High-Capacity Server
This option is simply a form of the Netezza-based hybrid (above) but on a dedicated server designed to support backup and recovery.
The better part about this server is that is has more disk drives and fewer CPUs, making it far more cost effective for storage than common SAN devices. Couple this with the minimal overhead required for transferring data between machines, and the ability to surgically control the content with the content-boundaries and catalog metadata, and we get the best of all worlds with this device.
Not only that, but it is also scalable to support storage of all other Netezza devices in the shop as well as any non-Netezza device where we simply want to capture structured information for archival purposes. The High Capacity server is queryable also, meaning that even the ad-hoc folks will find some value in keeping the data online and available.
Lastly, in Spring 2010, as part of the safe-harbor presentation one of the principals at IBM Netezza announced plans for a replication server. I can only imagine that this device will deliver us from any additional hiccups associated with streaming replication that we might now be doing in script or other utility control language.
At Brightlight, our data integration framework (nzDIF) has the nz_migrate techniques built directly into the flow substrate of the processing controller, as well as the enforcement and maintenance of the aforementioned content boundaries. We are actively acquiring and applying best, most scalable and simplified approaches as a solution framework firmly lashed to a purpose-built machine. I am a big proponent of encouraging Enzees to take on these things themselves, or at least let us coach you on how to make it happen. The solutions are simple because the Netezza platform itself is simplified in its operation. Stand on the shoulders of genius - the air is good up here.