DavidBirmingham 2700043KNU Tags:  performance transformation elt netezza sql etl transform 3,140 Views
I have noted in prior posts several of the "banes" of in-the-box data processing, not the least of which is harnessing the mechanics and nuances of the SQL statement itself. After all, the engine of in-the-box is a series of insert/select SQL statements. I've also noted that we need to squeeze the latency out of the inter-query handoff and management. These are important factors for efficiency, scalability and adaptability.
But this article deals primarily with "adaptive" SQL, that is, the ability to surgically and dynamically control the SQL, the paths of flow between SQL statements, their timings and the ability to conditionally execute them.
I am drawing a contrast between this approach and the common "wired" ETL application. In the wired application of an ETL tool, all components are known and flow-paths predefined. If we want to shut off a particular component or flow, we'd better make that decision at startup because we won't get to do this later. A benefit here is that if we add or change a flow-path, the ETL tool's dependency analysis will (usually) detect it and give it a thumbs-up or thumbs-down. We can (and do) perform this kind of design-time analysis, but what of dynamic run-time analysis?
Case in point: One group performs trickle-feed of data from a change-data-capture, so on any given loadng cycle, we don't know which files will show up. Not to worry in an ETL tool, since we would just build a separate mini-app to deal with the issues. The mini-app would key on the arrival of a specific file, process the file and present results to the database. This is a very typical implementation. But with hundreds of potential files, it's also logistically very daunting and hard to get the various streams to inter-operate. In fact, an ETL tool quickly reduces to "sphagetti-graphics" and the graphical user interface is just in-our-way at that point.
Case in point: One group has multiple query paths/flows where sql statements build one-to-the-next for the final outcome. These can follow a wide range of paths not unlike a labyrinth depending on a variety of different factors. The problem is, these factors aren't known until run-time and only appear in fleeting form as the data is processed. How do we capture these elements and use them as steering logic? In an ETL tool, our options are limited to none. In this particular case, three primary paths of logic were available each time the flows ran. Sometimes all three paths ran end-to-end. Sometimes only one, or two would run, or perhaps none-at-all. The starting conditions and unfolding data conditions determined the execution path.
But we have another name for this don't we? Isn't this just plain vanilla "computer programming"? Where the data shows up and we use the encountered-data and encountered-conditions to guide the IF-THEN-ELSE logic to conclusion? The problem you see, is that we are so accustomed to using IF-THEN-ELSE at the ROW/COLUMN level, we cannot imagine what this would look like at the SET level. Ahh, the conditional logic driving SETS is unique and distinct from that which drives basic elements. But then again, we can only scale in sets, not the basic elements. THis is where the dynamic nature of conditional-sets is invaluable.
But this isn't really about conditional sets, either. Only that conditional sets are a necessary capability and we have to account for them along with many other subtle nuances. Let's follow:
We have an external file and we load this into an intermediate/staging table (TABLE-A) in preparation for processing.
Now we build another target intermediate table (TABLE-B) and an insert/select statement to move / shape the data logically and physically from TABLE-A to TABLE-B.
From here we have several more similar operations, so we build intermediate tables for their results as well, such as TABLE-C, TABLE-D and TABLE-E
TABLE-A >>> TABLE-B >>> TABLE-C >>> TABLE-D >>> TABLE-E
Now let's say we have another chain of work starting from TABLE-V:
TABLE-V >>> TABLE-W >>> TABLE-X >>> TABLE-Y >>> TABLE-Z
Now something interesting happens, in that the developers sense a pattern that allows them to reuse certain logic if they only put these quantities into a couple of working tables, which we will call TABLE-G and TABLE-H, and now the flows look like this:
TABLE-A >>> TABLE-B >>>>>>>>TABLE-C >>> TABLE-D >>> TABLE-E
TABLE-V >>> TABLE-W >>>>>>>>TABLE-X >>> TABLE-Y >>> TABLE-Z
Notice how TABLE-G is feeding TABLE-C and TABLE-H is feeding TABLE-X, so that each of them have a 2-table dependency.
Now we get to the end of the chain of work and learn that TABLE-Z has to leverage some data in TABLE-C! We don't want to rebuild TABLE-C just for TABLE-Z, but in an ETL Tool this data would be bound/locked inside a flow. We could redirect the flow to TABLE-Z, unless the flow to TABLE-Z is entirely conditional and we don't know it until we encounter TABLE-C. What if, for example, the results of TABLE-C are conditional and if the condition is realized, none of the components following TABLE-C are executed. However, we could have TABLE-Z see this absence as acceptable and continue on.
Okay, that's a lot of stuff that might have your head spinning about now, but the simplicity in resolving the above is already in our hands. In any flow model, upstream components essentially have a "parent" relationship to a downstream "child" component. This parent-child relationship pervades flows (and especially trees) and as we can readily see, the above chain-of-events looks a lot like a tree (more so than a flow).
More importantly, each node of the tree is a checkpointed stop. We must build the intermediate table, process data into it and move on, but once we persist the data, we have a checkpointed operation. This is why it behaves so beautifully as a flow and a tree.
Now let's say over the course of SDLC (regular maintenance), that a developer needs to add some more operations and connect other existing operations to their results. This is essentially just introducing new source tables in the where/join clause, but the table as to exist. In short, if we add a new table to the logic of TABLE-X, it will now be dependent upon its original tables and the new ones. (Its query will break if they are not present at run time).
It is easy enough (honestly) to perform a quick dependency-check over all of our queries to make sure that their various source tables are accounted for. In other words, an operation actually exists that will produce the table. What if we picked the wrong table or even misspelled it? At run-time we would know, but we would rather know before execution because it's a design-time issue. This may verify that logically we have a plan to create the dependent table, but it does not deal with the simple fact that conditional circumstances may forego the physical instantiation of the table. Transforms ultimately do not operate on intent, but on the presence of physical assets.
As another nuance, this creates a disparity between the design-time flow of data, and the run-time flow of data. If the run-time is governed (e.g. ETL tools) so that the dependencies and conditions are all evaluated at the start of the application, the design-time and run-time are more easily mated for review by an auditor or analyst. But if any part of it is dynamically conditional, we can see how this could practically nullify the design-time form of the flow. They would simply say, "I know what the flow would do by design, but I want to see what it actually did at run time, because the data isn't matching up". Aha - so "intent" counts for design review, but "intent" is not what puts physical data into the tables. Operational processes do that.
As noted above with the necessity for conditionality and reduction of inter-transform latency, we now have a need to weave together at run-time what the flows will actually do. The "source tables" for a given transform are found in the where/join phrase and these had better be present when the SQL launches or it will be a short ride indeed.
And now, what you did not expect - one of the most powerful ways to use a Netezza machine is to forego the "serialization" of these flows and allow them to launch asynchronously. We can certainly throttle how many are "live" at at time, but if any or all of them can launch independently, how on earth are we supposed to manage the case where one or two of them really are dependent on another one or more? Do we put these in a separate flow? Do we really want our developers to have to remember that if they put an additional dependency in a transform that they have to regard whether that preceding transform has actually executed successfully?
So that's the real trick, isn't it? If I have forty transforms and all of them could run asynchronously except for about ten of them, that can only run after their predecessor completes, I have several options to see to it that these secondary operations do not fail (because their predecessor has not executed yet).
I can serialize them in by putting them into separate flows (or branches). One of them kicks off and runs to completion while the next one waits. This is logically consistent but also inefficient. If those secondary transforms are co-located with the original set, the optimizer can run them when there is bandwidth rather than waiting until the end. It is also logistically unwieldy because a developer has to remember to that if a transform should gain a dependency, it has to be moved to the second flow.
I can fully serialize them into a list, but this is the most inefficient since it "boxcars" the transforms and does not leverage the extra machine cycles we could have used to shrink the duration.
I can link them via their target table and source table, such that this relationship is dynamically identified and the flow path dynamically realized. If a given transform does not run (conditional failure) or simply fails to execute, the dependency breakage is dynamically known. What does this do? What if a given transform is supposed to use an incoming (intake) table if it is present (data was loaded) otherwise use a target-table's contents (e.g. trickle-feed, change-data-capture problem). This allows the transform to do its work with consistency but also have the ability to dynamically change its sources based on availability.
Now, we know ETL tools don't do this. Other tools may attempt to rise to this level of dynamic pathing, but the bottom line is that if those tools don't provide this kind of latency-reduction, high-throughput, dynamically adaptable model, they will not be able to leverage the full bandwidth of the machine. Trust me on this - the difference is between using 90 percent of the machine or only 10 percent at a time. That machine packs the virtual joules to make it happen, so let's make it happen.
When we originally developed our framework to wrap around some of these necessary functions, we had not considered these nuances of dynamic interdependence and frankly, ELT was so new that it didn't really matter. The overhead to execute "raw" SQL was zero, but we could not effectively parallelize/async the queries without losing control. Running async chains of transforms necessitated detailed control, but nobody had a decent algorithm for it, so once again Brightlight had to pioneer this capability. Our architecture allowed us to easily integrate these things into the substrate of the framework as a transparent function. This is the primary benefit of a framework, that the developers can continue to build their applications without disruption, but we can upgrade and enhance the framework to provide stronger and deeper functionality. Whether our framework is right for all applications is not the issue, but whether the complete implementation is right for Netezza. It's a powerful machine and we should not arbitrarily leave any cycles on the table.
Imagine slowly running out of steam because of latent implementation inefficiencies, then ultimately asking for a Netezza upgrade that, if the inefficiencies weren't present, the upgrade wouldn't be necessary. This has happened with more than one of our sites and rather than upgrading to all-new-hardware, we installed, converted and bought back an enormous amount of capacity. They eventually upgraded the hardware much later on, but for the right reasons.
DavidBirmingham 2700043KNU Tags:  port performance functional migration suspect 100x netezza 1 Comment 3,202 Views
After completing another migration from a traditional, general-purpose RDBMS to the Netezza technology, I visited a friend who had several artifacts in his home that had to be the strangest things I'd ever seen.
Now I'd heard of genetically altered cat fur, you know, buying a cat online while picking the fur color of your choice (blue, lavender, teal - etc). Seems like an odd thing to do to a cat. I like cats and dogs both, so don't imagine that this blog essay attempts to take sides. Some folks are downright serious on their choice of pet, so I'll smooth that fur wherever I can.
Back to the hairless cats. I asked him "Where did you find these? And what happened to your other cats?" And he laughed, "That's a funny story. These are my cats, but I had to shave them." To this, I rolled my eyes, wondering where this was about to lead. He told the tale:
"We took a vacation down south and put the cats in kennels in the back of the truck. The round-trip took a toll on their fur and matted up everything from head to toe. Weeks later, the cat fur had not smoothed out. The kids had been brushing it out but it wasn't working. And if you think they look ridiculous shaved, you have no idea how silly they looked with their hair matted."
"So you shaved them?" I asked.
"Sure" he said, "Seems practical right? Just get rid of the matted hair altogether. Teasing it apart would have taken, well, years of time. Their fur will grow back out soon enough"
"Aren't there, you know, shampoos and stuff for that? I mean, shaving seems a little extreme."
"Tried all that. Bad thing about it, mats are bad for cats - they cause infections and all kinds of nasty side effects. Best to just shave it all and be done with it."
Laughing on the inside, I thought a bit about how we have to decompose and de-engineer an organically-grown data warehouse. Some would suggest porting (forklifting) the whole thing over "as is". Like taking a matted-hair cat and moving them from one house to another. It changes the venue for the cat, but doesn't help the cat at all. It's still sick and getting sicker from the mats. Such folks "tell a tale" of the success of their migration derring-do. But they are like nomads. Hunting the game until there's no more, then pulling up stakes to find another place to burn out. Forklift-migrations have value only to the ones who are doing the migrating, not the recipients. No sooner will they tie a bow on it than someone will request a change, and we will discover what we already knew: The original data model (now the new data model) isn't very resilient to change no matter where it is hosted.
We realize we have ported both the good and the bad from the old system, when we had the opportunity to port the good and leave the bad behind. We essentially are agreeing that we are about to standardize on the past and then accommodate the future, rather than a better approach: standardize on the future and accommodate the past.
Many years ago, at one site we had to carefully tease-apart the data and the stored procedured to find out what they were actually doing. Unfortunately we had carved up the work for several teams rather than reviewing it together. Had we done this, we would have discovered that the stored procs executed in chains of work, and that many of the chains were copy-pasted from one original chain that was too "matted" to risk breaking. So they copied-and-modified this chain to perform the new functionality. Enough of these and we see how the stored procedure doesn't benefit us (at all) for back-end data processing. In fact, we strongly suggest people use stored procs in Netezza for BI-adaptation and optimization, for the presentation layer. But not for the back end. Stored procs are not operationally viable for a wide range of reasons. It's even funny how folks move from one technology to another and try to replicate the stored procedure logic as a knee-jerk exercise, without realizing how flawed it really is. Perhaps the Netezza stored proc will run a lot faster. Trust me, performance is the least of your worries.
So once we converged the teams together, these themes started popping out like rabbits. By the end of the first day we are all laughing at the sheer level of redundancy in the back end. But not particularly surprised at the outcome. We'd seen it in lots of places before, but not so bad.
Of course, it never once occurred to us that we would port these hundreds of stored procs over to the new system. Rather we would functionally specify what they are doing now, and leverage tools to accommodate the vast majority of the functionality, only building what was left over. I mean, this is a standard functional port, why complicate things? Forklifting into a Netezza machine will certainly yield 10x performance, so why the beef? Without optimizing the data structures and processes to leverage Netezza's power, we might get 10x but leavel 100x on the table. Is this a good tradeoff?
Well, true to form, someone had the capacity to complicate things. He whipped out a spreadsheet and calculated the cost of the hundreds-of-stored-procs in the original system, not realizing we were planning to reduced these to maybe fifteen operations at most. Spreadsheet calculator in-hand, he estimated that it would take 24 people, 8 months, to handle on these stored procs. I sat back in my seat, stunned, because he was costing a project we weren't about to undertake. Rather, building out 15 or so operations would require a handful of people and 90 days at the outside. But also true to form, the project principals saw visions of sugar plums (another word for sales-comp) that got in the way of their better judgment. They actually went to the client with these inflated numbers, he rejected their proposal outright and gave the business to someone else. It's easy to lose a deal when the client sees the inflation on-the-page.
But what our "spreadsheet guy" missed, was that we weren't about to embark on a journey of finding a home for each stored proc (we already knew this had no value, and the client knew it too). He believed that we intended to bring the matted-cats into the house and put them on pillows, when we intended to pick the cats we wanted to keep, and shave them.
Okay, that's a strange analogy, but we had no intention whatsover of accepting all that convoluted spaghetti as the foundation for the go-forward system.
Netezza gives us the capacity - to simplify. We keep the parts we consider valuable (the cat) and get rid of all the mess that keeps the cat sick and unhappy. Taking only the functions we want, we then reconstruct (let the hair grow back out) only what we want to keep, and take the opportunity to apply some solid architectural principles and likewise capitalize on the strengths of the Netezza platform.
In the end, if we really have a platform that is standardized on the future, but accommodates the past, we also have something else that is even more powerful: A simpler, stronger engine that is ready to grow in functionality, adapting to our changing needs. The old system was never built with this kind of vision or priority, because the power wasn't there to affect it anyhow.
DavidBirmingham 2700043KNU Tags:  time continuous large feed real trickle performance netezza scale 2,474 Views
After review of a "high performance" ELT platform (that's ELT, database-transform-in-the-box) - I started asking hard questions about things they had not considered. It's a high-performance platform, isn't it?
Well, yes and no. It supports a "continuous" model, but the performance is all in the query and the data, right? Well, we'd like to think that for purposes of long-cycle queries anyhow. Here's the expectation:
In a general-purpose RDBMS, transformation-in-the-box is expensive. Each query can take minutes or even hours to complete. At this point, nobody really cares about the overhead to launch and shepherd the query, or to report its status when completed. All of these infrastructure issues are eclipsed by the duration of the query itself. If only one percent of the operation's duration is in the overhead, who cares if we spend time optimizing or minimizing it?
So in one scenario, the product would launch its queries end-to-end using a scheduler. Each of the queries would be packaged into its own little run-time, then the scheduler would kick off each one and wait for its closure, only then kicking off the next, etc. Some would call this reasonable, others sophisticated. After all, if the duration of each query is protracted, why do we care about inter-transform latency?
Contrast this to a Netezza-centric series of transforms. A general-purpose database, recall, requires us to dogpile lots of logic into each transform, protracting the duration as a matter of necessity. In a Netezza-centric scenario, we will see those ten-or-so general-purpose queries chopped apart into more efficient, tactical form, with each query building upon the last towards a final outcome (in a fraction of the time of the general-purpose equivalent).
Apart from the mechanics of how Netezza makes this happen, look at how the mechanics of the operation changes dramatically. I'll use a known working example of 42 Netezza transforms. When first we had ten-or-so-general-purpose queries running for an hour or more, we now have over forty MPP-queries, each of which runs in less than a second of duration (with some exceptions). So all 42 queries, if we could kick them off one-after-another, will execute in around 45 seconds. If in this case, the users (or process) kicking off the sequence wants less than one-minute turnaround, now we have to deal with squeezing out inter-transform latency.
In plain-vanilla terms, the mechanism using the scheduler noted above, put six-seconds of latency between each transform. What does this mean? With each transform running in one second, and six-seconds of overhead, what could run in less than a minute now runs in six minutes - we have dramaticallly breached our one-minute SLA!
Now David, get serious - who on earth wants to shave seconds off the inter-transform latency? Six seconds of delay between each transform seems perfectly reasonable! Sure, if the transform itself will take twenty or thirty minutes. But if it will take less than a second, our overhead for it is now the glaring culprit with a smoking gun, red-hands and all that. And for those Netezza users who want to eliminate this latency, don't pooh-pooh their needs. The fat is our problem to solve.
And what does this say for ETL tools that perform push-down to the machine? They will also have transitional latency as they hand-off control across components. Geared for the big-fat, long-duration queries of the general-purpose world, nobody has ever cared about the smidgeon of latency between them. Only now, the smidgeon is not so much, and looks a lot like a boulder next to a basketball.
Consider the following breakdown of a standard transform's parts. They look a lot like a CPU's fetch-decode-execute cycle (yeah, that's a little geeky, I know)
Startup Overhead Execution Shutdown
So imagine that we have a tool with controls (like a scheduler) with several seconds of latency in startup, formulating the query (recall, we want query-generation, not hand-coded queries) leading up to Execution, and then some minor overhead to accept the status and transition to the next operation. We'll call it at four (4) seconds of latency.
If we scale the above timeline with several inline / serialized transforms- we would see an effect like this:
|---x--x--| |----x--x--| |---x--x--| ----
See the "DDD" (for dead-time). We lose that time in the additional latency for transform management. This is akin to the startup/shutdown cost of a launcher, scheduler, or ETL tool shepherding a "component" to activate the underpinning SQL statement
Either way, take a look at the total time "between the x's " that is the actual execution time. If this time is several hours, the dead-time is a nit. If the execution time is proportional to the line-drawing above, we have far too much overhead.
So for 42 of these operations, we would have 42x4 seconds of latency plus the 1 second of execution, for a grand total of 210 seconds, or 3.5 minutes. When we seriously consider squeezing the fat from this operation, our primary problem is in the non-optional overhead.
Now take a look at the scenario below. I have kicked off five transforms asynchronously, with their execution times between-the-x's -
See how the first one hands off to the second one, like a baton in a relay race? Notice how each of the transforms has already incurred its overhead in parallel to the first transform, and are now merely waiting (see the "W") for their predecessor to trigger their execution.
What is the start-to-finish time of these five transforms if executed like "boxcars" in serialized mode, accepting the penalty for intertransform latency? From start to finish is around 25 seconds. But with the above example, the inter-transform latency has been practically removed except for the simple handshake as they hand-off. The first one launches and we incur the initial four-seconds of latency, a necessary penalty. Then each subsequent transform requires one second, for a grand total of 10 seconds of runtime. We have effectively moved from a model with 25 seconds of runtime to 10 seconds of runtime.
While this is around 36% of the original run time, it does not seem as dramatic as when we compare it to the original model of 42 transforms. That is, four seconds of overhead plus 42 seconds is 46 seconds of runtime. Add to this 1/10th second for the handoff delay (another 4 seconds), for 50 seconds of grand total run time.
Its original time was 42*5 seconds, or 210 seconds. This new time of 50 seconds is 24% of our original run time. That's a 300% improvement in run-time and of course, is well within the boundaries of the one-minute SLA.
Offsetting the transform run-times like this, so that the overhead is essentially invisible, is a common hardware-stablization approach for micro-electronics. Here's an example:
Many years ago I worked for a company that was repackaging a design into much smaller form. Essentually we were reducing a rack of hardware into a 1-foot cube. All of the large wire-wrapped boards had been designed-down to much smaller form. This is when instabilities were first detected.
In one particular case, the engineer described it to me as an engineering-101 mistake on the part of the original designer. Every hardware circuit is driven by signals that pass through conditioners and gates (transistors) so that the final outcome is a signal of some kind on a particular part of the board's interface(s). The mistake was in that the timing signal, a pulse sent to the hardware 60 times a second, was being applied at the first input of logic. So a general signal would "sit" on a given input location, the timing signal would "fire", opening the gate, and the signal would then traverse through the many other components and paths to reach the output path of the hardware. But here was the problem: the signal took too long to make it from point-A to point-B before all of the other signals had already left it behind. The solution to this: Put the timing-trigger signal on the output side of the board. This way, when the original signal first arrived, it would make its way across all the necessary components and paths and then present its signal to a final gate, which was triggered by the timer. So when the trigger hit, the signal was already present and passed through instantly without a problem.
This sort of "triggered-steering-logic" is the theme of the noted inter-transform handoff scenario above. The transforms navigate their overhead to a stopping point and wait for the signal. In this case, they are waiting for a "done" signal from the transform preceding them. When this signal hits, the next transform is queued and ready to immediately execute its query without further delay. The subsequent transforms fall like dominos.
But that's really only part of the story. These 42 transforms don't just run in a serialized stream. Some of them, after all, have no dependencies whatsoever. Others have dependencies in a discrete chain. Why serialize them when their dependencies are relatively few? Here's an example:
DEF JKL MNO VWX
Transforms A,B and C are dependent upon each other, so will serialize. In the meantime, the D,E,F and G,H,I transforms are independent of A,B,C, so can run side-by-side. J,K,L transforms are dependent on the outcomes of C,F and I, so will wait on them to complete, then finish the K,L transforms as serialized. But then another set-of-three transforms takes off. In ETL tools, we recognize this as branching or "component-parallelism". However, in an ETL tool the branches must be wired together and follow each other by design of the graphic on the canvas. ELT however, is much more dynamic than that.
Look at the effect: We're saving 6 seconds of time by not executing the A-I transforms serialized. Likewise the P-U transforms will now take 3 seconds, not 9 seconds, saving another 6 seconds. If we can find the otherwise independent transforms and move them to the front of the chain(s), the total time for the run is the longest chain of dependent transforms. In this case, we shrank 42 transforms into the same time frame as 20 serialized transforms. This shrank the total time from 50 seconds down to less than 28 seconds.
But David, you mean we have to carefully weave these transforms together so that we can squeeze the fat from the timeline? Actually no. We have a sniffer that examines the "filter clause" for the dependent tables, you know, what the given transform will actually join against. This allows the transforms to self-discover their own optimum path without having to deal with painful weaving. If we happen to add another transform, or change the filter phrase in a transform to include/exclude another table in the join, it will automatically realign the priorities based on the intrinsic dependencies of the transforms. And your ETL tool won't do this dynamically.
Then we have the intake protocol, that of transferring data from one machine to another, or loading from files. We have a couple of options, that of loading the data completely before taking actions in the transforms, or we can load the data asynchronously to the transforms. Just as the transforms will "wait" on a prior table in the chain-of-transforms, we can also make them "wait" for the arrival of a particular intake table. Rather than waiting for all of the intake tables to arrive, we can initiate a transform chain when its particular data set has arrived (or for that matter, not at all, in case its data never arrives).
An in our second example, the total time allocated for loading the data, executing it and ensconcing it to the proper target tables was less than three minutes. Since serialized, all-or-nothing loading easily absorbed over half of this time, we found it important to squeeze the fat from this process. By causing the entire chain to run asynchronously, when the given intake table is ready, its transform-chain would automatically launch. As fortune would have it, the longest processing chain also had small tables to load. So we could kick off those transforms very early. Some of the shortest chains also had the largest load files, so by the time saved by starting the loads as-early-as-possible.
In the end, what started out as a five-minute-plus operation shrank to 160 total seconds of time, well-inside the 3-minute SLA with some room for recovery. Here's how it looked:
Load Time = L
Transform Time = T
Original - all transforms launch when the all loads are complete
LLLL TTT TTTTT
Async form, transforms begin when their source tables are ready:
One may ask, why wouldn't we do it in the second form anyhow? It seems that this is the optimum way to process data. Well, one answer is simply : checkpointing. If we launch the transforms prior to all of the loads finishing, it is much harder to recover in case of load-failure. We would have to cancel the in-flight transforms and otherwise bring the processing to a halt. If we load it all first, then proceed, any errors can stop the processing immediately. No wasted cycles. Efficiency is important with these kinds of transform-chains in a database like Netezza. Still, if we need to shrink the total job time, the recovery-shutdown protocol (in case of load failure) is a necessary capability. Besides, our protocol does not itself finalize anything to the target database until the last transform is complete, lest the data start to arrive intermittently and out-of-context. So as long as the load time is balanced with overall transform time, shutting down before finalization is usually doable.
This of course invites the obvious question - are any of the ETL tools (or ELT offerings) attempting to squeeze out the fat in a similar manner? Methinks not, and for one reason: They, like the general-purpose databases, likewise consider themselves to be general-purpose tools. They are not philosophically committed to squeezing out latency because Netezza is the only platform that can benefit from it. The uptake in setting up and coordinating this sort of baton-like handoff, while not particularly difficult, is also non-trivial and represents a significant effort for a product tool to embrace. When compared to interfacing with the general-purpose-platforms - Netezza is not a large-enough user base to justify ramping-up this high-performance, zero-latency capability. In short, the product's features are market-driven.
The Netezza user base is growing, however, and with IBM pushing the hardware, it will grow more and faster.
Sitting at the top-end of this food chain are the big-ticket TwinFin 24/48/96 machines that do massive amounts of processing-in-the-box and want to move toward continuous models, processing data as-the-world turns. The age of the nightly batch cycle is waning and Netezza is stepping up to the vast opportunities within the "continuous" world. Latency in this world is like poison. Just because it appears to be acceptable to the general-purpose world, this is an illusion. If the general-purpose queries ever dropped into subsecond-duration, the tools facing them would need to re-tool. Actually they need to re-tool now - they just don't feel the pressure yet.
DavidBirmingham 2700043KNU Tags:  netezza solution problem power referential enterprise 2,874 Views
A while back we bought a new house and decided that we would keep our first house for a few additional months to muse about whether we would keep it for a rental home. Well, shortly afterward was Sept 11th, 2001 and we were boxed into keeping it for quite a while longer. In all this, opportunists came out of the woodwork to take our house for a "song".
We tired of the songbirds quickly, but some of them were just bizarre. One couple looked at the house and said they would buy it if we would put $20k of upgrades into it. New this, new that. Our answer was no, we're selling "as is". But they persistent and felt that their requests were reasonable. No, I said, you can spend your money fixing it up like you want it, because I'm not spending my money to fix it up like you want it. They were confused. We weren't "playing the game" you see. Oddly, they had a realtor working with them who kept chanting, "Ask for what you want. It never hurts to ask."
Well, when you're asking the wrong questions, and it's obvious, it definitely hurts to ask. When on site with a large financial services firm, we were being examined for the replacement of their current consulting firm. Their lament: "They ask all the wrong questions." Apparently we were asking the right questions, because they gave the business to us.
Recently I walked through a demo of some things we've done for other clients. Two minutes into the demo, they started asking questions. This or that? Those or these? But nothing whatsoever to do with the core problems the solution was geared for, or the core questions that everyone asks.
It's like this: I have a list of questions on left side of the white board and another list on the right side. The questions on the right side deal with things like very-large-scale backup and high-speed recovery, disaster recovery, hot-failover between two Netezza machines, real-time replication, continuous processing, and of course, heavy-lifting processing inside-the-box.
These are things that accelerate business logic, the rapid turnaround of business rules and features, catalog-aware design-time and run-time modules, developer productivity, testing accuracy and turnaround, rapid-testing and rollout, release management and patched releases - operational integrity, a firm harness around the functional data, and radical simplification of very complex problems - in short, a high-wall of protection around a high-speed delivery model. Something that shrinks the time-to-delivery as well as shrinks operatiional processing time and protects capacity, while protecting the data itself.
I knew better than to hold my breath for a question on migration -
Whew! Whenever I give this short list to a CIO, they are usually asking in terms of having rolled out a large-scale environment with Netezza and are experiencing pain in one or more of those areas. Of course, we focus on those areas. Painkiller is not enough. You have to remove the source of pain..
How do I know that these are the questions Enzees are asking? Because they are - to me personally, to the Netezza community forums, and are the hot-topics at every Enzee Universe. Good grief, that's the short list of the lightning-round question-and-answer sessions at the Best Practice forums.
But they did not hear any of this. They were asking all of the wrong questions. The reason, methinks, is that they weren't in any pain. They felt the freedom to ask about the inconsequential because they had never experienced the consequences, or have even foreseen the consequences. In other words, I am solving problems they don't have.
Or do they? Anyone who buys a large-capacity storage devicesis by definition biting off a data management challenge. One of our clients has over 200 TB (uncompressed) on their TwinFin. They currently do their backups to another TwinFin of the same size. They would like to perform their backups offline, but no commodity backup/recovery system on the planet can handle this kind of load with any aplomb. Some claim to, of course, but have not considered the true needs of the Netezza user. It's all Texas-Sized Data Warehousin'
For example, one of the off-the-shelf products claims that once the data is offloaded, they have "hooks" allowing the user to query the offline archives. That's right, the user query will be dispatched to their proprietary engine and it will fetch the answer back for you. But wait a second. That protocol is for a transactional system. Going out onto the file system to fetch a transaction might have some value, since the only alternative is to bulk-load the entire dataset back into the device to query it, when all we wanted was one transaction.
However, for a Netezza "scanning analytic" query, potentially spanning tens of terabytes in scale, such a mechanism is no more than a child's toy. What we need is the ability to rapidly reload the data back into the machine so we can query it there, inside Netezza's MPP. More importantly, once it's back online we can join it to other tables, that's right, down on the MPP where the CPUs burn brightest.
But this is simply an example of how one vendor has solved a problem that no Netezza user is asking. Netezza users need to scan and analyze the data in-the-box. Those tools provide data access outside-the-box, without considering that having the data outside-the-box is the problem we were trying to solve in the first place!
Another issue is how the ETL tools interact with Netezza. All of them do, some better than others. Some are certainly getting better than the others. The game is on, they know that data processing is migrating back into the machine. Netezza is leading the way, and they want a piece of the action. Can you blame them?
But hold on. Is bulk-data processing in an ETL tool the same as bulk-data processing inside a Netezza box? No, not really. "Going parallel" in an ETL tool means spreading work out across CPUs for a select set of operations. Not only this, we will compete with other processes for those same CPUs. The net is, we cannot put all the CPUs on the machine to work for us. Clearly some of the ETL tools are otherwise proud of their scalability. And they should be. Add a few more CPUs to that rack and watch the ETL tool scale with it. Nothing bad about that. Capture those business rules in a point-click and off-we-go. They have spent millions of dollars tuning their performance models, pouring the brain power of their brightest people into making their product go-parallel and push that data hard. Imagine them handing off this hard-won, multi million-dollar performance capability to an upstart appliance? Hmm, no, they'll be the last who are draggged kicking and screaming into the inexorable future.
Anyone who is kicking the tires of a large-scale storage and processing device like Netezza is also in for a subtle surprise: Once you are migrated, you will unleash the creativity of you staff to add functionality that was never possible before. This will generate business, which will generate more data. The extra capacity will naturally offer confidence that all-new-business-great-and-small are not only do-able, but with strength that isn't available with other platforms. No worries, sez aye, the Netezza machine will scale with you. And you have chosen well, grasshopper.
Then one day they look around and say - have we backed-up this data recently? How ahout disaster recovery? What about archiving old data to make room for new? What about the ability to make the offline data available for the ad-hoc folks? That is, available in time for them to actually use it? These simple questions raise fear in the hearts of the mightiest of IT champions when they know it should have been asked, and applied a long time before ---- Before the locomotive was moving at 90 miles an hour dragging 500 boxcars. Before the locomotive even left the station. The sheer logistics of performing a backup of data this size, and this transient, is mind-numbing. I've noted that one client, hosting over 150 TB (uncompressed) naively plugged their commodity backup tool into the Netezza machine. After over-a-week of whirr-click backup activity with no end in site, someone finally said "Kill it. If it takes a week to back it up, it will take even longer to restore it". This is a wise observation, but also tells us something:
The commodity tools expect us to accept that the restore-process will be slower than the backup-process. In the Netezza world, the backup and restore can both happen faster, and take up less storage than the commodity backup tool. If the commodity backup has to offload in uncompressed form, we need to provide generous workspace for it. If it's a Netezza-compressed backup, we only need to provide for the amount relevant to our compression ratio. Some sites get a 16x, others get almost twice that. The mileage varies because of the nature and compressibility of the data. Either way, offload is fast because it comes right off the disk without passing through the host. Likewise the reload, directly to the disk without the host. In a traditional RDBMS, offload/reload has to pass through the host to get on and off the device. For bulk analytic data, the missing middle-man is just the ticket for rapid-reload.
But they weren't asking this question either. The problems we had already solved and were bringing to the table as Netezza-centric solutions, the bread-and-butter, core-mission capabilities that people ask for all the time, wasn't even on their radar. The disconnect is simple: They have read white-papers on what people are doing after they get the back-end squared-away. The nice-to-haves. The critical parts, you know, the failure points that could mean the demise of the business, or perhaps their own paychecks? Not even on the table.
This is akin to someone about to engage in the construction of a large cargo ship. To be sure, some folks are concerned about the utility of the ship's interior. But when putting pen to paper, which is more important, that the break rooms have contemporary wallpaper, or that the ship can master tempestuous seas and high swells? Methinks the crew of said ship couldn't care less about the amenities if they got wind that the architects gave short shrift to things like hull stress and multiple watertight compartments. You know, things to keep the ship from being claimed by the sea when stress is high. Boy, those light fixtures sure look cool, don't they?
Back to basics. Netezza is an appliance. It can perform as a load-and-query device just like all the other load-and-query devices. A primary differentiator, one that more customers are experiencing all the time, is that Netezza is a powerful data processing platform. When leveraged this way, it also becomes a problem-solving platform. We simply wrapped some additional logic around these core capabilities in order to harness the logistics of multiple sequential (or asynchronous) queries being fired off to the device, managing its workload, intermediate tables and whatnot. The appliance makes such an endeavor simple, and further simplifies the user's interaction with the machine for purposes of pattern-based utilization. Change-data-capture, referential integrity checking etc are all far more effective inside the machine than outside the machine.
For the shops that would rather master these aspects in the ETL tool, hey, no harm done. Lots of people do it that way. But they all eventually get to the same point in the game: the ETL tool runs out of gas. Or to give it more gas will require a significantly larger power-plant. They then realize that all those CPUs in the Netezza machine are just sitting there, doing nothing most of the time. Enough CPUs respond to peak, which is about five percent of the time. The remaining 95 percent of time, the box is running less than half capacity with a big chunk of that completely idle. Wouldn't it be great to get a handle on all that power? Recover it as part of an ongoing capacity model?
The most important part of this approach is to decide you want to do it. Lots of options naturally come your way when you try to get creative in the direction of power- because power drives the creatviity further. Ideas are given wings that can fly higher and farther than any other traditional RDBMS could even dream about. Managers start having ideas too. By golly if that machine can do this, then why not that? And why not, indeed?
In fact, most of the time when dealing with a standard RDBMS, the managers will ask why-not? in the sincerest of spirits. Then forth from the mouths of IT come the exact, precise reasons why-not. They are good reasons, logical reasons, and effectively put a wide moat around the management's idea-factories. Soon their ideas fade. They forget how good the ideas were. They will never see the light of day. The underpowered nature of the machines constrain them.
Contrasted with the Netezza machine, the question of why-not is more rhetorical in nature. The person asking the question is not expecting an answer either, but not for the same reasons as the manager above. He's not expecting an answer because the IT folks are already on it. His answer to the question will not be verbal, but will be positively expressed in real functional terms. Likely in a very short period of time.
Why-not? becomes more of a water-cooler phrase. Almost like, "I'm trying do decide whether we should eat at the club or eat at the diner. How about today we eat at the club?" To which the respondent says - "why not?" This is a very different and rarefied existence than the one borne on the constraints of an underpowered machine.
Once you gather a head of steam on all this, you realize that not only is Netezza an enabler for large-scale, complex problem solving, it provides the impetus for us to construct a problem-solving platform upon it. The ability to capture larger patterns of set-based processing, express them in simpler terms, and have them available to all - as capabilities that leverage the capabilites of the Netezza machine.
In your own domain, if you have a Netezza machine in-house and you're using it for data processing, even if it's the most inefficient model on the planet, it still beats the socks off the plain-vanilla ETL tool's work for the same operations. If you have coupled an ETL tool with this approach and are getting the gains out of it, even though this process may have initially been painful, you have answered the question with the right answers. The tools may not be quite up to speed, but that's okay. Your compass has not betrayed you. Stay the course and the benefits will be magnanimous indeed.
And for those who continue to use the machine as a load-and query device, and have not forayed into the radically rewarding realm of ELT and data processing inside the machine, there is only one question to ask, and it's the right question:
DavidBirmingham 2700043KNU Tags:  twinfin performance enzee underground pressure netezza adaptation transformation 2,817 Views
So I published this book last Spring ('11) on how the Netezza machine is a change-agent. It initiates transformation upon people or products that happen to intersect with it. Most of the time this transformation makes the subject better. Sort of like how heavy-lifting of weights will make the body stronger. Or the pressure can crush the subject. Stress works that way. We could imagine the Netezza machine as the change-agent entering the environment. Everything brushing against it or interacting with it will have to step-up, beef-up or adapt. I sometimes hear the new players say things like "But if the Netezza machine could only.." That's like a Buck Private saying of his drill sargeant, "If he could only ..." No, the subject must consider that the Netezza machine is never the object of transformation but rather is the initiator of it. But it's not a harsh existence by any means. Products that can adapt are far-and-away better than before. Those that cannot adapt now, will eventually, or remain in their current tier.
Having been directly or indirectly alongside these sorts of product integrations and proof-of-concepts (POCs) numerous times, it's always an interesting ride. The vendor shows up ready-to-go with visions-of-sugarplums in their head. And the suits who show up with them, are salivating for the ink on the license agreement. In less than an hour into the POC, all of them have a very different opinion of their product than when they arrived. Their bravado is reduced to a shy, sort of sheepish spin. Throw them a bone, not everyone walks out of this ring intact. Some of them shake their fist at the Netezza machine. It is unimpressed. Others shake their fist at their own product. Alas, it is but virtual, inanimate matter. What is transforming now? The person in the seat.
So I have watched them scramble to make the product hit-the-mark. Patches? We don't need no stinkin' patches. Except for today, when they will be on the phone in high-intensity conversations with their "engine-room" begging for special releases while on-site. Alas such malaise could have been avoided if only they had connected their product - at least once - to a Netezza machine. In so many cases, they will claim that they have Netezza machines in-the-shop so they are prepared-and-all-that. It is revealed, sometimes within the first hour, that the product has never been connected to a Netezza machine. It doesn't even do the basics, or address the machine correctly. It is especially humorous to hear them speak in terms of scalability as though a terabyte is a high-water mark for them. One may well ask, why are we wasting our time with underpowered technology? Well, in point of fact, when placed next to the Netezza machine it's all underpowered, so really it's just a matter of degree.
Case in point, Enzees know that in order to copy data from one database to another, we have to connect to the target database (we can only write to the database we are connected to). And then use a fully-qualifed database/tablename to grab data from elsewhere - in the "select" phrase. Forsooth, their product wants to do it like "all the others do" and connect to the source, pushing data to the target. Staring numb at the white board in realization of this fundamental flaw, they mutter "If only Netezza could....". But that's not the point. They arrived on site, product CD in hand, without ever having performed even one test on real Netezza machine, or this issue (and others) would have hit them on the first operation. They would have pulled up a chair in their labs, started the process of integration and perhaps call the potential customer "Can we push the POC off until next week? We have some issues (insert fabricated storyline here) and need to do this later."
Cue swarming engineers. Transformation ensues.
Another case in point, many enterprise products are built to standards that are optimized for the target runtime server. That is, they fully intend to bring the data out of the machine, process it and send it back. One of my colleagues made a joke about Jim Carrey's "The Grinch" and the mayor's lament for a "Grinch-less" Christmas. Well, didn't the Grinch tell Cindy-Lou Who that in order to fix a light on the tree, he would take the tree, fix it and bring it back? Seems like a lot of hassle for one light? Why can't you fix it here and not take it anywhere? Enzees see the analogy unfolding. No, we don't want to take the data out, process it and put it back. We want "Grinch-less" processing, too. Fix the data where it already is.
Why do this? Well, in 6.0 version of the NPS Host, the compression engine could easily give us up to 32x compression on the disk. Or even a nominal 16x compression, meaning that our 80 terabytes is now 5TB of storage. And while we may have to de-compress it on the inside of the machine to process it, the machine is well-suited to moving these quantities around internally. Woe unto the light-of- heart who would pull the data out into-the-open, blooming it to its full un-compressed glory, on the network, CPU, the network again - just to process it and put it back.
Unprepared for the largesse of such data stores, our vendor contender's product centers on common scalar data types. Integer, character, varchar, date. No big deal. Connect to the Netezza machine and find out that the "common" database size is in the many billions and tens of billions of rows. A chocolate-and-vanilla software product without regard to a BigInt (8 byte) data type, cannot exceed the ceiling of 2 billion (that's the biggest a simple integer can hold). This does not bode well for integrating to a database with a minmum of ten billion records and that's just the smallest table. Having integers peppered throughout the software architecture by default - would require a sweeping overhaul to remediate. As the day wears on, we see them struggle with singleton inserts (a big No-No in Netezza circles) and lack of process control over the Netezza return states and status. These are not exotic or odd, but no two databases behave the same way. The moment that Netezza returned the row-count that it had successfully copied four billion rows, we watched the product crash because it could not store the row-count anywhere - the product had standardized on integers, not big integers, so the internal variable overflowed and tossed everything overboard. Quite unfortunately, this was a data-transfer product and performed destructive operations on the data (copy over there, delete the original source over here). So any hiccup meant that we could lose data, and lots of it.
Cue announceer: "And the not-ready-for-prime-time-players..."
Oh, and that "lose data and lots of it" needs to be underscored. In a database holding tens of billions of rows (hundreds of terabytes) of structured data, that is, each record in inventory, with fiducial, legal, contractual, perhaps even regulatory wrappers around it, and we're way, way past the coffin zone. Some of you recall the "coffin zone" is the point-of-no-return for an extreme rock-face climber. Cross that boundary and you can't climb down. But we're not climbing a rock face are we? The principle is the same. Lose that data and we'll get a visit from the grim reaper. He doesn't hold a sickle, just a pink slip in one hand and a cardboard box in the other (just big enough for empty a desk-full of personal belongings).
One test after another either fails or reveals another product flaw. When the smoke clears, the "rock solid offering" complete with sales-slicks and slick-salesmen, is beaten and battered and ready for the showers. The product engineers must now overhaul their technology (transform it) and fortify it for Netezza, or remain in their tier. The Netezza machine has spoken, reset itself into a resting-stance, presses a fist into a palm, graciously bows, and with a terse, gutteral challenge of a sensei master, says: "Your Kung Fu is not strong!"
Now it's transformation-fu.
Superficially, this can look like a common product-integration firefight. But this kind of scramble tells a larger tale: They weren't really ready for the POC. This would be similar to an "expert" big-city fireman, supremely trained and battle-hardened in the art of firefighting and all its risks, joining Red Adair's oil-well -fire-fighting team ( a niche to be sure) and finding that none of the equipment or procedures he is familiar with apply any longer. He will have to unlearn what he knows in order to be effective on a radically larger scale. He might have been a superhero back home, faster than a speeding bullet, able to leap tall (burning) buildings in a single bound, but when he shows up at Red Adair's place, they will tell him to exchange his clothing for a fireproof form and get rid of the cape. Nobody's a hero around an oil-field fire. Heroes leave the site horizontally, feet-first. No exceptions.
Enzees have experienced a similar transformation (with a different kind of fire). The most-oft-asked questions at conferences are just that flavor: How do we bring newbees into the fold? How do we get them from thinking in row-based/transactional solutions into set-based solutions? How do we help them understand how to use sweeping-query-scans to process billions of rows? Or use one-rule-multiple-row approaches versus cursor-based multiple-rule-per-row? How do we get testers into a model of testing with key-based summaries instead of eyeballs-on-rows (when rows are in billions)?
We were dealing with a backup problem at one site because of a lack of external disk space. Commodity tools often use external disk space for this purpose, until they are connected to a Netezza machine and their admin tool complains that they need to add "another hundred terabytes" of workspace. We gulp, realizing that the workspace is only today a grand total of ten terabytes in size. And you need another hundred! Yeesh, you big-data-people!
Most of the universe outside the Enzee universe will never have to address problems on this scale. It is not the machine itself that is the niche. It is the problem/solution domain. Most of the commodity products that are stepping up are doing so only because it's clear that Netezza is here to stay and they need to step into Netezza's domain. I suppose at some point they expected Netezza to give them a call to start the integration process, but the Netezza Enzee Universe already had all that under control. It's amazing how lots-of-power can simplify hard tasks to the end of ignoring commodity products entirely.
Another case in point, a product vendor "popped over" with a couple of his newbee product guys and spent two weeks trying to get their product to play in-scale with Netezza. Before throwing in the towel, they offered up the common litany of observations. "No indexes? What the?" and "Netezza needs to change X", or the favorite "Nobody stores this much data in one place." The short version is, you brought a knife to a gun fight, as Sean Connery would assert, or perhaps, you brought a pick-axe and a rope to scale Mt. Everest. What were you thinking? You see, most people who have never heard of Netezza (I know, there really are folks out there who don't know about it, strange as is seems) do not understand the scale of data inside its enclosure. Billions of records? Tens of billions of records? A half-trillion records? Is that all you got?
We will watch a switch flip over in their brains as they assess what they are trying to bite off. A small group will embrace the problem and work toward harnessing the Netezza machine in every way possible. Another group will provide a bolt-on adapter for Netezza to interface to their core product engine. While another, larger group will assess the expense of such things, the marketplace they currently address, and conclude that they will for now remain in their current tier. This is like a 180-lb fighter climbing into the ring with a heavyweight, and walking away realiizing that they need to add some muscle, some speed, and some toughness or just stay in their own weight class and be successful there.
Another case-in-point is the need for high-intensity data processing in-the-box in a continuous form, coupled with the need for the reporting environment to share the data once-processed, likewise coupled with the need for backup/restore/archive and perhaps a hot-swap failover protocol. We can do these things with smaller machines and their supporting vendor software products. But what about Netezza, with such daunting data sizes, adding the complexity of data processing?
At one site we had a TwinFin 48 (384 processors) and two TwinFin 24's (192 processors) with the '48 doing the heavy-lifting for both production roles. When it came time to get more hardware, the architects decided to get another '48 and split the roles, so that one of the machines would do hard-data-processing and simply replicate-final-results to the second '48, limiting its processing impact for any given movement. This was not the only part of their plan. They then set up replicators to make "hot" versions of each of these databases on the other server. This allowed them to store all of the data on both, providing a hot DR live/live configuration, but it would only cost them storage, not CPU power. Configured correctly, neither of the live databases would know the difference. Our replicators (nzDIF modules) seamlessly operated this using the Netezza compressed format to achieve an effective 6TB/hour inter-machine transfer rate, plenty of power for incremental/trickle feeds across the machines.
Some say "I want an enterprise product that I can use for all of my databases". Well, this is the problem isn't it? Netezza is not like "all of our other databases". Products that have a smashing time with the lower-volume environments start to think that a "big" version of one of those environments somehow qualifies their product to step-up. I am fond of noting that Ab Initio, at one site loading a commodity SMP RDBMS, was achieving fifteen million rows in two hours. Ab Initio can load data a lot faster than that (and is on record as the only technology that can feed Netezza's load capacity). So what was the problem? The choice of database hardware? Software? Disk space? Actually it was the mistaken belief that any of those can scale to the same heights as Netezza. I could not imagine, for example, that if fifteen million rows would take two hours, what about a billion rows (1300 hours? ). Netezza's cruising-speed is over a million rows a second from one stream, and can load multiple streams-at-a-time.
Many very popular enteprise products have not bothered to integrate with a Netezza machine, and many of those who have, provide some form of bolt-on adapter for it. It usually works, but because the problem domain is a niche, it's not on their "product radar". It's not "integrated as-one". What does this mean? Netezza's ecosystem, and now assimilated by IBM, through IBM's product genius and sheer integration muscle, will ultimately have a powerful stack for enterprise computing such that none of the other players will be able to catch up. If those vendors have not integrated by now, the goal-line to achieve it is even now racing ahead of them toward the horizon. Perhaps they won't catch up. Perhaps they won't keep up. Some products (e.g. nzDIF) are at the front-edge, but nzDIF is not a shrink-wrapped or download-and-go kind of toolkit. We use it to accelerate our clients and differentiate our approaches. It's a development platform, an operational environment and expert system (our best and brightest capture Netezza best practices directly into the core). This has certainly been a year where we've gotten the most requests for it. But there's only one way to get a copy.
Cue Red Adair.
"No capes!" - Edna Mode, clothing-designer-for-the-gods, Disney/Pixar's The Incredibles
DavidBirmingham 2700043KNU Tags:  recovery twinfin disaster scalable infrastructure capacity high netezza scale archive backup 1 Comment 7,487 Views
When introducing the Netezza platform to a new environment, or even trying to leverage existing technologies to support it, very often the infrastructure admins will have a lot of questions, especially concerning backups and disaster recovery. Not the least of which are "how much" "how often" and such like. More often than not, every one our responses will be met with a common pattern, a sentence starting with the same two words:
Case in point, when we had a casual conversation with some overseers of backup technology as a precursor to "the big meeting" we almost - quite accidentally - shut down the conversation entirely. Just the mention of "billions" of rows, or speaking of the database in "Carl Sagan" scaled terms, caused them to want to scramble for budget and market surveys of technologies that were more scalable than their paltry nightly tape-backup routines. In this particular conversation, we were talking about nightly backups that were larger than the monthly backups of all of their other systems combined. Clearly we were about to pop the seams of their systems and they wanted a little runway to head off the problem in the right way.
But what is the "right way" to perform a backup where one of the tables is over 20 terabytes in size, the entire database is over 40 terabytes in size and the backup systems require two or even three weeks to extract and store this information for just one backup cycle? Quipped one admin "It takes so long to back up the system that we need to start the cycle again before the first cycle is even partially done." and another "Forget the backup. What about the restore? Will it take us three weeks to restore the data too? This seems unreasonable."
Yes it does seem unreasonable precisely because it is - quite unreasonable. As many of you may have already discovered, the Netezza platform is a change-agent, and will either transform or crush the environment where it is installed, so voracious are its processing needs and so mighty is its power to store mind-numbing quantities of information.
The aforementioned admins simply plugged their backup system into the Netezza machine, closed their eyes and flipped the switch, then helplessly watched the malaise unfold. It doesn't have to be so painful. These are very-large-scale systems that we are attempting to interface with smaller-scale systems. We might think that the backup system is the largest scaled system in our entire enclosure, but put a Netezza machine next to it and watch it scream like a little girl.
So here's the deal: No environment of this size shoiuld be handled in a manner that is logistically unworkable for the infrastructure hosting it. We can say all day that these lower-scaled technologies should work better or that Netezza should pony-up some stuff to bridge the difference, but we all know that it's not that easy. Netezza has simplified a lot of things, but simplification of things outside the Netezza machine - aren't we asking a bit much of one vendor?
To avoid pain and injury, think about the things that we need to accomplish that are daunting us, and solve the problem. The problem is not in the technology but in the largesse of the information. We would have the same problem on our home computers if we had a terabyte of data to backup onto a common 50-gig tape drive. We would need twenty tapes to store the data. The backup/restore technology works perfectly fine and reasonably well for a variety of large-scale purposes. We simply need to be creative about adapting it to the Netezza machine. Don't plug it in and hope for the best. Don't do monolithic backups. The data did not arrive in the machine in a monolithic manner so why are we trying to preserve it that way? Leave large-scale storage and retrieval to the Netezza machine and don't crush the supporting technologies with a mission they were never designed for.
Several equally viable schools of thought are in play here. What we are looking for is the most reliable one. Which one will instill the highest confidence with least complexity? The more complex a backup/restore solution becomes, the less operational confidence we have in it. If it cannot backup and restore in a reasonable time frame, we exist in a rather anxious frontier, wondering when the time will come that the restoration may be required and we put our faith in the notion that it either won't, or when it does all of the other collateral operational ssues will eclipse the importance of the restoration. In other words. future circumstance will get us off the hook. There is a better way, like a deterministic and testable means to truly backup and restore the system with high reliability and confidence.
On deck is the simplest form of the solution - another Netezza machine. Many of you already have a Disaster-Recovery machine in play. Trust me when I tell you that this should be fleshed out as a fully functional capability (discussed next) and then the need for a commodity backup/restore technology evaporates. Using another Netezza machine, especially when leveraging the Netezza-compressed information form, allows us to replicate terabytes of information in a matter of minutes. I don't have to point out that none of our secondary technologies can compete with this.
A second strategy requires a bit more thought, but it actually does leverage our current backup/restore technology in a manner that won't choke it. It won't change the fact that the restoration, while reliable, may be slow simply because moving many terabytes in and out of one of these secondary environments is inherently slow already.
A third strategy is a hybrid of the second, in that enormous SAN/NAS resources are deployed as the active storage mechanism for the data that is not on (or no longer on) the Netezza machine. This can be a very expensive proposition on its own. We know of sites that keep the data on SAN in load-ready form, and then load data on-demand to the Netezza machine, query the just-loaded data, return the result to the user and then drop the table. You may not have on-demand needs of this scale, but this shows that Netezza is ready to scale into it.
A fourth strategy is a hybrid of the first, that is we still use a Netezza machine to back up our other Netezza machine, but we use the more cost-effective Netezza High-Capacity server, which is less expensive than the common TwinFin (fewer CPUS, more disk drives) but otherwise behaves in every way identically to its more high-powered brethren. And honestly if we were to put apples-to-apples in a comparison between the cost of a big SAN plant to store these archives versus the High Capacity Server, the server wins hands-down. It's cheaper, simpler to operate, doesn't require any special adaptation and we can replicate data in terms guided by catalog metadata rather than adapting one technology to another.
So let's take these from the least viable to the most viable and compare them in context and contrast, and let the computer chips fall where they may.
Commodity backup/restore technology
If we want to leverage this, we need to understand that it cannot be used to perform monolothic operations. These are unmanageable for a lot of reasons:
To mitigate the above, we need to adapt the large-scale database to the backup technology by decoupling and downsizing the operation into manageable chunks. This is a direct application of the themes surrounding protracted data movement in any environment. The larger the data set, the more the need for checkpointed operations so that the overall event is an aggregation of smaller, manageable events. If any single event fails, it is independently restartable without having to start over. Case in point, if I have 100 steps to complete a task and they are all dependent upon one another, and the series should die on step 71, I still have 29 steps remaining that may have completed without incident, but I cannot run them without first completing step 71. This is what a monolithic backup buys us - an all-or-nothing dependency that is not manageable and I would argue is entirely artificial.
To continue this analogy, lets say that any one of these 100 steps only takes one minute. In the above, I am still 30 minutes away from completion. I arrive at 6am to find that step 71 died, and now I have to restart from step 1 and it will cost me another 100 minutes. Even if I could restart at step 71, I am still 30 minutes away from completion.
Contrast the above to a checkpointed, independent model. If we have 100 independent steps and the step 71 should die, the remaining 29 steps will still continue. We arrive at 6am to find that only one of the 100 steps died and we are only 1 minute away from full completion. The difference in the two models is very dramatic:
Monolithic means we are operationally reactive when failure occurs. The clock is ticking and we have to get things back on track and keep moving. Checkpointed means we are administratively responsive when failure occurs. We don't have to scramble to keep things going. In fact, in the above example, if step 71 should fail and the operator is notified, doesn't the operator have at least half an hour to initiate and close step 71 independently of the remaining 29 steps? Operators need breathing room, not an anxious existence.
Monolithic methods are supported de-facto by the backup/restore technology. If we want to perform a checkpointed operation, we have to adapt the backup/restore process to the physical or logical content of the information. We don't want to directly mate the backup technology itself, so we need to adapt it.
Logical Content Boundaries
This means we have to define logical content boundaries in the data. What's that? You don't have any logical content boundaries, not even for operational purposes? Well, per my constant harping on enhancing our solution data structures with operational content, such as job/audit ID and other quantities, perhaps we need to take a step back and underscore the value of these things because they exist for a variety of reasons. One of them is now upon us - the operatonal support of content-bounded backups. It is required for scalability and adaptability and is not particularly hard to apply or maintain.
A more important aspect of content boundary is the ability to identify old versus new data. If the data is carved out in manageable chunks, some will always be least-recent and some more-recent. Invariably the least-recent chunks will be identical in content no matter how many times we extract them for backup. This means we can extract/backup these only once and then focus our attention on the most active data. In a monolithic model, there is no distinction between least or most active, least or most recent. In large-scale databases, the least-recent data is the majority, so the monolithic backup is painfully redundant when it need not be.
Do we absolutely need content-bounded backups for all of our tables to function correctly? Of course not. But by applying this as a universal theme it allows us to treat all tables as part of a backup framework where all of them behave the same way. So part of this is in the capture of the data but a larger part is the operational consistency of the solution.
Many reference tables such as lookups will never grow larger and we know this. In fact, their data may remain static for many years. For the ones that are tied to the application and grow or change every day, these we will call solution tables. They are typically fed by an upstream source and are modified on a regular basis. Any of these tables can grow out of control. The reference data then represents a very low operational risk. Why then would we not simply fold the reference data into the larger body and treat all the tables the same? There is no operational penalty for it, but enormous benefit from being able to treat all tables the same way inside a common solution.
At this point, the backup/restore framework will address all of the tables the same way, but now we have the ability to leverage rules and conditions within the framework so that special handling is available if necessary. This is a common theme in large-scale processing: Handle everything as though it will grow, but accommodate exceptions with configuration rules. I'll forego this aspect for now and let' take a look at what we need in basic terms:
Setup formal archival databases. Whether these reside on the same server, or on a DR / High Capacity server (below) is immaterial. The point is that the data in the master tables will be actively rolled into these tables, which will form the backbone for all backup and restoration operations. We therefore replicate the masters (with streaming replication) into the archival stores and then at some interval perform the backup of the archives, not the master.
Foreshorten the Master Tables
Now that we have a means to define content boundaries - you did apply those boundaries right? We can now look at the database holistically for optimizations based on active data.
At one site we have a table with ninety billion records in three different fact tables spanning over ten years of information, However, the end-users and principals claimed (and we verified) that the most active data on any given day is the most recent six months. Anything prior to that, they would query perhaps once or twice a year for investigative purposes, but had not tied any reports to it.
So now we have an opportunity to get agreement from the users to shorten the master tables. This is especially necessary for those fact tables. In the end, these fact tables (above) were shortened to less than four billion rows each and these are kept trimmed on a regular basis. The original long-term data is held in another archival database that is the foundation for the backups and restores.
The above protocol, while likely easy to pull off an administer, still has a number of moving parts that the Netezza based-equivalent (later) will not have.
The only difference between the above protocol and one using a SAN-based storage mechanism, is the absence of a formal backup/restore technology. Rather the SAN is the long-term storage location and we perform the incremental extractions onto it. Rather than delete the files, we keep them.
This has significant implications for the cost of the SAN. After all, if we intend to interface this to the Netezza machine, we would not want common NAS storage because it is too slow and the vendors actively disclaim their technology from being viable for data warehousing. The primary reason is that the network and CPUs are set up for load-balancing, like a transactional database, but not bulk onload/offload of the data.
Not only will we need enough SAN to backup the environment, but to also carry fathers and grandfathers if need be (this is a policy decision). With checkpointed extracts, the father/grandfather issue is largely moot. This is because once a checkpointed extract of older data is pulled and stored, it won't be changing and capturing another one just like it has no value.
In this approach we leverage another Netezza machine like a DR server, as our backup, archival and restore foundation. It can easily hold the information quantity. The difference here is in price, of course, since a fully-functional TwinFin is more expensive than most common SAN installations. However, the High Capacity Server (below) mitigates this pricing problem while delivering a consistent data experience.
One primary benefit of performing backups into the DR server is that it can automatically serve the role of a hot-swap server in case of failure in the primary server.
For this scenario to work however, we would want streaming replication between the active databases and the DR server so that the data is being reconciled while being processed. This allows us to have a fully functional hot-swap if the primary crashes, and we can continue uninterrupted while the primary is serviced. Word to the wise on this kind of scenario however, bringing back the primary means that it is out of sync, since the secondary took over for a span of time. So we would need to be able to reverse the streaming replication to make it whole.
Scenarios like this often embrace the practice of operationally swapping the two machines on defined boundaries, like once a month or once a quarter, where they actually switch roles each time. This allows the operations staff to gain confidence in the two machines as redundant to each other in every way. I have seen cases like this where the primary machine went down, the secondary machine kicked in seamlessly and all was well. I have also seen cases where the principals kept the DR server up to date but when it came time to operationally switch, some important piece (usually in the infrastructure between the devices) was missing causing the failover itself to fail. It is best to have a plan in place, but it's better to have tested the plan and that it actually works.
A word on Netezza-compressed transfer. I wrote about this in Netezza Transformation but it is important to highlight here. We performed an experiment moving half a terabyte scattered across a hundred or more tables. This data was moved from its original home to a database in another machine. The first method used simple SQL-extract into an nzLoad component. This process took over an hour. The second method used transient external tables with compression, coupled with an nzload in compressed mode. The entire transfer took less than six minutes. This was because the compressed form of the data was already 14x compressed.
In other experiment using over 20x compression for the data, we were able to transfer ten terabytes in less than an hour. This kind of data transfer speaks well for the streaming replication necessary for DR server operation (above) but underscores the fact that even when transferring between Netezza machines, it's as though we haven't left the machine at all.
Netezza-based High-Capacity Server
This option is simply a form of the Netezza-based hybrid (above) but on a dedicated server designed to support backup and recovery.
The better part about this server is that is has more disk drives and fewer CPUs, making it far more cost effective for storage than common SAN devices. Couple this with the minimal overhead required for transferring data between machines, and the ability to surgically control the content with the content-boundaries and catalog metadata, and we get the best of all worlds with this device.
Not only that, but it is also scalable to support storage of all other Netezza devices in the shop as well as any non-Netezza device where we simply want to capture structured information for archival purposes. The High Capacity server is queryable also, meaning that even the ad-hoc folks will find some value in keeping the data online and available.
Lastly, in Spring 2010, as part of the safe-harbor presentation one of the principals at IBM Netezza announced plans for a replication server. I can only imagine that this device will deliver us from any additional hiccups associated with streaming replication that we might now be doing in script or other utility control language.
At Brightlight, our data integration framework (nzDIF) has the nz_migrate techniques built directly into the flow substrate of the processing controller, as well as the enforcement and maintenance of the aforementioned content boundaries. We are actively acquiring and applying best, most scalable and simplified approaches as a solution framework firmly lashed to a purpose-built machine. I am a big proponent of encouraging Enzees to take on these things themselves, or at least let us coach you on how to make it happen. The solutions are simple because the Netezza platform itself is simplified in its operation. Stand on the shoulders of genius - the air is good up here.
"What's this?" asked the CFO of the Data Warehousing Director, now holding a fresh-off-the-press request for more hardware.
"We've hit capacity," said the Director with a sigh, lowering himself wearily into a padded chair.
"What? That machine is barely a year old! And just last month you claimed it would be two years before we would need more capacity. I didn't allocate any budget for hardware because you said we wouldn't need it!"
"I was wrong. We just implemented an upgrade of the application and it started hammering the machine. We didn't know we had hit capacity. We thought we had enough."
"I've had enough," said the CFO, "I want some answers before I sign off on this."
And so began three weeks of malaise in searching for the right answer. An answer that nobody was particularly trained to find. Throwing in the towel, they sought outside help and eventually discovered some very interesting artifacts.
Now the reader may well assert, ahh, we have enterprise tools that allow us to see the machines on the network and in the operations center. Those tools should tell us everything we want to know. Well, sure, about the Linux host of the machine, but not about the health of the various databases, usage stats, trending and most certainly not application-level nuances like misapplied or ignored distribution keys, mangled zone maps and tables that haven't been groomed. What's that, the nzPortal could give us some of this? Sure, on a per-machine basis. But the machine admins want to deal with machine health on one level, while the information architects want a completely different, even oblique level of information that may require deep immersion into the logistics of one or more enteprise applications.
What's the real problem here? Logistical complexity. The simple fact that as we add more technologies, applications and functionality around the machine, it becomes the fulcrum for the enterprise. Without management of the activities both in detail and in context, logistical complexity arises. We want to embrace the simplicity of management and administration, and on a purely administrative level (e.g. what the common DBA role fulfills) it's still a part-time job. But this may hide the fact that the application engineers and implementors have been given, or rather been delegated with, the additional responsibility of logistics. Perhaps they didn't realize that implementation and architectural logistics was now on their plate. After all, they only build applications. Deploying, operating and maintaining them has always been someone else's gig. In any other technology besides Netezza, the DBAs make it their business to know the data and its processing nuances. But in those technologies, there is no power in the machine to take us very high, so the possibility of logistical complexity is held in check by the lack of power in the machine. This constraint released, logistical complexity now becomes a very real threat.
A number of years ago these needs reached critical mass and the ninjas of the Enzee Community that had been addressing these issues as onsite consultants finally congealed these capabilities into a platform that can not only see these trends, nuances and capacity issues, it can do it across multiple machines and applications. In the spirit of Netezza's "driving toward simplicity", this platform plumbs the depths of Netezza's more complex interior and serves up the nuggets in visualized, actionable form. It is essentially the application-administrative level of business intelligence - for the machines.
In the aforementioned "submarine hunt" the principals learned an important lesson of very-powerful machines: they can make even the most horrible implementation look stellar. By correcting the discovered issues, they reduced the machine's load by over eighty percent. Imagine recovering eighty percent of a machine's capacity just by applying some simple fixes. One day the machine is overloaded and we're looking at upgrading for non-trivial costs, and the next day the machine is barely breaching twenty percent capacity and we won't be chatting with the CFO anytime soon. Years even.
Some of you have similar stories (I've heard many of them!) and sincerely want a better way to deal with multiple machines and a wide array of applications and users in a more holistic manner, with an eye on what counts in a Netezza machine, not just what a typical database does. This takes administration to a level that simplifies and clarifies the complex in a form that intersects to the language base and nomenclature of the Enzee Universe.More importantly, the Enzee Universe is an ecosystem where many of us have found, through experience of success or pitfall, what works and what most people are asking for (and wondering why some of the leaders aren't taking the bull by the horns to solve it). Well, many members of the ecosystem have worked around the logistical issues without privy to the expanding capabilities of core observational needs, codified and implemented by people in-the-trench.
The inception of this observation technology was both brilliant and simple. It would faithfully gather stats from each of the machines as a summarized, transmission-safe extract and drop it to disk. Their on-site consultants could install it, kick it off and then go about their other analytic activities. At the end of a day (or two) they could examine the summaries and these would reveal all. As scheduled operations using the machine's power to process the machine's own statistics, it has a very low footprint and can be brought under workload management. As structured extracts. it was only a matter of time before they would put a pretty face on it. Now it's a dashboard to the inner sanctum of the Netezza machines, from the people who have collectively installed and shepherded more Netezza installations into production than any other, with no close second.
The Observation Deck, from Brightlight Consulting.
Next Tuesday, August 30th is a Webinar demonstration of this technology and I would like to cordially invite the Enzee Community (and all other interested parties!) to attend and take a deep look at what it can do, and what it could do for you.No pressure and no sales push. The need for this is obvious and we are simply demonstrating that it can be filled and supported. But the participants will be the judge as to its viability. It is the ecosystem's way after all.
The webinar is on Tuesday, 8/30 at 11 am pacific / 2 pm eastern. You can get more details and sign up for it at: http://advancedmonitoring.eventbrite.com/
For a brief history of why ELT (that is, in-box-data-processing) is even a topic of discussion, we must recognize that the high-powered appliances such as Netezza have not only made such implementation viable, but even desirable.
Just so we level-set on what one of these looks like, it's a SQL statement. Usually an insert/select but can also include updates and the like. Many of you recognize these as multi-stage operations inside a stored procedure. The sentiment of course is that such an animal can perform better inside-the-machine than to take it out of the machine (through an ETL tool), process it, and put it back. This may be true for smaller data sets, but you aren't reading this because you have smaller data sets, are you? Netezza users are big-game hunters. Pea-shooters are for small animals, but if we want big game, we'd better bring a bigger game with us, put on a bigger game face, and bring bigger equipment.
But is it really the size of the equipment? Netezza users know that the size and the architecture are keys to success. Dog-piling a bunch of CPUs together does not a power-frame make.
Okay, back to the storyline here - in-the-box SQL-transforms, in a traditional RDBMS platform, are the realm of small game. Once the game gets bigger, these transforms degrade in performance, and rather rapidly. Watch as a swarm of engineers tries to reconstruct and rebuild the procedures, the SQL, the tables and even upgrade the hardware in a futile attempt to achieve incrementally more power. Emphasis on incrementally-more. Not linearly more.
As they grow weary of this battle, the ETL tools start looking better and better. We reach a critical mass, and the ELT-in-the-box is summarily forsaken as we stand up the ETL tool.
Sometime after this transition, the data volumes once again overwhelm the environment. One thing leads to another, and one day the Netezza machine arrives on the floor. Hopes are high.
But notice the transition above - ELT was forsaken for ETL, likely never to return. But wait, now we really have the power to do the ELT SQL-transforms, but we've mentally and perhaps emotionally (yeah, verily even politically) moved away from SQL-transforms.
Some reading this might scratch their heads and think, What's he talking about? We've been doing ELT in the machines using (PLSQL, etc, name your approach here) for many years. Why would we shy away from it?
Why not use an ETL Tool? I mean, they handle push-down SQL-generation right?
I can perhaps summarize this situation in a single conversation I had with a ETL tool vendor who was hawking the capability for his own tool, and after showing me the mechanics of making one of these little jewels operate with aplomb, I asked him, "So what about doing a few hundred of these in a sequence, or branching them into multiple sequences?" The vendor rep looked almost hurt. "Why would you want to do that?"
Well, if we're really talking about migrating transforms into the machine, this can grow into hundreds of operations rather quickly. This situation apparently overwhelms the logistical capability of even the most powerful ETL tool. But I have hope that they will solve this situation. Eventually.
I am not holding out hope that it will happen soon, or voluntarily. These tool vendors have invested millions in the performance boosting of their own products and will not likely toss this investment on the flames of the appliance movement, even if said flames are the exhaust flames of the appliance's rocket engines. This is why I say "eventually". They don't really have a marketable reason to embrace this approach.
Another problem of the ETL tool is that it is so divorced from the appliance's infrastructure that it cannot control the cross-environment logistics. This is especially true of the "virtual transaction" - that is - multiple flows arriving in multiple tables that are each in context of one another, yet are individually shared-nothing operations. If one of the flows should fail, how do we back out of this? Can we do a rollback of the tables where their flows succeeded? No we cannot. We could certainly implement an approach, but this is neither inherent nor intrinsic to the ETL tool. We need a shared-nothing virtual transaction that will control all of the flow in a common context, commit them in that context and faithfully rollback the results in that same context. ETL tools don't go there. Unless we implement the tool that way. As an application. Once implemented, how do we reuse this for the next application, and the next? We can see that it's not part of the tool itself.
If next-generation "ELT" scenarios are to be successful, they need several very important capabilities that are simply non-optional and non-trivial:
I am sure the visitors of this blog have even more aspects of a "wish list" that they have implemented (perhaps painfully so) and want more control over the data, its processing logistics and error control and recovery. Feel free to add your own comments and suggestions here.
"You hate me don't you?" she whimpered, teary-eyed.
"What, the Lincoln Lawyer? You'll never get Matt to sign on -"
"No attorneys. People don't like to mix attorneys and bloodsucking, no matter how cute it sounds."
"Worked for Angel."
"Or not. He's in syndication now, right?"
"Are you saying," she swallowed hard, "That I'm a high maintenance character?"
"You would. I just don't see how Blade will change anything."
"He has more bandwidth and better technology. Lookit, how do you think Blade got so successful? He could find vamps practically anywhere, no matter how deep they tried to hide themselves."
"Yeah, how did he do that? And why can't we have writers that good?"
"Feature films, Spooki. Different realm altogether. Just ask Bella."
"Bella has hers, I have mine," she sighed, "So how does Blade do it? Haven't our writers gone to every extent to boost the story?"
"Simple. He knows where the vamps aren't. Not at the supermarket. Not at the cinema. Just by process of elimination, Blade can zone-in on his prey anywhere on the map. He's groomed for it. He doesn't have to go to every extent to find what he wants."
"There's just got to be some way we can reclaim our lead in all this. I can't believe we've been eclipsed."
"Hey, reclaim is a thing of the past. We groom our characters now. It's the only way to fly."
DavidBirmingham 2700043KNU Tags:  index partition cluster search netezza maps cbt zone mview 1 Comment 12,208 Views
Once again this entry will serve as an intro to the un-initiated. I get a lot of pings from folks who are new to the stuff and want to know their way around the machine from the beginning. Zone maps are bread-and-butter power tools of the Netezza technology. The companion capabilities are the materialized view and the cluster base tables (which are new to v 6.0).
About three years ago I sat in on a special presentation to our firm from one of Netezza's competitors. They put up some power-point slides to compare their technology to "the others". Their product was a data warehousing product and all the "others" were "OLTP" products. The usual suspects were in this list of course, but oddly so was Netezza. But Netezza is not an OLTP machine in any sense of the word. How odd that this company did not know enough about its primary rival to discern this simple truth. In addition, this vendor made certain odd claims about Netezza that were either architectural false-assumptions or just misinformation. One of which was this "In a Netezza machine, the more data you add to a table, the slower the table's response will be." This is of course completely false. Netezza tables are like any data storage with one major takeaway that should not be missed:
Netezza tables may behave inconsistently if the data is not properly configured and deployed. Notice I did not say "if the hardware is not properly ...." but the data itself. Netezza already has the hardware configuration under control and there's nothing we can do to affect it. We can however, align our data with the hardware architecture and will receive the benefit of a consistent experience no matter how much the data grows. It's hard to imagine this kind of consistency on any platform, so stay with me here.
A number of years ago I did some deep immersion in a health-care expert system for Value Health Sciences. The objective was to directly apply and enforce the codes found in the CPT code book used by physicians (those are the codes the doc will check on the sheet he hands to you before he leaves the office). Insurance companies have standard billing/invoicing arrangements with the physicians and these codes are the centerpiece. For VHS, their medical practitioners had set up rules for characterizing (and properly re-characterizing) a claim. For example, if a physician billed codes 1,2 and 4 but he was supposed to have billed these all as 1 and 5 (because code 5 bundles 2 and 4 to reduce cost) the system would have a rule in it that looked something like:
1,2,4 = 1, 5
The difficulty here of course, with some 100,000 codes and over 40,000 rules, is to organize them so that they are quickly accessible. By design, the first code in the rule (eg. code "1" above) was considered an "index" code. In other words, any and all rules that would use the code "1" as an anchor would also make "1" the first code. This lended itself to the opportunity to "vector" the rules. That is, create an in-memory array, indexed by the given code, and a list of rules would be handily available at that vector point. So that all we had to do was look at the claim (e.g. codes 1,2,4) fetch the various candidate rules from their vector lists ( the list for code 1 might have 3 rules, likewise for the others, reducing our total candidate rules from 40,000 to only 9 rules). We could then run the rules and dispatch the claim. When I had first received this model, the system was physically running through all 40,000 rules to find the right match. This vectored approach reduced the total rules, but also eliminated guesswork. And with all that, boosted the power and created a very consistent turnaround time for each patient claim.
The vector list above is a common way in programming to organize information so that - if we have something in our hands already - like a filter or a vector based on a natural key - we can quickly find where we need to start looking and eliminate the places where we are not interested. While this sounds a lot like a common index-structure approach, it's actually a process-of-elimination approach. The two have similar qualities, so I will delineate them here at a high level.
An index is a surgical approach to finding data by logging its record-based keys and a record locator into a separately maintained structure. By finding keys, we find physical records. This is important for transactional systems that need fast record-level turnaround. It is not ideal however, for analytics or data warehousing where the vast majority of queries don't care about individual records at all, but about answers that span records. This alternative approach of vectoring, is a way to bucket the information so that we're not attempting to specifically locate the record itself, but the bucket where the record may be located, that is, derive a group of candidates upon which we will take further action.
When we think about this, the candidate approach, as opposed to the surgical approach is the only way to scale anything that has to deal with multiple-record sets (set-based processing). Engines that are geared to examine one-record-at-a-time even if it is to aggregate or perform multi-record operations with the single-records it retrieves, cannot compete with an engine that is designed to defacto treat the records as multiples from the outset. Transactional versus bulk. The transactional engine is inherently slower and non-scalable compared to the bulk engine (for set-based operations).
I noted in a prior entry that Netezza's anti-indexing strategy heavily leverages "where not to look" because no matter how large a table gets, this "vectoring" or "candidate" approach will guarantee that our response time is very consistent. How does this apply to zone maps?
Recall the diagram with the CPU/disk combinations. Each of these disks is divided into its smallest actionable physical portion (an extent) and the extent then becomes the currency-of-the-realm when we're talking about searching for data. The machine automatically tracks the data that is added to an extent, and when we generate-statistics, every extent is accounted for in a zone map. The map has basic statistics on the columns in the table at the extent level, such as the high/low values for a column in the given extent.
Extent ColA ColB ColC
1 200 A B
1 300 A B
1 400 C D
1 500 E F
2 200 F G
2 400 H I
2 600 J K
3 100 L M
3 200 J O
Note the above layouts of the extents. If we were to look at the high/low values of each, we would find zone maps like this:
Extent ColA High ColA Low
1 500 200
2 600 200
3 200 100
So now if we ask a question to the machine based on say, a ColA value of 200, the machine will search all three extents. It has no choice because the zone map is telling it that pertinent data exists in all three.
Now let's physically sort the data by ColA
Extent ColA ColB ColC
1 100 L M
1 200 A B
1 200 F G
1 200 J O
2 300 A B
2 400 C D
2 400 H I
3 500 E F
3 600 J K
Now if we take a look at the zone map, we would see something like:
Extent ColA High ColA Low
1 200 100
2 400 300
3 600 500
So that now if we query the table for values of 200, the machine searches only one extent, not all of them. (Now, this is a simple example, because a given value could extend across many extents). Likewise a given disk has thousands of extents upon which to store the data, so it can spread the information out into much smaller, manageable chunks.
The point, however, is that by organizing the data this way, the zone maps reduce the total searchable extents. Keep in mind that we don't have to sort the data each time we want to use it. For many warehouses, the data naturally arrives in this contiguous form (for example, transaction date in a retail model). The Netezza machine will track all integers and dates automatically without us having to do anything additional to enable it. So we might take a penalty once to sort the data at the outset, then benefit from the data's natural arrival and collation.
This effectively creates the aforementioned vectored effect. It also guarantees a consistent experience because the data associated with the given value (say 200 above) isn't going anywhere. It will always be in the same location, tracked by the same zone map. Anytime we want to get back to the records associated with value 200, the zone map will serve them up. If the response time last week was 2 seconds, we could quadruple the size of the table and it will still be 2 seconds, because the data itself didn't go anywhere.
Contrast this to other heavy-lifting algorithms in software technologies. The Ab Initio platform, for example, will perform an in-memory Sort that will "intelligently" spill to disk when it runs out of memory. What this means is that it will take a predefined block of memory and attempt to perform all work inside it until it runs out. If so, it will manufacture 16 file locations within which to spill its overflow. These are organized with the same "vectored" approach as the sort keys would dictate. If these 16 files run out of space, they will further divide into 256 files to split the workload. Anyone working with the technology will tell you that this secondary split into 256 is to be avoided if at all possible because it negatively (and significantly) affects the performance of the sort. But we see the difference here. A general-purpose sort in software needs these 256 assets to functionally complete its mission, and adding more means less performance. Now compare this to Netezza which uses not just 256, but thousands of extents to divide the data out - and rather than experience the performance drag, will derive power and lift from this.
So let's apply something additional, especially for highly active tables. If we have a nightly batch window for processing, we likely have all the time we need to keep these zone maps healthy. But if we have an intraday model or even an aggressive micro-batch model, it is possible that the zone maps may be "violated". Note above that as long as the data arrives in contiguous chunks (like transaction date) the data will be set into the extents so that we minimize the total extents for a search. If a transaction date (for example) should appear outside of this contiguous range (say a record from last week's data has arrived) then it means the high/low range for that particular extent will be extended to include the errant record. It could mean that the given extent is also included in queries that don't related to either date, because the extent is simply being included because of its high-low range. That is, every query looking for this extent's primary transaction date, the errant transaction date, or any date in between them, will include this extent in the candidate list. While this is inconsequential in most cases, if it repeats over time it can slowly degrade the table's performance. Highly active micro-batch and intraday models have a tendency to see these effects if they remain unchecked or unguarded. But not to worry - the solution is simple.
What is daunting about this is that our highly-active-table might be a multi-billion-row beast that does not lend itself to re-sorting and in an intraday model, there's no time to re-sort the data anyhow.
A materialized view has the quality of being able to apply a virtual zone map to a table without physically re-sorting the information. All we have to do is pick the columns we want to zone map, and then form the materialized view with an order-by on those columns. Rebuilding one of these is usually very fast. In addition, we can be rebuilding it even while its prior version is active. At one site, we set up a process to rebuild the view thus:
Note that for the few moment that two materialized views are applied to two tables bears no risk at all. And if we drop the original materialized view, it will wait for all pending queries to stop - meaning that this won't break anything the users are doing (another necessity for highly active intraday models)
Important here is that the materialized views are not being used as surrogate tables, but as a means to virtualize the zone maps for faster response time on the table associated with the view. Another important aspect of the materialized view is that it can zone-map additional data types that the regular zone maps do not (e.g. varchars).
Let's say that we have a very active table and even re-creating the materialized view is too much uptake to maintain it. Not to worry, the materialized view has another quality that means we don't have to refresh it each time the table changes. The materialized view maintains its contents until the underpinning table changes. We then have the option to rebuild the view or use it "as is". This "as is" is usually agreeable for the short term, since when data is added to the underpinning table, the materialized view creates an extension to itself. Now whenever we query the table, it will examine the materialized view for default zone-map information but will also examine the extent. Considering that this extra extent is very small it is also insignificant in its overhead. So we can keep this in place for say, all day long and then clean up the materialized view in the evening. The materialized view's performance will not drag over the course of a day unless we do some serious re-churning of the underpinning data (which is not normal for common micro-batch models).
And now - Cluster-based tables
With all the above preliminary stuff out of the way, a discussion on CBTs is fairly straightforward. As of the 6.0 release, Netezza has provided this as a multi-dimensional capability for even more powerful searches. The data in the CBT is physically organized into extents based on keys. And when we perform a groom operation, the data is physically reshuffled so that the keys are physically co-located on the fewest extents possible. Like a self-enforcing zone map. only a lot more powerful. In a zone map approach, the primary way to get the data contiguous is to re-sort the data and the zone maps fall out for free. In a CBT, we organize the data and the execute groom. The initial groom will take awhile, but each subsequent groom will take inconsequential time. The data in the CBT can be organized on more than one key, but the keys can behave independently like a multi-dimensional effect.
Case in point - let's say we have 200 extents of data and the particular set of data we want spans only 10 of those extents (a zone map). We will ignore the other 190 extents and focus all the machine's energy on the 10 where the data is located. But wait, what if we have an additional key upon which to search, and the combination of the first key (10 extents) and second key reduces the total extents to only 2? That is, the first key appears in 10 extents but the first and second key appear in only 2 extents of those 10. Now we search only 2 extents. Again, we have organized and configured data, not hardware, to achieve the goal. We have told Netezza where-not-to-look and now we have zoomed in on a mere 2 extents out of 200.
But let's say that we add a lot more data to the table over the year, expanding this 200 extents to many hundreds if not thousands more. Can we see now that even if this happens, we can go back to the table again and again, ask the the same question as before, and only 2 of the extents will be used to search for data. This of course dispels the vendor-promulgated propaganda that Netezza tables get slower when we add more data.
DavidBirmingham 2700043KNU Tags:  spu netezza distribution parallel skew 36 Comments 26,044 Views
One of the most significant questions to answer on the Netezza platform is that of "distribution" - what does it mean? how does it apply? and all that. If you are a Netezza aficionado, be warned that the following commentary serves as a bit of a primer for the un-initiated. However, feel free to make comments or suggestions to improve it, or pass it on if you like.
In the prior entry I offered up a simple graphic (right) as to how the Netezza architecture is laid out internally. In this depiction, each CPU is harnessing its own disk drive. In the original architecture, this was called a SPU, because the CPU was coupled with a disk drive and an FPGA (field-programmable gate array) that holds the architectural firmware "secret sauce" of the Netezza platform. The FPGA is still present and accounted for in the new architecture as the additional blade(s) on the blade server. Not to over-complicate things, so let's look at the logical and physical layout of the data itself and this will be a bit clearer.
When we define a table on the Netezza host, it logically exists in the catalog once. However, with the depicted CPU/disk combinations (right) we can see 16 of these available to store the information physically. If the table is defined with a distribution of "random", then the data will be evenly divided between all 16 of them. Let's say we have a table arriving that has 16 million records. Once loaded, each of these SPUs would be in control over 1 million records. So the table exists logically once, and physically multiple times. For a plain straight query on this configuration, any question we ask the table will initate the same query on all SPUs simultaneously. They will work on their portion of the data and return it an answer. This Single-Instruction-Multiple-Data model is a bread-and-butter power strategy of Massively Parallel computing. It is powerful because at any given moment, the answer we want is only as far away as the slowest SPU. This would mean as fast as the SPU can scan 1 million records, this is the total duration of the query. All of the SPUs will move at this speed, in parallel, so will finish at the same time.
What if we put another table on the machine that contains say, 32 million rows? (trying to keep even numbers here for clarity). The machine will load this table such that each SPU will contain 2 million rows each. If no other data is on the machine, we effectively have 3 million records per SPU loaded, but the data itself may be completely unrelated. Notice how we would not divide the data another way, like putting 16 million records on 6 of the SPUs and the other 32 million records on the remaining 10 SPUs. No, in massively parallel context, we want all the SPUs working together for every query. The more SPUs are working, the faster the query will complete.
Now some understand the Massively Parallel model to be thus: Multiple servers, each containing some domain of the information, and when a query is sent out it will find its way to the appropriate server and cause it to search its storage for the answer. This is a more stovepiped model than the Netezza architecture and will ultimately have logistics and hardware scalability as Achilles's heels, not strengths for parallel processing. Many sites are successful with such models. But they are very application-centric and not open for reuse for other unrelated applications. I noted above that the information we just loaded on the Netezza machine can be in unrelated tables, and databases and application suites, because the Netezza machine is general-purpose when it comes to handling its applicability for data processing. It is purpose-built when we talk about scale.
But let's take the 16-million-record vs 32-million record model above and assume that one of the tables is a transactional table (the 32 million) and one is a dimensional table (the 16 million). The dimensional table contains a key called "store_id" that is represented also on the transactional table such that we can join them together in various contexts. Will the Netezza machine do this, and how will it? After all, the data for the tables is spread out across 16 SPUs.
Well, we have the option of joining these "as is" or we can apply an even more interesting approach, that of using a distribution key. Here's where we need to exercise some analysis, because it appears as though the STORE_ID is what we want for a distribution, but this might skew the data. What does this mean? When we assign a distribution, the Netezza machine will then hash the distribution key into one of 16 values. Every time that particular key appears, it will always be assigned the same value and land on the same SPU. So now we can redistribute both of these random tables on STORE_ID and be certain that for say, SPU #5, all of the same store_id's for the dimension table and for the transaction table are physically co-located on the same disk. You probably see where this is going now.
Ahh, but what if we choose store_id and it turns out that a large portion of the IDs hash to a common SPU? What if we see, rather than 1 million records per SPU, we now see around 500k per SPU with one SPU radically overloaded with the remainder? This will make the queries run slow, because while all the SPUs but one will finish fast, in half the time of before, they will all be waiting on the one overworked SPU with all the extra data. This is called data skew and is detrimental to performance.
However, had we chosen a date or timestamp field, our skew may have been even worse. We might see that the data is physically distributed on the SPUs in a very even form. But when we go to query the data, we will likely use a date as part of the query, meaning that only a few SPUs, perhaps even one SPU, will actually participate in the answer while all the others sit idle. This is called process skew and is also detrimental to performance.
Those are just some things to watch out for. If we stay in the model of using business keys for distribution and using dates for zone maps (which I will save for another entry) we will usually have a great time with the data and few worries at all.
Back to the configuration on store_id's. If we now query these two tables using store_id in the query itself, Netezza will co-locate the join on the SPU and will only allow records to emit from this join if there is a match. This means that the majority of the work will be happening in the SPU itself before it ever attempts to do anything else with it. What if we additionally clamped the query by using a date-field on the transaction table? The transaction table would then be filtered on this column, further reducing the join load on the machine and returning the data even faster.
So here is where we get lifting effects where other machines experience drag. For an SMP-based machine, adding joins can make the query run slower. On a Netezza machine, joins can serve as filter-effects, limiting the data returned by the query and increasing the machine's immediate information on where-not-to-look. Likewise the date-filter is another way we tell the machine where-not-to-look. As I have noted, the more clues we can give the machine as to where-not-to-look, the faster the query will behave.
I have personally witnessed cases where adding a join to a large table actually decreases the overall query time. Adding another one further decreases it. Adding another one even further and so on - five-joins later we were getting returns in less than five seconds where the original query was taking minutes. This filter-table and filter-join effect occurs as a free artifact of the Netezza architecture.
Distribution is simply a way to lay the data out on the disks so that we get as many SPUs working for us as possible on every single query. The more SPUs are working, the more hardware is being applied to the problem. The more hardware, the more the physics is working for us. Performance is in the physics, always and forever. Performance is never in the software.
While the above is powerful for read-query lift (co-located reading) there is another power-tool called the co-located write. If we want to perform an insert-select operation, or a create-table-as-select, we need to be mindful that certain rules apply when we create these tables. For example, if we want to perform a co-located read above and drop the result into an intermediate table, it is ideal for the intermediate table to be distributed on the same key as the tables we are reading from. Why is this? Because the Netezza machne will simply read the data from one portion of the disk and ultimately land it to another portion of the same disk. Once again it never leaves the hardware in order to affect the answer. If we were to distribute the target table on another key, the data will have to leave the SPU as it finds a new SPU home to align with its distribution key. If we actually need to redistribute, this is fine. But if we don't and this is an arbitrary configuration, we can buy back a lot of power just by co-locating the reads and writes for maximum hardware involvement.
So that's distribution at a 50,000 foot level. We will look at more details later, or hop over to the Netezza Community to my blog there where some of these details have already been vetted.
It is important to keep in mind that while distribution and co-location are keys to high(er) performance on the machine, the true hero here is the SPU architecture and how it applies to the problem at hand. We have seen cases where applying a distribution alone has provided dramatic effects, such as 30x and 40x boost because of the nature of the data. This is not typical however, since the co-located reads will usually provide only a percentage boost rather than an X-level boost. This is why we often suggest that people sort of clear-their-head of jumping directly into a distribution discussion and instead put together a workable data model with a random distribution. Once loaded, profile the various key candidates to get a feel for which ones work the best and which ones do not. We have seen some users struggle with the data only because they prematurely selected a distribution key that - unbeknownst to them - had a very high skew and made the queries run too slow. This protracted their workstreams and made all kinds of things take longer than they should have.
So at inception, go simple, and then work your way up to the ideal.
On multiple distribution keys:
Many question arise on how many distribution keys to use. Keep in mind that this is as much a physical choice as a functional one. If the chosen key provides good physical distribution, then there is no reason to use more keys. More granularity in a good distribution has no value. However, a distribution key MUST be used in a join in order to achieve co-location. So if we use multiple keys, we are committing to using all of them in all joins, and this is rarely the case. I would strongly suggest that you center on a single distribution key and only move to more than one if you have high physical skew (again, a physical not functional reason). The distribution is not an index - it is a hash. In multi-key distributions, the join does not first look at one column then the next - it looks at all three at once because they are hashed together for the distribution. Joined-at-the-hip if you will.
On leaving out a distribution key:
One case had three tables joining their main keys to get a functional answer. They were all distributed on the same key, which was a higher-level of data than the keys used in the join. The developer apparently thought that because the distribution keys are declared that the machine will magically use them behind the scenes with no additional effort from us. This is not the case. If the distribution key(s) (all of them) are not mentioned in the join, the machine will not attempt co-location. In this particular case, using the higher-level key would have extended the join somewhat but would not have changed the functional answer. Simply adding this column to the join reduced that query's duration by 90 percent. So even if a particular distribution key does not "directly" participate in the functional answer, it must directly participate in the join so as to achieve co-location. And if this does not change the functional outcome, we get the performance and the right answer.
How it affects concurrency:
Many times people will ask: Why is it that the query runs fast when it's running alone, but when it's running side-by-side with another instance of itself they both slow to a crawl? This is largely due to the lack of co-location in the query. When the query cannot co-locate, it must redistribute the data across the inter-blade network fabric so that all the CPUs are privy to all the data. This quickly saturates the fabric so that when another query launches, they start fighting over fabric bandwidth not the CPU bandwidth. In fact some have noted that the box appears to be asleep because the CPUs and drives aren't doing anything during this high-crunch cycle. That's right, and it's a telltale sign that your machine's resources are bottlenecked at the fabric and not the CPU or I/O levels. By co-locating the joins, we keep the data off the fabric and down in the CPUs where it belongs, and the query will close faster and can easily co-exist with other high-intensity queries.
Two words: load balancing -
Is our chosen platform designed for set-based bulk processing, or load balancing? Both at the software and hardware levels? The load-balancing engine (and attendant SMP hardware) are simply the wrong architecture for large-scale bulk processing. There's no way to "properly" configure the wrong architecture.
Let's level-set the difference here. In an SMP-based scenario, our engineers have to carefully configure the hardware to garner the very best performance from it. We don't have that option in Netezza, because the hardware is pre-configured. Rather in Netezza, we gain power by how we organize and configure the data. We don't really have this option in an SMP-based model, because the database engine software pre-defines how we will organize the information (through index structures) and we cannot affect our fate without the indexes. Let's see the contrasts summarized:
Netezza - no indexes, no hardware config, performance is derived from data configuration
SMP machine - high hardware config, index-depedent, no ability to affect performance with data configuration
In short, the two performance tuning models are not only polar opposites, Netezza is far more adaptable and flexible because it is easier to reconfigure data than to reconfigure hardware.
I am continually impressed with the valiant attempts of various platform aficionados who assert, claim and champion the notion that "properly" configured SMP-based hardware is the the only issue in evaluating competitive performance between platforms. In short. a properly configured <name your platform here> is just as viable as the IBM Netezza platform. Just name your components and off you go.
Of course, most folks who are making these claims are not hardware aficionados at all. Now, I appreciate software folks because at heart I am one of them, but I cut my teeth as an software engineer on solutions that aligned high-powered hardware with other high-powered hardware, all the while respecting the fact that the software I was creating was actually orchestrating and controlling the interaction between these gravity-bending machines, not physically moving the "payload" as it were. Nothing, absolutely nothing could move the data faster than the hardware. We inherently know this, yet many seem to think that software products can overcome this issue by using RAM and other creative methods to accelerate the effect of software operation.
So before I launch into a more complex rant on this, a picture is worth 1000 words (at least). In Netezza Transformation I offered up some graphics for contrast and compare (and alluded to them in another blog entry here).
In the depiction (right) we have CPUs (on the top) and disk drives (on the bottom). The pipeline in between them is the
general-purpose backplane of the hardware configuration, which may include a SAN interface, optical or 10gigE networking or other connection mechanism to transfer data between the server's CPUs and the SAN's disk drives. Even if these disk drives are local on the machine containing the CPUs, this backplane is still the connector between them.
Now we will load a data file containing 100 billion rows of information, some 25 terabytes in size. This is a medium data size for big-data aficionados. The data will necessarily enter the machine from an external network connection, into the software engine (runnning on the CPUs) which will deliver the data onto the assigned location of the disk drives. Seems like a very simplistic and remedial explanation doesn't it?
Now we will query this data. Our ad-hoc and reporting users will want to scan and re-scan this information. Note how now the bottleneck is actually the hardware itself. The data must be pulled in-total from the disk drives, through this backplane and into CPU memory before it can be examined or qualifed. Even if we use index structures, the more complex the query, the more likely we will encounter a full-table-scan. How long, do we suppose, it would take for this configuration to scan 25 terabytes? (Keep in mind that all 25 terabytes has to move through the backplane).
A server-level MPP model would suggest placing two of these configurations side-by-side and coordinating them. In short, one of the server frames would contain some portion of the information while another frame contained the other. We could imagine placing multiples of these side-by-side to create an MPP grid of sorts. This is the essential secret-sauce of many Java grids and other grid-based solutions. Divide the data across a general-purpose grid and then coordinate the grid for loading and querying.
But notice how deeply we are burying the data under many layers of administered complexity. Sure, we can do this, but is it practical and sustainable? I've seen setups like this that served one application (and served it well) but it was an inaccessible island of capability that served no other masters. As general-purpose as all of its parts were, it had been purpose-built and purpose-deployed for a single solution that required the most heavy lifting at the time of its inception. Now that it is in place, other solutions around it are growing in the capacity needs to serve the grid, and none of them have access to the power within the grid. The grid becomes starved from the outside-in. No satellite solutiion can feed it or consume it at the rate it can process data, and it has no extensibility to support their processing missions.
So now we come full circle, we have a "properly configured" one-trick-pony. Over time, the expense and risk of this pony will become self-evident. Parts will break. Data will get lost. Lots of moving parts, especially general-purpose moving parts that are out-in-the-open, only increases the total failure points of the entire solution. Debugging and troubleshooting at the system level become matters of high-engineering, not simple administration. As noted above, in the environment where I cut-my-teeth, I was surrounded by these high-end engineers because it was the nature of the project. I noticed that once the project went into a packaging mode to deploy and maintain, these engineers moved-on and cast their cares on the shoulders of the junior engineers who backfilled them. This was a struggling existence for them, because the complexity of the solution did not lend itself to simple maintenance by junior engineers.
The Caractacus Potts adventure begins! You know Potts from the movie Chitty-Chitty-Bang-Bang. We thrilled to his inventions, and laughed when they did not deliver. A "simple" machine to crack eggs, cook them and deliver-up breakfast worked fine for everyone but him, delivering a plate of two uncooked eggs still in their shells. The puzzled look on his face told us he recognized the problem but did not know where to look for resolution. With so many moving parts, it could be anywhere. This is a classic outcome of "eccentric innovation" and "eccentric engineering" a.k.a "skunkworks". More importantly, the innovation only solves one problem (e.g. egg-centric breakfast) not a general-purpose solution platform.
Well, let's keep it simple- what about a simple summary report? You know, national sales data summarized to the national, regional, district and store levels? Wouldn't this require a complete scan of the table to glean all of this? In the SMP-based depiction (above) how could be expect such a scan to operate? Software would pull the data in-total from the disk drives, choking the backplane. Software would then summarize the aforementioned quantities, in memory if possible and then deliver the result. Frankly, such a report could take many hours to execute, and keep the machine busy the whole time. Even if we "grid" this, the query would swamp the grid. On a tricked-out Sun E10k, we have about 12 gb/second throughput. Putting some math to this, with 25 terabytes in tow, we could expect the table scan to complete in about 30 minutes even if the full complement of 64 processors is on board, because the solution is I/O bound, not CPU-bound (no new CPUs will make it run faster). However, in reality the software engine and all its overhead drain the energy from the machine and this query will run for hours, even if it's the only operation running on the machine. So I guess we really will need more CPUs to balance off the software drain of the engine itself. (sigh).
This is because: engines that run on SMP-based devices are inherently load-balancing engines, not bulk-processing engines. Their processes stop, negotiate and resume even if there's nothing else going on. Think of it like this: Where I live in the country, at 5am in the morning all of the traffic lights on the main road blink-yellow until about 6am. If I travel on that road before then, I can go the speed-limit for over half-an-hour before hitting the first traffic light in the next (larger) town. But if those traffic lights all operated normally, I could get stopped at each one, protracting my 30-minute journey by orders of magnitude as I wait on traffic lights even when no other traffic is present. SMP-based engines automatically thread this way, where a flow-based model does not. A load-balancing engine will force all of its processes to stop at a virtual traffic light, come up for air to make sure nothing else requires attention, then go back to work. Transactional models absolutely require this but it is anathema to bulk processing.
Now we contrast this with IBM Netezza, which is a purpose-built platform for general-purpose across all solutions requiring such a platform. We don't have a one-trick pony. This would mean (ultimately) any form of data warehouse or bulk-processing solution, but more importantly anything that requires fast-retrieval of data, especially while performing high-intensity on-demand analytics on it.
In the IBM Netezza architecture (depicted right) each CPU has a shared-nothing disk drive and its own RAM. On the original Mustang Series 10400 (with 432 of these on board) we have a machine that costs far less than the Sun E10k noted above. Likewise we could scan those 100 billion rows in less than ten minutes. It won't ever take any longer than ten minutes. If we boost the CPUs of the machine, say by adding another frame to it (200+) CPUs, it will boost the machine's speed by another 50 percent. Queries that took 6 minutes now will take less than 4 minutes. It is a deterministic/predictable model, and adding more frames to the Netezza platform is simple and inexpensive compared to the sheer labor dollars of eccentric engineering.
As for the depiction (right) it has 16 CPU/Disk combinations. To use round numbers, let's say we put 1.5 terabytes on each disk for a grand total of 24 terabytes. With this configuration, for any given query, the path to conclusion is only 1.5 terabytes worth of scan-time away. Once we initiate the query, each CPU will run independently and will scan its 1.5 terabytes. All of them will complete simultaneously, meaning that the total duration of the query was no longer than it took to scan the 1.5 terabytes (they are all scanning in parallel). Now boost this to 400 CPUs, where each one now only has about 63 gigabytes share of the load. One scan of the entire 24TB table takes no longer than the time to scan 63 GB (they are all in parallel). We can measure our disk read-speed here and get very consistent estimates of how long a query should take.
Also keep in mind that (in a prior blog entry on indexes) I noted that we can radically reduce these operations to a fraction of their total scan times. But in the example above, full summary of the data on sales-boundaries, how much is that worth if we could do the sales base on a date? Or based on a range of dates? Perhaps even comparing last-years Independence-Day sales to this years?
In an SMP-based configuration, the information engineers would suggest partitions. The partition (for an SMP engine) is an artificial performance prop that anticipates the user's query needs based on known use-cases. It bundles data (say on a date boundary) so queries against that date can be fenced by the partition boundary. The Netezza zone map, on the other hand, automatically tells the machine where to look, and where not to look, to go capture the information required by the query. No props, no use-case anticipation, just the flexibility we really need if we want to keep fire-breathing users happy without special engineering to anticipate their needs.
Zone maps allow the sales-related comparison above to arrive in mere seconds to the user's fingertips. On an SMP machine, at best, even with partitions, indexes and other performance props, will require a maxed-out power frame (all CPUs on deck) and the best anticipatory information engineering to provide a consistent experience that even hopes to compete with Netezza. Even after all that, it won't come back in seconds, and won't provide the nimble flexibility so sought-after by even the average data analyst.
The conclusion is that the overall cost of deployment, ownership and ease of maintenance for a Netezza machine utterly eclipse the potential promise of SMP-based solutions. For an analyst, all columns, tables and functions are "fair game" for query - all over the database, 24x7. A Netezza machine provides just that. On an SMP-based engine, the analyst has to agree with information engineers on their entry points and usage patterns, and these have to be engineered into the model in order to support the users. Once engineered, the solution will support only that user base. All other user bases will require their own engineering model. This is not sustainable, durable or manageable, which is why those who are steeped in it will gladly embrace a Netezza machine. Value is recognized on so many levels.
DavidBirmingham 2700043KNU Tags:  anti-indexing zone netezza index distribution maps indexes 10,139 Views
At far back as the 2007 Enzee Universe conference in Boston, people have scratched their head on the notion of a large-scale database that actually boasts about the absence of index structures. After all, indexes are the mainstay of relational databases and we simply can't get by without them, right? This is a simple example of how the power and architecture of the technology frees us to think about data loading and storage, data retrieval and in-the-box processing in a completely different way.
Firstly, it's no rumor that Netezza has no indexes. And for those of us who can't stand dealing with them, this is a huge plus. One of the Enzees at the 2011 conference asked point-blank - "What does Netezza have that Oracle does not?" and the clear answers will arise in this and later blog entries, but for now the absence of maintenance of index structures, will do just fine. What does Netezza have?
Or rather, the vendor has spent enormous thought labor into simplification of complex matters so that we don't have to deal with them. We can get to the business of - well - the business. The application data and information that we want to spend more time with if only we weren't dealing with the technical administration of the database engine. This is one of the greatest weaknesses of the traditional relational database when it comes to data warehousing in general and bulk processing in particular. It is also one of their greatest weaknesses in the space of analytics. After all, we want to apply the analytics by selecting subject areas and data sets to analyze, but if we have to stand up lots of infrastructure to support this mission we are regressing toward functional hardening rather than functional flexibility. Brittleness ensues and someone asks for a refactor, a re-architecture or whatever. When this request arrives in relation to a complex, engineered, multi-terabyte (or say tens-of-terabytes) data store, many will see it as a mind-numbing proposal. For most infrastructures, the weight and complexity of the installation has overwhelmed the logistical capacity of the humans to ever hope reeling it back in. There's not enough power in the machine to help. This could take a very long time to remediate.
Not so with an appliance. Such big-data issues are its bread-and-butter zone. If we have issues with a particular data model or information construct, the machine has the power (and then some) to get us out of the bad direction and into the new one. How did the direction get "bad" in the first place? The business changed its information priorities since the inception of the original model, and now the original no longer serves it well. The business has moved, because that's what businesses do. The analysts found opportunities in the nuggets of information-gold, and now want direct access to that gold rather than having to navigate to it or salivate while waiting for it.
One of the core features of the Netezza platform is how the data is distributed on the disk drives. Because it is an MPP, each of the disk drives is its own self-contained, shared-nothing hardware structure. Contrast this to the common SMP-based platform where the CPUs exist in one duck pond and the disk drives exist in another duck pond. The data is then couriered between duck ponds through throttled estuaries. If we issue a query, all of the data has to be pulled through this estuary and presented to the CPU ducks so that the data can be manipulated and examined in software. It is a software engine on a general-purpose platform.
However, imagine the Netezza platform where each Intel CPU is mated with special hardware (an FPGA), a bounty of RAM to manipulate and cache data, a self-contained Linux operating system, and its own dedicated disk drive. Imagine also that the disk drive is itself divided into multiple sections, where the inner sections are used for internal processes like RAID, temp space and system data but the outermost ring is where user data is stored, offering the benefit of the fastest disk speed for the oft-accessed information. All of these little attention-to-detail aspect cause each of the CPUs to run at a much more powerful factor than their common SMP counterparts. Why? Because their data is co-located with the CPU, where with the SMP we have to drag the data out of one duck pond to get it close to the CPUs that may operate on it. And with SMP there is no such thing as dedicated CPU-to-disk access. The SMP CPUs are shared-everything, but also the disk drives - shared-everything at the hardware level but shared-nothing at the functional-logical level. Netezza's CPUs are shared-nothing at the functional-logical level and shared-nothing at the CPU/disk level.
In Netezza, let's imagine putting 100 of these CPU/disk combinations to work. When we form a table on the host, the table exists logically in the catalog in one place. But it exists physically in 100 places. If we load 100 million records into the machine distributed randomly, then each of the CPUs will "own" 1 million of those records. If we issue a query to the table, each of the CPUs will operate only on the million records it can see. So for any given query, we are not serially scanning 100M records. We are scanning 1M records 100 times.
Now some may object, that we're still scanning the data end-to-end, and for plain-vanilla queries, this is true. However, I recall the first time I performed a join on two tables that were 100M records each doing a one-to-one join on unique keys, on a machine with 24 of these CPUs and the answer returned in 13 seconds. This was a plain vanilla test, so your actual mileage may vary. However, I did the same test on the same machine with 1B records joined to 1B records and the answer returned in less than 300 seconds. Imagine attempting this kind of join on a traditional relational database and expecting a return in any time to actually use the result. (and no, I did not use co-location for this test - a matter for another blog entry)
We have an additional "for free" aspect of the machine called a zone map. If data is laid down in contiguous form (the way transactions arrive in a retail point-of-sale model for example) then the machine will automatically track these blocks and keep their locations in a "zone map". If we then query the database with a given date, the machine will ignore the places where it knows the data is not, and use the zone maps to locate the range of data where it needs to search.
As as example of this, we know of a 150-billion-row table that is over 25 terabytes in size, distributed and zoned such that over 95 percent of its queries return in 5 seconds or less. For a traditional RDBMS, it would take more time than that just to get its index scans squared away so that it could even approach the storage table. This is also why the Netezza machine itself can be scaled into the petabyte zone while maintaining the same performance. No indexes are in the way to load it. No indexes are necessary to search it. Imagine now: the two most daunting aspects of data warehousing and analytics - loading the data and keeping up with the user base - have been washed away with the elimination of indexes. (Don't we turn indexes off in a traditional SMP database so it will load faster, and don't we chase a rainbow trying to index and re-index the tables to keep up with user needs?) Not with Netezza. Without indexes on the columns, all columns are fair game for searching. Without indexes in the way for loading, we can deliver information into the machine, a reprocess it while there, with no penalty from the use of indexes.
By this measure, this is an anti-indexing strategy because Netezza operates on the basic principle of where-not-to-look. In other words, if we can tell it where the data is not, it can zone-in and find the data. When we think about it, this is how a common brick-and-mortar warehouse works. If we showed up and asked for a box of nails, the attendant knows that he doesn't have to look in the parts of the warehouse that carry lawn equipment, hammers, saws or window draperies. He knows where the nails don't exist.
Contrast this to the common SMP-based indexed database, which uses exactly the opposite approach. The indexes are searched for the specific key and then this key is applied to the primary data store. This is why indexed structures in general cannot scale in the same manner as an anti-indexed structure. Keep in mind, with a Netezza platform it won't matter how much data we put on that 25-terabyte table. We could double, triple it or more - and it would always provide the answer in a consistent time frame. This is because from query-to-query - it's still not-looking in the same places, no matter how big those places get. It will still continue to ignore them because the data's not there and it already knows that.
I've had people tell me that there is no real difference in the SMP versus the MPP. The SMP, "properly configured" (they claim) is just as good as any old MPP. However, there is no "proper" way to configure a general purpose hardware architecture so that it will scale. The only way we could hope to coordinate these CPUs is in software, and the only way we can get data into the CPUs is by accessing the shared disk drives in another duck pond on completely different hardware. The SMP configuration is by definition the wrong configuration for scalable bulk processing of any kind, so there is no way to "properly configure" something that already the inappropriate storage mechanism. This would be no different than claiming that a "properly configured" VW Bug (Herbie notwithstanding) could be just as fast as a stock car. The VW Bug is not the wrong platform for general purpose transportation. But it's the wrong platform for model requiring high-scale and high-performance, just an an SMP-based RDBMS cannot scale with the same factor (for set-based bulk processing) as a machine (Netezza) specifically built for this mission. Only a purpose-built machine can ultimately harness this level of data, and only an appliance can remove the otherwise mind-numbing complexity of keeping it scalable.
In the next weeks leading up to IOD (where I will be speaking on most-things Netezza) I'll offer up some additional insights on the internals of the architecture and how it differs from the traditional platforms.
This blog entry is presented to invite comments and feedback on the Enzee Universe Best Practices discussions and content for the 2011 Best Practices sessions on Monday June 20th and Tuesday June 21st 2011.
The long-awaited "SQL" to Netezza Underground has hit Amazon.Com. Netezza Transformation
This book tackles some deeper issues around transforming our data warehouse, our approaches and even our thinking to align with the arrival of our brand-new appliance
No, it's not the appliance we'll transform, but that the appliance will transform us. Once again, a little tongue-in-cheek irony
The book offers working examples of the stuff people ask me about most often, like ELT/SQL-Transforms, checkpointing, exception handling (primary/foreign key), windowing (SCDs and deduplication) as well as a cookbook on more details to watch for in a migration project.
And of course, is replete with Case Study Short and a whole chapter on Case Studies. There's also an appendix at the end to offer up some simple scripting jump-start routines that can make bash so much easier.