Modified on by DavidBirmingham
One of the most significant questions to answer on the Netezza platform is that of "distribution" - what does it mean? how does it apply? and all that. If you are a Netezza aficionado, be warned that the following commentary serves as a bit of a primer for the un-initiated. However, feel free to make comments or suggestions to improve it, or pass it on if you like.
In the prior entry I offered up a simple graphic (right) as to how the Netezza architecture is laid out internally.
In this depiction, each CPU is harnessing its own disk drive. In the original architecture, this was called a SPU, because the CPU was coupled with a disk drive and an FPGA (field-programmable gate array) that holds the architectural firmware "secret sauce" of the Netezza platform. The FPGA is still present and accounted for in the new architecture as the additional blade(s) on the blade server. Not to over-complicate things, so let's look at the logical and physical layout of the data itself and this will be a bit clearer.
When we define a table on the Netezza host, it logically exists in the catalog once. However, with the depicted CPU/disk combinations (right) we can see 16 of these available to store the information physically. If the table is defined with a distribution of "random", then the data will be evenly divided between all 16 of them. Let's say we have a table arriving that has 16 million records. Once loaded, each of these SPUs would be in control over 1 million records. So the table exists logically once, and physically multiple times. For a plain straight query on this configuration, any question we ask the table will initate the same query on all SPUs simultaneously. They will work on their portion of the data and return it an answer. This Single-Instruction-Multiple-Data model is a bread-and-butter power strategy of Massively Parallel computing. It is powerful because at any given moment, the answer we want is only as far away as the slowest SPU. This would mean as fast as the SPU can scan 1 million records, this is the total duration of the query. All of the SPUs will move at this speed, in parallel, so will finish at the same time.
What if we put another table on the machine that contains say, 32 million rows? (trying to keep even numbers here for clarity). The machine will load this table such that each SPU will contain 2 million rows each. If no other data is on the machine, we effectively have 3 million records per SPU loaded, but the data itself may be completely unrelated. Notice how we would not divide the data another way, like putting 16 million records on 6 of the SPUs and the other 32 million records on the remaining 10 SPUs. No, in massively parallel context, we want all the SPUs working together for every query. The more SPUs are working, the faster the query will complete.
Now some understand the Massively Parallel model to be thus: Multiple servers, each containing some domain of the information, and when a query is sent out it will find its way to the appropriate server and cause it to search its storage for the answer. This is a more stovepiped model than the Netezza architecture and will ultimately have logistics and hardware scalability as Achilles's heels, not strengths for parallel processing. Many sites are successful with such models. But they are very application-centric and not open for reuse for other unrelated applications. I noted above that the information we just loaded on the Netezza machine can be in unrelated tables, and databases and application suites, because the Netezza machine is general-purpose when it comes to handling its applicability for data processing. It is purpose-built when we talk about scale.
But let's take the 16-million-record vs 32-million record model above and assume that one of the tables is a transactional table (the 32 million) and one is a dimensional table (the 16 million). The dimensional table contains a key called "store_id" that is represented also on the transactional table such that we can join them together in various contexts. Will the Netezza machine do this, and how will it? After all, the data for the tables is spread out across 16 SPUs.
Well, we have the option of joining these "as is" or we can apply an even more interesting approach, that of using a distribution key. Here's where we need to exercise some analysis, because it appears as though the STORE_ID is what we want for a distribution, but this might skew the data. What does this mean? When we assign a distribution, the Netezza machine will then hash the distribution key into one of 16 values. Every time that particular key appears, it will always be assigned the same value and land on the same SPU. So now we can redistribute both of these random tables on STORE_ID and be certain that for say, SPU #5, all of the same store_id's for the dimension table and for the transaction table are physically co-located on the same disk. You probably see where this is going now.
Ahh, but what if we choose store_id and it turns out that a large portion of the IDs hash to a common SPU? What if we see, rather than 1 million records per SPU, we now see around 500k per SPU with one SPU radically overloaded with the remainder? This will make the queries run slow, because while all the SPUs but one will finish fast, in half the time of before, they will all be waiting on the one overworked SPU with all the extra data. This is called data skew and is detrimental to performance.
However, had we chosen a date or timestamp field, our skew may have been even worse. We might see that the data is physically distributed on the SPUs in a very even form. But when we go to query the data, we will likely use a date as part of the query, meaning that only a few SPUs, perhaps even one SPU, will actually participate in the answer while all the others sit idle. This is called process skew and is also detrimental to performance.
Those are just some things to watch out for. If we stay in the model of using business keys for distribution and using dates for zone maps (which I will save for another entry) we will usually have a great time with the data and few worries at all.
Back to the configuration on store_id's. If we now query these two tables using store_id in the query itself, Netezza will co-locate the join on the SPU and will only allow records to emit from this join if there is a match. This means that the majority of the work will be happening in the SPU itself before it ever attempts to do anything else with it. What if we additionally clamped the query by using a date-field on the transaction table? The transaction table would then be filtered on this column, further reducing the join load on the machine and returning the data even faster.
So here is where we get lifting effects where other machines experience drag. For an SMP-based machine, adding joins can make the query run slower. On a Netezza machine, joins can serve as filter-effects, limiting the data returned by the query and increasing the machine's immediate information on where-not-to-look. Likewise the date-filter is another way we tell the machine where-not-to-look. As I have noted, the more clues we can give the machine as to where-not-to-look, the faster the query will behave.
I have personally witnessed cases where adding a join to a large table actually decreases the overall query time. Adding another one further decreases it. Adding another one even further and so on - five-joins later we were getting returns in less than five seconds where the original query was taking minutes. This filter-table and filter-join effect occurs as a free artifact of the Netezza architecture.
Distribution is simply a way to lay the data out on the disks so that we get as many SPUs working for us as possible on every single query. The more SPUs are working, the more hardware is being applied to the problem. The more hardware, the more the physics is working for us. Performance is in the physics, always and forever. Performance is never in the software.
While the above is powerful for read-query lift (co-located reading) there is another power-tool called the co-located write. If we want to perform an insert-select operation, or a create-table-as-select, we need to be mindful that certain rules apply when we create these tables. For example, if we want to perform a co-located read above and drop the result into an intermediate table, it is ideal for the intermediate table to be distributed on the same key as the tables we are reading from. Why is this? Because the Netezza machne will simply read the data from one portion of the disk and ultimately land it to another portion of the same disk. Once again it never leaves the hardware in order to affect the answer. If we were to distribute the target table on another key, the data will have to leave the SPU as it finds a new SPU home to align with its distribution key. If we actually need to redistribute, this is fine. But if we don't and this is an arbitrary configuration, we can buy back a lot of power just by co-locating the reads and writes for maximum hardware involvement.
So that's distribution at a 50,000 foot level. We will look at more details later, or hop over to the Netezza Community to my blog there where some of these details have already been vetted.
It is important to keep in mind that while distribution and co-location are keys to high(er) performance on the machine, the true hero here is the SPU architecture and how it applies to the problem at hand. We have seen cases where applying a distribution alone has provided dramatic effects, such as 30x and 40x boost because of the nature of the data. This is not typical however, since the co-located reads will usually provide only a percentage boost rather than an X-level boost. This is why we often suggest that people sort of clear-their-head of jumping directly into a distribution discussion and instead put together a workable data model with a random distribution. Once loaded, profile the various key candidates to get a feel for which ones work the best and which ones do not. We have seen some users struggle with the data only because they prematurely selected a distribution key that - unbeknownst to them - had a very high skew and made the queries run too slow. This protracted their workstreams and made all kinds of things take longer than they should have.
So at inception, go simple, and then work your way up to the ideal.
On multiple distribution keys:
Many question arise on how many distribution keys to use. Keep in mind that this is as much a physical choice as a functional one. If the chosen key provides good physical distribution, then there is no reason to use more keys. More granularity in a good distribution has no value. However, a distribution key MUST be used in a join in order to achieve co-location. So if we use multiple keys, we are committing to using all of them in all joins, and this is rarely the case. I would strongly suggest that you center on a single distribution key and only move to more than one if you have high physical skew (again, a physical not functional reason). The distribution is not an index - it is a hash. In multi-key distributions, the join does not first look at one column then the next - it looks at all three at once because they are hashed together for the distribution. Joined-at-the-hip if you will.
On leaving out a distribution key:
One case had three tables joining their main keys to get a functional answer. They were all distributed on the same key, which was a higher-level of data than the keys used in the join. The developer apparently thought that because the distribution keys are declared that the machine will magically use them behind the scenes with no additional effort from us. This is not the case. If the distribution key(s) (all of them) are not mentioned in the join, the machine will not attempt co-location. In this particular case, using the higher-level key would have extended the join somewhat but would not have changed the functional answer. Simply adding this column to the join reduced that query's duration by 90 percent. So even if a particular distribution key does not "directly" participate in the functional answer, it must directly participate in the join so as to achieve co-location. And if this does not change the functional outcome, we get the performance and the right answer.
How it affects concurrency:
Many times people will ask: Why is it that the query runs fast when it's running alone, but when it's running side-by-side with another instance of itself they both slow to a crawl? This is largely due to the lack of co-location in the query. When the query cannot co-locate, it must redistribute the data across the inter-blade network fabric so that all the CPUs are privy to all the data. This quickly saturates the fabric so that when another query launches, they start fighting over fabric bandwidth not the CPU bandwidth. In fact some have noted that the box appears to be asleep because the CPUs and drives aren't doing anything during this high-crunch cycle. That's right, and it's a telltale sign that your machine's resources are bottlenecked at the fabric and not the CPU or I/O levels. By co-locating the joins, we keep the data off the fabric and down in the CPUs where it belongs, and the query will close faster and can easily co-exist with other high-intensity queries.
Modified on by DavidBirmingham
For the past many months I have been diligently updating and upgrading the original 2008 Netezza Underground to address the many features of TwinFin, Striper and other offerings from IBM. I have recently been notified that it has passed final edit and is available on Amazon.com.
All I can say is "whew!" and many thanks to those who helped put it together. It's been a whirlwind.
Here is the URL
When I started the project I realized that a big part of the original book remains timeless. I didn't leave it "as is" though - practically every page and all the chapters have new material, case studies and such. I peppered the book with some additional graphics since the intrinsic points require a bit more reinforcement than mere words will suffice.
The original chapter on "Distribution Stuff" is now "Performance Stuff" and is twice as long, covering the various aspects of setting up tables, troubleshooting, page-level zone maps and a lot more.
Fortunately, this time around there is a better mechanism to contact me if you have questions or want to report any errata (hey, it could happen!) - you can reach me through this blog, on linked-in or directly through my primary email address at Brightlight Consulting:
For a brief history of why ELT (that is, in-box-data-processing) is even a topic of discussion, we must recognize that the high-powered appliances such as Netezza have not only made such implementation viable, but even desirable.
Just so we level-set on what one of these looks like, it's a SQL statement. Usually an insert/select but can also include updates and the like. Many of you recognize these as multi-stage operations inside a stored procedure. The sentiment of course is that such an animal can perform better inside-the-machine than to take it out of the machine (through an ETL tool), process it, and put it back. This may be true for smaller data sets, but you aren't reading this because you have smaller data sets, are you? Netezza users are big-game hunters. Pea-shooters are for small animals, but if we want big game, we'd better bring a bigger game with us, put on a bigger game face, and bring bigger equipment.
But is it really the size of the equipment? Netezza users know that the size and the architecture are keys to success. Dog-piling a bunch of CPUs together does not a power-frame make.
Okay, back to the storyline here - in-the-box SQL-transforms, in a traditional RDBMS platform, are the realm of small game. Once the game gets bigger, these transforms degrade in performance, and rather rapidly. Watch as a swarm of engineers tries to reconstruct and rebuild the procedures, the SQL, the tables and even upgrade the hardware in a futile attempt to achieve incrementally more power. Emphasis on incrementally-more. Not linearly more.
As they grow weary of this battle, the ETL tools start looking better and better. We reach a critical mass, and the ELT-in-the-box is summarily forsaken as we stand up the ETL tool.
Sometime after this transition, the data volumes once again overwhelm the environment. One thing leads to another, and one day the Netezza machine arrives on the floor. Hopes are high.
But notice the transition above - ELT was forsaken for ETL, likely never to return. But wait, now we really have the power to do the ELT SQL-transforms, but we've mentally and perhaps emotionally (yeah, verily even politically) moved away from SQL-transforms.
Some reading this might scratch their heads and think, What's he talking about? We've been doing ELT in the machines using (PLSQL, etc, name your approach here) for many years. Why would we shy away from it?
- Hand-crafted SQL statements. They are easy to deploy but difficult to maintain and re-deploy. The moment we have to change the data model, this becomes painfully obvious.
- Proliferation of one-off operations, in cut-and-paste form, increasing inefficiency and complexity of the overall implementation
- Lack of overarching control - stored procs run and go "black" for an hour or more without any clue as to their status
- They don't scale in a traditional (SMP-based) RDBMS - the larger the data gets, the more the implementation will suffer
- ETL tools don't support them in the quantity and logistical quality we really need, but they are getting there.
Why not use an ETL Tool? I mean, they handle push-down SQL-generation right?
I can perhaps summarize this situation in a single conversation I had with a ETL tool vendor who was hawking the capability for his own tool, and after showing me the mechanics of making one of these little jewels operate with aplomb, I asked him, "So what about doing a few hundred of these in a sequence, or branching them into multiple sequences?" The vendor rep looked almost hurt. "Why would you want to do that?"
Well, if we're really talking about migrating transforms into the machine, this can grow into hundreds of operations rather quickly. This situation apparently overwhelms the logistical capability of even the most powerful ETL tool. But I have hope that they will solve this situation. Eventually.
I am not holding out hope that it will happen soon, or voluntarily. These tool vendors have invested millions in the performance boosting of their own products and will not likely toss this investment on the flames of the appliance movement, even if said flames are the exhaust flames of the appliance's rocket engines. This is why I say "eventually". They don't really have a marketable reason to embrace this approach.
Another problem of the ETL tool is that it is so divorced from the appliance's infrastructure that it cannot control the cross-environment logistics. This is especially true of the "virtual transaction" - that is - multiple flows arriving in multiple tables that are each in context of one another, yet are individually shared-nothing operations. If one of the flows should fail, how do we back out of this? Can we do a rollback of the tables where their flows succeeded? No we cannot. We could certainly implement an approach, but this is neither inherent nor intrinsic to the ETL tool. We need a shared-nothing virtual transaction that will control all of the flow in a common context, commit them in that context and faithfully rollback the results in that same context. ETL tools don't go there. Unless we implement the tool that way. As an application. Once implemented, how do we reuse this for the next application, and the next? We can see that it's not part of the tool itself.
If next-generation "ELT" scenarios are to be successful, they need several very important capabilities that are simply non-optional and non-trivial:
- Catalog-aware - The entire application must be coupled to the catalog definitons so that changes in definitions will automatically drive the resolution of their impact. We don't want to go on a submarine hunt to find every place where a particular column is used. The tool should already know.
- Logistical control - the ELT management needs to be able to support hundreds of transforms with arbitrary branches and synchronization
- SQL-generation - from templates that are tied to the catalog - If the table changes, we can propagate the changed automatically into the transforms.
- Proper handling of shared-nothing operations into the target, including virtual transactions that we can replicate to other databases, or rollback completely.
- Treat internal appliance databases as "waypoints" rather than "endpoints". That is, multiple databases are in play and used for various roles, such as intake, intermediate workspace, and persistence. These resources should be configurable and relatively transparent to the tool.
- Support of major patterns such as CDC, Dedupe, column-level scrubs and transforms, masking, column remapping, soft-delete, slowly-changing-dimensions and of course, analytic functions etc.
I am sure the visitors of this blog have even more aspects of a "wish list" that they have implemented (perhaps painfully so) and want more control over the data, its processing logistics and error control and recovery. Feel free to add your own comments and suggestions here.
Modified on by DavidBirmingham
I hear a lot of feedback on the use of CDC to put data into a PureData for Analytics, Powered By Netezza Technology device. In the other machines (traditional database engines) the data flies into the box, the CDC is on-and-off the machine in seconds. But in my Netezza machine, the CDC seems to grind. I have it running every fifteen minutes, they say, and the prior CDC instance is still running when the next instance kicks off. This is totally unacceptable. Maybe we shouldn't be using CDC for this?
Or maybe they just don't have it configured correctly?
There are two major PDA principles in play here. One is strategic and the other is tactical. Many people can look at the tactical principle and accept it because it is testable, repeatable and measurable. The strategic one however, they will hold their judgment on because it does not fit their paradigm of what a database should do. I'll save the strategic one for last, because its implications are further reaching.
The CDC operation will accumulate records into a cache and then apply these at the designated time interval. This micro-batch scenario fits Netezza databases well. The secondary part of this is that the actual operation will include a delete/insert combination to cover all deletes, updates and inserts. So when the operation is complete, the contents of the Netezza table will be identical to the contents of the source table at that point in time (even though we expect some latency, that's okay).
The critical piece is this: An update operation on a Netezza table is under-the-covers a full-record-delete and full-record-insert. It does not update the record in place. A delete operation is just a subset of this combination. This is why the CDC's delete/insert combination is able to perfectly handle all deletes, updates and inserts. The missing understanding however, is the distribution key.
If we have a body of records that we need to perform a delete operation with against another, larger table, and the larger table is distributed on RANDOM, think about what the delete operation must do in a mechanical sense. It must take every record in the body of incoming records and ship it to every SPU so that the data is visible to all dataslices. It must do this because the data is random and it cannot know where to find a given record to apply the operation - it could literally be anywhere and if the record is not unique, could exist on every dataslice. It's random after all. This causes a delete operation (and by corollary an update operation) to grind as it attempts to locate its targets.
Contrast this to a table that is distributed on a key, and we actually use the key in the delete operation (such as a primary key). The incoming body of records is divided by key, and only that key's worth of data is shipped to the dataslice sharing that key - the operation is lightning-fast. This is why we say - never, ever perform a delete or update on a random table, or on a table that doesn't share the distribution key of the data we intend to affect. Deletes and Updates must be configured to co-locate, or they will grind.
Now back to the CDC operation. Whenever I hear that the CDC operation is grinding, my first question is: Do you have the target Netezza tables distributed on the same primary keys of the source table? The answer is invariably no (we will discover why in a moment). So then I ask them, what would it take to get the tables distributed on the primary key? How much effort would it be? And they invariably answer, well, not much, but it would break our solution.
Why is that?
Because they are reporting from the same tables that the CDC is affecting. And when reporting from these tables, the distribution key has to face the way the reporting users will use the tables, not the way CDC is using the tables. This conversation often closes with a "thank you very much" because now they understand the problem and see it as a shortcoming of Netezza or CDC, but not a shortcoming of how they have implemented the solution.
Which brings us to the strategic principle: There is no such thing as a general purpose database in Netezza.
What are we witnessing here? The CDC is writing to tables that should be configured and optimized for its use. They are not so, because the reporting users want them configured and optimized for their own use. They are using the same database for two purposes because they are steeped in the "normalization" protocol prevalent in general-purpose systems - that the databases should be multi-use or general-purpose.
But is this really true in the traditional databases? If we were using Oracle, DB2, SQLServer - to get better performance out of the data model wouldn't we reconfigure it into a star schema and aggressively index the most active tables? This moves away from the transactional flavor of the original tables to a strongly purpose-built flavor.
Why is it that we think this model is to be ignored when moving to Netezza? Oddly, Netezza is a Data Warehouse Appliance - it was designed to circunscribe and simplify the most prevalent practices of data warehousing - not the least of which - is the principle that there is no such thing as a general-purpose database. In a traditional engine we would never attempt to use transactional tables for reporting - they are too slow for set-based operations and deep-analytics. Yet over here in the Netezza machine, somehow this principle is either set-aside or ignored - or perhaps the solution implementors are unaware of it - and so these seemingly inexplicable grinding mysteries arise and people scratch their heads and wonder what's wrong with the machine.
And again, they never wonder what's wrong with their solution.
If we take a step back, what we will see are reports that leverage the CDC-based tables, but we will see a common theme, which I will cover in a part-2 of this article. The theme is one of "re integration" versus "pre-integration". That is, integration-on-demand rather than data that is pre-configured and pre-formulated into consumption-ready formats. What is a symptom of this? How about a proliferation of views that snap-together five or more tables with a prevalence of left-outer-joins? Or a prevalence of nested views (five, ten, fifteen levels deep) that attempt to reconfigure data on-demand (rather than pre-configure data for an integrate-once-consume-many approach?) Think also about the type of solution that performs real-time fetches from source systems, integrates the data on-the-fly and presents it to the user - this is another type of integration-on-demand that can radically debilitate source systems as they are hit-and-re-hit for the very same set-based extracts dozens or hundreds of times in a day.
I'll take a deep-dive on integration-on-demand in the next installment, but for now think about what our CDC-based solution has enticed us to do: We have now reconfigured the tables with a new distribution key that helps the reports run faster, but because this deviates from the primary-key design of the source tables (which CDC operates against) then the CDC operation will grind. And when it grinds, it will consume precious resources like the inter-SPU network fabric. The grinding isn't just a duration issue - it's inappropriately using resources that would otherwise be available to the reporting users.
What's missing here is a simple step after the CDC completes. Its a really simple step. It will cause the average "purist" data modeler and DBA to retch their lunch when they hear of it. It will cause the admins of "traditional" engines to look askance at the Netezza machine and wonder what they could have been thinking when they purchased it. But the ultimate users of the system, when they see the subsecond response of the reports and way their queries return in lightning fashion compared to the tens-of-minutes, or even hours - of the prior solution, these same DBAs, admins and modelers will want to embrace the mystery.
The mystery here is "scale". When dealing with tables that have tens of billions, or hundreds of billions of records, the standard purist protocols that rigorously and faithfully protect capacity in the traditionl engines - actually sacrifice capacity and performance in the Netezza engine. It's not that we want to set aside those protocols. We just want to adapt them for purposes of scale.
The "next step" we have to take is to formulate data structures that align with how the reporting users intend to query the data, then use the CDC data to fill them. It's not that the CDC product can do this for you. It gets the data to the box. This "next step" in the process is simply forwarding the CDC data to these newly formulated tables. When this happens, the pre-integration and pre-calculation opportunities are upon us, and we can use them to reduce the total workload of the on-demand query by including the pre-integration and pre-calculation into the new target tables. These tables are then consumption-ready, have far fewer joins (and the need for left-outer joins often fall by the way-side). After all, why perform the left-outer operations on-demand if we can perform them once, use Netezza's massively parallel power for it, and then when the users ask a question, the data they want is pre-formulated rather than re-formulated on demand.
This necessarily means we need to regard our databases in terms of "roles". Each role has a purpose - and we deliberately embrace the notion of purpose-built schemas, and deliver our solution from the enslavement of a general-purpose model. The CDC-facing tables with support CDC - we won't report from them. The reporting tables face the user - we won't CDC to them.
Keep in mind that this problem (of CDC to Netezza) can rear its head with other approaches also - such as streaming data with a replicator or ETL tool to simulate the same effect of CDC. Either way, the data arrives in the Netezza machine looking a lot like the source structures and aren't consumption-ready.
I worked with a group some years ago with a CDC-like solution, and they took the "next step" to heart, formulated a set of target tables that were separate from staging and then used an ETL tool to fill them. The protocol was simply this: The ETL tool sources the data and fills the staging tables, then the ETL sources the staging tables and fills the target tables. This provided the necessary separation, so functionally fulfilled the mission. The problem with the solution however, was that for the transformation leg, the ETL tool was yanking the data from the machine into the ETL tool's domain, reformulating it and then pushing it back onto the machine. The data actually met itself coming-and-going over the network. A fully parallelized table was being serialized, crunched and then re-paralellized into the machine. As the data grew, this operation became slower and slower. That's what we would expect right? The bottleneck now is the ETL tool. The proper way to do this, if an ETL tool must be involved, is to leverage it to send SQL statements to the machine, keep the data inside the box. The Netezza architecture can process data internally far faster than any ETL tool could ever hope - so why take it out and incur the additional penalty of network transportation?
The ETL tool aficionados will balk at such a suggestion because it is such a strong deviation from their paradigm. But this is why Netezza is a change-agent. It requires things that traditional engines do not because it solves problems in scales that traditional engines cannot. In fact, performing such transformations inside a traditional engine would be a very bad idea. The ETL tools are all configured and optimized to handle transformation one way - outside the box. This is because it is a general-purpose tool and works well with general-purpose engines. There is a theme here: the phrase "general purpose" has limited viability inside the Netezza machine. If we embrace this reality with a full head of steam, the Netezza technology can provide all of our users with a breathtaking experience and we will have a scalable and extensible back-end solution.
Modified on by DavidBirmingham
Sometimes the average Netezza user gets a bit tripped-up on how an MPP works and how co-located joining operates. They see the "distribute on" phrase and immediately translate "partition" or "index" when Netezza has neither. In fact, those concepts and practices don't even have an equivalent in Netezza. This confusion is simply borne on the notion that Netezza-is-like-other-databases-so-fill-in-the-blank. And this mistake won't lead to functional problems. They will still get the right answer, and get it pretty fast. But it could be soooo much faster.
As an example, we might have a traditional star-schema for our reporting users. We might have a fact table that records customer transactions, along with dimensions of a customer table, a vendor table, a product table etc. If we look at the size of the tables, we find that the product and vendor tables are relatively small compared to the customer, and the fact table dwarfs them all. A typical default would be to distribute each of these tables on their own integer ID, such as customer_id, vendor_id etc. and then putting in a transaction fact record id (transaction_id) that is separate from the others, even though the transaction record contains the ID fields from the other tables.
Then the users will attempt to join the customer and the transaction fact using the customer_id. Functionally this will deliver the correct answer but let's take a look under-the-covers what the performance characteristics will be. As a note, the machine is filled with SBlades, each containing 8 CPUs. For example, if we have a TwinFin-12, this is 12 SBlades with 8 CPUs, or 96 CPUs. They are interconnected with a proprietary, high-speed Ethernet configured to optimize inter-CPU cross-talk.
Also whenever we put a table into the machine, it logically exists in one place, the catalog, but physically exists on disks assigned to the CPUs. A simplistic explanation would be that if we have 100 CPU/disk combinations and load 100,000 rows to a table that is distributed on "random", each of the disks would receive exactly 1000 records. When we query the table, the same query is sent to all 100 CPUs and they only operate on their local portion of the data. So in essence, every table is co-located with every other table on the machine. This does not mean however, that they will act in co-location on the CPU. The way we get them to act in co-location (that is, joining them local to the CPU) is to distribute them on the same key.
But because our noted tables are not distributed on the same key, they cannot co-locate the join. This means that the requested data from the customer table will be shipped to the fact table. What does this look like? Because the customer table has no connection to the transaction_id, the machine must ship all customer records to all blades (redistribution) so that the CPUs there can attempt to join on the body of the customer table. We can see how inefficient this is. This is not a drawback of the Netezza machine. It is a misapplication of the machine's capabilities.
Symptoms: One query might run "fine". But two of them run slow. Several of them even slower. Results are inconsistent when other activities are running on the machine. We can see why this is the case, because the processing is competing for the fabric. Why is this important to understand? The inter-CPU fabric is a fairly finite resource and if we allow data to fly over it in an inefficient manner, it will quickly saturate the fabric. All the queries start fighting over it.
Taking a step back, let's try something else. We distribute the transaction_fact on the customer_id, not the transaction_id. Keep in mind that the transaction_id only exists on the transaction table so using it for distribution will never engage co-location. Once we have both tables distributed on the customer_id, let's look at the results now:
When the query initiates, the host will recognize that the data is co-located and the data will start to join without ever leaving the CPU where the two table portions are co-located. The join result is all that rises from the CPU, and no data is shipped around the machine to affect the answer. This is the most efficient and scalable way to deal with big-data in the box.
Now another question arises: If the vendor and product dimensions are not co-located with the transaction_fact, how then will we avoid this redistribution of data? The answer is simple: they are small tables so their impact is negligible. Keep in mind that we want to co-locate the big-ticket-or-most-active tables. I say that because we have sites that are similar in nature where the customer is as large as two of the other dimensions, but is not the most active dimension. We want to center our performance model on the most-active datasets.
This effect can rear its head in counter-intuitive ways. Take for example the two tables - fact_order_header and fact_order_detail. These two tables are both quite monstrous even though the detail table is somewhat larger. Fact_order_header is distributed on the order_header_id and the fact_order_detail is distributed on the order_detail_id. The fact_order_detail also contains the order_header_id, however.
In the above examples, the order header was being joined to the detail, along with a number of other keys. This achieved the correct functional answer, but because they were not using the same distribution key, the join was not co-located. So we suggested putting the order_detail table on the same distribution as the order-header (order_header_id). Since the tables were already being joined on this column, this was a perfect fit. The join received an instant boost and was scalable, no longer saturating the inter-CPU fabric.
The problem was in how the data architects thought about the distribution keys. They were using key-based thinking (like primary and foreign keys) and not MPP-based thinking. In key-based thinking, functionality flows from parent-to-child, but in MPP-based thinking, there is no overriding functional flow of keys - it's all about physics. This is not to say that "function doesn't matter" but we cannot put together the tables on a highly physical machine and expect it to behave at highest performance unless we regard the physics and protect the physics as an asset. Addressing the functionality alone might provide the right functional answer, but not the most scalable performance.
After completing another migration from a traditional, general-purpose RDBMS to the Netezza technology, I visited a friend who had several artifacts in his home that had to be the strangest things I'd ever seen.
Now I'd heard of genetically altered cat fur, you know, buying a cat online while picking the fur color of your choice (blue, lavender, teal - etc). Seems like an odd thing to do to a cat. I like cats and dogs both, so don't imagine that this blog essay attempts to take sides. Some folks are downright serious on their choice of pet, so I'll smooth that fur wherever I can.
Back to the hairless cats. I asked him "Where did you find these? And what happened to your other cats?" And he laughed, "That's a funny story. These are my cats, but I had to shave them." To this, I rolled my eyes, wondering where this was about to lead. He told the tale:
"We took a vacation down south and put the cats in kennels in the back of the truck. The round-trip took a toll on their fur and matted up everything from head to toe. Weeks later, the cat fur had not smoothed out. The kids had been brushing it out but it wasn't working. And if you think they look ridiculous shaved, you have no idea how silly they looked with their hair matted."
"So you shaved them?" I asked.
"Sure" he said, "Seems practical right? Just get rid of the matted hair altogether. Teasing it apart would have taken, well, years of time. Their fur will grow back out soon enough"
"Aren't there, you know, shampoos and stuff for that? I mean, shaving seems a little extreme."
"Tried all that. Bad thing about it, mats are bad for cats - they cause infections and all kinds of nasty side effects. Best to just shave it all and be done with it."
Laughing on the inside, I thought a bit about how we have to decompose and de-engineer an organically-grown data warehouse. Some would suggest porting (forklifting) the whole thing over "as is". Like taking a matted-hair cat and moving them from one house to another. It changes the venue for the cat, but doesn't help the cat at all. It's still sick and getting sicker from the mats. Such folks "tell a tale" of the success of their migration derring-do. But they are like nomads. Hunting the game until there's no more, then pulling up stakes to find another place to burn out. Forklift-migrations have value only to the ones who are doing the migrating, not the recipients. No sooner will they tie a bow on it than someone will request a change, and we will discover what we already knew: The original data model (now the new data model) isn't very resilient to change no matter where it is hosted.
We realize we have ported both the good and the bad from the old system, when we had the opportunity to port the good and leave the bad behind. We essentially are agreeing that we are about to standardize on the past and then accommodate the future, rather than a better approach: standardize on the future and accommodate the past.
Many years ago, at one site we had to carefully tease-apart the data and the stored procedured to find out what they were actually doing. Unfortunately we had carved up the work for several teams rather than reviewing it together. Had we done this, we would have discovered that the stored procs executed in chains of work, and that many of the chains were copy-pasted from one original chain that was too "matted" to risk breaking. So they copied-and-modified this chain to perform the new functionality. Enough of these and we see how the stored procedure doesn't benefit us (at all) for back-end data processing. In fact, we strongly suggest people use stored procs in Netezza for BI-adaptation and optimization, for the presentation layer. But not for the back end. Stored procs are not operationally viable for a wide range of reasons. It's even funny how folks move from one technology to another and try to replicate the stored procedure logic as a knee-jerk exercise, without realizing how flawed it really is. Perhaps the Netezza stored proc will run a lot faster. Trust me, performance is the least of your worries.
So once we converged the teams together, these themes started popping out like rabbits. By the end of the first day we are all laughing at the sheer level of redundancy in the back end. But not particularly surprised at the outcome. We'd seen it in lots of places before, but not so bad.
Of course, it never once occurred to us that we would port these hundreds of stored procs over to the new system. Rather we would functionally specify what they are doing now, and leverage tools to accommodate the vast majority of the functionality, only building what was left over. I mean, this is a standard functional port, why complicate things? Forklifting into a Netezza machine will certainly yield 10x performance, so why the beef? Without optimizing the data structures and processes to leverage Netezza's power, we might get 10x but leavel 100x on the table. Is this a good tradeoff?
Well, true to form, someone had the capacity to complicate things. He whipped out a spreadsheet and calculated the cost of the hundreds-of-stored-procs in the original system, not realizing we were planning to reduced these to maybe fifteen operations at most. Spreadsheet calculator in-hand, he estimated that it would take 24 people, 8 months, to handle on these stored procs. I sat back in my seat, stunned, because he was costing a project we weren't about to undertake. Rather, building out 15 or so operations would require a handful of people and 90 days at the outside. But also true to form, the project principals saw visions of sugar plums (another word for sales-comp) that got in the way of their better judgment. They actually went to the client with these inflated numbers, he rejected their proposal outright and gave the business to someone else. It's easy to lose a deal when the client sees the inflation on-the-page.
But what our "spreadsheet guy" missed, was that we weren't about to embark on a journey of finding a home for each stored proc (we already knew this had no value, and the client knew it too). He believed that we intended to bring the matted-cats into the house and put them on pillows, when we intended to pick the cats we wanted to keep, and shave them.
Okay, that's a strange analogy, but we had no intention whatsover of accepting all that convoluted spaghetti as the foundation for the go-forward system.
Netezza gives us the capacity - to simplify. We keep the parts we consider valuable (the cat) and get rid of all the mess that keeps the cat sick and unhappy. Taking only the functions we want, we then reconstruct (let the hair grow back out) only what we want to keep, and take the opportunity to apply some solid architectural principles and likewise capitalize on the strengths of the Netezza platform.
In the end, if we really have a platform that is standardized on the future, but accommodates the past, we also have something else that is even more powerful: A simpler, stronger engine that is ready to grow in functionality, adapting to our changing needs. The old system was never built with this kind of vision or priority, because the power wasn't there to affect it anyhow.
When introducing the Netezza platform to a new environment, or even trying to leverage existing technologies to support it, very often the infrastructure admins will have a lot of questions, especially concerning backups and disaster recovery. Not the least of which are "how much" "how often" and such like. More often than not, every one our responses will be met with a common pattern, a sentence starting with the same two words:
Case in point, when we had a casual conversation with some overseers of backup technology as a precursor to "the big meeting" we almost - quite accidentally - shut down the conversation entirely. Just the mention of "billions" of rows, or speaking of the database in "Carl Sagan" scaled terms, caused them to want to scramble for budget and market surveys of technologies that were more scalable than their paltry nightly tape-backup routines. In this particular conversation, we were talking about nightly backups that were larger than the monthly backups of all of their other systems combined. Clearly we were about to pop the seams of their systems and they wanted a little runway to head off the problem in the right way.
But what is the "right way" to perform a backup where one of the tables is over 20 terabytes in size, the entire database is over 40 terabytes in size and the backup systems require two or even three weeks to extract and store this information for just one backup cycle? Quipped one admin "It takes so long to back up the system that we need to start the cycle again before the first cycle is even partially done." and another "Forget the backup. What about the restore? Will it take us three weeks to restore the data too? This seems unreasonable."
Yes it does seem unreasonable precisely because it is - quite unreasonable. As many of you may have already discovered, the Netezza platform is a change-agent, and will either transform or crush the environment where it is installed, so voracious are its processing needs and so mighty is its power to store mind-numbing quantities of information.
The aforementioned admins simply plugged their backup system into the Netezza machine, closed their eyes and flipped the switch, then helplessly watched the malaise unfold. It doesn't have to be so painful. These are very-large-scale systems that we are attempting to interface with smaller-scale systems. We might think that the backup system is the largest scaled system in our entire enclosure, but put a Netezza machine next to it and watch it scream like a little girl.
So here's the deal: No environment of this size shoiuld be handled in a manner that is logistically unworkable for the infrastructure hosting it. We can say all day that these lower-scaled technologies should work better or that Netezza should pony-up some stuff to bridge the difference, but we all know that it's not that easy. Netezza has simplified a lot of things, but simplification of things outside the Netezza machine - aren't we asking a bit much of one vendor?
To avoid pain and injury, think about the things that we need to accomplish that are daunting us, and solve the problem. The problem is not in the technology but in the largesse of the information. We would have the same problem on our home computers if we had a terabyte of data to backup onto a common 50-gig tape drive. We would need twenty tapes to store the data. The backup/restore technology works perfectly fine and reasonably well for a variety of large-scale purposes. We simply need to be creative about adapting it to the Netezza machine. Don't plug it in and hope for the best. Don't do monolithic backups. The data did not arrive in the machine in a monolithic manner so why are we trying to preserve it that way? Leave large-scale storage and retrieval to the Netezza machine and don't crush the supporting technologies with a mission they were never designed for.
Several equally viable schools of thought are in play here. What we are looking for is the most reliable one. Which one will instill the highest confidence with least complexity? The more complex a backup/restore solution becomes, the less operational confidence we have in it. If it cannot backup and restore in a reasonable time frame, we exist in a rather anxious frontier, wondering when the time will come that the restoration may be required and we put our faith in the notion that it either won't, or when it does all of the other collateral operational ssues will eclipse the importance of the restoration. In other words. future circumstance will get us off the hook. There is a better way, like a deterministic and testable means to truly backup and restore the system with high reliability and confidence.
On deck is the simplest form of the solution - another Netezza machine. Many of you already have a Disaster-Recovery machine in play. Trust me when I tell you that this should be fleshed out as a fully functional capability (discussed next) and then the need for a commodity backup/restore technology evaporates. Using another Netezza machine, especially when leveraging the Netezza-compressed information form, allows us to replicate terabytes of information in a matter of minutes. I don't have to point out that none of our secondary technologies can compete with this.
A second strategy requires a bit more thought, but it actually does leverage our current backup/restore technology in a manner that won't choke it. It won't change the fact that the restoration, while reliable, may be slow simply because moving many terabytes in and out of one of these secondary environments is inherently slow already.
A third strategy is a hybrid of the second, in that enormous SAN/NAS resources are deployed as the active storage mechanism for the data that is not on (or no longer on) the Netezza machine. This can be a very expensive proposition on its own. We know of sites that keep the data on SAN in load-ready form, and then load data on-demand to the Netezza machine, query the just-loaded data, return the result to the user and then drop the table. You may not have on-demand needs of this scale, but this shows that Netezza is ready to scale into it.
A fourth strategy is a hybrid of the first, that is we still use a Netezza machine to back up our other Netezza machine, but we use the more cost-effective Netezza High-Capacity server, which is less expensive than the common TwinFin (fewer CPUS, more disk drives) but otherwise behaves in every way identically to its more high-powered brethren. And honestly if we were to put apples-to-apples in a comparison between the cost of a big SAN plant to store these archives versus the High Capacity Server, the server wins hands-down. It's cheaper, simpler to operate, doesn't require any special adaptation and we can replicate data in terms guided by catalog metadata rather than adapting one technology to another.
So let's take these from the least viable to the most viable and compare them in context and contrast, and let the computer chips fall where they may.
Commodity backup/restore technology
If we want to leverage this, we need to understand that it cannot be used to perform monolothic operations. These are unmanageable for a lot of reasons:
To mitigate the above, we need to adapt the large-scale database to the backup technology by decoupling and downsizing the operation into manageable chunks. This is a direct application of the themes surrounding protracted data movement in any environment. The larger the data set, the more the need for checkpointed operations so that the overall event is an aggregation of smaller, manageable events. If any single event fails, it is independently restartable without having to start over. Case in point, if I have 100 steps to complete a task and they are all dependent upon one another, and the series should die on step 71, I still have 29 steps remaining that may have completed without incident, but I cannot run them without first completing step 71. This is what a monolithic backup buys us - an all-or-nothing dependency that is not manageable and I would argue is entirely artificial.
To continue this analogy, lets say that any one of these 100 steps only takes one minute. In the above, I am still 30 minutes away from completion. I arrive at 6am to find that step 71 died, and now I have to restart from step 1 and it will cost me another 100 minutes. Even if I could restart at step 71, I am still 30 minutes away from completion.
Contrast the above to a checkpointed, independent model. If we have 100 independent steps and the step 71 should die, the remaining 29 steps will still continue. We arrive at 6am to find that only one of the 100 steps died and we are only 1 minute away from full completion. The difference in the two models is very dramatic:
Monolithic means we are operationally reactive when failure occurs. The clock is ticking and we have to get things back on track and keep moving. Checkpointed means we are administratively responsive when failure occurs. We don't have to scramble to keep things going. In fact, in the above example, if step 71 should fail and the operator is notified, doesn't the operator have at least half an hour to initiate and close step 71 independently of the remaining 29 steps? Operators need breathing room, not an anxious existence.
Monolithic methods are supported de-facto by the backup/restore technology. If we want to perform a checkpointed operation, we have to adapt the backup/restore process to the physical or logical content of the information. We don't want to directly mate the backup technology itself, so we need to adapt it.
Logical Content Boundaries
This means we have to define logical content boundaries in the data. What's that? You don't have any logical content boundaries, not even for operational purposes? Well, per my constant harping on enhancing our solution data structures with operational content, such as job/audit ID and other quantities, perhaps we need to take a step back and underscore the value of these things because they exist for a variety of reasons. One of them is now upon us - the operatonal support of content-bounded backups. It is required for scalability and adaptability and is not particularly hard to apply or maintain.
A more important aspect of content boundary is the ability to identify old versus new data. If the data is carved out in manageable chunks, some will always be least-recent and some more-recent. Invariably the least-recent chunks will be identical in content no matter how many times we extract them for backup. This means we can extract/backup these only once and then focus our attention on the most active data. In a monolithic model, there is no distinction between least or most active, least or most recent. In large-scale databases, the least-recent data is the majority, so the monolithic backup is painfully redundant when it need not be.
Do we absolutely need content-bounded backups for all of our tables to function correctly? Of course not. But by applying this as a universal theme it allows us to treat all tables as part of a backup framework where all of them behave the same way. So part of this is in the capture of the data but a larger part is the operational consistency of the solution.
Many reference tables such as lookups will never grow larger and we know this. In fact, their data may remain static for many years. For the ones that are tied to the application and grow or change every day, these we will call solution tables. They are typically fed by an upstream source and are modified on a regular basis. Any of these tables can grow out of control. The reference data then represents a very low operational risk. Why then would we not simply fold the reference data into the larger body and treat all the tables the same? There is no operational penalty for it, but enormous benefit from being able to treat all tables the same way inside a common solution.
At this point, the backup/restore framework will address all of the tables the same way, but now we have the ability to leverage rules and conditions within the framework so that special handling is available if necessary. This is a common theme in large-scale processing: Handle everything as though it will grow, but accommodate exceptions with configuration rules. I'll forego this aspect for now and let' take a look at what we need in basic terms:
- An intermedaite landing zone: Some off-machine storage location that can hold the data while in transit - e.g a NAS or SAN volume. This will not be the same size as the database but some fraction of it. It is a workspace for intermediate files, not permanant storage
- Content boundaries: A means to define and manage content bthese are tags or quantities in the data itself. They need to be consistent for all tables so that the backups and restore operate the same way. Of course, if we don't have these, we need to apply them.
- A restoration database: a separate database that will be used for the restoration process, because we won't be restoring the data directly into the table from whence it came. Why not? There is an increasing likelihood that the data structures have changed since the last backup. If the structure has changed, we cannot reload the data into it. We need a way to shepherd the data back into the machine.
- Processes: that capture / backup / restore the data using content boundaries and common utilities. These are typically control scripts that are easily configured, deployed and maintained
- Archival Database (next): Physical content replicas of the masters that hold all of the original information in a form useful for backup and restore. Whether they keep the data active/online is immaterial. Their role is to interface with the backup/restore mechanisms so that any data once made offline can easily be made online through this database.
Setup formal archival databases. Whether these reside on the same server, or on a DR / High Capacity server (below) is immaterial. The point is that the data in the master tables will be actively rolled into these tables, which will form the backbone for all backup and restoration operations. We therefore replicate the masters (with streaming replication) into the archival stores and then at some interval perform the backup of the archives, not the master.
Foreshorten the Master Tables
Now that we have a means to define content boundaries - you did apply those boundaries right? We can now look at the database holistically for optimizations based on active data.
At one site we have a table with ninety billion records in three different fact tables spanning over ten years of information, However, the end-users and principals claimed (and we verified) that the most active data on any given day is the most recent six months. Anything prior to that, they would query perhaps once or twice a year for investigative purposes, but had not tied any reports to it.
So now we have an opportunity to get agreement from the users to shorten the master tables. This is especially necessary for those fact tables. In the end, these fact tables (above) were shortened to less than four billion rows each and these are kept trimmed on a regular basis. The original long-term data is held in another archival database that is the foundation for the backups and restores.
On a per-table basis, use the content boundaries to manufacture/compile a set of checkpointed instructions for each table to support extraction and restoration
Execute the extractions to create flat files
Each extraction will pull data to a Netezza-compressed flat file and also supply another file with the metadata instructions necessary to restore the file's contents. On a per-file basis, these instructions will include:
Original table definition, used to construct a template of the table in the restoration database
File load instructions used to get the content back into the machine
Transformation command, used to move data from the restoration database into its original home
- The files will be extracted (using content boundaries) onto the storage landing zone
- The smaller extractions can be performed in parallel, are independent and form their own checkpointed process
- The backup techology will begin its backup process once the second (metadata companion) file arrives for the set. The metadata companion file behaves as a trigger.
- Once the backup for the file-set is complete, the backup/restore deletes the file and its companion
- This decouples the backup/restore processing from the database (completely) so that it can focus solely on file-based backup and restore
- For restoration, both the content file(s) and companion metadata file(s) will be retrieved and placed on the storage landing zone
- The metadata will be used to reconstruct an empty intermediate version of the original table in the Restoration Database
- Load the table using nzload
- Then the metadata will be used (even a preformulated SQL statement) to copy the data into the original target
- Throw away the intermediate load table, delete the data file and its companion metadata file
- The restoration process can be run as many ways-parallel as supported by nzload
- The restoration process can also be surgical, with the more agile ability to restore data in smaller segments
The above protocol, while likely easy to pull off an administer, still has a number of moving parts that the Netezza based-equivalent (later) will not have.
The only difference between the above protocol and one using a SAN-based storage mechanism, is the absence of a formal backup/restore technology. Rather the SAN is the long-term storage location and we perform the incremental extractions onto it. Rather than delete the files, we keep them.
This has significant implications for the cost of the SAN. After all, if we intend to interface this to the Netezza machine, we would not want common NAS storage because it is too slow and the vendors actively disclaim their technology from being viable for data warehousing. The primary reason is that the network and CPUs are set up for load-balancing, like a transactional database, but not bulk onload/offload of the data.
Not only will we need enough SAN to backup the environment, but to also carry fathers and grandfathers if need be (this is a policy decision). With checkpointed extracts, the father/grandfather issue is largely moot. This is because once a checkpointed extract of older data is pulled and stored, it won't be changing and capturing another one just like it has no value.
In this approach we leverage another Netezza machine like a DR server, as our backup, archival and restore foundation. It can easily hold the information quantity. The difference here is in price, of course, since a fully-functional TwinFin is more expensive than most common SAN installations. However, the High Capacity Server (below) mitigates this pricing problem while delivering a consistent data experience.
One primary benefit of performing backups into the DR server is that it can automatically serve the role of a hot-swap server in case of failure in the primary server.
For this scenario to work however, we would want streaming replication between the active databases and the DR server so that the data is being reconciled while being processed. This allows us to have a fully functional hot-swap if the primary crashes, and we can continue uninterrupted while the primary is serviced. Word to the wise on this kind of scenario however, bringing back the primary means that it is out of sync, since the secondary took over for a span of time. So we would need to be able to reverse the streaming replication to make it whole.
Scenarios like this often embrace the practice of operationally swapping the two machines on defined boundaries, like once a month or once a quarter, where they actually switch roles each time. This allows the operations staff to gain confidence in the two machines as redundant to each other in every way. I have seen cases like this where the primary machine went down, the secondary machine kicked in seamlessly and all was well. I have also seen cases where the principals kept the DR server up to date but when it came time to operationally switch, some important piece (usually in the infrastructure between the devices) was missing causing the failover itself to fail. It is best to have a plan in place, but it's better to have tested the plan and that it actually works.
On a per-table basis, use the content boundaries to manufacture/compile a set of checkpointed instructions for each table to support extraction and restoration
Using a variation of the nzmigrate utility, perform the table-level extractions
Extract data in Netezza-compressed form to a flat file
Load the flat file into a restoration database in the second machine
Perform the transform-execution to copy data from the restoration database into the target table
The extractions, restoratons and transformations can all leverage simple scripts and catalog metadata, not become dependent on deeply hand-crafted code
Intermediate (SAN) landing zone requires less space because the data is being transferred in Netezza-compressed format and it is cleaned-up on-the-fly
Transfer is quicker because Netezza-compressed data is written to the data-slices directly instead of coming through the host.
A word on Netezza-compressed transfer. I wrote about this in Netezza Transformation but it is important to highlight here. We performed an experiment moving half a terabyte scattered across a hundred or more tables. This data was moved from its original home to a database in another machine. The first method used simple SQL-extract into an nzLoad component. This process took over an hour. The second method used transient external tables with compression, coupled with an nzload in compressed mode. The entire transfer took less than six minutes. This was because the compressed form of the data was already 14x compressed.
In other experiment using over 20x compression for the data, we were able to transfer ten terabytes in less than an hour. This kind of data transfer speaks well for the streaming replication necessary for DR server operation (above) but underscores the fact that even when transferring between Netezza machines, it's as though we haven't left the machine at all.
Netezza-based High-Capacity Server
This option is simply a form of the Netezza-based hybrid (above) but on a dedicated server designed to support backup and recovery.
The better part about this server is that is has more disk drives and fewer CPUs, making it far more cost effective for storage than common SAN devices. Couple this with the minimal overhead required for transferring data between machines, and the ability to surgically control the content with the content-boundaries and catalog metadata, and we get the best of all worlds with this device.
Not only that, but it is also scalable to support storage of all other Netezza devices in the shop as well as any non-Netezza device where we simply want to capture structured information for archival purposes. The High Capacity server is queryable also, meaning that even the ad-hoc folks will find some value in keeping the data online and available.
Lastly, in Spring 2010, as part of the safe-harbor presentation one of the principals at IBM Netezza announced plans for a replication server. I can only imagine that this device will deliver us from any additional hiccups associated with streaming replication that we might now be doing in script or other utility control language.
At Brightlight, our data integration framework (nzDIF) has the nz_migrate techniques built directly into the flow substrate of the processing controller, as well as the enforcement and maintenance of the aforementioned content boundaries. We are actively acquiring and applying best, most scalable and simplified approaches as a solution framework firmly lashed to a purpose-built machine. I am a big proponent of encouraging Enzees to take on these things themselves, or at least let us coach you on how to make it happen. The solutions are simple because the Netezza platform itself is simplified in its operation. Stand on the shoulders of genius - the air is good up here.
Once again this entry will serve as an intro to the un-initiated. I get a lot of pings from folks who are new to the stuff and want to know their way around the machine from the beginning. Zone maps are bread-and-butter power tools of the Netezza technology. The companion capabilities are the materialized view and the cluster base tables (which are new to v 6.0).
About three years ago I sat in on a special presentation to our firm from one of Netezza's competitors. They put up some power-point slides to compare their technology to "the others". Their product was a data warehousing product and all the "others" were "OLTP" products. The usual suspects were in this list of course, but oddly so was Netezza. But Netezza is not an OLTP machine in any sense of the word. How odd that this company did not know enough about its primary rival to discern this simple truth. In addition, this vendor made certain odd claims about Netezza that were either architectural false-assumptions or just misinformation. One of which was this "In a Netezza machine, the more data you add to a table, the slower the table's response will be." This is of course completely false. Netezza tables are like any data storage with one major takeaway that should not be missed:
Netezza tables may behave inconsistently if the data is not properly configured and deployed. Notice I did not say "if the hardware is not properly ...." but the data itself. Netezza already has the hardware configuration under control and there's nothing we can do to affect it. We can however, align our data with the hardware architecture and will receive the benefit of a consistent experience no matter how much the data grows. It's hard to imagine this kind of consistency on any platform, so stay with me here.
A number of years ago I did some deep immersion in a health-care expert system for Value Health Sciences. The objective was to directly apply and enforce the codes found in the CPT code book used by physicians (those are the codes the doc will check on the sheet he hands to you before he leaves the office). Insurance companies have standard billing/invoicing arrangements with the physicians and these codes are the centerpiece. For VHS, their medical practitioners had set up rules for characterizing (and properly re-characterizing) a claim. For example, if a physician billed codes 1,2 and 4 but he was supposed to have billed these all as 1 and 5 (because code 5 bundles 2 and 4 to reduce cost) the system would have a rule in it that looked something like:
1,2,4 = 1, 5
The difficulty here of course, with some 100,000 codes and over 40,000 rules, is to organize them so that they are quickly accessible. By design, the first code in the rule (eg. code "1" above) was considered an "index" code. In other words, any and all rules that would use the code "1" as an anchor would also make "1" the first code. This lended itself to the opportunity to "vector" the rules. That is, create an in-memory array, indexed by the given code, and a list of rules would be handily available at that vector point. So that all we had to do was look at the claim (e.g. codes 1,2,4) fetch the various candidate rules from their vector lists ( the list for code 1 might have 3 rules, likewise for the others, reducing our total candidate rules from 40,000 to only 9 rules). We could then run the rules and dispatch the claim. When I had first received this model, the system was physically running through all 40,000 rules to find the right match. This vectored approach reduced the total rules, but also eliminated guesswork. And with all that, boosted the power and created a very consistent turnaround time for each patient claim.
The vector list above is a common way in programming to organize information so that - if we have something in our hands already - like a filter or a vector based on a natural key - we can quickly find where we need to start looking and eliminate the places where we are not interested. While this sounds a lot like a common index-structure approach, it's actually a process-of-elimination approach. The two have similar qualities, so I will delineate them here at a high level.
An index is a surgical approach to finding data by logging its record-based keys and a record locator into a separately maintained structure. By finding keys, we find physical records. This is important for transactional systems that need fast record-level turnaround. It is not ideal however, for analytics or data warehousing where the vast majority of queries don't care about individual records at all, but about answers that span records. This alternative approach of vectoring, is a way to bucket the information so that we're not attempting to specifically locate the record itself, but the bucket where the record may be located, that is, derive a group of candidates upon which we will take further action.
When we think about this, the candidate approach, as opposed to the surgical approach is the only way to scale anything that has to deal with multiple-record sets (set-based processing). Engines that are geared to examine one-record-at-a-time even if it is to aggregate or perform multi-record operations with the single-records it retrieves, cannot compete with an engine that is designed to defacto treat the records as multiples from the outset. Transactional versus bulk. The transactional engine is inherently slower and non-scalable compared to the bulk engine (for set-based operations).
I noted in a prior entry that Netezza's anti-indexing strategy heavily leverages "where not to look" because no matter how large a table gets, this "vectoring" or "candidate" approach will guarantee that our response time is very consistent. How does this apply to zone maps?
Recall the diagram with the CPU/disk combinations. Each of these disks is divided into its smallest actionable physical portion (an extent) and the extent then becomes the currency-of-the-realm when we're talking about searching for data. The machine automatically tracks the data that is added to an extent, and when we generate-statistics, every extent is accounted for in a zone map. The map has basic statistics on the columns in the table at the extent level, such as the high/low values for a column in the given extent.
Extent ColA ColB ColC
1 200 A B
1 300 A B
1 400 C D
1 500 E F
2 200 F G
2 400 H I
2 600 J K
3 100 L M
3 200 J O
Note the above layouts of the extents. If we were to look at the high/low values of each, we would find zone maps like this:
Extent ColA High ColA Low
1 500 200
2 600 200
3 200 100
So now if we ask a question to the machine based on say, a ColA value of 200, the machine will search all three extents. It has no choice because the zone map is telling it that pertinent data exists in all three.
Now let's physically sort the data by ColA
Extent ColA ColB ColC
1 100 L M
1 200 A B
1 200 F G
1 200 J O
2 300 A B
2 400 C D
2 400 H I
3 500 E F
3 600 J K
Now if we take a look at the zone map, we would see something like:
Extent ColA High ColA Low
1 200 100
2 400 300
3 600 500
So that now if we query the table for values of 200, the machine searches only one extent, not all of them. (Now, this is a simple example, because a given value could extend across many extents). Likewise a given disk has thousands of extents upon which to store the data, so it can spread the information out into much smaller, manageable chunks.
The point, however, is that by organizing the data this way, the zone maps reduce the total searchable extents. Keep in mind that we don't have to sort the data each time we want to use it. For many warehouses, the data naturally arrives in this contiguous form (for example, transaction date in a retail model). The Netezza machine will track all integers and dates automatically without us having to do anything additional to enable it. So we might take a penalty once to sort the data at the outset, then benefit from the data's natural arrival and collation.
This effectively creates the aforementioned vectored effect. It also guarantees a consistent experience because the data associated with the given value (say 200 above) isn't going anywhere. It will always be in the same location, tracked by the same zone map. Anytime we want to get back to the records associated with value 200, the zone map will serve them up. If the response time last week was 2 seconds, we could quadruple the size of the table and it will still be 2 seconds, because the data itself didn't go anywhere.
Contrast this to other heavy-lifting algorithms in software technologies. The Ab Initio platform, for example, will perform an in-memory Sort that will "intelligently" spill to disk when it runs out of memory. What this means is that it will take a predefined block of memory and attempt to perform all work inside it until it runs out. If so, it will manufacture 16 file locations within which to spill its overflow. These are organized with the same "vectored" approach as the sort keys would dictate. If these 16 files run out of space, they will further divide into 256 files to split the workload. Anyone working with the technology will tell you that this secondary split into 256 is to be avoided if at all possible because it negatively (and significantly) affects the performance of the sort. But we see the difference here. A general-purpose sort in software needs these 256 assets to functionally complete its mission, and adding more means less performance. Now compare this to Netezza which uses not just 256, but thousands of extents to divide the data out - and rather than experience the performance drag, will derive power and lift from this.
So let's apply something additional, especially for highly active tables. If we have a nightly batch window for processing, we likely have all the time we need to keep these zone maps healthy. But if we have an intraday model or even an aggressive micro-batch model, it is possible that the zone maps may be "violated". Note above that as long as the data arrives in contiguous chunks (like transaction date) the data will be set into the extents so that we minimize the total extents for a search. If a transaction date (for example) should appear outside of this contiguous range (say a record from last week's data has arrived) then it means the high/low range for that particular extent will be extended to include the errant record. It could mean that the given extent is also included in queries that don't related to either date, because the extent is simply being included because of its high-low range. That is, every query looking for this extent's primary transaction date, the errant transaction date, or any date in between them, will include this extent in the candidate list. While this is inconsequential in most cases, if it repeats over time it can slowly degrade the table's performance. Highly active micro-batch and intraday models have a tendency to see these effects if they remain unchecked or unguarded. But not to worry - the solution is simple.
What is daunting about this is that our highly-active-table might be a multi-billion-row beast that does not lend itself to re-sorting and in an intraday model, there's no time to re-sort the data anyhow.
A materialized view has the quality of being able to apply a virtual zone map to a table without physically re-sorting the information. All we have to do is pick the columns we want to zone map, and then form the materialized view with an order-by on those columns. Rebuilding one of these is usually very fast. In addition, we can be rebuilding it even while its prior version is active. At one site, we set up a process to rebuild the view thus:
Note that for the few moment that two materialized views are applied to two tables bears no risk at all. And if we drop the original materialized view, it will wait for all pending queries to stop - meaning that this won't break anything the users are doing (another necessity for highly active intraday models)
Important here is that the materialized views are not being used as surrogate tables, but as a means to virtualize the zone maps for faster response time on the table associated with the view. Another important aspect of the materialized view is that it can zone-map additional data types that the regular zone maps do not (e.g. varchars).
Let's say that we have a very active table and even re-creating the materialized view is too much uptake to maintain it. Not to worry, the materialized view has another quality that means we don't have to refresh it each time the table changes. The materialized view maintains its contents until the underpinning table changes. We then have the option to rebuild the view or use it "as is". This "as is" is usually agreeable for the short term, since when data is added to the underpinning table, the materialized view creates an extension to itself. Now whenever we query the table, it will examine the materialized view for default zone-map information but will also examine the extent. Considering that this extra extent is very small it is also insignificant in its overhead. So we can keep this in place for say, all day long and then clean up the materialized view in the evening. The materialized view's performance will not drag over the course of a day unless we do some serious re-churning of the underpinning data (which is not normal for common micro-batch models).
And now - Cluster-based tables
With all the above preliminary stuff out of the way, a discussion on CBTs is fairly straightforward. As of the 6.0 release, Netezza has provided this as a multi-dimensional capability for even more powerful searches. The data in the CBT is physically organized into extents based on keys. And when we perform a groom operation, the data is physically reshuffled so that the keys are physically co-located on the fewest extents possible. Like a self-enforcing zone map. only a lot more powerful. In a zone map approach, the primary way to get the data contiguous is to re-sort the data and the zone maps fall out for free. In a CBT, we organize the data and the execute groom. The initial groom will take awhile, but each subsequent groom will take inconsequential time. The data in the CBT can be organized on more than one key, but the keys can behave independently like a multi-dimensional effect.
Case in point - let's say we have 200 extents of data and the particular set of data we want spans only 10 of those extents (a zone map). We will ignore the other 190 extents and focus all the machine's energy on the 10 where the data is located. But wait, what if we have an additional key upon which to search, and the combination of the first key (10 extents) and second key reduces the total extents to only 2? That is, the first key appears in 10 extents but the first and second key appear in only 2 extents of those 10. Now we search only 2 extents. Again, we have organized and configured data, not hardware, to achieve the goal. We have told Netezza where-not-to-look and now we have zoomed in on a mere 2 extents out of 200.
But let's say that we add a lot more data to the table over the year, expanding this 200 extents to many hundreds if not thousands more. Can we see now that even if this happens, we can go back to the table again and again, ask the the same question as before, and only 2 of the extents will be used to search for data. This of course dispels the vendor-promulgated propaganda that Netezza tables get slower when we add more data.
As the title suggests, one of the challenges of new Netezza users is in learning about the product, what it can (and doesn't) do, and how it applies in data warehousing. When I first published the book (Netezza Underground on Amazon.com) the impetus for the effort was just that - people asking me lots of questions leading to fairly repetitive and predictable answers. It's an appliance after all. We can apply it to a multiplicative array of solutions but on the inside, certain things stand out as immutable truths.
Of course, lots of folks are running around down here in the catacombs, convincing people that they need to read the glowing glyphs on the ancient stone walls as guides on their quest for more information. This is entirely unnecessary. The title of the book (and the blog) is a tongue-in-cheek nod to the way that some might spin the story on the machine in their own favor. Some might claim that we need lots of consulting hours to roll out the simplest solution. Consultants can help, of course (I'm one of them). But it depends on the spin and the tale that is told, that will determine the magnitude and expense of those consultants..
I'll try to provide a balance that delivers value without extraordinary expense. Yet another source of misinformation is Netezza's competitors, who like to toss "innocent" bombs here and there to direct people down the path toward their own product. All is fair in business and war, as they say, but as we bring these issues out of the darkness and into the light, the objective is to become more informed.
I am neither a Netezza nor IBM employee, so apart from compensation for actual work performed in rolling out solutions, for which people would pay me regardless of technology, I don't have any other relationship with these companies. I am a huge fan of the Netezza product and architecture and make no secret of this.
So some may have come here to ask questions about the technology or read what some seasoned experts have to say. We can do all that and a lot more. For now, let's look at the counter-intuitive nature of the product's internals.
I'll paint an imaginary picture first. Let's say we have two boxes. In one box we have thirty-two circles and in the second box we have thirty-two drums. The circles are CPUs and the drums are disk drives. They are in separate boxes, and now we draw a pipeline between them. Make it as large as you want. This depicts a standard SMP (Symmetric Multi-Processor) hardware architecture that is the common platform for data warehousing. With the exception of Teradata and Netezza, this hardware configuration is ubiquitous.
Now let's draw another mental picture. This time we'll have one box. Each of the circles will be mated with one of the drums. Now we have thirty-two circle/drum combinations. In the Netezza machine each of these is called a SPU (snippet processing unit) and represent the CPU coupled with a dedicated disk drive. Some additional hardware exists to coordinate and accelerate this combination, but this is a simplified mental depiction of the fundamental difference between the prior SMP configuration and Netezza's, an MPP (Massively Parallel Processor) configuration.
Now I used thirty-two as a simple example. In reality, Netezza can host hundreds of these cpu/disk combinations, the largest standalone frame containing over eight hundred of them, scaling to over a petabyte in storage capacity. Those of us who regularly operate on these machines are accustomed to loading, scanning and processing data by the terabyte, the smallest tables in the tens of billions of rows.
Some notes on the difference in their operation:
SMP: Typically used for transactional (OLTP) processing and is not purpose-built for data warehousing. It is a general-purpose platform. The typical bane of an OLTP engine is that it performs well on getting data inside, but is lousy on getting data out (in quantity, for reports and such).
MPP: (Netezza specifically) does not do transactional processing at all. It is purpose-built to inhale and exhale Libraries-of-Congress at a time.
SMP: Table exists logically in the database, but physically on the file system in a monolithic or contiguous table space.
MPP: Table exists logically in the database, but is physically spread out across all the SPUS. If we have 100 SPUs and want to load a table with 100,000 records, each SPU will receive 1000 records.
SMP: SQL statements are executed by the "machine" as a whole, sharing CPUs and drives. While the SQL operations may be logically and functionally "shared nothing" - the hardware is "shared-everything". In fact, CPUs could have other responsibilities too, which have no bearing on completing a SQL statement operation. In the above example, the SMP has to access the file system, draw the data into memory nearest the CPUs, then perform operations on the data. Copying data, for example, would mean drawing all the data from the tablespace into the CPUs and then pushing back down into another tablespace, but both tablespaces are on the same shared disk array.
MPP: SQL Statements are sent to all CPUs simultaneously, so all are working together to close the request. In the above example, each SPU only has to deal with 1000 records. Unlike an SMP, the data is already nearest the CPU so it doesn't have to go anywhere. The CPU can act directly on the data and coordinate as necessary with other CPUs. Copying data for example, means that the 1000 rows is copied to a locaton on the local drive. If 100 CPUs perform this simultaneously, the data is copied 100 times faster and it never leaves the disks.
SMP: Lots of overhead, knobs and tuning to make it go and keep it going. From the verbosity of the DDL to the declaration of index structures, table spaces and the like.
MPP: (Netezza) the overhead of adminstration is hidden from the user. In the above example, the user need only declare, load and utilize the table, not be concerned about managing the SPUs or disk allocation. The user's only concern is in aligning the data content with the hardware architecture, an easy task to perform and an easy state to maintain.
SMP: SQL-transforms, therefore, the insert/select operations that run so slow on an SMP platform, continue to run slow, and slower as data is added. They are not a viable option so usually would not occur to us to leverage them otherwise.
MPP: SQL-transforms leverage the MPP and do not slow as the data grows.
In fact, many people purchase the Netezza platform in context of its competitors, who don't (nor can they) use SQL transforms to affect data after arrival.
If we purchase Netezza as a load-and-query platform only, we have missed a special capability of the machine that differentiates it from the other products. If we leverage this power, we re-balance the workload of the overworked ETL tools, in some cases eliminating them altogether. Netezza calls this practice "bringing everything under air", that is, bring the data as-is into the machine and do the heavy-lifting transforms after it's inside.
The Brightlight Accelerator Framework is one example of a flow-based, run-time harness for deploying high-performance, metadata-driven SQL-transforms. While we have matured this capability over time, the consistency of the Netezza platform is the key to its success.
SMP:Scalability is through extreme engineering, leading to hardware upgrade.
MPP:Scalability is a function of data organization as it leverages the hardware's power. Upgrading hardware is therefore rarely the first option of choice.
SMP:Constrained by index structures and the shared-everything hardware architecture through which all data must pass
MPP:Constrained by human imagination, not index structures (there aren't any) and no shared hardware.
We have noted that with traditional SMP architectures, when the machine starts to run out of power, engineers will swarm the machine and start to instantiate exotic data structures and concepts that serve only as performance props, not the 'elegant' model once installed by the master modelers. As the box continues to decline, more engineering ensues, eliciting even stranger approaches to data storage and management. Soon it becomes so functionally brittle that it cannot handle any more functionality, so people resort to workarounds and bolt-ons. Functionality that should be a part of the solution is now outside of it, and functionality that never should have been inside (mainly for performance) has taken up permanent residence.
In a Netezza machine we have so much power available that traditional modeling approaches (e.g. 3NF and Dimensional) may have a defacto home, but now we can examine and deploy other concepts that may be more useful and scalable. Approaches that we never had the opportunity to examine before, because the modeling tool did not (and does not) support it and the natural constraints of the SMP-based engine artificially constrain us to index-based data management.
In addition, the Netezza machine operates on a counter-intuitive principle of finding information based on "where not to look". With this principle, let's say we have a terabyte of data and we know where our necessary information is, based on where-not-to-look. See how this won't change even if we add hundreds of terabytes to the system? If I add another 99 terabytes to the same system, the query will still return with the same answer in the same duration, because the data I'm looking for doesn't appear in those 99 terabytes and the machine already knows this.
For example, if we go over to Wal-Mart and we know where the kiosk is for the special-buys-of-the-day, we can find this easily. It's in a familiar location. What if tomorrow they expand the same Wal-Mart to ten times its current size? Will it affect how long it takes me to find that kiosk? Clearly not. Netezza operates the same way, in that no matter how much data we add, we don't have to worry about scanning through all of it to find the answer.
And when it boils down to basics, the where-not-to-look is the only way to scale. No other engine, especially not one that is completely dependent upon index structures to locate information, can scale to the same heights as Netezza.
Well, some of the stuff above is provocative and may elicit commentary. I'll continue to post as time progresses, so the data won't seem so much like underground information, after a while anyhow.
Modified on by DavidBirmingham
Many years ago we encountered an environment where the client wanted the old system refactored into the new. The "new" here being the Netezza platform and the "old" here being an overwhelmed RDBMS that couldn't hope to keep up with the workload. So the team landed on the ground with all hopes high. The client had purchased the equivalent of a 4-rack striper for production and a 1-rack Striper for development. Oddly, the same thing happened here as happens in many places. The 4-rack was dispatched to the protected production enclave and the 1-rack was dropped into the local data center with the developers salivating to get started. And get started they did.
The first team inherited about half a terabyte of raw data from the old system and started crunching on it. The second team, starting a week later, began testing on the work of the first team. A third team entered the fray, building out test cases and a wide array of number-crunching exercises. While these three teams dogpiled onto and hammered the 1-rack, the 4-rack sat elsewhere, humming with nothing to do.
We know that in any environment we encounter, with any technology we can name, the development machines are underpowered compared to the production environment. And while the production environment has a lot of growing priorities for ongoing projects, we don't have this scenario for our first project, do we? Our first project has a primary, overarching theme: it is a huge bubble of work that we need to muscle-through with as much power as possible. That "as much as possible" in our case, was the 4-rack sitting behind the smoked glass, mocking us.
And this is the irony - for a first project we have a huge "first-bubble" of work before us that will never appear again. the bubble includes all the data movement, management and backfilling of structures that we will execute only once, right? Really? I've been in places where these processes have to be executed dozens if not hundreds of times in a development or integration environment as a means to boil out any latent bugs prior to its maiden - and only - conversion voyage. But is this a maiden-and-only voyage? Hardly - typically the production guys will want to make several dry runs of the stuff too. We can multiply their need for dry runs with ours, because we have no intention of invoking such a large-scale movement of data without extensive testing.
And yet, we're doing it on the smaller machine. No doubt the 1-rack has some stuff - but I've seen cases where it might take us two weeks to wrap up a particularly heavy-lifting piece of logic. If we'd done this on the larger 4-rack, we would ahve finished it in days or less. Double the power, half the time-to-deliver (when the time is deliver is governed by testing)
In practically every case of a data warehouse conversion, the actual 'coding' and development itself is a nit compered to the timeline required for testing. I've noted this in a number of places and forms, in that the testing load for a data warehouse conversion is the largest and most protracted part of the effort. And if testing (as in our case) is largely loading, crunching and presenting the data, we need the strongest possible hardware to get past the first bubble. A data conversion project is a "testing" project more so than a "development" project, and with the volumes we'll ultimately crunch, hardware is king.
But I've had this conversation with more people than I can count. Why can't you deploy the production environment with all its power, for use in getting past the first bubble, then scratch the system and deploy for production? What is the danger here? I know plenty of people, some of them vendor product engineers, who would be happy to validate such a 'scratch' so that the production system arrives with nothing but its originally deployed default environment.
Yet another philosophy is that we would pre-configure the machine for production deployment, but nobody likes developers doing this kind of thing in a vacuum. They would rather see deployment/implementation scripts that "promote" the implementation. I'm a big fan of that, too, for the first and every following deployment. That's why I would prefer we used the production-destined system to get past the first-bubble-blues, then scratch it, and get the original environment standing up straight, and only then treat it as an operational production asset.
Most projects like this have a very short runway in their time-to-market, and we do a disservice to the hard-working folks who are doing their best to stand up this environment, They need all the power they can get, especially when they enter the testing cycle.
And for this, it's an 80/20 rule for every technical work product we will ever produce. Take a look sometime at what it takes to roll out a simple Java Bean, or a C# application, or a web site. Part of the time is spent in raw development, and part of it in testing. If I include the total number of minutes spent by the developer in unit testing, and then by hardcore testers in a UAT or QA environment, and it is clear that the total wall-clock hours spent in producing quality technology breaks into the 80/20 rule - 20 percent of the time is spent in development, and 80 percent in testing.
And if the majority of the time is spent in testing, what are we testing on Enzee space? The machine's ability to load, internally crunch and then publish the data. On a Netezza machine, this last operation is largely a function of the first two. But we have to test all the loading don't we? And when testing the full processing cycle we have to load-and-crunch in the same stream, no? What does it take to do this? Hardware, baby, and lots of it.
I can say that multiple small teams can get a lot of "ongoing" work done on a 1-rack, no doubt a very powerful environment. I can also say that a machine like this, for multiple teams in the first-bubble effort, will gaze longingly at the 4-rack in the hopes they can get to it soon, because so much testing is still before them, and they need the power to close.
What are some options to make this work? Typically the production controllers and operators don't like to see any "development" work in the machines that sit inside the production enclosure. They want tried-and-tested solutions that are production-ready while they're running. At the same time, they have no issues with allowing a pre-production instance into the environment because they know a pre-production instance is often necessary for performance testing. Here's the rub: the entire conversion and migration is one giant performance test! So designating the environment as pre-production isn't subtle, nuanced, disingenuous or sneaky - it accurately defines what we're trying to do. It's a performance-centric conversion of a pre-existing production solution, now de-engineered for the Netezza machine. As I noted, development is usually a nit, where the testing is the centerpiece of the work.
With that, Netezza gives us the power to close, to handily muscle-through this first-bubble without the blues - we only hurt ourselves with "policies" for the environment that are impractical for the first-bubble.
This brings us full-circle yet again to a common problem with environments assimilating a Netezza machine. The scales and protocols put pressure on policies, because those policies are geared for general-purpose environments. There's nothing wrong with the policies, they protect things inside those general-purpose environments. But the same policies that protect things in general-purpose realm actually sacrifice performance in the Netezza realm. Don't toss those policies - adapt them.