Modified on by DavidBirmingham
One of the most significant questions to answer on the Netezza platform is that of "distribution" - what does it mean? how does it apply? and all that. If you are a Netezza aficionado, be warned that the following commentary serves as a bit of a primer for the un-initiated. However, feel free to make comments or suggestions to improve it, or pass it on if you like.
In the prior entry I offered up a simple graphic (right) as to how the Netezza architecture is laid out internally.
In this depiction, each CPU is harnessing its own disk drive. In the original architecture, this was called a SPU, because the CPU was coupled with a disk drive and an FPGA (field-programmable gate array) that holds the architectural firmware "secret sauce" of the Netezza platform. The FPGA is still present and accounted for in the new architecture as the additional blade(s) on the blade server. Not to over-complicate things, so let's look at the logical and physical layout of the data itself and this will be a bit clearer.
When we define a table on the Netezza host, it logically exists in the catalog once. However, with the depicted CPU/disk combinations (right) we can see 16 of these available to store the information physically. If the table is defined with a distribution of "random", then the data will be evenly divided between all 16 of them. Let's say we have a table arriving that has 16 million records. Once loaded, each of these SPUs would be in control over 1 million records. So the table exists logically once, and physically multiple times. For a plain straight query on this configuration, any question we ask the table will initate the same query on all SPUs simultaneously. They will work on their portion of the data and return it an answer. This Single-Instruction-Multiple-Data model is a bread-and-butter power strategy of Massively Parallel computing. It is powerful because at any given moment, the answer we want is only as far away as the slowest SPU. This would mean as fast as the SPU can scan 1 million records, this is the total duration of the query. All of the SPUs will move at this speed, in parallel, so will finish at the same time.
What if we put another table on the machine that contains say, 32 million rows? (trying to keep even numbers here for clarity). The machine will load this table such that each SPU will contain 2 million rows each. If no other data is on the machine, we effectively have 3 million records per SPU loaded, but the data itself may be completely unrelated. Notice how we would not divide the data another way, like putting 16 million records on 6 of the SPUs and the other 32 million records on the remaining 10 SPUs. No, in massively parallel context, we want all the SPUs working together for every query. The more SPUs are working, the faster the query will complete.
Now some understand the Massively Parallel model to be thus: Multiple servers, each containing some domain of the information, and when a query is sent out it will find its way to the appropriate server and cause it to search its storage for the answer. This is a more stovepiped model than the Netezza architecture and will ultimately have logistics and hardware scalability as Achilles's heels, not strengths for parallel processing. Many sites are successful with such models. But they are very application-centric and not open for reuse for other unrelated applications. I noted above that the information we just loaded on the Netezza machine can be in unrelated tables, and databases and application suites, because the Netezza machine is general-purpose when it comes to handling its applicability for data processing. It is purpose-built when we talk about scale.
But let's take the 16-million-record vs 32-million record model above and assume that one of the tables is a transactional table (the 32 million) and one is a dimensional table (the 16 million). The dimensional table contains a key called "store_id" that is represented also on the transactional table such that we can join them together in various contexts. Will the Netezza machine do this, and how will it? After all, the data for the tables is spread out across 16 SPUs.
Well, we have the option of joining these "as is" or we can apply an even more interesting approach, that of using a distribution key. Here's where we need to exercise some analysis, because it appears as though the STORE_ID is what we want for a distribution, but this might skew the data. What does this mean? When we assign a distribution, the Netezza machine will then hash the distribution key into one of 16 values. Every time that particular key appears, it will always be assigned the same value and land on the same SPU. So now we can redistribute both of these random tables on STORE_ID and be certain that for say, SPU #5, all of the same store_id's for the dimension table and for the transaction table are physically co-located on the same disk. You probably see where this is going now.
Ahh, but what if we choose store_id and it turns out that a large portion of the IDs hash to a common SPU? What if we see, rather than 1 million records per SPU, we now see around 500k per SPU with one SPU radically overloaded with the remainder? This will make the queries run slow, because while all the SPUs but one will finish fast, in half the time of before, they will all be waiting on the one overworked SPU with all the extra data. This is called data skew and is detrimental to performance.
However, had we chosen a date or timestamp field, our skew may have been even worse. We might see that the data is physically distributed on the SPUs in a very even form. But when we go to query the data, we will likely use a date as part of the query, meaning that only a few SPUs, perhaps even one SPU, will actually participate in the answer while all the others sit idle. This is called process skew and is also detrimental to performance.
Those are just some things to watch out for. If we stay in the model of using business keys for distribution and using dates for zone maps (which I will save for another entry) we will usually have a great time with the data and few worries at all.
Back to the configuration on store_id's. If we now query these two tables using store_id in the query itself, Netezza will co-locate the join on the SPU and will only allow records to emit from this join if there is a match. This means that the majority of the work will be happening in the SPU itself before it ever attempts to do anything else with it. What if we additionally clamped the query by using a date-field on the transaction table? The transaction table would then be filtered on this column, further reducing the join load on the machine and returning the data even faster.
So here is where we get lifting effects where other machines experience drag. For an SMP-based machine, adding joins can make the query run slower. On a Netezza machine, joins can serve as filter-effects, limiting the data returned by the query and increasing the machine's immediate information on where-not-to-look. Likewise the date-filter is another way we tell the machine where-not-to-look. As I have noted, the more clues we can give the machine as to where-not-to-look, the faster the query will behave.
I have personally witnessed cases where adding a join to a large table actually decreases the overall query time. Adding another one further decreases it. Adding another one even further and so on - five-joins later we were getting returns in less than five seconds where the original query was taking minutes. This filter-table and filter-join effect occurs as a free artifact of the Netezza architecture.
Distribution is simply a way to lay the data out on the disks so that we get as many SPUs working for us as possible on every single query. The more SPUs are working, the more hardware is being applied to the problem. The more hardware, the more the physics is working for us. Performance is in the physics, always and forever. Performance is never in the software.
While the above is powerful for read-query lift (co-located reading) there is another power-tool called the co-located write. If we want to perform an insert-select operation, or a create-table-as-select, we need to be mindful that certain rules apply when we create these tables. For example, if we want to perform a co-located read above and drop the result into an intermediate table, it is ideal for the intermediate table to be distributed on the same key as the tables we are reading from. Why is this? Because the Netezza machne will simply read the data from one portion of the disk and ultimately land it to another portion of the same disk. Once again it never leaves the hardware in order to affect the answer. If we were to distribute the target table on another key, the data will have to leave the SPU as it finds a new SPU home to align with its distribution key. If we actually need to redistribute, this is fine. But if we don't and this is an arbitrary configuration, we can buy back a lot of power just by co-locating the reads and writes for maximum hardware involvement.
So that's distribution at a 50,000 foot level. We will look at more details later, or hop over to the Netezza Community to my blog there where some of these details have already been vetted.
It is important to keep in mind that while distribution and co-location are keys to high(er) performance on the machine, the true hero here is the SPU architecture and how it applies to the problem at hand. We have seen cases where applying a distribution alone has provided dramatic effects, such as 30x and 40x boost because of the nature of the data. This is not typical however, since the co-located reads will usually provide only a percentage boost rather than an X-level boost. This is why we often suggest that people sort of clear-their-head of jumping directly into a distribution discussion and instead put together a workable data model with a random distribution. Once loaded, profile the various key candidates to get a feel for which ones work the best and which ones do not. We have seen some users struggle with the data only because they prematurely selected a distribution key that - unbeknownst to them - had a very high skew and made the queries run too slow. This protracted their workstreams and made all kinds of things take longer than they should have.
So at inception, go simple, and then work your way up to the ideal.
On multiple distribution keys:
Many question arise on how many distribution keys to use. Keep in mind that this is as much a physical choice as a functional one. If the chosen key provides good physical distribution, then there is no reason to use more keys. More granularity in a good distribution has no value. However, a distribution key MUST be used in a join in order to achieve co-location. So if we use multiple keys, we are committing to using all of them in all joins, and this is rarely the case. I would strongly suggest that you center on a single distribution key and only move to more than one if you have high physical skew (again, a physical not functional reason). The distribution is not an index - it is a hash. In multi-key distributions, the join does not first look at one column then the next - it looks at all three at once because they are hashed together for the distribution. Joined-at-the-hip if you will.
On leaving out a distribution key:
One case had three tables joining their main keys to get a functional answer. They were all distributed on the same key, which was a higher-level of data than the keys used in the join. The developer apparently thought that because the distribution keys are declared that the machine will magically use them behind the scenes with no additional effort from us. This is not the case. If the distribution key(s) (all of them) are not mentioned in the join, the machine will not attempt co-location. In this particular case, using the higher-level key would have extended the join somewhat but would not have changed the functional answer. Simply adding this column to the join reduced that query's duration by 90 percent. So even if a particular distribution key does not "directly" participate in the functional answer, it must directly participate in the join so as to achieve co-location. And if this does not change the functional outcome, we get the performance and the right answer.
How it affects concurrency:
Many times people will ask: Why is it that the query runs fast when it's running alone, but when it's running side-by-side with another instance of itself they both slow to a crawl? This is largely due to the lack of co-location in the query. When the query cannot co-locate, it must redistribute the data across the inter-blade network fabric so that all the CPUs are privy to all the data. This quickly saturates the fabric so that when another query launches, they start fighting over fabric bandwidth not the CPU bandwidth. In fact some have noted that the box appears to be asleep because the CPUs and drives aren't doing anything during this high-crunch cycle. That's right, and it's a telltale sign that your machine's resources are bottlenecked at the fabric and not the CPU or I/O levels. By co-locating the joins, we keep the data off the fabric and down in the CPUs where it belongs, and the query will close faster and can easily co-exist with other high-intensity queries.
Modified on by DavidBirmingham
I hear a lot of feedback on the use of CDC to put data into a PureData for Analytics, Powered By Netezza Technology device. In the other machines (traditional database engines) the data flies into the box, the CDC is on-and-off the machine in seconds. But in my Netezza machine, the CDC seems to grind. I have it running every fifteen minutes, they say, and the prior CDC instance is still running when the next instance kicks off. This is totally unacceptable. Maybe we shouldn't be using CDC for this?
Or maybe they just don't have it configured correctly?
There are two major PDA principles in play here. One is strategic and the other is tactical. Many people can look at the tactical principle and accept it because it is testable, repeatable and measurable. The strategic one however, they will hold their judgment on because it does not fit their paradigm of what a database should do. I'll save the strategic one for last, because its implications are further reaching.
The CDC operation will accumulate records into a cache and then apply these at the designated time interval. This micro-batch scenario fits Netezza databases well. The secondary part of this is that the actual operation will include a delete/insert combination to cover all deletes, updates and inserts. So when the operation is complete, the contents of the Netezza table will be identical to the contents of the source table at that point in time (even though we expect some latency, that's okay).
The critical piece is this: An update operation on a Netezza table is under-the-covers a full-record-delete and full-record-insert. It does not update the record in place. A delete operation is just a subset of this combination. This is why the CDC's delete/insert combination is able to perfectly handle all deletes, updates and inserts. The missing understanding however, is the distribution key.
If we have a body of records that we need to perform a delete operation with against another, larger table, and the larger table is distributed on RANDOM, think about what the delete operation must do in a mechanical sense. It must take every record in the body of incoming records and ship it to every SPU so that the data is visible to all dataslices. It must do this because the data is random and it cannot know where to find a given record to apply the operation - it could literally be anywhere and if the record is not unique, could exist on every dataslice. It's random after all. This causes a delete operation (and by corollary an update operation) to grind as it attempts to locate its targets.
Contrast this to a table that is distributed on a key, and we actually use the key in the delete operation (such as a primary key). The incoming body of records is divided by key, and only that key's worth of data is shipped to the dataslice sharing that key - the operation is lightning-fast. This is why we say - never, ever perform a delete or update on a random table, or on a table that doesn't share the distribution key of the data we intend to affect. Deletes and Updates must be configured to co-locate, or they will grind.
Now back to the CDC operation. Whenever I hear that the CDC operation is grinding, my first question is: Do you have the target Netezza tables distributed on the same primary keys of the source table? The answer is invariably no (we will discover why in a moment). So then I ask them, what would it take to get the tables distributed on the primary key? How much effort would it be? And they invariably answer, well, not much, but it would break our solution.
Why is that?
Because they are reporting from the same tables that the CDC is affecting. And when reporting from these tables, the distribution key has to face the way the reporting users will use the tables, not the way CDC is using the tables. This conversation often closes with a "thank you very much" because now they understand the problem and see it as a shortcoming of Netezza or CDC, but not a shortcoming of how they have implemented the solution.
Which brings us to the strategic principle: There is no such thing as a general purpose database in Netezza.
What are we witnessing here? The CDC is writing to tables that should be configured and optimized for its use. They are not so, because the reporting users want them configured and optimized for their own use. They are using the same database for two purposes because they are steeped in the "normalization" protocol prevalent in general-purpose systems - that the databases should be multi-use or general-purpose.
But is this really true in the traditional databases? If we were using Oracle, DB2, SQLServer - to get better performance out of the data model wouldn't we reconfigure it into a star schema and aggressively index the most active tables? This moves away from the transactional flavor of the original tables to a strongly purpose-built flavor.
Why is it that we think this model is to be ignored when moving to Netezza? Oddly, Netezza is a Data Warehouse Appliance - it was designed to circunscribe and simplify the most prevalent practices of data warehousing - not the least of which - is the principle that there is no such thing as a general-purpose database. In a traditional engine we would never attempt to use transactional tables for reporting - they are too slow for set-based operations and deep-analytics. Yet over here in the Netezza machine, somehow this principle is either set-aside or ignored - or perhaps the solution implementors are unaware of it - and so these seemingly inexplicable grinding mysteries arise and people scratch their heads and wonder what's wrong with the machine.
And again, they never wonder what's wrong with their solution.
If we take a step back, what we will see are reports that leverage the CDC-based tables, but we will see a common theme, which I will cover in a part-2 of this article. The theme is one of "re integration" versus "pre-integration". That is, integration-on-demand rather than data that is pre-configured and pre-formulated into consumption-ready formats. What is a symptom of this? How about a proliferation of views that snap-together five or more tables with a prevalence of left-outer-joins? Or a prevalence of nested views (five, ten, fifteen levels deep) that attempt to reconfigure data on-demand (rather than pre-configure data for an integrate-once-consume-many approach?) Think also about the type of solution that performs real-time fetches from source systems, integrates the data on-the-fly and presents it to the user - this is another type of integration-on-demand that can radically debilitate source systems as they are hit-and-re-hit for the very same set-based extracts dozens or hundreds of times in a day.
I'll take a deep-dive on integration-on-demand in the next installment, but for now think about what our CDC-based solution has enticed us to do: We have now reconfigured the tables with a new distribution key that helps the reports run faster, but because this deviates from the primary-key design of the source tables (which CDC operates against) then the CDC operation will grind. And when it grinds, it will consume precious resources like the inter-SPU network fabric. The grinding isn't just a duration issue - it's inappropriately using resources that would otherwise be available to the reporting users.
What's missing here is a simple step after the CDC completes. Its a really simple step. It will cause the average "purist" data modeler and DBA to retch their lunch when they hear of it. It will cause the admins of "traditional" engines to look askance at the Netezza machine and wonder what they could have been thinking when they purchased it. But the ultimate users of the system, when they see the subsecond response of the reports and way their queries return in lightning fashion compared to the tens-of-minutes, or even hours - of the prior solution, these same DBAs, admins and modelers will want to embrace the mystery.
The mystery here is "scale". When dealing with tables that have tens of billions, or hundreds of billions of records, the standard purist protocols that rigorously and faithfully protect capacity in the traditionl engines - actually sacrifice capacity and performance in the Netezza engine. It's not that we want to set aside those protocols. We just want to adapt them for purposes of scale.
The "next step" we have to take is to formulate data structures that align with how the reporting users intend to query the data, then use the CDC data to fill them. It's not that the CDC product can do this for you. It gets the data to the box. This "next step" in the process is simply forwarding the CDC data to these newly formulated tables. When this happens, the pre-integration and pre-calculation opportunities are upon us, and we can use them to reduce the total workload of the on-demand query by including the pre-integration and pre-calculation into the new target tables. These tables are then consumption-ready, have far fewer joins (and the need for left-outer joins often fall by the way-side). After all, why perform the left-outer operations on-demand if we can perform them once, use Netezza's massively parallel power for it, and then when the users ask a question, the data they want is pre-formulated rather than re-formulated on demand.
This necessarily means we need to regard our databases in terms of "roles". Each role has a purpose - and we deliberately embrace the notion of purpose-built schemas, and deliver our solution from the enslavement of a general-purpose model. The CDC-facing tables with support CDC - we won't report from them. The reporting tables face the user - we won't CDC to them.
Keep in mind that this problem (of CDC to Netezza) can rear its head with other approaches also - such as streaming data with a replicator or ETL tool to simulate the same effect of CDC. Either way, the data arrives in the Netezza machine looking a lot like the source structures and aren't consumption-ready.
I worked with a group some years ago with a CDC-like solution, and they took the "next step" to heart, formulated a set of target tables that were separate from staging and then used an ETL tool to fill them. The protocol was simply this: The ETL tool sources the data and fills the staging tables, then the ETL sources the staging tables and fills the target tables. This provided the necessary separation, so functionally fulfilled the mission. The problem with the solution however, was that for the transformation leg, the ETL tool was yanking the data from the machine into the ETL tool's domain, reformulating it and then pushing it back onto the machine. The data actually met itself coming-and-going over the network. A fully parallelized table was being serialized, crunched and then re-paralellized into the machine. As the data grew, this operation became slower and slower. That's what we would expect right? The bottleneck now is the ETL tool. The proper way to do this, if an ETL tool must be involved, is to leverage it to send SQL statements to the machine, keep the data inside the box. The Netezza architecture can process data internally far faster than any ETL tool could ever hope - so why take it out and incur the additional penalty of network transportation?
The ETL tool aficionados will balk at such a suggestion because it is such a strong deviation from their paradigm. But this is why Netezza is a change-agent. It requires things that traditional engines do not because it solves problems in scales that traditional engines cannot. In fact, performing such transformations inside a traditional engine would be a very bad idea. The ETL tools are all configured and optimized to handle transformation one way - outside the box. This is because it is a general-purpose tool and works well with general-purpose engines. There is a theme here: the phrase "general purpose" has limited viability inside the Netezza machine. If we embrace this reality with a full head of steam, the Netezza technology can provide all of our users with a breathtaking experience and we will have a scalable and extensible back-end solution.
Modified on by DavidBirmingham
How many logical modelers does it take to screw in a lightbulb? None, it's a physical problem.
I watched in dismay as the DBAs, developers and query analyts threw queries at the machine like darts. They would formulate the query one way, tune it, reformulate it. Some forms were slightly faster but they needed orders-of-magnitude more power. Nothing seemed to get it off the ground.
Tufts of hair flew over the cubicle walls as seasoned techologists would first yelp softly when their stab-at-improvement didn't work, then yelp loudly as they pulled-out more hair. Yes, they were pulling-their-hair-out trying to solve the problem the wrong way. Of course, some of us got bald the old-fashioned way: general stress.
I took the original query as they had first written it, added two minor suggestions that were irrelevant to the functional answer to the query, and the operation boosted into the stratosphere.
What they saw however, was how I manipulated the query logic, not how I manipulated the physics. I can ask for a show of hands in any room of technologists and get mixed answers on the question "where is the seat of power?". Some will say software, others hardware, others a mix of the two, while those who still adhere to the musings of Theodoric of York, Medieval Data Scientist, would say that there's a small toad or dwarf living in the heart of the machine.
To make it even more abstract, users sit in a chair in physical reality. They live by physical clocks in that same reality, and "speed of thought" analytics is only enabled when we can leverage physics to the point of creating a time-dilation illusion: where only seconds have passed in the analyst's world while many hours have passed in reality. After all, when an analyst is immersed in the flow of a true speed-of-thought experience, they will hit the submit-key as fast as their fingers can touch the keyboard, synthesize answers, rinse and repeat. And do this for many hours seemingly without blinking. If the machine has a hiccup of some kind, or is slow to return, the illusion-bubble is popped and they re-enter the second-per-second reality that the rest of us live in. Perhaps their hair is a little grayer, their eyes a little more dilated, but they will swear that they have unmasked the Will-O-The-Wisp and are about to announce the Next Big Breakthrough. Anticipation is high.
But for those who don't have this speed-of-thought experience, chained to a stodgy commodity-technology, they will never know the true meaning of "wheels-up" in the analytics sense. They will never achieve this time-dilated immersion experience. The clock on the wall marks true real-time for them, and it is maddening.
Notice the allusions to physics rather than logic. We don't doubt that the analyst has the logic down-cold. But logic will not dilate time. Only physics can do that. Emmett Brown mastered it with a flux capacitor. We don't need to actually jump time, but a little dilation never hurt anybody.
The chief factor in query-turnaround within the Netezza machine is the way in which the logical structures have been physically implemented. We can have the same logical structure, same physical content, with wildly different physical implementations. The "distribute on" and "organize on" govern this factor through co-location at two levels: Co-location of multiple tables on a common dataslice and co-location (on a given dataslice) of commonly-keyed records on as few disk pages as possible (zone maps). The table can contain the same logical and physical content, but its implementation on Netezza physics can be radically different based on these two factors.
Take for example the case of a large-scale fund manager with thousands of instruments in his portfolio. As his business grows, he crosses the one-million mark, then two million. His analytics engine creates two million results for each analytics run, with dozens of analytics-runs every day, adding up quickly to billions of records in constant churn. His tables are distributed on his Instrument_ID, because in his world, everything is Instrument-centric. All of the operations for loading, assimilating and integrating data centers upon the Instrument and nothing else. They are organized on portfolio_date, because the portfolio's activity governs his operations.
His business-side counterpart on the other hand, sells products based on the portfolio. The products can serve to increase the portfolio, but many of the products are analytics-results from the portfolio itself. This is a product-centric view of the data. Everything about it requires the fact-table and supporting tables to be distributed on the Product_id plus being organized on the product-centric transaction_date. This aligns the logical content of the tables to the physics of the machine. It also aligns the contents of the tables with the intended query path of the user-base. One of them will enter-and-navigate via the Instrument, where the other will use the Product.
We can predict the product manager's conversation with the DBAs:
PM: "I need a version of the primary fact table in my reporting database."
DB: "You mean a different version of the fact? Like different columns?"
PM: "No, the same columns, same content."
DB: "Then use the one we have. We're not copying a hundred billion rows just so you can keep your own version."
PM: "Well, it's logically the same but the physical implementation is completely different."
DB: "Oh really? You mean that instead of doing an insert-into-select-from, we'll move the data over by carrier pigeon?"
PM: (staring) "No, the new table has a different distribution and organize, so it's a completely different physical table than the original"
DB: "You're just splitting semantic hairs with me. Data is data."
I watched this conversation from a healthy distance for nearly fourteen months before the DBA acquiesced and installed the necessary table. Prior to this, the PM had been required to manufacture summary tables and an assortment of other supporting tables in lieu of the necessary fact-table. The user experience suffered immensely during this interval, many of them openly questioning whether the Netezza acquisition had been a wise choice. But once the new distribution/organize was installed, user-queries that had been running in 30 minutes now ran in 3 seconds. Queries that had taken five minutes were now sub-second. Where before only five users at a time could be hitting the machine, now twenty or more enjoyed a stellar experience.
How does making a copy of the same data make such a difference? Because it's not really a copy. When we think "copy" we think "photocopy", that is "identical". A DBA will rarely imagine that using a different distribution and organize will create a version of the table that is, in physical terms, radically different from its counterpart. They see the table logically, just a reference in the catalog.
The physics of the Netezza machine is unleashed with logical data structures that are configured to leverage the physical power of the machine. Moreover, the physical implementation must be in synergy (and harmony) with how the users intend to consume them with logical queries. In the above case, the Instrument-centric consumer had a great experience because the tables were physically configured in a manner that logically dovetailed with his query-logic intentions. The Product-centric manager however, had a less-than-stellar experience because that same table had not been physically configured to logically dovetail with his query-logic intentions. The DBA had basically misunderstood that the Netezza high-performance experience rests on the synergy between the logical queries and physical data structures.
In short, each of these managers required a purpose-built form of the data. The DBA thinks in terms of general-purpose and reuse. To him, capacity-planning is about preserving disk storage. He would never imagine that sacrificing disk storage (to build the new table) would translate into an increase in the throughput of the machine. So while the DBA is already thinking in physical terms, he believes that users only think in logical terms. Physics has always been the DBA's problem. Who do those "logical" users think they are, coming to the DBA to offer up a lecture on physics?
In this regard, what if the DBA had built-out the new table but the PM's staff had not included the new distribution key in their query? Or did not leverage the optimized zone-maps as determined by the Organize-On? The result would be the same as before: a logical query that is not leveraging the physics of the machine. At this point, adding the distribution key to the query, or adding filter attributes, is not "tuning" but "debugging". Once those are in place, we don't have to "tune" anything else. Or rather, if the data structures are right, no query tuning is necessary. If the data structures are wrong, no query tuning will matter.
And this is why the aforementioned aficionados were losing their hair. They truly believed that the tried-and-true methods for query-tuning in an Oracle/SQLServer machine would be similar in Netezza. Alas - they are not.
What does all of this mean? When a logical query is submitted to the machine, it cannot manufacture power. It can only leverage or activate the power that was pre-configured for its use. This is why "query-tuning" doesn't work so well with Netezza. I once suggested "query tuning in Netezza is like using a steering wheel to make a car go faster." The actual power is under the hood, not in the user's hands. While the user can leverage it the wrong way, the user cannot, through business-logic queries, make the machine boost by orders-of-magnitude.
Where does the developer/user/analyst need to apply their labor? They already know how they want to navigate the data, so they need to work toward a purpose-built physical implementation, using a logical model to describe and enable it. Notice the reversal of roles: the traditional method is to use a logical model to "physicalize" a database. This is because in a commodity platform (and a load-balancing engine) the physics is all horizontal and shared-everything. We can affect the query turnaround time using logical query statements because we can use hints and such to tell the database how to behave on-demand.
We cannot tell Netezza how to "physically" behave on-demand in the same way. We can use logical query statements to leverage the physics as we have pre-configured it, but if the statement uses the tables outside of their pre-configured physics, the user will not experience the same capacity or turnaround no matter how they reconfigure or re-state the query logic.
All of this makes a case for purpose-built data models leading to purpose-built physical models, and the rejection of general-purpose data models in the Netezza machine. After all, it's a purpose-built machine quite unlike its general purpose, commodity counterparts in the marketplace. In those machines (e.g. Oracle, SQLServer) we have to engineer a purpose-built model (such as a star-schema) to overcome the physical limitations of the general-purpose platform. Why then would we move away from the general-purpose machine into a purpose-built machine, and attempt to embrace a general-purpose data model?
Could it be that the average Netezza user believes that the power of the machine gives it a magical ability to enable a general-purpose model in the way that the general-purpose machines could not? Ever see a third-normal-form model being used for reporting in a general-purpose machine? It's so ugly that they run-don't-walk toward engineering a purpose-built model, photocopying data from the general-purpose contents into the purpose-built form. No, the power of the Netezza machine doesn't give it magical abilities to overcome this problem. A third-normal form model doesn't work better in Netezza than a purpose-built model.
Enter the new solution aficionado who wants their solution to run as fast as the existing solution. They will be told, almost in reflex by the DBA that they have to make their solution work with the existing structures, even though they don't leverage physics in the way the new solution will need it. And this is the time to make a case for another purpose-built model. One that faces the new user-base with tables that are physically optimized to support that user base. Will all tables have to come over? Of course not. Will all of the data of the existing fact table(s) have to come over? Usually not, which is silver lining of the approach.
But think about this: The tables in Netezza are 4x compressed already. If we make another physical copy of the table, itself being 4x compressed, the data is still (on aggregate) 2x compressed across the two tables. That is, the data is doubled at 4x compression, so it only uses the same amount of space as the original table would have if it were only 2x compressed. In this perspective, it's still ahead of the storage -capacity curve. And in having their physics face the user base, we preserve machine capacity as well.
This is perhaps the one most-misunderstood tradeoff of requiring multiple user bases to use the same tables even though their physical form only supports one of the user-bases. And that is simply this: When we kick off queries that don't leverage the physics, we scan more data on the dataslices and we broadcast more data between the CPUs. This effectively drags the queries down and saturates the machine. The query drag might be something only experienced by the one-off user base, but left to itself the machine capacity saturation will affect all users including the ones using the primary distribution. Everyone suffers, and all for the preservation of some disk space. Trust me, if there is a question between happy and productive users versus burning some extra disk space, it's not a hard decision. Preserving disk storage in the heat of unhappy users is a bad tradeoff.
Or to make an absurd analogy, let's say we show up for work on day one, have a great day of onboarding and when we leave, we notice that our car is a little "whiney". Taking it to a shop, he tells us that someone has locked the car in first-gear and he can't fix it. We casually make this complaint the next day (it took us a little longer to get to work).
DBA: Oh sure, all new folks have their car put in first gear. It's a requirement.
US: (stunned) What the?
DBA: Well, if you had been here first, you could keep all the gears, but everyone we've added since then has to be in first gear for everything.
DBA: Yes, first-gear for your car, your development machine, even your career ladder. About the only thing that they don't put in first-gear are your work hours. Those are unlimited.
US: That's outrageous!
DBA: We can't give everyone all the gears they want. It's just not scalable.
The problem with working with tables that aren't physically configured as-we-intend-to-use-them is that using them will cause the machine to work much harder than it has to. Not only will our queries be slower, we can't run as many of them. And while we're running, those folks with high-gear solutions in the same machine will start to look like first-gear stuff too. The inefficiency of our work will steal energy from everyone. We cannot pretend that the machine has unlimited capacity. If our solution eats up a big portion of the capacity then there's less capacity for everyone else. Even if we use workload management, whatever we throttle the poorly-leveraged solution into will only make it worse, because if a first-gear solution needs anything, it most certainly uses more capacity than it would normally require.
Energy-loss is the real cost of a poor physical implementation. All solutions start out with a certain capacity limit (same number of CPUs, memory, disk storage) and it is important that we balance these factors to give the users the best possible experience. Throttling CPUs or disk space, or refusing to give up disk space merely to preserve disk capacity, only forestalls the inevitable. The solution's structures must be aligned with machine physics and the queries must be configured to leverage that physics.
The depiction(above) describes how the modeler's world (a logical world) in no wise intersects with the physical world, yet the physical world is what will drive the user's performance experience. The high-intensity physics of Netezza is not just something we "get for free", it is a resource we need to deliberately leverage and fiercely protect.
In the above, the "Logical data structure" is applied to the machine catalog (using query logic to create it). But once created, it doesn't have any content, so we will use more logic to push data into the physical data structure. The true test of this pairing is when we attempt to apply a logical query (top) and it uses the data structure logic/physics to access the machine physics (bottom). Can we now see why a logical query, all on its own, cannot manufacture or manipulate power? It is the physical data structure working in synergy with the logical query that unleashes Netezza's power. And this is why some discussions with modelers may require deeper communication about the necessity to leverage the deep-metal physics while we honor and protect the machine's power.
Modified on by DavidBirmingham
Sometimes the average Netezza user gets a bit tripped-up on how an MPP works and how co-located joining operates. They see the "distribute on" phrase and immediately translate "partition" or "index" when Netezza has neither. In fact, those concepts and practices don't even have an equivalent in Netezza. This confusion is simply borne on the notion that Netezza-is-like-other-databases-so-fill-in-the-blank. And this mistake won't lead to functional problems. They will still get the right answer, and get it pretty fast. But it could be soooo much faster.
As an example, we might have a traditional star-schema for our reporting users. We might have a fact table that records customer transactions, along with dimensions of a customer table, a vendor table, a product table etc. If we look at the size of the tables, we find that the product and vendor tables are relatively small compared to the customer, and the fact table dwarfs them all. A typical default would be to distribute each of these tables on their own integer ID, such as customer_id, vendor_id etc. and then putting in a transaction fact record id (transaction_id) that is separate from the others, even though the transaction record contains the ID fields from the other tables.
Then the users will attempt to join the customer and the transaction fact using the customer_id. Functionally this will deliver the correct answer but let's take a look under-the-covers what the performance characteristics will be. As a note, the machine is filled with SBlades, each containing 8 CPUs. For example, if we have a TwinFin-12, this is 12 SBlades with 8 CPUs, or 96 CPUs. They are interconnected with a proprietary, high-speed Ethernet configured to optimize inter-CPU cross-talk.
Also whenever we put a table into the machine, it logically exists in one place, the catalog, but physically exists on disks assigned to the CPUs. A simplistic explanation would be that if we have 100 CPU/disk combinations and load 100,000 rows to a table that is distributed on "random", each of the disks would receive exactly 1000 records. When we query the table, the same query is sent to all 100 CPUs and they only operate on their local portion of the data. So in essence, every table is co-located with every other table on the machine. This does not mean however, that they will act in co-location on the CPU. The way we get them to act in co-location (that is, joining them local to the CPU) is to distribute them on the same key.
But because our noted tables are not distributed on the same key, they cannot co-locate the join. This means that the requested data from the customer table will be shipped to the fact table. What does this look like? Because the customer table has no connection to the transaction_id, the machine must ship all customer records to all blades (redistribution) so that the CPUs there can attempt to join on the body of the customer table. We can see how inefficient this is. This is not a drawback of the Netezza machine. It is a misapplication of the machine's capabilities.
Symptoms: One query might run "fine". But two of them run slow. Several of them even slower. Results are inconsistent when other activities are running on the machine. We can see why this is the case, because the processing is competing for the fabric. Why is this important to understand? The inter-CPU fabric is a fairly finite resource and if we allow data to fly over it in an inefficient manner, it will quickly saturate the fabric. All the queries start fighting over it.
Taking a step back, let's try something else. We distribute the transaction_fact on the customer_id, not the transaction_id. Keep in mind that the transaction_id only exists on the transaction table so using it for distribution will never engage co-location. Once we have both tables distributed on the customer_id, let's look at the results now:
When the query initiates, the host will recognize that the data is co-located and the data will start to join without ever leaving the CPU where the two table portions are co-located. The join result is all that rises from the CPU, and no data is shipped around the machine to affect the answer. This is the most efficient and scalable way to deal with big-data in the box.
Now another question arises: If the vendor and product dimensions are not co-located with the transaction_fact, how then will we avoid this redistribution of data? The answer is simple: they are small tables so their impact is negligible. Keep in mind that we want to co-locate the big-ticket-or-most-active tables. I say that because we have sites that are similar in nature where the customer is as large as two of the other dimensions, but is not the most active dimension. We want to center our performance model on the most-active datasets.
This effect can rear its head in counter-intuitive ways. Take for example the two tables - fact_order_header and fact_order_detail. These two tables are both quite monstrous even though the detail table is somewhat larger. Fact_order_header is distributed on the order_header_id and the fact_order_detail is distributed on the order_detail_id. The fact_order_detail also contains the order_header_id, however.
In the above examples, the order header was being joined to the detail, along with a number of other keys. This achieved the correct functional answer, but because they were not using the same distribution key, the join was not co-located. So we suggested putting the order_detail table on the same distribution as the order-header (order_header_id). Since the tables were already being joined on this column, this was a perfect fit. The join received an instant boost and was scalable, no longer saturating the inter-CPU fabric.
The problem was in how the data architects thought about the distribution keys. They were using key-based thinking (like primary and foreign keys) and not MPP-based thinking. In key-based thinking, functionality flows from parent-to-child, but in MPP-based thinking, there is no overriding functional flow of keys - it's all about physics. This is not to say that "function doesn't matter" but we cannot put together the tables on a highly physical machine and expect it to behave at highest performance unless we regard the physics and protect the physics as an asset. Addressing the functionality alone might provide the right functional answer, but not the most scalable performance.
At far back as the 2007 Enzee Universe conference in Boston, people have scratched their head on the notion of a large-scale database that actually boasts about the absence of index structures. After all, indexes are the mainstay of relational databases and we simply can't get by without them, right? This is a simple example of how the power and architecture of the technology frees us to think about data loading and storage, data retrieval and in-the-box processing in a completely different way.
Firstly, it's no rumor that Netezza has no indexes. And for those of us who can't stand dealing with them, this is a huge plus. One of the Enzees at the 2011 conference asked point-blank - "What does Netezza have that Oracle does not?" and the clear answers will arise in this and later blog entries, but for now the absence of maintenance of index structures, will do just fine. What does Netezza have?
Or rather, the vendor has spent enormous thought labor into simplification of complex matters so that we don't have to deal with them. We can get to the business of - well - the business. The application data and information that we want to spend more time with if only we weren't dealing with the technical administration of the database engine. This is one of the greatest weaknesses of the traditional relational database when it comes to data warehousing in general and bulk processing in particular. It is also one of their greatest weaknesses in the space of analytics. After all, we want to apply the analytics by selecting subject areas and data sets to analyze, but if we have to stand up lots of infrastructure to support this mission we are regressing toward functional hardening rather than functional flexibility. Brittleness ensues and someone asks for a refactor, a re-architecture or whatever. When this request arrives in relation to a complex, engineered, multi-terabyte (or say tens-of-terabytes) data store, many will see it as a mind-numbing proposal. For most infrastructures, the weight and complexity of the installation has overwhelmed the logistical capacity of the humans to ever hope reeling it back in. There's not enough power in the machine to help. This could take a very long time to remediate.
Not so with an appliance. Such big-data issues are its bread-and-butter zone. If we have issues with a particular data model or information construct, the machine has the power (and then some) to get us out of the bad direction and into the new one. How did the direction get "bad" in the first place? The business changed its information priorities since the inception of the original model, and now the original no longer serves it well. The business has moved, because that's what businesses do. The analysts found opportunities in the nuggets of information-gold, and now want direct access to that gold rather than having to navigate to it or salivate while waiting for it.
One of the core features of the Netezza platform is how the data is distributed on the disk drives. Because it is an MPP, each of the disk drives is its own self-contained, shared-nothing hardware structure. Contrast this to the common SMP-based platform where the CPUs exist in one duck pond and the disk drives exist in another duck pond. The data is then couriered between duck ponds through throttled estuaries. If we issue a query, all of the data has to be pulled through this estuary and presented to the CPU ducks so that the data can be manipulated and examined in software. It is a software engine on a general-purpose platform.
However, imagine the Netezza platform where each Intel CPU is mated with special hardware (an FPGA), a bounty of RAM to manipulate and cache data, a self-contained Linux operating system, and its own dedicated disk drive. Imagine also that the disk drive is itself divided into multiple sections, where the inner sections are used for internal processes like RAID, temp space and system data but the outermost ring is where user data is stored, offering the benefit of the fastest disk speed for the oft-accessed information. All of these little attention-to-detail aspect cause each of the CPUs to run at a much more powerful factor than their common SMP counterparts. Why? Because their data is co-located with the CPU, where with the SMP we have to drag the data out of one duck pond to get it close to the CPUs that may operate on it. And with SMP there is no such thing as dedicated CPU-to-disk access. The SMP CPUs are shared-everything, but also the disk drives - shared-everything at the hardware level but shared-nothing at the functional-logical level. Netezza's CPUs are shared-nothing at the functional-logical level and shared-nothing at the CPU/disk level.
In Netezza, let's imagine putting 100 of these CPU/disk combinations to work. When we form a table on the host, the table exists logically in the catalog in one place. But it exists physically in 100 places. If we load 100 million records into the machine distributed randomly, then each of the CPUs will "own" 1 million of those records. If we issue a query to the table, each of the CPUs will operate only on the million records it can see. So for any given query, we are not serially scanning 100M records. We are scanning 1M records 100 times.
Now some may object, that we're still scanning the data end-to-end, and for plain-vanilla queries, this is true. However, I recall the first time I performed a join on two tables that were 100M records each doing a one-to-one join on unique keys, on a machine with 24 of these CPUs and the answer returned in 13 seconds. This was a plain vanilla test, so your actual mileage may vary. However, I did the same test on the same machine with 1B records joined to 1B records and the answer returned in less than 300 seconds. Imagine attempting this kind of join on a traditional relational database and expecting a return in any time to actually use the result. (and no, I did not use co-location for this test - a matter for another blog entry)
We have an additional "for free" aspect of the machine called a zone map. If data is laid down in contiguous form (the way transactions arrive in a retail point-of-sale model for example) then the machine will automatically track these blocks and keep their locations in a "zone map". If we then query the database with a given date, the machine will ignore the places where it knows the data is not, and use the zone maps to locate the range of data where it needs to search.
As as example of this, we know of a 150-billion-row table that is over 25 terabytes in size, distributed and zoned such that over 95 percent of its queries return in 5 seconds or less. For a traditional RDBMS, it would take more time than that just to get its index scans squared away so that it could even approach the storage table. This is also why the Netezza machine itself can be scaled into the petabyte zone while maintaining the same performance. No indexes are in the way to load it. No indexes are necessary to search it. Imagine now: the two most daunting aspects of data warehousing and analytics - loading the data and keeping up with the user base - have been washed away with the elimination of indexes. (Don't we turn indexes off in a traditional SMP database so it will load faster, and don't we chase a rainbow trying to index and re-index the tables to keep up with user needs?) Not with Netezza. Without indexes on the columns, all columns are fair game for searching. Without indexes in the way for loading, we can deliver information into the machine, a reprocess it while there, with no penalty from the use of indexes.
By this measure, this is an anti-indexing strategy because Netezza operates on the basic principle of where-not-to-look. In other words, if we can tell it where the data is not, it can zone-in and find the data. When we think about it, this is how a common brick-and-mortar warehouse works. If we showed up and asked for a box of nails, the attendant knows that he doesn't have to look in the parts of the warehouse that carry lawn equipment, hammers, saws or window draperies. He knows where the nails don't exist.
Contrast this to the common SMP-based indexed database, which uses exactly the opposite approach. The indexes are searched for the specific key and then this key is applied to the primary data store. This is why indexed structures in general cannot scale in the same manner as an anti-indexed structure. Keep in mind, with a Netezza platform it won't matter how much data we put on that 25-terabyte table. We could double, triple it or more - and it would always provide the answer in a consistent time frame. This is because from query-to-query - it's still not-looking in the same places, no matter how big those places get. It will still continue to ignore them because the data's not there and it already knows that.
I've had people tell me that there is no real difference in the SMP versus the MPP. The SMP, "properly configured" (they claim) is just as good as any old MPP. However, there is no "proper" way to configure a general purpose hardware architecture so that it will scale. The only way we could hope to coordinate these CPUs is in software, and the only way we can get data into the CPUs is by accessing the shared disk drives in another duck pond on completely different hardware. The SMP configuration is by definition the wrong configuration for scalable bulk processing of any kind, so there is no way to "properly configure" something that already the inappropriate storage mechanism. This would be no different than claiming that a "properly configured" VW Bug (Herbie notwithstanding) could be just as fast as a stock car. The VW Bug is not the wrong platform for general purpose transportation. But it's the wrong platform for model requiring high-scale and high-performance, just an an SMP-based RDBMS cannot scale with the same factor (for set-based bulk processing) as a machine (Netezza) specifically built for this mission. Only a purpose-built machine can ultimately harness this level of data, and only an appliance can remove the otherwise mind-numbing complexity of keeping it scalable.
In the next weeks leading up to IOD (where I will be speaking on most-things Netezza) I'll offer up some additional insights on the internals of the architecture and how it differs from the traditional platforms.
As the title suggests, one of the challenges of new Netezza users is in learning about the product, what it can (and doesn't) do, and how it applies in data warehousing. When I first published the book (Netezza Underground on Amazon.com) the impetus for the effort was just that - people asking me lots of questions leading to fairly repetitive and predictable answers. It's an appliance after all. We can apply it to a multiplicative array of solutions but on the inside, certain things stand out as immutable truths.
Of course, lots of folks are running around down here in the catacombs, convincing people that they need to read the glowing glyphs on the ancient stone walls as guides on their quest for more information. This is entirely unnecessary. The title of the book (and the blog) is a tongue-in-cheek nod to the way that some might spin the story on the machine in their own favor. Some might claim that we need lots of consulting hours to roll out the simplest solution. Consultants can help, of course (I'm one of them). But it depends on the spin and the tale that is told, that will determine the magnitude and expense of those consultants..
I'll try to provide a balance that delivers value without extraordinary expense. Yet another source of misinformation is Netezza's competitors, who like to toss "innocent" bombs here and there to direct people down the path toward their own product. All is fair in business and war, as they say, but as we bring these issues out of the darkness and into the light, the objective is to become more informed.
I am neither a Netezza nor IBM employee, so apart from compensation for actual work performed in rolling out solutions, for which people would pay me regardless of technology, I don't have any other relationship with these companies. I am a huge fan of the Netezza product and architecture and make no secret of this.
So some may have come here to ask questions about the technology or read what some seasoned experts have to say. We can do all that and a lot more. For now, let's look at the counter-intuitive nature of the product's internals.
I'll paint an imaginary picture first. Let's say we have two boxes. In one box we have thirty-two circles and in the second box we have thirty-two drums. The circles are CPUs and the drums are disk drives. They are in separate boxes, and now we draw a pipeline between them. Make it as large as you want. This depicts a standard SMP (Symmetric Multi-Processor) hardware architecture that is the common platform for data warehousing. With the exception of Teradata and Netezza, this hardware configuration is ubiquitous.
Now let's draw another mental picture. This time we'll have one box. Each of the circles will be mated with one of the drums. Now we have thirty-two circle/drum combinations. In the Netezza machine each of these is called a SPU (snippet processing unit) and represent the CPU coupled with a dedicated disk drive. Some additional hardware exists to coordinate and accelerate this combination, but this is a simplified mental depiction of the fundamental difference between the prior SMP configuration and Netezza's, an MPP (Massively Parallel Processor) configuration.
Now I used thirty-two as a simple example. In reality, Netezza can host hundreds of these cpu/disk combinations, the largest standalone frame containing over eight hundred of them, scaling to over a petabyte in storage capacity. Those of us who regularly operate on these machines are accustomed to loading, scanning and processing data by the terabyte, the smallest tables in the tens of billions of rows.
Some notes on the difference in their operation:
SMP: Typically used for transactional (OLTP) processing and is not purpose-built for data warehousing. It is a general-purpose platform. The typical bane of an OLTP engine is that it performs well on getting data inside, but is lousy on getting data out (in quantity, for reports and such).
MPP: (Netezza specifically) does not do transactional processing at all. It is purpose-built to inhale and exhale Libraries-of-Congress at a time.
SMP: Table exists logically in the database, but physically on the file system in a monolithic or contiguous table space.
MPP: Table exists logically in the database, but is physically spread out across all the SPUS. If we have 100 SPUs and want to load a table with 100,000 records, each SPU will receive 1000 records.
SMP: SQL statements are executed by the "machine" as a whole, sharing CPUs and drives. While the SQL operations may be logically and functionally "shared nothing" - the hardware is "shared-everything". In fact, CPUs could have other responsibilities too, which have no bearing on completing a SQL statement operation. In the above example, the SMP has to access the file system, draw the data into memory nearest the CPUs, then perform operations on the data. Copying data, for example, would mean drawing all the data from the tablespace into the CPUs and then pushing back down into another tablespace, but both tablespaces are on the same shared disk array.
MPP: SQL Statements are sent to all CPUs simultaneously, so all are working together to close the request. In the above example, each SPU only has to deal with 1000 records. Unlike an SMP, the data is already nearest the CPU so it doesn't have to go anywhere. The CPU can act directly on the data and coordinate as necessary with other CPUs. Copying data for example, means that the 1000 rows is copied to a locaton on the local drive. If 100 CPUs perform this simultaneously, the data is copied 100 times faster and it never leaves the disks.
SMP: Lots of overhead, knobs and tuning to make it go and keep it going. From the verbosity of the DDL to the declaration of index structures, table spaces and the like.
MPP: (Netezza) the overhead of adminstration is hidden from the user. In the above example, the user need only declare, load and utilize the table, not be concerned about managing the SPUs or disk allocation. The user's only concern is in aligning the data content with the hardware architecture, an easy task to perform and an easy state to maintain.
SMP: SQL-transforms, therefore, the insert/select operations that run so slow on an SMP platform, continue to run slow, and slower as data is added. They are not a viable option so usually would not occur to us to leverage them otherwise.
MPP: SQL-transforms leverage the MPP and do not slow as the data grows.
In fact, many people purchase the Netezza platform in context of its competitors, who don't (nor can they) use SQL transforms to affect data after arrival.
If we purchase Netezza as a load-and-query platform only, we have missed a special capability of the machine that differentiates it from the other products. If we leverage this power, we re-balance the workload of the overworked ETL tools, in some cases eliminating them altogether. Netezza calls this practice "bringing everything under air", that is, bring the data as-is into the machine and do the heavy-lifting transforms after it's inside.
The Brightlight Accelerator Framework is one example of a flow-based, run-time harness for deploying high-performance, metadata-driven SQL-transforms. While we have matured this capability over time, the consistency of the Netezza platform is the key to its success.
SMP:Scalability is through extreme engineering, leading to hardware upgrade.
MPP:Scalability is a function of data organization as it leverages the hardware's power. Upgrading hardware is therefore rarely the first option of choice.
SMP:Constrained by index structures and the shared-everything hardware architecture through which all data must pass
MPP:Constrained by human imagination, not index structures (there aren't any) and no shared hardware.
We have noted that with traditional SMP architectures, when the machine starts to run out of power, engineers will swarm the machine and start to instantiate exotic data structures and concepts that serve only as performance props, not the 'elegant' model once installed by the master modelers. As the box continues to decline, more engineering ensues, eliciting even stranger approaches to data storage and management. Soon it becomes so functionally brittle that it cannot handle any more functionality, so people resort to workarounds and bolt-ons. Functionality that should be a part of the solution is now outside of it, and functionality that never should have been inside (mainly for performance) has taken up permanent residence.
In a Netezza machine we have so much power available that traditional modeling approaches (e.g. 3NF and Dimensional) may have a defacto home, but now we can examine and deploy other concepts that may be more useful and scalable. Approaches that we never had the opportunity to examine before, because the modeling tool did not (and does not) support it and the natural constraints of the SMP-based engine artificially constrain us to index-based data management.
In addition, the Netezza machine operates on a counter-intuitive principle of finding information based on "where not to look". With this principle, let's say we have a terabyte of data and we know where our necessary information is, based on where-not-to-look. See how this won't change even if we add hundreds of terabytes to the system? If I add another 99 terabytes to the same system, the query will still return with the same answer in the same duration, because the data I'm looking for doesn't appear in those 99 terabytes and the machine already knows this.
For example, if we go over to Wal-Mart and we know where the kiosk is for the special-buys-of-the-day, we can find this easily. It's in a familiar location. What if tomorrow they expand the same Wal-Mart to ten times its current size? Will it affect how long it takes me to find that kiosk? Clearly not. Netezza operates the same way, in that no matter how much data we add, we don't have to worry about scanning through all of it to find the answer.
And when it boils down to basics, the where-not-to-look is the only way to scale. No other engine, especially not one that is completely dependent upon index structures to locate information, can scale to the same heights as Netezza.
Well, some of the stuff above is provocative and may elicit commentary. I'll continue to post as time progresses, so the data won't seem so much like underground information, after a while anyhow.
Modified on by DavidBirmingham
A "little" blurb about IBM Integrated Analytics System- IBM has had this version of the Analytics MPP in beta for a bit, and has recently announced (today).
Our team at Sirius was privileged to get an orientation and a little more than a sneak-peek, but access to our own beta machine to give it a go. The Enzee community will be pleased about a number of things in this release.
But even at the orientation, a number of things popped out that would "take your feet off the desk" so to speak. This is no ordinary release. IBM has integrated the Db2 BLU engine with the Netezza MPP, so now we can build and issue portable queries with a common, consistent experience. This is quite an achievement and many kudos to the engineers who have participated.
Power Systems vs Blade Servers
Under the hardware covers we now have Power Systems. This matters for a lot of good reasons, the best one being horizontal elasticity. Most Enzees know whenever we want to upgrade our hardware to more power, say from one rack to two, we have to bring the two-rack into the shop, copy the data from the one-rack to the two-rack, and off-we-go.
With IIAS, we simply add-a-frame, perform some simple configuration commands, the data rebalances to the new capacity, and off-we-go. Simple and painless. How cool is that?
In this hardware scenario, dataslices and zone maps (now synopsis tables) behave the same. Distribution and organization follow the same rules. Enzees will experience no changes in these core areas.
Db2 BLU experienced this radical speed also, but with Netezza-oriented MPPs, it's even more profound. Netezza tables are row-oriented, so the "read" operation takes a page from disk into the FPGA, where the desired columns are stripped from rows, and undesired rows are filtered from the stream.
In Columnar mode, the pages don't store rows, but columns. When we ask for certain columns, the other columns' pages aren't touched. So now we read even less of the disk itself, and even less data meets the CPU.
Oh, yeah, you heard that right. In current Netezza, an electro-mechanical drive will not only burn out, but even in operation, the read-head has to seek a page on the disk, fetch it, and multi-task with other read operations to optimize the physical read-head over the spinning disk. Even the spinning disk is optimized to carry the user data on the outer-third of the disk where it spins fastest. This can serialize queries at the read-head and is potential for concurrency bottlenecks.
With solid-state drives, the read-time is radically faster, and memory is fair-game by any process, without waiting. It's solid-state so mechanical breakdown is not even on the radar. More importantly, it reads and writes data at phenomenal speed compared to a hard drive.
Where before, the disk read speed was the number-one drag on a query, now we will likely see this re-balanced to other areas of the machine. Don't get me wrong, it still takes time to read memory, so we should not forego the use of zone maps - er - synopsis tables now - to reduce and filter the total amount we read. This is just good stewardship of the CPUs and other resources. Just because we "can" read a lot of data faster, doesn't mean we "should" - we should filter and reduce the total data arriving into the CPU for the most efficient query utility.
What it does not have, is the FPGA. This hardware-filtration workhorse, IBM is taking the step to remove, because the disk drives are so powerful, columnar tables are so data-efficient, and the CPUs are so much stronger, the hardware filtration power may well be superfluous. Let's shake that tree. We're all scientists, right? In retrospect, the purpose of the FPGA is to simulate column-oriented streams from row-based storage. But if we have columnar tables, the FPGA is less of a player.
Common SQL Engine
This means SQL statement portability and consistency across various platforms, and the libraries such as SQL Toolkit, Inza, Fluid Query etc are now baked-in to the engine and available at our fingertips when we power-up.
We'll have more consistent experience, and less guesswork for functional/operational behaviors.
And we'll experience seamless integration with the Data Science Experience (DSX), machine learning, and a bunch of other offerings IBM has already announced or is in the queue.
Ease of configuration
As we were moving through our POC, the IBM team assigned to us was amazingly helpful. They knew where all the hooks were to tweak this or tune that, because the common engine and underpinning metadata are pervasive and well-understood. Anytime we had a question or thought we'd encountered an issue, the solution was invariably a simple configuration change.
This speaks volumes for the level of effort, thought and insight poured into this release. IBM has thought-through the wide variety of priorities between these platforms and has provided the ones that matter most, seamlessly.
Expect to hear good things as this rollout proceeds.
Synopsis Tables vs Zone Maps
Essentially work the same way.
DashDb and DashDb-local have been rolled-up to Db2 Warehouse. Same SQL engine. Same built-in analytic libraries. Db2 Warehouse with the MPP hardware is the IIAS.
If we plan to rollout the Db2 Warehouse first and later upgrade to MPP hardware, we need to enable the queries and tables for this, and this simply means to leverage Synopsis Tables rather than indexes. Do this, and we'll be 90+ percent on the way to leveraging MPP hardware when we go there. If not, and the tables/queries are index-dependent, we'll need to review the environment (for the largest tables) to make sure we're leveraging the MPP correctly.
These are not hard to do and much of it can be automated.
A while back we bought a new house and decided that we would keep our first house for a few additional months to muse about whether we would keep it for a rental home. Well, shortly afterward was Sept 11th, 2001 and we were boxed into keeping it for quite a while longer. In all this, opportunists came out of the woodwork to take our house for a "song".
We tired of the songbirds quickly, but some of them were just bizarre. One couple looked at the house and said they would buy it if we would put $20k of upgrades into it. New this, new that. Our answer was no, we're selling "as is". But they persistent and felt that their requests were reasonable. No, I said, you can spend your money fixing it up like you want it, because I'm not spending my money to fix it up like you want it. They were confused. We weren't "playing the game" you see. Oddly, they had a realtor working with them who kept chanting, "Ask for what you want. It never hurts to ask."
Well, when you're asking the wrong questions, and it's obvious, it definitely hurts to ask. When on site with a large financial services firm, we were being examined for the replacement of their current consulting firm. Their lament: "They ask all the wrong questions." Apparently we were asking the right questions, because they gave the business to us.
Recently I walked through a demo of some things we've done for other clients. Two minutes into the demo, they started asking questions. This or that? Those or these? But nothing whatsoever to do with the core problems the solution was geared for, or the core questions that everyone asks.
It's like this: I have a list of questions on left side of the white board and another list on the right side. The questions on the right side deal with things like very-large-scale backup and high-speed recovery, disaster recovery, hot-failover between two Netezza machines, real-time replication, continuous processing, and of course, heavy-lifting processing inside-the-box.
These are things that accelerate business logic, the rapid turnaround of business rules and features, catalog-aware design-time and run-time modules, developer productivity, testing accuracy and turnaround, rapid-testing and rollout, release management and patched releases - operational integrity, a firm harness around the functional data, and radical simplification of very complex problems - in short, a high-wall of protection around a high-speed delivery model. Something that shrinks the time-to-delivery as well as shrinks operatiional processing time and protects capacity, while protecting the data itself.
I knew better than to hold my breath for a question on migration -
Whew! Whenever I give this short list to a CIO, they are usually asking in terms of having rolled out a large-scale environment with Netezza and are experiencing pain in one or more of those areas. Of course, we focus on those areas. Painkiller is not enough. You have to remove the source of pain..
How do I know that these are the questions Enzees are asking? Because they are - to me personally, to the Netezza community forums, and are the hot-topics at every Enzee Universe. Good grief, that's the short list of the lightning-round question-and-answer sessions at the Best Practice forums.
But they did not hear any of this. They were asking all of the wrong questions. The reason, methinks, is that they weren't in any pain. They felt the freedom to ask about the inconsequential because they had never experienced the consequences, or have even foreseen the consequences. In other words, I am solving problems they don't have.
Or do they? Anyone who buys a large-capacity storage devicesis by definition biting off a data management challenge. One of our clients has over 200 TB (uncompressed) on their TwinFin. They currently do their backups to another TwinFin of the same size. They would like to perform their backups offline, but no commodity backup/recovery system on the planet can handle this kind of load with any aplomb. Some claim to, of course, but have not considered the true needs of the Netezza user. It's all Texas-Sized Data Warehousin'
For example, one of the off-the-shelf products claims that once the data is offloaded, they have "hooks" allowing the user to query the offline archives. That's right, the user query will be dispatched to their proprietary engine and it will fetch the answer back for you. But wait a second. That protocol is for a transactional system. Going out onto the file system to fetch a transaction might have some value, since the only alternative is to bulk-load the entire dataset back into the device to query it, when all we wanted was one transaction.
However, for a Netezza "scanning analytic" query, potentially spanning tens of terabytes in scale, such a mechanism is no more than a child's toy. What we need is the ability to rapidly reload the data back into the machine so we can query it there, inside Netezza's MPP. More importantly, once it's back online we can join it to other tables, that's right, down on the MPP where the CPUs burn brightest.
But this is simply an example of how one vendor has solved a problem that no Netezza user is asking. Netezza users need to scan and analyze the data in-the-box. Those tools provide data access outside-the-box, without considering that having the data outside-the-box is the problem we were trying to solve in the first place!
Another issue is how the ETL tools interact with Netezza. All of them do, some better than others. Some are certainly getting better than the others. The game is on, they know that data processing is migrating back into the machine. Netezza is leading the way, and they want a piece of the action. Can you blame them?
But hold on. Is bulk-data processing in an ETL tool the same as bulk-data processing inside a Netezza box? No, not really. "Going parallel" in an ETL tool means spreading work out across CPUs for a select set of operations. Not only this, we will compete with other processes for those same CPUs. The net is, we cannot put all the CPUs on the machine to work for us. Clearly some of the ETL tools are otherwise proud of their scalability. And they should be. Add a few more CPUs to that rack and watch the ETL tool scale with it. Nothing bad about that. Capture those business rules in a point-click and off-we-go. They have spent millions of dollars tuning their performance models, pouring the brain power of their brightest people into making their product go-parallel and push that data hard. Imagine them handing off this hard-won, multi million-dollar performance capability to an upstart appliance? Hmm, no, they'll be the last who are draggged kicking and screaming into the inexorable future.
Anyone who is kicking the tires of a large-scale storage and processing device like Netezza is also in for a subtle surprise: Once you are migrated, you will unleash the creativity of you staff to add functionality that was never possible before. This will generate business, which will generate more data. The extra capacity will naturally offer confidence that all-new-business-great-and-small are not only do-able, but with strength that isn't available with other platforms. No worries, sez aye, the Netezza machine will scale with you. And you have chosen well, grasshopper.
Then one day they look around and say - have we backed-up this data recently? How ahout disaster recovery? What about archiving old data to make room for new? What about the ability to make the offline data available for the ad-hoc folks? That is, available in time for them to actually use it? These simple questions raise fear in the hearts of the mightiest of IT champions when they know it should have been asked, and applied a long time before ---- Before the locomotive was moving at 90 miles an hour dragging 500 boxcars. Before the locomotive even left the station. The sheer logistics of performing a backup of data this size, and this transient, is mind-numbing. I've noted that one client, hosting over 150 TB (uncompressed) naively plugged their commodity backup tool into the Netezza machine. After over-a-week of whirr-click backup activity with no end in site, someone finally said "Kill it. If it takes a week to back it up, it will take even longer to restore it". This is a wise observation, but also tells us something:
The commodity tools expect us to accept that the restore-process will be slower than the backup-process. In the Netezza world, the backup and restore can both happen faster, and take up less storage than the commodity backup tool. If the commodity backup has to offload in uncompressed form, we need to provide generous workspace for it. If it's a Netezza-compressed backup, we only need to provide for the amount relevant to our compression ratio. Some sites get a 16x, others get almost twice that. The mileage varies because of the nature and compressibility of the data. Either way, offload is fast because it comes right off the disk without passing through the host. Likewise the reload, directly to the disk without the host. In a traditional RDBMS, offload/reload has to pass through the host to get on and off the device. For bulk analytic data, the missing middle-man is just the ticket for rapid-reload.
But they weren't asking this question either. The problems we had already solved and were bringing to the table as Netezza-centric solutions, the bread-and-butter, core-mission capabilities that people ask for all the time, wasn't even on their radar. The disconnect is simple: They have read white-papers on what people are doing after they get the back-end squared-away. The nice-to-haves. The critical parts, you know, the failure points that could mean the demise of the business, or perhaps their own paychecks? Not even on the table.
This is akin to someone about to engage in the construction of a large cargo ship. To be sure, some folks are concerned about the utility of the ship's interior. But when putting pen to paper, which is more important, that the break rooms have contemporary wallpaper, or that the ship can master tempestuous seas and high swells? Methinks the crew of said ship couldn't care less about the amenities if they got wind that the architects gave short shrift to things like hull stress and multiple watertight compartments. You know, things to keep the ship from being claimed by the sea when stress is high. Boy, those light fixtures sure look cool, don't they?
Back to basics. Netezza is an appliance. It can perform as a load-and-query device just like all the other load-and-query devices. A primary differentiator, one that more customers are experiencing all the time, is that Netezza is a powerful data processing platform. When leveraged this way, it also becomes a problem-solving platform. We simply wrapped some additional logic around these core capabilities in order to harness the logistics of multiple sequential (or asynchronous) queries being fired off to the device, managing its workload, intermediate tables and whatnot. The appliance makes such an endeavor simple, and further simplifies the user's interaction with the machine for purposes of pattern-based utilization. Change-data-capture, referential integrity checking etc are all far more effective inside the machine than outside the machine.
For the shops that would rather master these aspects in the ETL tool, hey, no harm done. Lots of people do it that way. But they all eventually get to the same point in the game: the ETL tool runs out of gas. Or to give it more gas will require a significantly larger power-plant. They then realize that all those CPUs in the Netezza machine are just sitting there, doing nothing most of the time. Enough CPUs respond to peak, which is about five percent of the time. The remaining 95 percent of time, the box is running less than half capacity with a big chunk of that completely idle. Wouldn't it be great to get a handle on all that power? Recover it as part of an ongoing capacity model?
The most important part of this approach is to decide you want to do it. Lots of options naturally come your way when you try to get creative in the direction of power- because power drives the creatviity further. Ideas are given wings that can fly higher and farther than any other traditional RDBMS could even dream about. Managers start having ideas too. By golly if that machine can do this, then why not that? And why not, indeed?
In fact, most of the time when dealing with a standard RDBMS, the managers will ask why-not? in the sincerest of spirits. Then forth from the mouths of IT come the exact, precise reasons why-not. They are good reasons, logical reasons, and effectively put a wide moat around the management's idea-factories. Soon their ideas fade. They forget how good the ideas were. They will never see the light of day. The underpowered nature of the machines constrain them.
Contrasted with the Netezza machine, the question of why-not is more rhetorical in nature. The person asking the question is not expecting an answer either, but not for the same reasons as the manager above. He's not expecting an answer because the IT folks are already on it. His answer to the question will not be verbal, but will be positively expressed in real functional terms. Likely in a very short period of time.
Why-not? becomes more of a water-cooler phrase. Almost like, "I'm trying do decide whether we should eat at the club or eat at the diner. How about today we eat at the club?" To which the respondent says - "why not?" This is a very different and rarefied existence than the one borne on the constraints of an underpowered machine.
Once you gather a head of steam on all this, you realize that not only is Netezza an enabler for large-scale, complex problem solving, it provides the impetus for us to construct a problem-solving platform upon it. The ability to capture larger patterns of set-based processing, express them in simpler terms, and have them available to all - as capabilities that leverage the capabilites of the Netezza machine.
In your own domain, if you have a Netezza machine in-house and you're using it for data processing, even if it's the most inefficient model on the planet, it still beats the socks off the plain-vanilla ETL tool's work for the same operations. If you have coupled an ETL tool with this approach and are getting the gains out of it, even though this process may have initially been painful, you have answered the question with the right answers. The tools may not be quite up to speed, but that's okay. Your compass has not betrayed you. Stay the course and the benefits will be magnanimous indeed.
And for those who continue to use the machine as a load-and query device, and have not forayed into the radically rewarding realm of ELT and data processing inside the machine, there is only one question to ask, and it's the right question:
So I published this book last Spring ('11) on how the Netezza machine is a change-agent. It initiates transformation upon people or products that happen to intersect with it. Most of the time this transformation makes the subject better. Sort of like how heavy-lifting of weights will make the body stronger. Or the pressure can crush the subject. Stress works that way. We could imagine the Netezza machine as the change-agent entering the environment. Everything brushing against it or interacting with it will have to step-up, beef-up or adapt. I sometimes hear the new players say things like "But if the Netezza machine could only.." That's like a Buck Private saying of his drill sargeant, "If he could only ..." No, the subject must consider that the Netezza machine is never the object of transformation but rather is the initiator of it. But it's not a harsh existence by any means. Products that can adapt are far-and-away better than before. Those that cannot adapt now, will eventually, or remain in their current tier.
Having been directly or indirectly alongside these sorts of product integrations and proof-of-concepts (POCs) numerous times, it's always an interesting ride. The vendor shows up ready-to-go with visions-of-sugarplums in their head. And the suits who show up with them, are salivating for the ink on the license agreement. In less than an hour into the POC, all of them have a very different opinion of their product than when they arrived. Their bravado is reduced to a shy, sort of sheepish spin. Throw them a bone, not everyone walks out of this ring intact. Some of them shake their fist at the Netezza machine. It is unimpressed. Others shake their fist at their own product. Alas, it is but virtual, inanimate matter. What is transforming now? The person in the seat.
So I have watched them scramble to make the product hit-the-mark. Patches? We don't need no stinkin' patches. Except for today, when they will be on the phone in high-intensity conversations with their "engine-room" begging for special releases while on-site. Alas such malaise could have been avoided if only they had connected their product - at least once - to a Netezza machine. In so many cases, they will claim that they have Netezza machines in-the-shop so they are prepared-and-all-that. It is revealed, sometimes within the first hour, that the product has never been connected to a Netezza machine. It doesn't even do the basics, or address the machine correctly. It is especially humorous to hear them speak in terms of scalability as though a terabyte is a high-water mark for them. One may well ask, why are we wasting our time with underpowered technology? Well, in point of fact, when placed next to the Netezza machine it's all underpowered, so really it's just a matter of degree.
Case in point, Enzees know that in order to copy data from one database to another, we have to connect to the target database (we can only write to the database we are connected to). And then use a fully-qualifed database/tablename to grab data from elsewhere - in the "select" phrase. Forsooth, their product wants to do it like "all the others do" and connect to the source, pushing data to the target. Staring numb at the white board in realization of this fundamental flaw, they mutter "If only Netezza could....". But that's not the point. They arrived on site, product CD in hand, without ever having performed even one test on real Netezza machine, or this issue (and others) would have hit them on the first operation. They would have pulled up a chair in their labs, started the process of integration and perhaps call the potential customer "Can we push the POC off until next week? We have some issues (insert fabricated storyline here) and need to do this later."
Cue swarming engineers. Transformation ensues.
Another case in point, many enterprise products are built to standards that are optimized for the target runtime server. That is, they fully intend to bring the data out of the machine, process it and send it back. One of my colleagues made a joke about Jim Carrey's "The Grinch" and the mayor's lament for a "Grinch-less" Christmas. Well, didn't the Grinch tell Cindy-Lou Who that in order to fix a light on the tree, he would take the tree, fix it and bring it back? Seems like a lot of hassle for one light? Why can't you fix it here and not take it anywhere? Enzees see the analogy unfolding. No, we don't want to take the data out, process it and put it back. We want "Grinch-less" processing, too. Fix the data where it already is.
Why do this? Well, in 6.0 version of the NPS Host, the compression engine could easily give us up to 32x compression on the disk. Or even a nominal 16x compression, meaning that our 80 terabytes is now 5TB of storage. And while we may have to de-compress it on the inside of the machine to process it, the machine is well-suited to moving these quantities around internally. Woe unto the light-of- heart who would pull the data out into-the-open, blooming it to its full un-compressed glory, on the network, CPU, the network again - just to process it and put it back.
Unprepared for the largesse of such data stores, our vendor contender's product centers on common scalar data types. Integer, character, varchar, date. No big deal. Connect to the Netezza machine and find out that the "common" database size is in the many billions and tens of billions of rows. A chocolate-and-vanilla software product without regard to a BigInt (8 byte) data type, cannot exceed the ceiling of 2 billion (that's the biggest a simple integer can hold). This does not bode well for integrating to a database with a minmum of ten billion records and that's just the smallest table. Having integers peppered throughout the software architecture by default - would require a sweeping overhaul to remediate. As the day wears on, we see them struggle with singleton inserts (a big No-No in Netezza circles) and lack of process control over the Netezza return states and status. These are not exotic or odd, but no two databases behave the same way. The moment that Netezza returned the row-count that it had successfully copied four billion rows, we watched the product crash because it could not store the row-count anywhere - the product had standardized on integers, not big integers, so the internal variable overflowed and tossed everything overboard. Quite unfortunately, this was a data-transfer product and performed destructive operations on the data (copy over there, delete the original source over here). So any hiccup meant that we could lose data, and lots of it.
Cue announceer: "And the not-ready-for-prime-time-players..."
Oh, and that "lose data and lots of it" needs to be underscored. In a database holding tens of billions of rows (hundreds of terabytes) of structured data, that is, each record in inventory, with fiducial, legal, contractual, perhaps even regulatory wrappers around it, and we're way, way past the coffin zone. Some of you recall the "coffin zone" is the point-of-no-return for an extreme rock-face climber. Cross that boundary and you can't climb down. But we're not climbing a rock face are we? The principle is the same. Lose that data and we'll get a visit from the grim reaper. He doesn't hold a sickle, just a pink slip in one hand and a cardboard box in the other (just big enough for empty a desk-full of personal belongings).
One test after another either fails or reveals another product flaw. When the smoke clears, the "rock solid offering" complete with sales-slicks and slick-salesmen, is beaten and battered and ready for the showers. The product engineers must now overhaul their technology (transform it) and fortify it for Netezza, or remain in their tier. The Netezza machine has spoken, reset itself into a resting-stance, presses a fist into a palm, graciously bows, and with a terse, gutteral challenge of a sensei master, says: "Your Kung Fu is not strong!"
Now it's transformation-fu.
Superficially, this can look like a common product-integration firefight. But this kind of scramble tells a larger tale: They weren't really ready for the POC. This would be similar to an "expert" big-city fireman, supremely trained and battle-hardened in the art of firefighting and all its risks, joining Red Adair's oil-well -fire-fighting team ( a niche to be sure) and finding that none of the equipment or procedures he is familiar with apply any longer. He will have to unlearn what he knows in order to be effective on a radically larger scale. He might have been a superhero back home, faster than a speeding bullet, able to leap tall (burning) buildings in a single bound, but when he shows up at Red Adair's place, they will tell him to exchange his clothing for a fireproof form and get rid of the cape. Nobody's a hero around an oil-field fire. Heroes leave the site horizontally, feet-first. No exceptions.
Enzees have experienced a similar transformation (with a different kind of fire). The most-oft-asked questions at conferences are just that flavor: How do we bring newbees into the fold? How do we get them from thinking in row-based/transactional solutions into set-based solutions? How do we help them understand how to use sweeping-query-scans to process billions of rows? Or use one-rule-multiple-row approaches versus cursor-based multiple-rule-per-row? How do we get testers into a model of testing with key-based summaries instead of eyeballs-on-rows (when rows are in billions)?
We were dealing with a backup problem at one site because of a lack of external disk space. Commodity tools often use external disk space for this purpose, until they are connected to a Netezza machine and their admin tool complains that they need to add "another hundred terabytes" of workspace. We gulp, realizing that the workspace is only today a grand total of ten terabytes in size. And you need another hundred! Yeesh, you big-data-people!
Most of the universe outside the Enzee universe will never have to address problems on this scale. It is not the machine itself that is the niche. It is the problem/solution domain. Most of the commodity products that are stepping up are doing so only because it's clear that Netezza is here to stay and they need to step into Netezza's domain. I suppose at some point they expected Netezza to give them a call to start the integration process, but the Netezza Enzee Universe already had all that under control. It's amazing how lots-of-power can simplify hard tasks to the end of ignoring commodity products entirely.
Another case in point, a product vendor "popped over" with a couple of his newbee product guys and spent two weeks trying to get their product to play in-scale with Netezza. Before throwing in the towel, they offered up the common litany of observations. "No indexes? What the?" and "Netezza needs to change X", or the favorite "Nobody stores this much data in one place." The short version is, you brought a knife to a gun fight, as Sean Connery would assert, or perhaps, you brought a pick-axe and a rope to scale Mt. Everest. What were you thinking? You see, most people who have never heard of Netezza (I know, there really are folks out there who don't know about it, strange as is seems) do not understand the scale of data inside its enclosure. Billions of records? Tens of billions of records? A half-trillion records? Is that all you got?
We will watch a switch flip over in their brains as they assess what they are trying to bite off. A small group will embrace the problem and work toward harnessing the Netezza machine in every way possible. Another group will provide a bolt-on adapter for Netezza to interface to their core product engine. While another, larger group will assess the expense of such things, the marketplace they currently address, and conclude that they will for now remain in their current tier. This is like a 180-lb fighter climbing into the ring with a heavyweight, and walking away realiizing that they need to add some muscle, some speed, and some toughness or just stay in their own weight class and be successful there.
Another case-in-point is the need for high-intensity data processing in-the-box in a continuous form, coupled with the need for the reporting environment to share the data once-processed, likewise coupled with the need for backup/restore/archive and perhaps a hot-swap failover protocol. We can do these things with smaller machines and their supporting vendor software products. But what about Netezza, with such daunting data sizes, adding the complexity of data processing?
At one site we had a TwinFin 48 (384 processors) and two TwinFin 24's (192 processors) with the '48 doing the heavy-lifting for both production roles. When it came time to get more hardware, the architects decided to get another '48 and split the roles, so that one of the machines would do hard-data-processing and simply replicate-final-results to the second '48, limiting its processing impact for any given movement. This was not the only part of their plan. They then set up replicators to make "hot" versions of each of these databases on the other server. This allowed them to store all of the data on both, providing a hot DR live/live configuration, but it would only cost them storage, not CPU power. Configured correctly, neither of the live databases would know the difference. Our replicators (nzDIF modules) seamlessly operated this using the Netezza compressed format to achieve an effective 6TB/hour inter-machine transfer rate, plenty of power for incremental/trickle feeds across the machines.
Some say "I want an enterprise product that I can use for all of my databases". Well, this is the problem isn't it? Netezza is not like "all of our other databases". Products that have a smashing time with the lower-volume environments start to think that a "big" version of one of those environments somehow qualifies their product to step-up. I am fond of noting that Ab Initio, at one site loading a commodity SMP RDBMS, was achieving fifteen million rows in two hours. Ab Initio can load data a lot faster than that (and is on record as the only technology that can feed Netezza's load capacity). So what was the problem? The choice of database hardware? Software? Disk space? Actually it was the mistaken belief that any of those can scale to the same heights as Netezza. I could not imagine, for example, that if fifteen million rows would take two hours, what about a billion rows (1300 hours? ). Netezza's cruising-speed is over a million rows a second from one stream, and can load multiple streams-at-a-time.
Many very popular enteprise products have not bothered to integrate with a Netezza machine, and many of those who have, provide some form of bolt-on adapter for it. It usually works, but because the problem domain is a niche, it's not on their "product radar". It's not "integrated as-one". What does this mean? Netezza's ecosystem, and now assimilated by IBM, through IBM's product genius and sheer integration muscle, will ultimately have a powerful stack for enterprise computing such that none of the other players will be able to catch up. If those vendors have not integrated by now, the goal-line to achieve it is even now racing ahead of them toward the horizon. Perhaps they won't catch up. Perhaps they won't keep up. Some products (e.g. nzDIF) are at the front-edge, but nzDIF is not a shrink-wrapped or download-and-go kind of toolkit. We use it to accelerate our clients and differentiate our approaches. It's a development platform, an operational environment and expert system (our best and brightest capture Netezza best practices directly into the core). This has certainly been a year where we've gotten the most requests for it. But there's only one way to get a copy.
Cue Red Adair.
"No capes!" - Edna Mode, clothing-designer-for-the-gods, Disney/Pixar's The Incredibles
Modified on by DavidBirmingham
I'm wont to say the Netezza technology does a bit more than suggest it will run fast - it's an explicit promise practically nailed to the cabinet.
One may well ask (I'm glad you did) we did everything we were supposed to do, but it still runs slow. Why?
Many sites I visit, I hear something similar, and the irony is, they aren't far from the mark. It's not like we'll have to revamp everything, or overhaul the model, or perform some massive architectural rework threatening the vast spacetime continuum-
No, it's usually a tweak here and there. Sometimes not.
Here's the deal. It's a data warehouse appliance. It circumscribes decades of tried-and-true data warehousing, making it easier than ever to rollout fairly standardized warehouse design. Problem is, many folks who buy one have never been exposed to data warehouse principles. No wonder when we pop-the-hood, we don't find what we expect to find - an implemented data warehouse. What we find is an oddity.
Stay with me, it'll be painless, I promise. In the data warehouse world, we embraced a common, simple truth: 3NF Third-Normal-Form (transaction-style) schemas are lousy for reporting. They require too many joins and drag the transactional operation to its knees. Formulating optimized tables in the same machine won't help, because reporting is set-based, and transactional is entity-based, and the very engine we're running on favors the latter and disfavors the former. Hence the need to move the data off the machine and reformulate it for better performance.
The spirit in play here, is the acknowledgement that the 3NF model doesn't work, and the embracing of a model that does. This typically involves a dimensional-style model. It means the tables of the transactional model are merged, consolidated etc into fewer tables. It also means data types, data values and the like are all scrubbed to a known value, and their types aligned, to avoid any kind of on-demand drag. After all, would we rather zap a field into upper case on the back-end, or run the upper() function in the on-demand where-clause? Running it in the where-clause creates egregious drag. Zapping it in the back-end costs us nothing.
But once these sorts of schemas are moved into Netezza technology, something seems to happen to the mind. One might think it's okay to take those source tables, CDC-them over to the Netezza machine, slap some reports on top of these tables, and call it a day.
Not so fast. (Oops, that's what we're trying to fix - the system that's not so fast - I make myself laugh sometimes)
Bringing the source's 3NF model into Netezza is fine for Staging. Even for a central Hub schema to consolidate all sources. What the 3NF model is notoriously poor at, is set-based operations such as analytic queries for reporting. These were poor in our source system and they are poor over here, too. They are always poor for reporting. No amount of high-powered hardware will change this.
But that's okay. If we have the data inside the machine, we're not far from finishing-up the next step. It's just a lot of people now who aren't taking the next step. They leave the original source structures "as is" and wonder why they can't get a faster query. I mean, the high-powered hardware's there, right?
Pre-integration vs Re-Integration
Pre-Calculation vs Re-calculation
Pre-formulation vs re-formulation
Pre-scrub vs re-scrub
See a pattern here?
Saw a chat between several folks describing the following configuration:
Sources -> Hadoop - > Atomic EDW -> Dimensional Analytics
And claimed that the Atomic EDW and Dimensional model were no longer necessary. We can go straight to the interface to Hadoop, and push all our queries against it. While this might be functionally true, how does it dovetail with the above assertions about pre-integration and pre-calculation?
A configuration without the Atomic EDW and Dimensional (data marts) - requires the user queries to re-calculate and re-integrate the very same data, on demand, repeatedly.
For example, if it takes three minutes to bring all the "raw" data together, what if we brought the data together and take the three-minute-hit just once? Then the user queries are benefitting from this pre-integration because they are not experiencing integration-on-demand.
It goes like this: We walk into a bike shop and take a look at all the cool bikes, various frames and options, and pick one. Rather than take it off the rack, the owner says he will have one ready in a week. What the? Just sell me that one, No, he says, we will build if from scratch, shower it with love, have reruns of the Dick Van Dyke show playing in the background, and it's all good. You'll love it. So you come back a week later to pick up the bike, and a lackey does the transaction for you while the owner is having the same speech with another customer. Doesn't seem very efficient, does it?
A 3NF model in Netezza practically requires a complete data rebuild every time we issue the query. The same tables are completely re-scanned, their data re-joined, their quantities re-calculated - hmm - we just did this five minutes ago when we ran the query last time, didn't we? On-demand re-integration/re-calculation is where we end up dissipating our power, and wasting our time.
If we were to take the next step with the raw data, reformulate new tables that pre-integrate and pre-calculate the necessary values, we have a consumption-ready model. When our queries hit it, the data streams from the machine faster, because all the heavy-lifting is already done. Integrate-once/consume-many.
At a site where a "raw" model like this prevailed, a very complex view was placed on the tables, joining dozens of them and performing a wide variety of work. As an experiment, we simply persisted the contents of the view without any filters. The same raw data as experienced by all on-demand queries would be pre-integrated into a single table. Where queries on the original view would take ten minutes or longer, the queries on this persisted form took under twenty seconds. We can see why - all the heavy lifting had been persisted - integrate once / consume many. This of course is not operationally practical since all those source tables were being trickle-fed every five minutes.
But the spirit of the operation remains. We should trickle-feed the inbound data to staging, and then forward the changes to a persisted location optimized for reporting.
I didn't mention that this view was used for over 200 "location" reports. Meaning that the same query ran over 200 times - that's 200 times 10 minutes - or 2000 minutes - 33 hours if running linearly. Because this query saturated the machine with data, they couldn't run anything in parallel. Their reports were always behind. After we converted it to the 20-second model, they could run the reports in a little over an hour if running linearly. Once we applied zone maps for their dates and their LOCATION_ID, the queries became sub-second, and could run dozens of them in parallel. The entire reporting cycle, over 200 reports finished in less than three minutes.
Isn't this the sort of performance we were looking for? The pre-integration and the attention to machine physics reduced a 33-hour job to under 3 minutes.
Now can we see how not-addressing inefficiency will dissipate the machine's power rather than harness it?
A symptom of a poorly implemented model can often be found in the shape of the query. If the query is complex, has a lot of work happening on-demand, or is performing a lot of left-outer joins, it simply means the data isn't consumption-ready. If we open our BI tool and find that a lot of columns are wrapped with NVL or COALESCE into known defaults, it's time to apply those defaults to the data and not clutter the business logic with scrubbing that should have already been done.
Let's look at a query template, or rather, the structure of a query attempting to leverage these transaction-facing tables.
Select (some stuff here)
from (Sub-select here) sub-select here)
Join master (columns, where-clause stuff )
Join secondary (join columns, where-clause stuff )
It's the presence of deep sub-selects, multi-nested views, and the left-outer than are all more than telltale - the data is in a raw state and not ready for mass consumption.
Our teams have remediated countless models from this malaise. Not particularly hard to do, but at all these sites, their folks were too busy to engage the problem full-time, and with the necessary confidence to get it right. Others took our advice to create "sandbox" databases, pull some data inside and formulate tables that worked for them, as a prototype. Once the prototype showed promise, they had evidence to discuss changes with the data modelers to eventually make it operational. This is exactly the process you'd see us follow on your site, because nothing works better than your own data to answer your performance questions.
By the same token, "some" of these sites knew what to do, but didn't do it. One site's leader said to me, "We went to Enzee Universe and took a lot of notes. We hired a guy to put this together and he said we didn't need any of that." Followed by a long awkward pause. I said "Is that why I'm here now?"" and he said, "Exactly." He was worried I was going to tell him he had to rebuild from scratch, but we know that's never the case. Performance optimization is centered in the big-ticket, heavy-lifting tables. If we reduce the noise around those, everything starts to hum like a sewing machine.
The takeaway in all this - if performance is an issue (don't feel you have a lot of data but the system seems stressed) the problem isn't your queries - it's your data model. Performance in Netezza is solely derived from hardware, and the hardware's physics is unlocked by optimized data structures, not efficient queries. Tuning a query in Netezza is a lot like using a steering wheel to make a car go faster. A query is logical but a performance is physical.
Get the data structures in a place where they leverage machine physics, and get the queries in a place where they unlock that physics, and cut 'em off the chain.