Modified by DavidBirmingham
One of the most significant questions to answer on the Netezza platform is that of "distribution" - what does it mean? how does it apply? and all that. If you are a Netezza aficionado, be warned that the following commentary serves as a bit of a primer for the un-initiated. However, feel free to make comments or suggestions to improve it, or pass it on if you like.
In the prior entry I offered up a simple graphic (right) as to how the Netezza architecture is laid out internally.
In this depiction, each CPU is harnessing its own disk drive. In the original architecture, this was called a SPU, because the CPU was coupled with a disk drive and an FPGA (field-programmable gate array) that holds the architectural firmware "secret sauce" of the Netezza platform. The FPGA is still present and accounted for in the new architecture as the additional blade(s) on the blade server. Not to over-complicate things, so let's look at the logical and physical layout of the data itself and this will be a bit clearer.
When we define a table on the Netezza host, it logically exists in the catalog once. However, with the depicted CPU/disk combinations (right) we can see 16 of these available to store the information physically. If the table is defined with a distribution of "random", then the data will be evenly divided between all 16 of them. Let's say we have a table arriving that has 16 million records. Once loaded, each of these SPUs would be in control over 1 million records. So the table exists logically once, and physically multiple times. For a plain straight query on this configuration, any question we ask the table will initate the same query on all SPUs simultaneously. They will work on their portion of the data and return it an answer. This Single-Instruction-Multiple-Data model is a bread-and-butter power strategy of Massively Parallel computing. It is powerful because at any given moment, the answer we want is only as far away as the slowest SPU. This would mean as fast as the SPU can scan 1 million records, this is the total duration of the query. All of the SPUs will move at this speed, in parallel, so will finish at the same time.
What if we put another table on the machine that contains say, 32 million rows? (trying to keep even numbers here for clarity). The machine will load this table such that each SPU will contain 2 million rows each. If no other data is on the machine, we effectively have 3 million records per SPU loaded, but the data itself may be completely unrelated. Notice how we would not divide the data another way, like putting 16 million records on 6 of the SPUs and the other 32 million records on the remaining 10 SPUs. No, in massively parallel context, we want all the SPUs working together for every query. The more SPUs are working, the faster the query will complete.
Now some understand the Massively Parallel model to be thus: Multiple servers, each containing some domain of the information, and when a query is sent out it will find its way to the appropriate server and cause it to search its storage for the answer. This is a more stovepiped model than the Netezza architecture and will ultimately have logistics and hardware scalability as Achilles's heels, not strengths for parallel processing. Many sites are successful with such models. But they are very application-centric and not open for reuse for other unrelated applications. I noted above that the information we just loaded on the Netezza machine can be in unrelated tables, and databases and application suites, because the Netezza machine is general-purpose when it comes to handling its applicability for data processing. It is purpose-built when we talk about scale.
But let's take the 16-million-record vs 32-million record model above and assume that one of the tables is a transactional table (the 32 million) and one is a dimensional table (the 16 million). The dimensional table contains a key called "store_id" that is represented also on the transactional table such that we can join them together in various contexts. Will the Netezza machine do this, and how will it? After all, the data for the tables is spread out across 16 SPUs.
Well, we have the option of joining these "as is" or we can apply an even more interesting approach, that of using a distribution key. Here's where we need to exercise some analysis, because it appears as though the STORE_ID is what we want for a distribution, but this might skew the data. What does this mean? When we assign a distribution, the Netezza machine will then hash the distribution key into one of 16 values. Every time that particular key appears, it will always be assigned the same value and land on the same SPU. So now we can redistribute both of these random tables on STORE_ID and be certain that for say, SPU #5, all of the same store_id's for the dimension table and for the transaction table are physically co-located on the same disk. You probably see where this is going now.
Ahh, but what if we choose store_id and it turns out that a large portion of the IDs hash to a common SPU? What if we see, rather than 1 million records per SPU, we now see around 500k per SPU with one SPU radically overloaded with the remainder? This will make the queries run slow, because while all the SPUs but one will finish fast, in half the time of before, they will all be waiting on the one overworked SPU with all the extra data. This is called data skew and is detrimental to performance.
However, had we chosen a date or timestamp field, our skew may have been even worse. We might see that the data is physically distributed on the SPUs in a very even form. But when we go to query the data, we will likely use a date as part of the query, meaning that only a few SPUs, perhaps even one SPU, will actually participate in the answer while all the others sit idle. This is called process skew and is also detrimental to performance.
Those are just some things to watch out for. If we stay in the model of using business keys for distribution and using dates for zone maps (which I will save for another entry) we will usually have a great time with the data and few worries at all.
Back to the configuration on store_id's. If we now query these two tables using store_id in the query itself, Netezza will co-locate the join on the SPU and will only allow records to emit from this join if there is a match. This means that the majority of the work will be happening in the SPU itself before it ever attempts to do anything else with it. What if we additionally clamped the query by using a date-field on the transaction table? The transaction table would then be filtered on this column, further reducing the join load on the machine and returning the data even faster.
So here is where we get lifting effects where other machines experience drag. For an SMP-based machine, adding joins can make the query run slower. On a Netezza machine, joins can serve as filter-effects, limiting the data returned by the query and increasing the machine's immediate information on where-not-to-look. Likewise the date-filter is another way we tell the machine where-not-to-look. As I have noted, the more clues we can give the machine as to where-not-to-look, the faster the query will behave.
I have personally witnessed cases where adding a join to a large table actually decreases the overall query time. Adding another one further decreases it. Adding another one even further and so on - five-joins later we were getting returns in less than five seconds where the original query was taking minutes. This filter-table and filter-join effect occurs as a free artifact of the Netezza architecture.
Distribution is simply a way to lay the data out on the disks so that we get as many SPUs working for us as possible on every single query. The more SPUs are working, the more hardware is being applied to the problem. The more hardware, the more the physics is working for us. Performance is in the physics, always and forever. Performance is never in the software.
While the above is powerful for read-query lift (co-located reading) there is another power-tool called the co-located write. If we want to perform an insert-select operation, or a create-table-as-select, we need to be mindful that certain rules apply when we create these tables. For example, if we want to perform a co-located read above and drop the result into an intermediate table, it is ideal for the intermediate table to be distributed on the same key as the tables we are reading from. Why is this? Because the Netezza machne will simply read the data from one portion of the disk and ultimately land it to another portion of the same disk. Once again it never leaves the hardware in order to affect the answer. If we were to distribute the target table on another key, the data will have to leave the SPU as it finds a new SPU home to align with its distribution key. If we actually need to redistribute, this is fine. But if we don't and this is an arbitrary configuration, we can buy back a lot of power just by co-locating the reads and writes for maximum hardware involvement.
So that's distribution at a 50,000 foot level. We will look at more details later, or hop over to the Netezza Community to my blog there where some of these details have already been vetted.
It is important to keep in mind that while distribution and co-location are keys to high(er) performance on the machine, the true hero here is the SPU architecture and how it applies to the problem at hand. We have seen cases where applying a distribution alone has provided dramatic effects, such as 30x and 40x boost because of the nature of the data. This is not typical however, since the co-located reads will usually provide only a percentage boost rather than an X-level boost. This is why we often suggest that people sort of clear-their-head of jumping directly into a distribution discussion and instead put together a workable data model with a random distribution. Once loaded, profile the various key candidates to get a feel for which ones work the best and which ones do not. We have seen some users struggle with the data only because they prematurely selected a distribution key that - unbeknownst to them - had a very high skew and made the queries run too slow. This protracted their workstreams and made all kinds of things take longer than they should have.
So at inception, go simple, and then work your way up to the ideal.
On multiple distribution keys:
Many question arise on how many distribution keys to use. Keep in mind that this is as much a physical choice as a functional one. If the chosen key provides good physical distribution, then there is no reason to use more keys. More granularity in a good distribution has no value. However, a distribution key MUST be used in a join in order to achieve co-location. So if we use multiple keys, we are committing to using all of them in all joins, and this is rarely the case. I would strongly suggest that you center on a single distribution key and only move to more than one if you have high physical skew (again, a physical not functional reason). The distribution is not an index - it is a hash. In multi-key distributions, the join does not first look at one column then the next - it looks at all three at once because they are hashed together for the distribution. Joined-at-the-hip if you will.
On leaving out a distribution key:
One case had three tables joining their main keys to get a functional answer. They were all distributed on the same key, which was a higher-level of data than the keys used in the join. The developer apparently thought that because the distribution keys are declared that the machine will magically use them behind the scenes with no additional effort from us. This is not the case. If the distribution key(s) (all of them) are not mentioned in the join, the machine will not attempt co-location. In this particular case, using the higher-level key would have extended the join somewhat but would not have changed the functional answer. Simply adding this column to the join reduced that query's duration by 90 percent. So even if a particular distribution key does not "directly" participate in the functional answer, it must directly participate in the join so as to achieve co-location. And if this does not change the functional outcome, we get the performance and the right answer.
How it affects concurrency:
Many times people will ask: Why is it that the query runs fast when it's running alone, but when it's running side-by-side with another instance of itself they both slow to a crawl? This is largely due to the lack of co-location in the query. When the query cannot co-locate, it must redistribute the data across the inter-blade network fabric so that all the CPUs are privy to all the data. This quickly saturates the fabric so that when another query launches, they start fighting over fabric bandwidth not the CPU bandwidth. In fact some have noted that the box appears to be asleep because the CPUs and drives aren't doing anything during this high-crunch cycle. That's right, and it's a telltale sign that your machine's resources are bottlenecked at the fabric and not the CPU or I/O levels. By co-locating the joins, we keep the data off the fabric and down in the CPUs where it belongs, and the query will close faster and can easily co-exist with other high-intensity queries.
Once again this entry will serve as an intro to the un-initiated. I get a lot of pings from folks who are new to the stuff and want to know their way around the machine from the beginning. Zone maps are bread-and-butter power tools of the Netezza technology. The companion capabilities are the materialized view and the cluster base tables (which are new to v 6.0).
About three years ago I sat in on a special presentation to our firm from one of Netezza's competitors. They put up some power-point slides to compare their technology to "the others". Their product was a data warehousing product and all the "others" were "OLTP" products. The usual suspects were in this list of course, but oddly so was Netezza. But Netezza is not an OLTP machine in any sense of the word. How odd that this company did not know enough about its primary rival to discern this simple truth. In addition, this vendor made certain odd claims about Netezza that were either architectural false-assumptions or just misinformation. One of which was this "In a Netezza machine, the more data you add to a table, the slower the table's response will be." This is of course completely false. Netezza tables are like any data storage with one major takeaway that should not be missed:
Netezza tables may behave inconsistently if the data is not properly configured and deployed. Notice I did not say "if the hardware is not properly ...." but the data itself. Netezza already has the hardware configuration under control and there's nothing we can do to affect it. We can however, align our data with the hardware architecture and will receive the benefit of a consistent experience no matter how much the data grows. It's hard to imagine this kind of consistency on any platform, so stay with me here.
A number of years ago I did some deep immersion in a health-care expert system for Value Health Sciences. The objective was to directly apply and enforce the codes found in the CPT code book used by physicians (those are the codes the doc will check on the sheet he hands to you before he leaves the office). Insurance companies have standard billing/invoicing arrangements with the physicians and these codes are the centerpiece. For VHS, their medical practitioners had set up rules for characterizing (and properly re-characterizing) a claim. For example, if a physician billed codes 1,2 and 4 but he was supposed to have billed these all as 1 and 5 (because code 5 bundles 2 and 4 to reduce cost) the system would have a rule in it that looked something like:
1,2,4 = 1, 5
The difficulty here of course, with some 100,000 codes and over 40,000 rules, is to organize them so that they are quickly accessible. By design, the first code in the rule (eg. code "1" above) was considered an "index" code. In other words, any and all rules that would use the code "1" as an anchor would also make "1" the first code. This lended itself to the opportunity to "vector" the rules. That is, create an in-memory array, indexed by the given code, and a list of rules would be handily available at that vector point. So that all we had to do was look at the claim (e.g. codes 1,2,4) fetch the various candidate rules from their vector lists ( the list for code 1 might have 3 rules, likewise for the others, reducing our total candidate rules from 40,000 to only 9 rules). We could then run the rules and dispatch the claim. When I had first received this model, the system was physically running through all 40,000 rules to find the right match. This vectored approach reduced the total rules, but also eliminated guesswork. And with all that, boosted the power and created a very consistent turnaround time for each patient claim.
The vector list above is a common way in programming to organize information so that - if we have something in our hands already - like a filter or a vector based on a natural key - we can quickly find where we need to start looking and eliminate the places where we are not interested. While this sounds a lot like a common index-structure approach, it's actually a process-of-elimination approach. The two have similar qualities, so I will delineate them here at a high level.
An index is a surgical approach to finding data by logging its record-based keys and a record locator into a separately maintained structure. By finding keys, we find physical records. This is important for transactional systems that need fast record-level turnaround. It is not ideal however, for analytics or data warehousing where the vast majority of queries don't care about individual records at all, but about answers that span records. This alternative approach of vectoring, is a way to bucket the information so that we're not attempting to specifically locate the record itself, but the bucket where the record may be located, that is, derive a group of candidates upon which we will take further action.
When we think about this, the candidate approach, as opposed to the surgical approach is the only way to scale anything that has to deal with multiple-record sets (set-based processing). Engines that are geared to examine one-record-at-a-time even if it is to aggregate or perform multi-record operations with the single-records it retrieves, cannot compete with an engine that is designed to defacto treat the records as multiples from the outset. Transactional versus bulk. The transactional engine is inherently slower and non-scalable compared to the bulk engine (for set-based operations).
I noted in a prior entry that Netezza's anti-indexing strategy heavily leverages "where not to look" because no matter how large a table gets, this "vectoring" or "candidate" approach will guarantee that our response time is very consistent. How does this apply to zone maps?
Recall the diagram with the CPU/disk combinations. Each of these disks is divided into its smallest actionable physical portion (an extent) and the extent then becomes the currency-of-the-realm when we're talking about searching for data. The machine automatically tracks the data that is added to an extent, and when we generate-statistics, every extent is accounted for in a zone map. The map has basic statistics on the columns in the table at the extent level, such as the high/low values for a column in the given extent.
Extent ColA ColB ColC
1 200 A B
1 300 A B
1 400 C D
1 500 E F
2 200 F G
2 400 H I
2 600 J K
3 100 L M
3 200 J O
Note the above layouts of the extents. If we were to look at the high/low values of each, we would find zone maps like this:
Extent ColA High ColA Low
1 500 200
2 600 200
3 200 100
So now if we ask a question to the machine based on say, a ColA value of 200, the machine will search all three extents. It has no choice because the zone map is telling it that pertinent data exists in all three.
Now let's physically sort the data by ColA
Extent ColA ColB ColC
1 100 L M
1 200 A B
1 200 F G
1 200 J O
2 300 A B
2 400 C D
2 400 H I
3 500 E F
3 600 J K
Now if we take a look at the zone map, we would see something like:
Extent ColA High ColA Low
1 200 100
2 400 300
3 600 500
So that now if we query the table for values of 200, the machine searches only one extent, not all of them. (Now, this is a simple example, because a given value could extend across many extents). Likewise a given disk has thousands of extents upon which to store the data, so it can spread the information out into much smaller, manageable chunks.
The point, however, is that by organizing the data this way, the zone maps reduce the total searchable extents. Keep in mind that we don't have to sort the data each time we want to use it. For many warehouses, the data naturally arrives in this contiguous form (for example, transaction date in a retail model). The Netezza machine will track all integers and dates automatically without us having to do anything additional to enable it. So we might take a penalty once to sort the data at the outset, then benefit from the data's natural arrival and collation.
This effectively creates the aforementioned vectored effect. It also guarantees a consistent experience because the data associated with the given value (say 200 above) isn't going anywhere. It will always be in the same location, tracked by the same zone map. Anytime we want to get back to the records associated with value 200, the zone map will serve them up. If the response time last week was 2 seconds, we could quadruple the size of the table and it will still be 2 seconds, because the data itself didn't go anywhere.
Contrast this to other heavy-lifting algorithms in software technologies. The Ab Initio platform, for example, will perform an in-memory Sort that will "intelligently" spill to disk when it runs out of memory. What this means is that it will take a predefined block of memory and attempt to perform all work inside it until it runs out. If so, it will manufacture 16 file locations within which to spill its overflow. These are organized with the same "vectored" approach as the sort keys would dictate. If these 16 files run out of space, they will further divide into 256 files to split the workload. Anyone working with the technology will tell you that this secondary split into 256 is to be avoided if at all possible because it negatively (and significantly) affects the performance of the sort. But we see the difference here. A general-purpose sort in software needs these 256 assets to functionally complete its mission, and adding more means less performance. Now compare this to Netezza which uses not just 256, but thousands of extents to divide the data out - and rather than experience the performance drag, will derive power and lift from this.
So let's apply something additional, especially for highly active tables. If we have a nightly batch window for processing, we likely have all the time we need to keep these zone maps healthy. But if we have an intraday model or even an aggressive micro-batch model, it is possible that the zone maps may be "violated". Note above that as long as the data arrives in contiguous chunks (like transaction date) the data will be set into the extents so that we minimize the total extents for a search. If a transaction date (for example) should appear outside of this contiguous range (say a record from last week's data has arrived) then it means the high/low range for that particular extent will be extended to include the errant record. It could mean that the given extent is also included in queries that don't related to either date, because the extent is simply being included because of its high-low range. That is, every query looking for this extent's primary transaction date, the errant transaction date, or any date in between them, will include this extent in the candidate list. While this is inconsequential in most cases, if it repeats over time it can slowly degrade the table's performance. Highly active micro-batch and intraday models have a tendency to see these effects if they remain unchecked or unguarded. But not to worry - the solution is simple.
What is daunting about this is that our highly-active-table might be a multi-billion-row beast that does not lend itself to re-sorting and in an intraday model, there's no time to re-sort the data anyhow.
A materialized view has the quality of being able to apply a virtual zone map to a table without physically re-sorting the information. All we have to do is pick the columns we want to zone map, and then form the materialized view with an order-by on those columns. Rebuilding one of these is usually very fast. In addition, we can be rebuilding it even while its prior version is active. At one site, we set up a process to rebuild the view thus:
Note that for the few moment that two materialized views are applied to two tables bears no risk at all. And if we drop the original materialized view, it will wait for all pending queries to stop - meaning that this won't break anything the users are doing (another necessity for highly active intraday models)
Important here is that the materialized views are not being used as surrogate tables, but as a means to virtualize the zone maps for faster response time on the table associated with the view. Another important aspect of the materialized view is that it can zone-map additional data types that the regular zone maps do not (e.g. varchars).
Let's say that we have a very active table and even re-creating the materialized view is too much uptake to maintain it. Not to worry, the materialized view has another quality that means we don't have to refresh it each time the table changes. The materialized view maintains its contents until the underpinning table changes. We then have the option to rebuild the view or use it "as is". This "as is" is usually agreeable for the short term, since when data is added to the underpinning table, the materialized view creates an extension to itself. Now whenever we query the table, it will examine the materialized view for default zone-map information but will also examine the extent. Considering that this extra extent is very small it is also insignificant in its overhead. So we can keep this in place for say, all day long and then clean up the materialized view in the evening. The materialized view's performance will not drag over the course of a day unless we do some serious re-churning of the underpinning data (which is not normal for common micro-batch models).
And now - Cluster-based tables
With all the above preliminary stuff out of the way, a discussion on CBTs is fairly straightforward. As of the 6.0 release, Netezza has provided this as a multi-dimensional capability for even more powerful searches. The data in the CBT is physically organized into extents based on keys. And when we perform a groom operation, the data is physically reshuffled so that the keys are physically co-located on the fewest extents possible. Like a self-enforcing zone map. only a lot more powerful. In a zone map approach, the primary way to get the data contiguous is to re-sort the data and the zone maps fall out for free. In a CBT, we organize the data and the execute groom. The initial groom will take awhile, but each subsequent groom will take inconsequential time. The data in the CBT can be organized on more than one key, but the keys can behave independently like a multi-dimensional effect.
Case in point - let's say we have 200 extents of data and the particular set of data we want spans only 10 of those extents (a zone map). We will ignore the other 190 extents and focus all the machine's energy on the 10 where the data is located. But wait, what if we have an additional key upon which to search, and the combination of the first key (10 extents) and second key reduces the total extents to only 2? That is, the first key appears in 10 extents but the first and second key appear in only 2 extents of those 10. Now we search only 2 extents. Again, we have organized and configured data, not hardware, to achieve the goal. We have told Netezza where-not-to-look and now we have zoomed in on a mere 2 extents out of 200.
But let's say that we add a lot more data to the table over the year, expanding this 200 extents to many hundreds if not thousands more. Can we see now that even if this happens, we can go back to the table again and again, ask the the same question as before, and only 2 of the extents will be used to search for data. This of course dispels the vendor-promulgated propaganda that Netezza tables get slower when we add more data.
At far back as the 2007 Enzee Universe conference in Boston, people have scratched their head on the notion of a large-scale database that actually boasts about the absence of index structures. After all, indexes are the mainstay of relational databases and we simply can't get by without them, right? This is a simple example of how the power and architecture of the technology frees us to think about data loading and storage, data retrieval and in-the-box processing in a completely different way.
Firstly, it's no rumor that Netezza has no indexes. And for those of us who can't stand dealing with them, this is a huge plus. One of the Enzees at the 2011 conference asked point-blank - "What does Netezza have that Oracle does not?" and the clear answers will arise in this and later blog entries, but for now the absence of maintenance of index structures, will do just fine. What does Netezza have?
Or rather, the vendor has spent enormous thought labor into simplification of complex matters so that we don't have to deal with them. We can get to the business of - well - the business. The application data and information that we want to spend more time with if only we weren't dealing with the technical administration of the database engine. This is one of the greatest weaknesses of the traditional relational database when it comes to data warehousing in general and bulk processing in particular. It is also one of their greatest weaknesses in the space of analytics. After all, we want to apply the analytics by selecting subject areas and data sets to analyze, but if we have to stand up lots of infrastructure to support this mission we are regressing toward functional hardening rather than functional flexibility. Brittleness ensues and someone asks for a refactor, a re-architecture or whatever. When this request arrives in relation to a complex, engineered, multi-terabyte (or say tens-of-terabytes) data store, many will see it as a mind-numbing proposal. For most infrastructures, the weight and complexity of the installation has overwhelmed the logistical capacity of the humans to ever hope reeling it back in. There's not enough power in the machine to help. This could take a very long time to remediate.
Not so with an appliance. Such big-data issues are its bread-and-butter zone. If we have issues with a particular data model or information construct, the machine has the power (and then some) to get us out of the bad direction and into the new one. How did the direction get "bad" in the first place? The business changed its information priorities since the inception of the original model, and now the original no longer serves it well. The business has moved, because that's what businesses do. The analysts found opportunities in the nuggets of information-gold, and now want direct access to that gold rather than having to navigate to it or salivate while waiting for it.
One of the core features of the Netezza platform is how the data is distributed on the disk drives. Because it is an MPP, each of the disk drives is its own self-contained, shared-nothing hardware structure. Contrast this to the common SMP-based platform where the CPUs exist in one duck pond and the disk drives exist in another duck pond. The data is then couriered between duck ponds through throttled estuaries. If we issue a query, all of the data has to be pulled through this estuary and presented to the CPU ducks so that the data can be manipulated and examined in software. It is a software engine on a general-purpose platform.
However, imagine the Netezza platform where each Intel CPU is mated with special hardware (an FPGA), a bounty of RAM to manipulate and cache data, a self-contained Linux operating system, and its own dedicated disk drive. Imagine also that the disk drive is itself divided into multiple sections, where the inner sections are used for internal processes like RAID, temp space and system data but the outermost ring is where user data is stored, offering the benefit of the fastest disk speed for the oft-accessed information. All of these little attention-to-detail aspect cause each of the CPUs to run at a much more powerful factor than their common SMP counterparts. Why? Because their data is co-located with the CPU, where with the SMP we have to drag the data out of one duck pond to get it close to the CPUs that may operate on it. And with SMP there is no such thing as dedicated CPU-to-disk access. The SMP CPUs are shared-everything, but also the disk drives - shared-everything at the hardware level but shared-nothing at the functional-logical level. Netezza's CPUs are shared-nothing at the functional-logical level and shared-nothing at the CPU/disk level.
In Netezza, let's imagine putting 100 of these CPU/disk combinations to work. When we form a table on the host, the table exists logically in the catalog in one place. But it exists physically in 100 places. If we load 100 million records into the machine distributed randomly, then each of the CPUs will "own" 1 million of those records. If we issue a query to the table, each of the CPUs will operate only on the million records it can see. So for any given query, we are not serially scanning 100M records. We are scanning 1M records 100 times.
Now some may object, that we're still scanning the data end-to-end, and for plain-vanilla queries, this is true. However, I recall the first time I performed a join on two tables that were 100M records each doing a one-to-one join on unique keys, on a machine with 24 of these CPUs and the answer returned in 13 seconds. This was a plain vanilla test, so your actual mileage may vary. However, I did the same test on the same machine with 1B records joined to 1B records and the answer returned in less than 300 seconds. Imagine attempting this kind of join on a traditional relational database and expecting a return in any time to actually use the result. (and no, I did not use co-location for this test - a matter for another blog entry)
We have an additional "for free" aspect of the machine called a zone map. If data is laid down in contiguous form (the way transactions arrive in a retail point-of-sale model for example) then the machine will automatically track these blocks and keep their locations in a "zone map". If we then query the database with a given date, the machine will ignore the places where it knows the data is not, and use the zone maps to locate the range of data where it needs to search.
As as example of this, we know of a 150-billion-row table that is over 25 terabytes in size, distributed and zoned such that over 95 percent of its queries return in 5 seconds or less. For a traditional RDBMS, it would take more time than that just to get its index scans squared away so that it could even approach the storage table. This is also why the Netezza machine itself can be scaled into the petabyte zone while maintaining the same performance. No indexes are in the way to load it. No indexes are necessary to search it. Imagine now: the two most daunting aspects of data warehousing and analytics - loading the data and keeping up with the user base - have been washed away with the elimination of indexes. (Don't we turn indexes off in a traditional SMP database so it will load faster, and don't we chase a rainbow trying to index and re-index the tables to keep up with user needs?) Not with Netezza. Without indexes on the columns, all columns are fair game for searching. Without indexes in the way for loading, we can deliver information into the machine, a reprocess it while there, with no penalty from the use of indexes.
By this measure, this is an anti-indexing strategy because Netezza operates on the basic principle of where-not-to-look. In other words, if we can tell it where the data is not, it can zone-in and find the data. When we think about it, this is how a common brick-and-mortar warehouse works. If we showed up and asked for a box of nails, the attendant knows that he doesn't have to look in the parts of the warehouse that carry lawn equipment, hammers, saws or window draperies. He knows where the nails don't exist.
Contrast this to the common SMP-based indexed database, which uses exactly the opposite approach. The indexes are searched for the specific key and then this key is applied to the primary data store. This is why indexed structures in general cannot scale in the same manner as an anti-indexed structure. Keep in mind, with a Netezza platform it won't matter how much data we put on that 25-terabyte table. We could double, triple it or more - and it would always provide the answer in a consistent time frame. This is because from query-to-query - it's still not-looking in the same places, no matter how big those places get. It will still continue to ignore them because the data's not there and it already knows that.
I've had people tell me that there is no real difference in the SMP versus the MPP. The SMP, "properly configured" (they claim) is just as good as any old MPP. However, there is no "proper" way to configure a general purpose hardware architecture so that it will scale. The only way we could hope to coordinate these CPUs is in software, and the only way we can get data into the CPUs is by accessing the shared disk drives in another duck pond on completely different hardware. The SMP configuration is by definition the wrong configuration for scalable bulk processing of any kind, so there is no way to "properly configure" something that already the inappropriate storage mechanism. This would be no different than claiming that a "properly configured" VW Bug (Herbie notwithstanding) could be just as fast as a stock car. The VW Bug is not the wrong platform for general purpose transportation. But it's the wrong platform for model requiring high-scale and high-performance, just an an SMP-based RDBMS cannot scale with the same factor (for set-based bulk processing) as a machine (Netezza) specifically built for this mission. Only a purpose-built machine can ultimately harness this level of data, and only an appliance can remove the otherwise mind-numbing complexity of keeping it scalable.
In the next weeks leading up to IOD (where I will be speaking on most-things Netezza) I'll offer up some additional insights on the internals of the architecture and how it differs from the traditional platforms.
When introducing the Netezza platform to a new environment, or even trying to leverage existing technologies to support it, very often the infrastructure admins will have a lot of questions, especially concerning backups and disaster recovery. Not the least of which are "how much" "how often" and such like. More often than not, every one our responses will be met with a common pattern, a sentence starting with the same two words:
Case in point, when we had a casual conversation with some overseers of backup technology as a precursor to "the big meeting" we almost - quite accidentally - shut down the conversation entirely. Just the mention of "billions" of rows, or speaking of the database in "Carl Sagan" scaled terms, caused them to want to scramble for budget and market surveys of technologies that were more scalable than their paltry nightly tape-backup routines. In this particular conversation, we were talking about nightly backups that were larger than the monthly backups of all of their other systems combined. Clearly we were about to pop the seams of their systems and they wanted a little runway to head off the problem in the right way.
But what is the "right way" to perform a backup where one of the tables is over 20 terabytes in size, the entire database is over 40 terabytes in size and the backup systems require two or even three weeks to extract and store this information for just one backup cycle? Quipped one admin "It takes so long to back up the system that we need to start the cycle again before the first cycle is even partially done." and another "Forget the backup. What about the restore? Will it take us three weeks to restore the data too? This seems unreasonable."
Yes it does seem unreasonable precisely because it is - quite unreasonable. As many of you may have already discovered, the Netezza platform is a change-agent, and will either transform or crush the environment where it is installed, so voracious are its processing needs and so mighty is its power to store mind-numbing quantities of information.
The aforementioned admins simply plugged their backup system into the Netezza machine, closed their eyes and flipped the switch, then helplessly watched the malaise unfold. It doesn't have to be so painful. These are very-large-scale systems that we are attempting to interface with smaller-scale systems. We might think that the backup system is the largest scaled system in our entire enclosure, but put a Netezza machine next to it and watch it scream like a little girl.
So here's the deal: No environment of this size shoiuld be handled in a manner that is logistically unworkable for the infrastructure hosting it. We can say all day that these lower-scaled technologies should work better or that Netezza should pony-up some stuff to bridge the difference, but we all know that it's not that easy. Netezza has simplified a lot of things, but simplification of things outside the Netezza machine - aren't we asking a bit much of one vendor?
To avoid pain and injury, think about the things that we need to accomplish that are daunting us, and solve the problem. The problem is not in the technology but in the largesse of the information. We would have the same problem on our home computers if we had a terabyte of data to backup onto a common 50-gig tape drive. We would need twenty tapes to store the data. The backup/restore technology works perfectly fine and reasonably well for a variety of large-scale purposes. We simply need to be creative about adapting it to the Netezza machine. Don't plug it in and hope for the best. Don't do monolithic backups. The data did not arrive in the machine in a monolithic manner so why are we trying to preserve it that way? Leave large-scale storage and retrieval to the Netezza machine and don't crush the supporting technologies with a mission they were never designed for.
Several equally viable schools of thought are in play here. What we are looking for is the most reliable one. Which one will instill the highest confidence with least complexity? The more complex a backup/restore solution becomes, the less operational confidence we have in it. If it cannot backup and restore in a reasonable time frame, we exist in a rather anxious frontier, wondering when the time will come that the restoration may be required and we put our faith in the notion that it either won't, or when it does all of the other collateral operational ssues will eclipse the importance of the restoration. In other words. future circumstance will get us off the hook. There is a better way, like a deterministic and testable means to truly backup and restore the system with high reliability and confidence.
On deck is the simplest form of the solution - another Netezza machine. Many of you already have a Disaster-Recovery machine in play. Trust me when I tell you that this should be fleshed out as a fully functional capability (discussed next) and then the need for a commodity backup/restore technology evaporates. Using another Netezza machine, especially when leveraging the Netezza-compressed information form, allows us to replicate terabytes of information in a matter of minutes. I don't have to point out that none of our secondary technologies can compete with this.
A second strategy requires a bit more thought, but it actually does leverage our current backup/restore technology in a manner that won't choke it. It won't change the fact that the restoration, while reliable, may be slow simply because moving many terabytes in and out of one of these secondary environments is inherently slow already.
A third strategy is a hybrid of the second, in that enormous SAN/NAS resources are deployed as the active storage mechanism for the data that is not on (or no longer on) the Netezza machine. This can be a very expensive proposition on its own. We know of sites that keep the data on SAN in load-ready form, and then load data on-demand to the Netezza machine, query the just-loaded data, return the result to the user and then drop the table. You may not have on-demand needs of this scale, but this shows that Netezza is ready to scale into it.
A fourth strategy is a hybrid of the first, that is we still use a Netezza machine to back up our other Netezza machine, but we use the more cost-effective Netezza High-Capacity server, which is less expensive than the common TwinFin (fewer CPUS, more disk drives) but otherwise behaves in every way identically to its more high-powered brethren. And honestly if we were to put apples-to-apples in a comparison between the cost of a big SAN plant to store these archives versus the High Capacity Server, the server wins hands-down. It's cheaper, simpler to operate, doesn't require any special adaptation and we can replicate data in terms guided by catalog metadata rather than adapting one technology to another.
So let's take these from the least viable to the most viable and compare them in context and contrast, and let the computer chips fall where they may.
Commodity backup/restore technology
If we want to leverage this, we need to understand that it cannot be used to perform monolothic operations. These are unmanageable for a lot of reasons:
To mitigate the above, we need to adapt the large-scale database to the backup technology by decoupling and downsizing the operation into manageable chunks. This is a direct application of the themes surrounding protracted data movement in any environment. The larger the data set, the more the need for checkpointed operations so that the overall event is an aggregation of smaller, manageable events. If any single event fails, it is independently restartable without having to start over. Case in point, if I have 100 steps to complete a task and they are all dependent upon one another, and the series should die on step 71, I still have 29 steps remaining that may have completed without incident, but I cannot run them without first completing step 71. This is what a monolithic backup buys us - an all-or-nothing dependency that is not manageable and I would argue is entirely artificial.
To continue this analogy, lets say that any one of these 100 steps only takes one minute. In the above, I am still 30 minutes away from completion. I arrive at 6am to find that step 71 died, and now I have to restart from step 1 and it will cost me another 100 minutes. Even if I could restart at step 71, I am still 30 minutes away from completion.
Contrast the above to a checkpointed, independent model. If we have 100 independent steps and the step 71 should die, the remaining 29 steps will still continue. We arrive at 6am to find that only one of the 100 steps died and we are only 1 minute away from full completion. The difference in the two models is very dramatic:
Monolithic means we are operationally reactive when failure occurs. The clock is ticking and we have to get things back on track and keep moving. Checkpointed means we are administratively responsive when failure occurs. We don't have to scramble to keep things going. In fact, in the above example, if step 71 should fail and the operator is notified, doesn't the operator have at least half an hour to initiate and close step 71 independently of the remaining 29 steps? Operators need breathing room, not an anxious existence.
Monolithic methods are supported de-facto by the backup/restore technology. If we want to perform a checkpointed operation, we have to adapt the backup/restore process to the physical or logical content of the information. We don't want to directly mate the backup technology itself, so we need to adapt it.
Logical Content Boundaries
This means we have to define logical content boundaries in the data. What's that? You don't have any logical content boundaries, not even for operational purposes? Well, per my constant harping on enhancing our solution data structures with operational content, such as job/audit ID and other quantities, perhaps we need to take a step back and underscore the value of these things because they exist for a variety of reasons. One of them is now upon us - the operatonal support of content-bounded backups. It is required for scalability and adaptability and is not particularly hard to apply or maintain.
A more important aspect of content boundary is the ability to identify old versus new data. If the data is carved out in manageable chunks, some will always be least-recent and some more-recent. Invariably the least-recent chunks will be identical in content no matter how many times we extract them for backup. This means we can extract/backup these only once and then focus our attention on the most active data. In a monolithic model, there is no distinction between least or most active, least or most recent. In large-scale databases, the least-recent data is the majority, so the monolithic backup is painfully redundant when it need not be.
Do we absolutely need content-bounded backups for all of our tables to function correctly? Of course not. But by applying this as a universal theme it allows us to treat all tables as part of a backup framework where all of them behave the same way. So part of this is in the capture of the data but a larger part is the operational consistency of the solution.
Many reference tables such as lookups will never grow larger and we know this. In fact, their data may remain static for many years. For the ones that are tied to the application and grow or change every day, these we will call solution tables. They are typically fed by an upstream source and are modified on a regular basis. Any of these tables can grow out of control. The reference data then represents a very low operational risk. Why then would we not simply fold the reference data into the larger body and treat all the tables the same? There is no operational penalty for it, but enormous benefit from being able to treat all tables the same way inside a common solution.
At this point, the backup/restore framework will address all of the tables the same way, but now we have the ability to leverage rules and conditions within the framework so that special handling is available if necessary. This is a common theme in large-scale processing: Handle everything as though it will grow, but accommodate exceptions with configuration rules. I'll forego this aspect for now and let' take a look at what we need in basic terms:
- An intermedaite landing zone: Some off-machine storage location that can hold the data while in transit - e.g a NAS or SAN volume. This will not be the same size as the database but some fraction of it. It is a workspace for intermediate files, not permanant storage
- Content boundaries: A means to define and manage content bthese are tags or quantities in the data itself. They need to be consistent for all tables so that the backups and restore operate the same way. Of course, if we don't have these, we need to apply them.
- A restoration database: a separate database that will be used for the restoration process, because we won't be restoring the data directly into the table from whence it came. Why not? There is an increasing likelihood that the data structures have changed since the last backup. If the structure has changed, we cannot reload the data into it. We need a way to shepherd the data back into the machine.
- Processes: that capture / backup / restore the data using content boundaries and common utilities. These are typically control scripts that are easily configured, deployed and maintained
- Archival Database (next): Physical content replicas of the masters that hold all of the original information in a form useful for backup and restore. Whether they keep the data active/online is immaterial. Their role is to interface with the backup/restore mechanisms so that any data once made offline can easily be made online through this database.
Setup formal archival databases. Whether these reside on the same server, or on a DR / High Capacity server (below) is immaterial. The point is that the data in the master tables will be actively rolled into these tables, which will form the backbone for all backup and restoration operations. We therefore replicate the masters (with streaming replication) into the archival stores and then at some interval perform the backup of the archives, not the master.
Foreshorten the Master Tables
Now that we have a means to define content boundaries - you did apply those boundaries right? We can now look at the database holistically for optimizations based on active data.
At one site we have a table with ninety billion records in three different fact tables spanning over ten years of information, However, the end-users and principals claimed (and we verified) that the most active data on any given day is the most recent six months. Anything prior to that, they would query perhaps once or twice a year for investigative purposes, but had not tied any reports to it.
So now we have an opportunity to get agreement from the users to shorten the master tables. This is especially necessary for those fact tables. In the end, these fact tables (above) were shortened to less than four billion rows each and these are kept trimmed on a regular basis. The original long-term data is held in another archival database that is the foundation for the backups and restores.
On a per-table basis, use the content boundaries to manufacture/compile a set of checkpointed instructions for each table to support extraction and restoration
Execute the extractions to create flat files
Each extraction will pull data to a Netezza-compressed flat file and also supply another file with the metadata instructions necessary to restore the file's contents. On a per-file basis, these instructions will include:
Original table definition, used to construct a template of the table in the restoration database
File load instructions used to get the content back into the machine
Transformation command, used to move data from the restoration database into its original home
- The files will be extracted (using content boundaries) onto the storage landing zone
- The smaller extractions can be performed in parallel, are independent and form their own checkpointed process
- The backup techology will begin its backup process once the second (metadata companion) file arrives for the set. The metadata companion file behaves as a trigger.
- Once the backup for the file-set is complete, the backup/restore deletes the file and its companion
- This decouples the backup/restore processing from the database (completely) so that it can focus solely on file-based backup and restore
- For restoration, both the content file(s) and companion metadata file(s) will be retrieved and placed on the storage landing zone
- The metadata will be used to reconstruct an empty intermediate version of the original table in the Restoration Database
- Load the table using nzload
- Then the metadata will be used (even a preformulated SQL statement) to copy the data into the original target
- Throw away the intermediate load table, delete the data file and its companion metadata file
- The restoration process can be run as many ways-parallel as supported by nzload
- The restoration process can also be surgical, with the more agile ability to restore data in smaller segments
The above protocol, while likely easy to pull off an administer, still has a number of moving parts that the Netezza based-equivalent (later) will not have.
The only difference between the above protocol and one using a SAN-based storage mechanism, is the absence of a formal backup/restore technology. Rather the SAN is the long-term storage location and we perform the incremental extractions onto it. Rather than delete the files, we keep them.
This has significant implications for the cost of the SAN. After all, if we intend to interface this to the Netezza machine, we would not want common NAS storage because it is too slow and the vendors actively disclaim their technology from being viable for data warehousing. The primary reason is that the network and CPUs are set up for load-balancing, like a transactional database, but not bulk onload/offload of the data.
Not only will we need enough SAN to backup the environment, but to also carry fathers and grandfathers if need be (this is a policy decision). With checkpointed extracts, the father/grandfather issue is largely moot. This is because once a checkpointed extract of older data is pulled and stored, it won't be changing and capturing another one just like it has no value.
In this approach we leverage another Netezza machine like a DR server, as our backup, archival and restore foundation. It can easily hold the information quantity. The difference here is in price, of course, since a fully-functional TwinFin is more expensive than most common SAN installations. However, the High Capacity Server (below) mitigates this pricing problem while delivering a consistent data experience.
One primary benefit of performing backups into the DR server is that it can automatically serve the role of a hot-swap server in case of failure in the primary server.
For this scenario to work however, we would want streaming replication between the active databases and the DR server so that the data is being reconciled while being processed. This allows us to have a fully functional hot-swap if the primary crashes, and we can continue uninterrupted while the primary is serviced. Word to the wise on this kind of scenario however, bringing back the primary means that it is out of sync, since the secondary took over for a span of time. So we would need to be able to reverse the streaming replication to make it whole.
Scenarios like this often embrace the practice of operationally swapping the two machines on defined boundaries, like once a month or once a quarter, where they actually switch roles each time. This allows the operations staff to gain confidence in the two machines as redundant to each other in every way. I have seen cases like this where the primary machine went down, the secondary machine kicked in seamlessly and all was well. I have also seen cases where the principals kept the DR server up to date but when it came time to operationally switch, some important piece (usually in the infrastructure between the devices) was missing causing the failover itself to fail. It is best to have a plan in place, but it's better to have tested the plan and that it actually works.
On a per-table basis, use the content boundaries to manufacture/compile a set of checkpointed instructions for each table to support extraction and restoration
Using a variation of the nzmigrate utility, perform the table-level extractions
Extract data in Netezza-compressed form to a flat file
Load the flat file into a restoration database in the second machine
Perform the transform-execution to copy data from the restoration database into the target table
The extractions, restoratons and transformations can all leverage simple scripts and catalog metadata, not become dependent on deeply hand-crafted code
Intermediate (SAN) landing zone requires less space because the data is being transferred in Netezza-compressed format and it is cleaned-up on-the-fly
Transfer is quicker because Netezza-compressed data is written to the data-slices directly instead of coming through the host.
A word on Netezza-compressed transfer. I wrote about this in Netezza Transformation but it is important to highlight here. We performed an experiment moving half a terabyte scattered across a hundred or more tables. This data was moved from its original home to a database in another machine. The first method used simple SQL-extract into an nzLoad component. This process took over an hour. The second method used transient external tables with compression, coupled with an nzload in compressed mode. The entire transfer took less than six minutes. This was because the compressed form of the data was already 14x compressed.
In other experiment using over 20x compression for the data, we were able to transfer ten terabytes in less than an hour. This kind of data transfer speaks well for the streaming replication necessary for DR server operation (above) but underscores the fact that even when transferring between Netezza machines, it's as though we haven't left the machine at all.
Netezza-based High-Capacity Server
This option is simply a form of the Netezza-based hybrid (above) but on a dedicated server designed to support backup and recovery.
The better part about this server is that is has more disk drives and fewer CPUs, making it far more cost effective for storage than common SAN devices. Couple this with the minimal overhead required for transferring data between machines, and the ability to surgically control the content with the content-boundaries and catalog metadata, and we get the best of all worlds with this device.
Not only that, but it is also scalable to support storage of all other Netezza devices in the shop as well as any non-Netezza device where we simply want to capture structured information for archival purposes. The High Capacity server is queryable also, meaning that even the ad-hoc folks will find some value in keeping the data online and available.
Lastly, in Spring 2010, as part of the safe-harbor presentation one of the principals at IBM Netezza announced plans for a replication server. I can only imagine that this device will deliver us from any additional hiccups associated with streaming replication that we might now be doing in script or other utility control language.
At Brightlight, our data integration framework (nzDIF) has the nz_migrate techniques built directly into the flow substrate of the processing controller, as well as the enforcement and maintenance of the aforementioned content boundaries. We are actively acquiring and applying best, most scalable and simplified approaches as a solution framework firmly lashed to a purpose-built machine. I am a big proponent of encouraging Enzees to take on these things themselves, or at least let us coach you on how to make it happen. The solutions are simple because the Netezza platform itself is simplified in its operation. Stand on the shoulders of genius - the air is good up here.
As the title suggests, one of the challenges of new Netezza users is in learning about the product, what it can (and doesn't) do, and how it applies in data warehousing. When I first published the book (Netezza Underground on Amazon.com) the impetus for the effort was just that - people asking me lots of questions leading to fairly repetitive and predictable answers. It's an appliance after all. We can apply it to a multiplicative array of solutions but on the inside, certain things stand out as immutable truths.
Of course, lots of folks are running around down here in the catacombs, convincing people that they need to read the glowing glyphs on the ancient stone walls as guides on their quest for more information. This is entirely unnecessary. The title of the book (and the blog) is a tongue-in-cheek nod to the way that some might spin the story on the machine in their own favor. Some might claim that we need lots of consulting hours to roll out the simplest solution. Consultants can help, of course (I'm one of them). But it depends on the spin and the tale that is told, that will determine the magnitude and expense of those consultants..
I'll try to provide a balance that delivers value without extraordinary expense. Yet another source of misinformation is Netezza's competitors, who like to toss "innocent" bombs here and there to direct people down the path toward their own product. All is fair in business and war, as they say, but as we bring these issues out of the darkness and into the light, the objective is to become more informed.
I am neither a Netezza nor IBM employee, so apart from compensation for actual work performed in rolling out solutions, for which people would pay me regardless of technology, I don't have any other relationship with these companies. I am a huge fan of the Netezza product and architecture and make no secret of this.
So some may have come here to ask questions about the technology or read what some seasoned experts have to say. We can do all that and a lot more. For now, let's look at the counter-intuitive nature of the product's internals.
I'll paint an imaginary picture first. Let's say we have two boxes. In one box we have thirty-two circles and in the second box we have thirty-two drums. The circles are CPUs and the drums are disk drives. They are in separate boxes, and now we draw a pipeline between them. Make it as large as you want. This depicts a standard SMP (Symmetric Multi-Processor) hardware architecture that is the common platform for data warehousing. With the exception of Teradata and Netezza, this hardware configuration is ubiquitous.
Now let's draw another mental picture. This time we'll have one box. Each of the circles will be mated with one of the drums. Now we have thirty-two circle/drum combinations. In the Netezza machine each of these is called a SPU (snippet processing unit) and represent the CPU coupled with a dedicated disk drive. Some additional hardware exists to coordinate and accelerate this combination, but this is a simplified mental depiction of the fundamental difference between the prior SMP configuration and Netezza's, an MPP (Massively Parallel Processor) configuration.
Now I used thirty-two as a simple example. In reality, Netezza can host hundreds of these cpu/disk combinations, the largest standalone frame containing over eight hundred of them, scaling to over a petabyte in storage capacity. Those of us who regularly operate on these machines are accustomed to loading, scanning and processing data by the terabyte, the smallest tables in the tens of billions of rows.
Some notes on the difference in their operation:
SMP: Typically used for transactional (OLTP) processing and is not purpose-built for data warehousing. It is a general-purpose platform. The typical bane of an OLTP engine is that it performs well on getting data inside, but is lousy on getting data out (in quantity, for reports and such).
MPP: (Netezza specifically) does not do transactional processing at all. It is purpose-built to inhale and exhale Libraries-of-Congress at a time.
SMP: Table exists logically in the database, but physically on the file system in a monolithic or contiguous table space.
MPP: Table exists logically in the database, but is physically spread out across all the SPUS. If we have 100 SPUs and want to load a table with 100,000 records, each SPU will receive 1000 records.
SMP: SQL statements are executed by the "machine" as a whole, sharing CPUs and drives. While the SQL operations may be logically and functionally "shared nothing" - the hardware is "shared-everything". In fact, CPUs could have other responsibilities too, which have no bearing on completing a SQL statement operation. In the above example, the SMP has to access the file system, draw the data into memory nearest the CPUs, then perform operations on the data. Copying data, for example, would mean drawing all the data from the tablespace into the CPUs and then pushing back down into another tablespace, but both tablespaces are on the same shared disk array.
MPP: SQL Statements are sent to all CPUs simultaneously, so all are working together to close the request. In the above example, each SPU only has to deal with 1000 records. Unlike an SMP, the data is already nearest the CPU so it doesn't have to go anywhere. The CPU can act directly on the data and coordinate as necessary with other CPUs. Copying data for example, means that the 1000 rows is copied to a locaton on the local drive. If 100 CPUs perform this simultaneously, the data is copied 100 times faster and it never leaves the disks.
SMP: Lots of overhead, knobs and tuning to make it go and keep it going. From the verbosity of the DDL to the declaration of index structures, table spaces and the like.
MPP: (Netezza) the overhead of adminstration is hidden from the user. In the above example, the user need only declare, load and utilize the table, not be concerned about managing the SPUs or disk allocation. The user's only concern is in aligning the data content with the hardware architecture, an easy task to perform and an easy state to maintain.
SMP: SQL-transforms, therefore, the insert/select operations that run so slow on an SMP platform, continue to run slow, and slower as data is added. They are not a viable option so usually would not occur to us to leverage them otherwise.
MPP: SQL-transforms leverage the MPP and do not slow as the data grows.
In fact, many people purchase the Netezza platform in context of its competitors, who don't (nor can they) use SQL transforms to affect data after arrival.
If we purchase Netezza as a load-and-query platform only, we have missed a special capability of the machine that differentiates it from the other products. If we leverage this power, we re-balance the workload of the overworked ETL tools, in some cases eliminating them altogether. Netezza calls this practice "bringing everything under air", that is, bring the data as-is into the machine and do the heavy-lifting transforms after it's inside.
The Brightlight Accelerator Framework is one example of a flow-based, run-time harness for deploying high-performance, metadata-driven SQL-transforms. While we have matured this capability over time, the consistency of the Netezza platform is the key to its success.
SMP:Scalability is through extreme engineering, leading to hardware upgrade.
MPP:Scalability is a function of data organization as it leverages the hardware's power. Upgrading hardware is therefore rarely the first option of choice.
SMP:Constrained by index structures and the shared-everything hardware architecture through which all data must pass
MPP:Constrained by human imagination, not index structures (there aren't any) and no shared hardware.
We have noted that with traditional SMP architectures, when the machine starts to run out of power, engineers will swarm the machine and start to instantiate exotic data structures and concepts that serve only as performance props, not the 'elegant' model once installed by the master modelers. As the box continues to decline, more engineering ensues, eliciting even stranger approaches to data storage and management. Soon it becomes so functionally brittle that it cannot handle any more functionality, so people resort to workarounds and bolt-ons. Functionality that should be a part of the solution is now outside of it, and functionality that never should have been inside (mainly for performance) has taken up permanent residence.
In a Netezza machine we have so much power available that traditional modeling approaches (e.g. 3NF and Dimensional) may have a defacto home, but now we can examine and deploy other concepts that may be more useful and scalable. Approaches that we never had the opportunity to examine before, because the modeling tool did not (and does not) support it and the natural constraints of the SMP-based engine artificially constrain us to index-based data management.
In addition, the Netezza machine operates on a counter-intuitive principle of finding information based on "where not to look". With this principle, let's say we have a terabyte of data and we know where our necessary information is, based on where-not-to-look. See how this won't change even if we add hundreds of terabytes to the system? If I add another 99 terabytes to the same system, the query will still return with the same answer in the same duration, because the data I'm looking for doesn't appear in those 99 terabytes and the machine already knows this.
For example, if we go over to Wal-Mart and we know where the kiosk is for the special-buys-of-the-day, we can find this easily. It's in a familiar location. What if tomorrow they expand the same Wal-Mart to ten times its current size? Will it affect how long it takes me to find that kiosk? Clearly not. Netezza operates the same way, in that no matter how much data we add, we don't have to worry about scanning through all of it to find the answer.
And when it boils down to basics, the where-not-to-look is the only way to scale. No other engine, especially not one that is completely dependent upon index structures to locate information, can scale to the same heights as Netezza.
Well, some of the stuff above is provocative and may elicit commentary. I'll continue to post as time progresses, so the data won't seem so much like underground information, after a while anyhow.
Modified by DavidBirmingham
For the past many months I have been diligently updating and upgrading the original 2008 Netezza Underground to address the many features of TwinFin, Striper and other offerings from IBM. I have recently been notified that it has passed final edit and is available on Amazon.com.
All I can say is "whew!" and many thanks to those who helped put it together. It's been a whirlwind.
Here is the URL
When I started the project I realized that a big part of the original book remains timeless. I didn't leave it "as is" though - practically every page and all the chapters have new material, case studies and such. I peppered the book with some additional graphics since the intrinsic points require a bit more reinforcement than mere words will suffice.
The original chapter on "Distribution Stuff" is now "Performance Stuff" and is twice as long, covering the various aspects of setting up tables, troubleshooting, page-level zone maps and a lot more.
Fortunately, this time around there is a better mechanism to contact me if you have questions or want to report any errata (hey, it could happen!) - you can reach me through this blog, on linked-in or directly through my primary email address at Brightlight Consulting:
Last week (4/3/13) IBM did a product launch of the new Hadoop Appliance and the DB2 BLU Acceleration. The BLU model is columnar and they ran with the Netezza model of "simplify-load-and-go" so the total instructions to get data into the machine and act on it is now dirt-simple.
The Hadoop appliance also ran with part of the Netezza model. The Hadoop appliance takes the MPP approach in-a-box so that it's a self-contained appliance without having to stand up a gaggle-of-servers for the same purpose. Keep in mind that these appliances consume less power and generate less heat than the aggregate of their distributed counterparts on the raised floor.
I contrast this to the average hapless soul who wants to do Hadoop and calls upon his management to roll out a gaggle of servers to make it work, and cobbles together the necessary parts and software to make it all happen, painstakingly tuning the environment because that's-what-engineers-do. Then someone says, hey, we could have saved all that money (labor is not free, and neither is hardware) and bought a PureData appliance for Hadoop that has scalable power and a simplified interface - AND integrates to the other environments like PureData Netezza and PureData DB2 for a self-contained operational and administrative experience. We don't need to pay or hire our engineers to home-grow the core substrate. Now they can concentrate on what we hired them for: solving business problems rather than engineer technologies.
The bane of the above model is simply this: we will roll all of it out once, for one application. Repeating it for another application starts us from scratch again because rarely do our engineers roll out such environments with reusable patterns and modules. It is a custom-tuned and rarefied atmosphere for one business purpose. This is true of most application/solution development. The engineers do not focus on the parts they intend to leverage or reuse for the next application. It is all very application-centric all-the-way-to-the-Hadoop servers. One may argue that the Hadoop servers are reusable, but we know in application development that an app-server is rolled out per-application. So while the app-server might itself be similarly configured to other app servers, it is still a separate machine. At some point in this game, the "mission critical" card will be played and all other Hadoop projects will need their own hardware - er - their own gaggle of multiple servers. This is when the instances start to reproduce like rabbits. Would we rather just trade-in all those servers, or forego their purchase altogether and install an appliance? Even if it's one appliance per application instance, it's better than a farm of servers that stretch across a raised floor so wide that we can see the earth curve? Tempting no?
Orrrr - we could continue to do it the hard way. Many years ago I was impressed with the notion of "Eccentric Innovation" in that managers who were running out of capacity would act in desperation to stand up home-grown skunkworks (innovations) that were cobbled together by their most "creative" engineers who they did not hire for engineering or their ability to innovate - and ended up with an eccentric innovation - one that they would not have purchased off-the-shelf if given the choice, but that they instead paid several-times-more for and now they own it and only a handful of people on the planet can actually operate it. It's a very tense existence.
In the appliance genre, it sort of looks like this: If I give you a four-slice toaster, you will likely not use all four-slots except on busy mornings or if you have a big family. However, if I give you a 400-slice toaster, your problem is no longer toasting bread, but "bread-management" - keeping the toaster busy by pushing and pulling bread to and from it, and boosting your bread-movement infrastructure. No different for the Hadoop platform. No sooner will it roll out and people will start to use it, but will they use it enough to justify its expense? The total-cost-of-ownership is a glaring, almost blinding problem with a "common" Hadoop rollout but the costs of labor and upkeep are intangible. Appliances may have a tangible up-front expense but their low-maintenance and scalability mitigate total-cost-of-ownership issues.
And - of course - do we want a swarm of engineers running the Hadoop farm or do we want appliances in a lights-out ops center, quietly solving the world's problems before bedtime?
Modified by DavidBirmingham
Sometimes the average Netezza user gets a bit tripped-up on how an MPP works and how co-located joining operates. They see the "distribute on" phrase and immediately translate "partition" or "index" when Netezza has neither. In fact, those concepts and practices don't even have an equivalent in Netezza. This confusion is simply borne on the notion that Netezza-is-like-other-databases-so-fill-in-the-blank. And this mistake won't lead to functional problems. They will still get the right answer, and get it pretty fast. But it could be soooo much faster.
As an example, we might have a traditional star-schema for our reporting users. We might have a fact table that records customer transactions, along with dimensions of a customer table, a vendor table, a product table etc. If we look at the size of the tables, we find that the product and vendor tables are relatively small compared to the customer, and the fact table dwarfs them all. A typical default would be to distribute each of these tables on their own integer ID, such as customer_id, vendor_id etc. and then putting in a transaction fact record id (transaction_id) that is separate from the others, even though the transaction record contains the ID fields from the other tables.
Then the users will attempt to join the customer and the transaction fact using the customer_id. Functionally this will deliver the correct answer but let's take a look under-the-covers what the performance characteristics will be. As a note, the machine is filled with SBlades, each containing 8 CPUs. For example, if we have a TwinFin-12, this is 12 SBlades with 8 CPUs, or 96 CPUs. They are interconnected with a proprietary, high-speed Ethernet configured to optimize inter-CPU cross-talk.
Also whenever we put a table into the machine, it logically exists in one place, the catalog, but physically exists on disks assigned to the CPUs. A simplistic explanation would be that if we have 100 CPU/disk combinations and load 100,000 rows to a table that is distributed on "random", each of the disks would receive exactly 1000 records. When we query the table, the same query is sent to all 100 CPUs and they only operate on their local portion of the data. So in essence, every table is co-located with every other table on the machine. This does not mean however, that they will act in co-location on the CPU. The way we get them to act in co-location (that is, joining them local to the CPU) is to distribute them on the same key.
But because our noted tables are not distributed on the same key, they cannot co-locate the join. This means that the requested data from the customer table will be shipped to the fact table. What does this look like? Because the customer table has no connection to the transaction_id, the machine must ship all customer records to all blades (redistribution) so that the CPUs there can attempt to join on the body of the customer table. We can see how inefficient this is. This is not a drawback of the Netezza machine. It is a misapplication of the machine's capabilities.
Symptoms: One query might run "fine". But two of them run slow. Several of them even slower. Results are inconsistent when other activities are running on the machine. We can see why this is the case, because the processing is competing for the fabric. Why is this important to understand? The inter-CPU fabric is a fairly finite resource and if we allow data to fly over it in an inefficient manner, it will quickly saturate the fabric. All the queries start fighting over it.
Taking a step back, let's try something else. We distribute the transaction_fact on the customer_id, not the transaction_id. Keep in mind that the transaction_id only exists on the transaction table so using it for distribution will never engage co-location. Once we have both tables distributed on the customer_id, let's look at the results now:
When the query initiates, the host will recognize that the data is co-located and the data will start to join without ever leaving the CPU where the two table portions are co-located. The join result is all that rises from the CPU, and no data is shipped around the machine to affect the answer. This is the most efficient and scalable way to deal with big-data in the box.
Now another question arises: If the vendor and product dimensions are not co-located with the transaction_fact, how then will we avoid this redistribution of data? The answer is simple: they are small tables so their impact is negligible. Keep in mind that we want to co-locate the big-ticket-or-most-active tables. I say that because we have sites that are similar in nature where the customer is as large as two of the other dimensions, but is not the most active dimension. We want to center our performance model on the most-active datasets.
This effect can rear its head in counter-intuitive ways. Take for example the two tables - fact_order_header and fact_order_detail. These two tables are both quite monstrous even though the detail table is somewhat larger. Fact_order_header is distributed on the order_header_id and the fact_order_detail is distributed on the order_detail_id. The fact_order_detail also contains the order_header_id, however.
In the above examples, the order header was being joined to the detail, along with a number of other keys. This achieved the correct functional answer, but because they were not using the same distribution key, the join was not co-located. So we suggested putting the order_detail table on the same distribution as the order-header (order_header_id). Since the tables were already being joined on this column, this was a perfect fit. The join received an instant boost and was scalable, no longer saturating the inter-CPU fabric.
The problem was in how the data architects thought about the distribution keys. They were using key-based thinking (like primary and foreign keys) and not MPP-based thinking. In key-based thinking, functionality flows from parent-to-child, but in MPP-based thinking, there is no overriding functional flow of keys - it's all about physics. This is not to say that "function doesn't matter" but we cannot put together the tables on a highly physical machine and expect it to behave at highest performance unless we regard the physics and protect the physics as an asset. Addressing the functionality alone might provide the right functional answer, but not the most scalable performance.
I tossed the clothing into the washer, grabbed the non-chlorine bleach, popped off the cap and poured some into the tub with the clothing. My wife, horrified, asked "What are you doing, didn't you measure it?" To which I say "Of course, it was three bloops."
"Three bloops," she recoils in further horror, "are you kidding me?"
"Well, no, look it requires exactly one cup of bleach for this size of load," I explained, then grabbed the measuring cup and turned the bleach bottle on top of it and let the contents "bloop" three times. It made exactly one cup of liquid. It would be one cup of liquid no matter how many times I repeated it. I simply took the shortcut and measured the bloops. An inexact measurement but just as effective. Of course, she wants me to use the measuring cup every time, and cannot imagine going through life with a bloop-here or a bloop-there. I mean, all those recipes in the kitchen with exact measurements. Do we reduce those to inexact quantities too?
A great mathematician once told me that he never figures the exact to-the-penny tip for a wait-staff of a restaurant. If he wants to give fifteen percent, he mentally calculates ten percent by chopping a zero off the whole-dollar amount, then divides this value approximately in half, adds it to the first then upwardly rounds to the nearest dollar. This method is horrifying to accountants, whom I have seen use calculators to figure exact tip amounts. He also told me that when figuring Celsius temperature, he simply subtracted 32 and divided by two. Is this the right "formula"? No, it's supposed to be 5/9ths right? But if all he wants to know is whether to grab a sweater, coat or neither, this is close enough.
I made some chili later that evening. This is a simple reciple. We brown and drain two pounds of ground meat. We then toss this into a two quart container and follow it with three large cans of diced tomatoes. Plus three packets of chili mix. Do I need to read the labels, really? Sure, it calls for a cup of this or several ounces of that. But lets face it, we're after a certain taste and I know that these ingredients in these proportions deliver it. This particular evening my daughter and I were making the chili together and she carefully followed the instructions, but we were left over with half a packet of chili mix plus half a can of diced tomatoes. We can see the pattern here, right? So to her horror I tossed the additional tomatoes into the container along with the mix and started stirring. I don't know if this is a "guy" thing or not. My youngest son was also horrified at my abject disregard for the instructions on the packet. I have tampered with the forces of the universe, you see. Do not deviate from the recipe, lest the earth open up and swallow us all. Or something like that.
Anyhow, when we served up the chili, they ate as voraciously as always and all was well. As a throwback from the days of yore, I add mustard to my helping of chili and use Frito's Scoops to spoon it out. Anyone familiar with Frito-Pie fully understands the connection.
Now one might well ask, what on earth has this to do with enterprise architecture or anything akin to it? In science, aren't we supposed to cross every T and make sure nothing is amiss? Yes and no. Implementations drill on detail. Architecture, not so much.
I listened to a crew of IT admins debate the required size of their new environment. One of them quipped that if we need more space or CPUs, we would need 6 weeks of lead time to order it. After sizing the environment, all agreed that we were at least within 10 percent of the necessary sizing, and this should be good enough. All except for one, who was concerned that if it was too small, we would have to order more. But you have six weeks to decide, right? Well, no, we need to order the hardware now so that it will be here in six weeks when the rollout needs it. Yeah, said another, but if we run into an issue between now and then we just order more. We don't need all of it right way. We're only using about twenty percent of the capacity to begin with. What's the big deal?
And yet, they continued in their paralysis, unwilling to make a committment until one of our rank simply forced them to. In the blink of an eye, all was clarified when everyone present was willing to admit that such decisions have inexact quantities all over them. It's an educated guess. Like a hypothesis. So we're still using science, but we have to make a call and get moving. People depend on it, and time has expired to delay any further.
We see a lot of this kind of in-exactness in data warehousing. Capacity planning especially has percentages and utilization wriitten all over it. A case in point is that we need to be seriously considering capacity upgrades when a system reaches 60 percent of its current capacity. Why is this? Because a red line exists at the 80 percent mark, and above that is reserved for system recovery and workspace. It is not okay to presume that the "last 20 percent" of capacity is available for regular use. It is the red zone. But if we go to the boss and say we are reviewing capacity when the current utilization is only 10 percent above the halfway mark (60 percent) - they may well ask - isn't this a bit premature?
Well, it honestly depends on how long it takes to get the upgrade/transition for capacity underway. If the assessment takes a few weeks, and the designation, procurement, delivery and installation take many more weeks, we have to consider how fast the data is growing. Will the data grow into another 10 percent of the machine by the time we're able to install the upgrade? Okay, then, we're still outside the red zone. But every percentage point above this is one tick closer to the red zone. And we don't want to cross it. Many times I have personally witnessed "perfectly operational" systems simply hang one day. Out of the blue. The processing capacity required for an intermittent spike did not have room to finish. Or an error recovery needed more spill space than it had left to give. Simple things often lead to catastrophic failure when chaos has no place to go. Or for that matter, when the machine cannot dispatch the chaos because it is already too overwhelmed.
A colleague relates that in his data processing shop, the disk space had long since breached capacity and regularly spilled over to tape drives during the evening's processing window. As he described it to me, their environment was using tape drives for runtime workspace! And the CIO could not stop complaining about how long the jobs were taking, but simply refused to buy any additional disk space for the environment. In his mind, the final storage needed exactly 80 percent of the capacity and they were not in the red zone. But weren't they?
In a data processing scenario, the "understood" quantity of disk space runs anywhere from 6x to 8x of the final product's size. So if we are targeting a 1 TB warehouse, we would need between 6 TB and 8 TB of workspace to support it. Whether this is actually hosted on the physical database machine is immaterial if the disk space is shared between database and the external, flat-file world, which is often the case. I recall one instance where I specified 300 gb for a 20 gb warehouse and the manager, himself a warehousing aficionado, raised a strong objection to such a need. When we did the math, I was actually being pretty conservative in the estimate, seeing that we needed to support Development, Testing and Production workspaces, you see. With 60 gb between them, and 6x needed to support each - voila! We have easily breached 300 gb. The punch-line of course, was that they needed to order an additional 150 gb to finalize the project. Oh well, it's just an educated guess.
But if he was willing to give me so much grief over 300gb, imagine what I would have heard for 450gb? The point being, it won't always be true that our bosses will give us, for all configuration lifecycle environment, upwards of 40x what we need in the end - but it certainly gives us insight as to why "all that disk space" seems to evaporate within a couple of months of the project's inception!
Set-based operations, big structured data handling, and now big-data on-the-grid, we will find even more "inexactness" to wade through. I had an interesting conversation just this week with someone who could-not-believe the data being returned by his big-data cluster. Something had to be amiss, he asserted, because "he just knew" that things had-to-be-different. Basically, he'd spent millions on marketing and brand recognition and had expected measureable lift for his product. When it did not arrive, it basically meant that all those millions were spent for nothing. Either that, or he was looking in the wrong place. I asked him if sales ever changed after one of these marketing pushes, and he said no, the marketing pushes were traditionally geared to keep product loyalists from defecting.
So I asked a very impertinent question - how do you know if the marketing pushes are doing anything at all? Wouldn't it be odd to just forego the next marketing push and see-what-happens? This was interesting to him, but simply out of his hands. The engine to create the marketing collateral and the waves of market "push" were ensconced as science in the highest echelons of the company. Asking them to forego even one cycle, and the risk involved in such a thing, could be suicide.
So let's measure it, I suggested. If the quantities are a science and we can know for certain where the lift is, or is not going, we can measure it as a trend. So he set up a number of market "trolls" as it were, to cast the net for information on their products and various trending for competitor products. He pulled these stats daily for the month prior to the marketing push, through the push and for one week thereafter. I warned him that if it measures "nothing" we have nothing to report on. We really need to report on "something" so we can show what directly affects loyalty to the product. He knew of several interesting 'anti'-quantities that could show us, almost in negative terms, whether the marketing push had any value. These are proprietary so I will not share them here. Nonetheless, by measuring these anti-quantities we could see a loyalty trend in a different way. Not when people re-aligned with their products, but when they dis-aligned with them.
This was an interesting graph. It showed that the loyalty to their brands had less to do with their marketing pushes and more to do with the sale-event discounts associated with their competitors. In a comedy of errors, their marketing pushes just-so-happened to be timed when their competitor sale-events were ebbing off, offering the illusion that loyalty was being restored when it fact it was simply re-aligning to its normal center. What if, he mused, they simply delayed the marketing push for some point after the customer loyalty naturally aligned-to-center? This could in fact pull in even more loyal customers, or let them know whether the loyalty push had any value at all.
Last year I ran into my colleague again, some two years after we had first characterized the situation. He told me that after showing all of the metrics, the marketing folks pooh-pooh'd his findings. All except for one, an ambitious soul who had recently been promoted to the second-in-command of the marketing department. According to legend, this person worked with my colleague to distil the right answers, inexact though they might be. Some 8 months later, they finally agreed to offset the marketing push by four weeks to see what the results would be. Three weeks into this cycle, with the upward trends behaving normally even though no marketing push was underway, gave them what they needed to know. The marketing pushes were at best, mis-timed and at worst, completely worthless. They decided to forego the marketing push entirely. Six months later, the trends remained in place without any effort on their part. With a primary benefit: They had saved many millions of dollars in marketing expenses. And by this time they had already begun the process of running an entirely different kind of marketing push, this time on the edge of the peak rather than in the trough of the lull.
So we can see that watching "trends" or "patterns" of gross movement gives us insight into how to attack (or retreat from) the marketplace in ways that make us more competitive. These gross movements are inexact. While we cannot conjure up successfull marketing potions with three-bloops of elixir, the approach to success is not so different. Patterns, swaths, wakes, edges, trends, peaks etc are all inexact measurements we derive from the existing information. But we need to do it on a scale that is impossible with commodity, general-purpose technologies. More importantly, while the detail data drives the final results, incrementally more information may not "move the needle" at all. In fact, just like the measurement of three-bloops - there's very little in how that measurement system will deviate from the center in any signficant manner. It is this "significance" we care about, and why the inexact results and processes to derive them may not be lockstep-perfect, but they tell us what we need to know.
The long-awaited "SQL" to Netezza Underground has hit Amazon.Com. Netezza Transformation
This book tackles some deeper issues around transforming our data warehouse, our approaches and even our thinking to align with the arrival of our brand-new appliance
No, it's not the appliance we'll transform, but that the appliance will transform us. Once again, a little tongue-in-cheek irony
The book offers working examples of the stuff people ask me about most often, like ELT/SQL-Transforms, checkpointing, exception handling (primary/foreign key), windowing (SCDs and deduplication) as well as a cookbook on more details to watch for in a migration project.
And of course, is replete with Case Study Short and a whole chapter on Case Studies. There's also an appendix at the end to offer up some simple scripting jump-start routines that can make bash so much easier.
Modified by DavidBirmingham
In this case, a tortoise-brained hare is an animal that is capable of going fast but wonders why he can't.
Over the last six months or so I have seen a "trending" situation with those who use the Organize-On. This is some pretty cool functionality so it's important that we get it right. Generally speaking, folks who understand how zone-maps work will have a splendid time with Organize. Others, not so much. The machine is supposed to be fast, so why are all my queries so slow?
All I have to do is add the Organize On keys and I'm good to go. Answer: No, you have to groom first, as this actually applies the Organize to the physical data. Response: Really? when we did this it took forever! Answer: Yes, the first groom may take a bit of time but every groom after that will be painless.
We CTAS a table every night for maintenance and the groom is always running slow. Answer: if you are using Organize On, don't use a CTAS for regular maintenance. With an Organize, the table is physically broken apart and spread across multiple additional extent-pages to provide for easier groom maintenance. A CTAS re-compresses it. So another groom is required to blow it back out again.
We like to use Materialized Views to optimize our tables but now the Organize doesn't let us. How can we get this back? Answer: You don't, because the Organize essentially replaces the Materialized View and does it so much better. Asking this question means you might not understand zone maps as well as you think you do! Just sayin'
We have included the distribution key in the Organize, along with some other join keys. Answer: (Sigh) -Remove them. Join-only keys do not belong in the Organize. The Organize is for filter attributes. Uh, perhaps you don't understand zone maps? Just sayin'
We have some lookup tables where we applied an Organize. But this didn't seem to matter. Answer- No it won't because if the table is not big enough the system won't use zone maps anyhow.
We were told to turn off Page-Level zone maps but this caused a 5x reduction in performance. Answer: Page-levels radically reduce disk I/O which is our number-one enemy. Turn them back on and go home happy.
It is clear that our outlying cast of folks who have never been introduced to zone maps think that the Organize is either a clustered-index or some other kind of indexing scheme. Fair enough, is this because of the deceptive naming? Like the Materialized View? (okay, don't get me started). The Organize keys are not keys in the same sense as indexes and are certainly not used as join keys.
They are filter-attributes. By this we mean that we fully intend to use these columns with constant values, lists or selected-lists as constraints.
Zone Map Primer
Here I will use my favorite example because it is time-tested and describes the capability. If I pay a visit to Wal-Mart and I'm looking for batteries, I might visit the customer kiosk and the lady tells me that the batteries are "on the end-cap, Aisle Three". This is an indexing model because she told me exactly where to look. This does not scale because when we get to billions of rows we spend more time searching the index than retrieving records.
In a Netezza model however, she would say "It's on an end-cap but I'm not sure which one." So I go to Aisle 1, then 2, then 3 and jackpot, I buy the batteries and go home. The most important part of her answer however, was in what she "did not say". She did not say that the batteries were in Men's or Women's clothing, Automotive or Electronics. In constraining the search to a particular location, she also told me "where not to look". This aspect of "where not to look" is critical to understanding zone maps. It is also critical to stratospheric scalability. Clearly if this Wal-Mart were to quintuple in size, my battery-buying-duration experience would not change in the least.
Now for the geeky part:
Each Netezza disk has 120K physical extents. Each extent has 3MB of space, divided across 24 pages(blocks) of 128k each. If we are running prior to OS 7.x, we will be zone-mapping at the extent-level. If we are 7.x or higher, we will use the page-level. Run don't walk to 7.x and page-level zone maps! They will radically reduce I/O on the extent and this is critically important. A set of records found only on a single page can return in 1/24th of the time than if it has to scan the full extent (96 percent faster). In a real-world experiment, the zone maps on a Twinfin-18 were changed from page-level to extent-level and the same battery of queries executed against it. It performed 5x to 10x slower than with the page-level turned on. Do not underestimating the influence of disk I/O on query turnaround. Netezza is a very physical machine.
Here's another internal example: Compression in Netezza is a nominal 4x. I have seen it much higher. Let's say we have an uncompressed record that is 10,000 bytes in size. When we read it, we will read all 10,000 bytes. If the data is compressed however, we will read only 2500 bytes, a 75% reduction in disk I/O. Netezza is the only platform where compression boosts the power in both reading and writing data because it reduces disk I/O tremendously.
Zone maps are a table that Netezza holds in memory for supporting a table. Each table has its own set, describing the contents of the extents/pages allocated to the table on a given dataslice. The contents of the zone maps are built automatically when we load up our table. It will collect the information from the records stored on the given extent/page for all integers, dates and timestamps. It will store the high and low value of each, representing the table's record "ranges" for that extent.
Thus when we execute a query using one of these columns as a filter-attribute, it will go to the zone maps first and cherry-pick only those extents containing the data we want, excluding all others. this means that the data on those other zone maps won't even see the light of day for the query-in-play. If we use an Organize, we can use a wider range of data types and be more deliberate with zone-map management.
Let's say we want to search-on the transaction-date on our fact table. If we have not physically organized the data, the same transaction-date value may appear across many hundreds or thousands of extents, affecting the high/low range of many zone maps. If however, we were to physically co-locate the records with common dates into tighter physical groups, the records will physically appear inside fewer zone maps. These are the extents/pages that the machine will cherry-pick for scanning and will completely scan each one. We want it to scan as few of these as possible.
In times past, we had three primary ways of doing this.
A brute-force sort, which is pretty egregious when the table gets very large.
A software program that separates the key's distinct values and executes a block-select from the original table to the new, selecting only once for each distinct value. This physically co-locates like-keys. (the data does not have to be sorted, only like-keys co-located)
A materialized view, which would manufacture a virtual zone map (which is why it's not allowed with the Organize)
Let's say that our table's data takes up 400 extents on each data slice. If our transaction-date appears in 300 of these (it is poorly organized) then when a query runs like so:
Select count(*) from mytable where transaction_date = '2014-01-01';
All 300 extents will be searched for this information. However if we Organize on this transaction-date and then groom - the data will be physically shuffled around to co-locate the records with the same transaction-date on as few extents as possible. Let's further say that once this happens, the given date appears in only one extent. What we have just done is optimized the table 300x. We have eliminated 299 other locations to look for data. This is important because scanning a 3MB extent is a lot of work. If we are scanning 299 additional extents for each dataslice, we're really doing a lot of extra work for nothing. If we translate this to a page-level problem, we may have 24x300 pages originally containing the keys. If with Organize we reduce this to a single page, we have further reduced our scanning load by another 96 percent of the extent containing the page.
The important factor in the example is the "out-in-the-open" value of "2014-01-01". It is a "filter attribute". It is not being joined to a table with this value, Doing it this way means that the FPGA/CPU will discover which zone maps have a high/low boundary that contains this value and will retrieve a candidate list of them, If there is only one as opposed to 300, we have radically reduced out workload. Netezza will literally exclude those extents from being examined at all. We have told the machine where-not-to-look. If we join this value however, such as using a time dimension, applying the date value to time dimensions and joining the time dimension to our fact table, we will require the system to fetch the record in order to examine it, at which time it will determine whether to keep it or toss it. this sort of thing can initiate a full table scan, nullifying the zone map entirely.
We don't want this to happen. We want the data to stay on disk and never see the light of day if it's not participating in the query.
To show how dramatic this can be, we were at one site hosting over 100 billion rows in the primary fact table. A full scan of this table took 8 minutes (which is not too shabby in itself, just sayin' ). The reporting users knew that if a query ever exceeded a few minutes in duration, it was probably ignoring the zone maps. This is because once-Organized, this table would return the average query in sub-second response. Think about that,100 billion rows in sub-second response.
This is why paying attention to zone maps is such a big deal. Optimizing distribution can get us boost in the single-digits (2x, 3x etc) on a given query. Optimizing zone maps can get us 1000x boost and higher.
The Organize-On accepts one or more keys which will be applied to physically co-locate records of like-valued keys, then it will update the zone maps. Here is a test to see if we understand the application:
Take one of your largest tables. CTAS the table to another database, order by the distribution key, or by the hidden "rowid" column to make sure that the given filter key is not ordered. This could take a bit of time, of course. Then perform a query using one of your date parameters as in the example above, and time the query. Now perform an
Alter Table tabname organize on (date column name here).
Then perform groom. Once completed, execute the same query and get a timing. We can see that a several-minute query can go sub-second very easily.
What's more the additional keys in the Organize are (more or less) independently organized. They will all enjoy a much faster turnaround than not using the Organize. If records arrive on the table out-of-order, no worries. Run the groom again. Subsequent runs of groom will always be shorter in duration than the first one.
Clearly however, if the zone-map is intended to apply filter-attributes to guarantee the exclusion of extents/pages completely, we cannot use a join-key. Or at least, not a join-only key. This also means that the distribution key is out (unless we plan to call-up an individual record based on the distribution key). Also, generally do not mix high cardinality keys with low cardinality keys. Netezza finds its strength somewhere in the middle. We will find that it disfavors the low-cardinality keys when high-cardinality ones are in-the-mix.
A good way to tell which of the filter attributes for a table are "high-traffic" is to turn on query history and then examine the table/view associated with "column access statistics - $vhist_column_access_stats. This will provide the number of times the column participated in a query and with which table(s) the base table interacted with. Perform a descending sort on the NUM_WHERE column and this will reveal all. In this short list we will see filter-attributes that are most useful. Don't use any of the join-only columns or the distribution key. These may adversely affect the multi-key algorithm's output and might not optimize any zone maps for this table.
At one site, we noted several inappropriate keys in the Organize, and simply by removing them and "grooming" again, the table experienced a 100x boost. The inappropriate keys were washing out the effectiveness of the other keys.
Many of us have seen this skyline, with buildings of various heights stabbing toward the sky. Compare this to the distribution graph that is part of the Netezza Administration GUI application. Normally this should be very flat (but a jagged-edge is usually okay). Clearly a Manhattan skyline is forbidden.
Or is it? if we have a table that is very skewed (like a Manhattan skyline) but the data can be easily "horizontally" sliced with zone maps, our round-trip time for a "tall" data-slice is no different than a "short" one. All we need to look out for is process-skew (too many horizontal slices on one dataslice)
Measuring the madness
Okay, David this is all very interesting but how can I know which extents or pages or whatever is being used by the keys? Well, there are a couple of handy hidden columns on each row that can help tell-the-tale. One is the _PAGEID and one is the _EXTENTID.
select count(*) , datasliceid dsid, _pageid pid from fact_customer group by datasliceid, _pageid
will tell us how many distinct pages are being used for each data slice.
select count(*), datasliceid, _pageid from fact_customer where transaction_date = '2013-01-01' group by datasliceid, _pageid order by datasliceid, _pageid;
In the above, the "count" should be reasonably even for each dataslice.
If the total count is radically more than "1", then let's organize on transaction_id and then groom. Now try the metrics again to see if it did not reduce the total pages.
I am sure with these two columns in-hand you can think of a variety of creative ways to use them. The conclusion of it all is to get the records with common key values packed as closely as we can so they take up as few extents / pages as possible.
It's a wrap
So now that the Organize seems a little better, you know, organized, maybe this will provide a bit of guidance on how to set up your own Organize and zone maps.
And don't forget to groom when changing the Organize keys.
We won't need to groom every time we do an operation. I would suggest a groom on both a schedule and a threshold. Pick a threshold ( a lot of folks like five percent). When the total deleted rows gets above five percent of the total non-deleted rows, or the total pages per unique data point gets above an unacceptable threshold, it's time to groom. But grooming on every operation is expensive, has marginal value and actually may throw away records we wanted to keep (in case of emergency rollback).
For a brief history of why ELT (that is, in-box-data-processing) is even a topic of discussion, we must recognize that the high-powered appliances such as Netezza have not only made such implementation viable, but even desirable.
Just so we level-set on what one of these looks like, it's a SQL statement. Usually an insert/select but can also include updates and the like. Many of you recognize these as multi-stage operations inside a stored procedure. The sentiment of course is that such an animal can perform better inside-the-machine than to take it out of the machine (through an ETL tool), process it, and put it back. This may be true for smaller data sets, but you aren't reading this because you have smaller data sets, are you? Netezza users are big-game hunters. Pea-shooters are for small animals, but if we want big game, we'd better bring a bigger game with us, put on a bigger game face, and bring bigger equipment.
But is it really the size of the equipment? Netezza users know that the size and the architecture are keys to success. Dog-piling a bunch of CPUs together does not a power-frame make.
Okay, back to the storyline here - in-the-box SQL-transforms, in a traditional RDBMS platform, are the realm of small game. Once the game gets bigger, these transforms degrade in performance, and rather rapidly. Watch as a swarm of engineers tries to reconstruct and rebuild the procedures, the SQL, the tables and even upgrade the hardware in a futile attempt to achieve incrementally more power. Emphasis on incrementally-more. Not linearly more.
As they grow weary of this battle, the ETL tools start looking better and better. We reach a critical mass, and the ELT-in-the-box is summarily forsaken as we stand up the ETL tool.
Sometime after this transition, the data volumes once again overwhelm the environment. One thing leads to another, and one day the Netezza machine arrives on the floor. Hopes are high.
But notice the transition above - ELT was forsaken for ETL, likely never to return. But wait, now we really have the power to do the ELT SQL-transforms, but we've mentally and perhaps emotionally (yeah, verily even politically) moved away from SQL-transforms.
Some reading this might scratch their heads and think, What's he talking about? We've been doing ELT in the machines using (PLSQL, etc, name your approach here) for many years. Why would we shy away from it?
- Hand-crafted SQL statements. They are easy to deploy but difficult to maintain and re-deploy. The moment we have to change the data model, this becomes painfully obvious.
- Proliferation of one-off operations, in cut-and-paste form, increasing inefficiency and complexity of the overall implementation
- Lack of overarching control - stored procs run and go "black" for an hour or more without any clue as to their status
- They don't scale in a traditional (SMP-based) RDBMS - the larger the data gets, the more the implementation will suffer
- ETL tools don't support them in the quantity and logistical quality we really need, but they are getting there.
Why not use an ETL Tool? I mean, they handle push-down SQL-generation right?
I can perhaps summarize this situation in a single conversation I had with a ETL tool vendor who was hawking the capability for his own tool, and after showing me the mechanics of making one of these little jewels operate with aplomb, I asked him, "So what about doing a few hundred of these in a sequence, or branching them into multiple sequences?" The vendor rep looked almost hurt. "Why would you want to do that?"
Well, if we're really talking about migrating transforms into the machine, this can grow into hundreds of operations rather quickly. This situation apparently overwhelms the logistical capability of even the most powerful ETL tool. But I have hope that they will solve this situation. Eventually.
I am not holding out hope that it will happen soon, or voluntarily. These tool vendors have invested millions in the performance boosting of their own products and will not likely toss this investment on the flames of the appliance movement, even if said flames are the exhaust flames of the appliance's rocket engines. This is why I say "eventually". They don't really have a marketable reason to embrace this approach.
Another problem of the ETL tool is that it is so divorced from the appliance's infrastructure that it cannot control the cross-environment logistics. This is especially true of the "virtual transaction" - that is - multiple flows arriving in multiple tables that are each in context of one another, yet are individually shared-nothing operations. If one of the flows should fail, how do we back out of this? Can we do a rollback of the tables where their flows succeeded? No we cannot. We could certainly implement an approach, but this is neither inherent nor intrinsic to the ETL tool. We need a shared-nothing virtual transaction that will control all of the flow in a common context, commit them in that context and faithfully rollback the results in that same context. ETL tools don't go there. Unless we implement the tool that way. As an application. Once implemented, how do we reuse this for the next application, and the next? We can see that it's not part of the tool itself.
If next-generation "ELT" scenarios are to be successful, they need several very important capabilities that are simply non-optional and non-trivial:
- Catalog-aware - The entire application must be coupled to the catalog definitons so that changes in definitions will automatically drive the resolution of their impact. We don't want to go on a submarine hunt to find every place where a particular column is used. The tool should already know.
- Logistical control - the ELT management needs to be able to support hundreds of transforms with arbitrary branches and synchronization
- SQL-generation - from templates that are tied to the catalog - If the table changes, we can propagate the changed automatically into the transforms.
- Proper handling of shared-nothing operations into the target, including virtual transactions that we can replicate to other databases, or rollback completely.
- Treat internal appliance databases as "waypoints" rather than "endpoints". That is, multiple databases are in play and used for various roles, such as intake, intermediate workspace, and persistence. These resources should be configurable and relatively transparent to the tool.
- Support of major patterns such as CDC, Dedupe, column-level scrubs and transforms, masking, column remapping, soft-delete, slowly-changing-dimensions and of course, analytic functions etc.
I am sure the visitors of this blog have even more aspects of a "wish list" that they have implemented (perhaps painfully so) and want more control over the data, its processing logistics and error control and recovery. Feel free to add your own comments and suggestions here.
Two words: load balancing -
Is our chosen platform designed for set-based bulk processing, or load balancing? Both at the software and hardware levels? The load-balancing engine (and attendant SMP hardware) are simply the wrong architecture for large-scale bulk processing. There's no way to "properly" configure the wrong architecture.
Let's level-set the difference here. In an SMP-based scenario, our engineers have to carefully configure the hardware to garner the very best performance from it. We don't have that option in Netezza, because the hardware is pre-configured. Rather in Netezza, we gain power by how we organize and configure the data. We don't really have this option in an SMP-based model, because the database engine software pre-defines how we will organize the information (through index structures) and we cannot affect our fate without the indexes. Let's see the contrasts summarized:
Netezza - no indexes, no hardware config, performance is derived from data configuration
SMP machine - high hardware config, index-depedent, no ability to affect performance with data configuration
In short, the two performance tuning models are not only polar opposites, Netezza is far more adaptable and flexible because it is easier to reconfigure data than to reconfigure hardware.
I am continually impressed with the valiant attempts of various platform aficionados who assert, claim and champion the notion that "properly" configured SMP-based hardware is the the only issue in evaluating competitive performance between platforms. In short. a properly configured <name your platform here> is just as viable as the IBM Netezza platform. Just name your components and off you go.
Of course, most folks who are making these claims are not hardware aficionados at all. Now, I appreciate software folks because at heart I am one of them, but I cut my teeth as an software engineer on solutions that aligned high-powered hardware with other high-powered hardware, all the while respecting the fact that the software I was creating was actually orchestrating and controlling the interaction between these gravity-bending machines, not physically moving the "payload" as it were. Nothing, absolutely nothing could move the data faster than the hardware. We inherently know this, yet many seem to think that software products can overcome this issue by using RAM and other creative methods to accelerate the effect of software operation.
So before I launch into a more complex rant on this, a picture is worth 1000 words (at least). In Netezza Transformation I offered up some graphics for contrast and compare (and alluded to them in another blog entry here).
In the depiction (right) we have CPUs (on the top) and disk drives (on the bottom). The pipeline in between them is the
general-purpose backplane of the hardware configuration, which may include a SAN interface, optical or 10gigE networking or other connection
mechanism to transfer data between the server's CPUs and the SAN's disk drives. Even if these disk drives are local on the machine containing the CPUs, this backplane is still the connector between them.
Now we will load a data file containing 100 billion rows of information, some 25 terabytes in size. This is a medium data size for big-data aficionados. The data will necessarily enter the machine from an external network connection, into the software engine (runnning on the CPUs) which will deliver the data onto the assigned location of the disk drives. Seems like a very simplistic and remedial explanation doesn't it?
Now we will query this data. Our ad-hoc and reporting users will want to scan and re-scan this information. Note how now the bottleneck is actually the hardware itself. The data must be pulled in-total from the disk drives, through this backplane and into CPU memory before it can be examined or qualifed. Even if we use index structures, the more complex the query, the more likely we will encounter a full-table-scan. How long, do we suppose, it would take for this configuration to scan 25 terabytes? (Keep in mind that all 25 terabytes has to move through the backplane).
A server-level MPP model would suggest placing two of these configurations side-by-side and coordinating them. In short, one of the server frames would contain some portion of the information while another frame contained the other. We could imagine placing multiples of these side-by-side to create an MPP grid of sorts. This is the essential secret-sauce of many Java grids and other grid-based solutions. Divide the data across a general-purpose grid and then coordinate the grid for loading and querying.
But notice how deeply we are burying the data under many layers of administered complexity. Sure, we can do this, but is it practical and sustainable? I've seen setups like this that served one application (and served it well) but it was an inaccessible island of capability that served no other masters. As general-purpose as all of its parts were, it had been purpose-built and purpose-deployed for a single solution that required the most heavy lifting at the time of its inception. Now that it is in place, other solutions around it are growing in the capacity needs to serve the grid, and none of them have access to the power within the grid. The grid becomes starved from the outside-in. No satellite solutiion can feed it or consume it at the rate it can process data, and it has no extensibility to support their processing missions.
So now we come full circle, we have a "properly configured" one-trick-pony. Over time, the expense and risk of this pony will become self-evident. Parts will break. Data will get lost. Lots of moving parts, especially general-purpose moving parts that are out-in-the-open, only increases the total failure points of the entire solution. Debugging and troubleshooting at the system level become matters of high-engineering, not simple administration. As noted above, in the environment where I cut-my-teeth, I was surrounded by these high-end engineers because it was the nature of the project. I noticed that once the project went into a packaging mode to deploy and maintain, these engineers moved-on and cast their cares on the shoulders of the junior engineers who backfilled them. This was a struggling existence for them, because the complexity of the solution did not lend itself to simple maintenance by junior engineers.
The Caractacus Potts adventure begins! You know Potts from the movie Chitty-Chitty-Bang-Bang. We thrilled to his inventions, and laughed when they did not deliver. A "simple" machine to crack eggs, cook them and deliver-up breakfast worked fine for everyone but him, delivering a plate of two uncooked eggs still in their shells. The puzzled look on his face told us he recognized the problem but did not know where to look for resolution. With so many moving parts, it could be anywhere. This is a classic outcome of "eccentric innovation" and "eccentric engineering" a.k.a "skunkworks". More importantly, the innovation only solves one problem (e.g. egg-centric breakfast) not a general-purpose solution platform.
Well, let's keep it simple- what about a simple summary report? You know, national sales data summarized to the national, regional, district and store levels? Wouldn't this require a complete scan of the table to glean all of this? In the SMP-based depiction (above) how could be expect such a scan to operate? Software would pull the data in-total from the disk drives, choking the backplane. Software would then summarize the aforementioned quantities, in memory if possible and then deliver the result. Frankly, such a report could take many hours to execute, and keep the machine busy the whole time. Even if we "grid" this, the query would swamp the grid. On a tricked-out Sun E10k, we have about 12 gb/second throughput. Putting some math to this, with 25 terabytes in tow, we could expect the table scan to complete in about 30 minutes even if the full complement of 64 processors is on board, because the solution is I/O bound, not CPU-bound (no new CPUs will make it run faster). However, in reality the software engine and all its overhead drain the energy from the machine and this query will run for hours, even if it's the only operation running on the machine. So I guess we really will need more CPUs to balance off the software drain of the engine itself. (sigh).
This is because: engines that run on SMP-based devices are inherently load-balancing engines, not bulk-processing engines. Their processes stop, negotiate and resume even if there's nothing else going on. Think of it like this: Where I live in the country, at 5am in the morning all of the traffic lights on the main road blink-yellow until about 6am. If I travel on that road before then, I can go the speed-limit for over half-an-hour before hitting the first traffic light in the next (larger) town. But if those traffic lights all operated normally, I could get stopped at each one, protracting my 30-minute journey by orders of magnitude as I wait on traffic lights even when no other traffic is present. SMP-based engines automatically thread this way, where a flow-based model does not. A load-balancing engine will force all of its processes to stop at a virtual traffic light, come up for air to make sure nothing else requires attention, then go back to work. Transactional models absolutely require this but it is anathema to bulk processing.
Now we contrast this with IBM Netezza, which is a purpose-built platform for general-purpose across all solutions requiring such a platform. We don't have a one-trick pony. This would mean (ultimately) any form of data warehouse or bulk-processing solution, but more importantly anything that requires fast-retrieval of data, especially while performing high-intensity on-demand analytics on it.
In the IBM Netezza architecture (depicted right) each CPU has a shared-nothing disk drive and its own RAM. On the original Mustang Series 10400 (with 432 of these on board) we have a machine that costs far less than the Sun E10k noted above. Likewise we could scan those 100 billion rows in less than ten minutes. It won't ever take any longer than ten minutes. If we boost the CPUs of the machine, say by adding another frame to it (200+) CPUs, it will boost the machine's speed by another 50 percent. Queries that took 6 minutes now will take less than 4 minutes. It is a deterministic/predictable model, and adding more frames to the Netezza platform is simple and inexpensive compared to the sheer labor dollars of eccentric engineering.
As for the depiction (right) it has 16 CPU/Disk combinations. To use round numbers, let's say we put 1.5 terabytes on each disk for a grand total of 24 terabytes. With this configuration, for any given query, the path to conclusion is only 1.5 terabytes worth of scan-time away. Once we initiate the query, each CPU will run independently and will scan its 1.5 terabytes. All of them will complete simultaneously, meaning that the total duration of the query was no longer than it took to scan the 1.5 terabytes (they are all scanning in parallel). Now boost this to 400 CPUs, where each one now only has about 63 gigabytes share of the load. One scan of the entire 24TB table takes no longer than the time to scan 63 GB (they are all in parallel). We can measure our disk read-speed here and get very consistent estimates of how long a query should take.
Also keep in mind that (in a prior blog entry on indexes) I noted that we can radically reduce these operations to a fraction of their total scan times. But in the example above, full summary of the data on sales-boundaries, how much is that worth if we could do the sales base on a date? Or based on a range of dates? Perhaps even comparing last-years Independence-Day sales to this years?
In an SMP-based configuration, the information engineers would suggest partitions. The partition (for an SMP engine) is an artificial performance prop that anticipates the user's query needs based on known use-cases. It bundles data (say on a date boundary) so queries against that date can be fenced by the partition boundary. The Netezza zone map, on the other hand, automatically tells the machine where to look, and where not to look, to go capture the information required by the query. No props, no use-case anticipation, just the flexibility we really need if we want to keep fire-breathing users happy without special engineering to anticipate their needs.
Zone maps allow the sales-related comparison above to arrive in mere seconds to the user's fingertips. On an SMP machine, at best, even with partitions, indexes and other performance props, will require a maxed-out power frame (all CPUs on deck) and the best anticipatory information engineering to provide a consistent experience that even hopes to compete with Netezza. Even after all that, it won't come back in seconds, and won't provide the nimble flexibility so sought-after by even the average data analyst.
The conclusion is that the overall cost of deployment, ownership and ease of maintenance for a Netezza machine utterly eclipse the potential promise of SMP-based solutions. For an analyst, all columns, tables and functions are "fair game" for query - all over the database, 24x7. A Netezza machine provides just that. On an SMP-based engine, the analyst has to agree with information engineers on their entry points and usage patterns, and these have to be engineered into the model in order to support the users. Once engineered, the solution will support only that user base. All other user bases will require their own engineering model. This is not sustainable, durable or manageable, which is why those who are steeped in it will gladly embrace a Netezza machine. Value is recognized on so many levels.
When processing in bulk, leading us to scale (that is, tens of billions of rows) we can trust Netezza to pull this off rather handily. With hardware and architecture in the bag for us, is there anything we can do wrong, or perhaps inefficiently, that would deny us entry to the soaring heights of performance so elusive on other platforms? After all, if the stratosphere is within reach, we want to go there as quickly as possible.
As it turns out, there are some things that Netezza doesn't do well. And all of them are transactional in nature. Not to worry, as we really don't want Netezza to foray into those realms. We can do it faster and simpler in set-based form. In fact, we find that many newbies have to shake off some transactional thinking that in later days they refer to as cobwebs. The trappings of transactional thinking that are artificial constraints, necessary evils to our existence on an SMP-based RDBMS (but are in our rear-view-mirror and rapidly fading once we cross over to Netezza)
Now don't get me wrong, many people will deploy an SMP-based RDBMS for their warehouses and have a smashing time of it for many years, perhaps far into the future with no scalability issues at all. We could even venture to assume that over half of all warehouses rolled out this way will likely never see a capacity hiccup, simply because they are rolled out firmly in the center of the "bell curve". (A little statistics lingo there)
Recall many years ago, in the prob&stat class we may have slept through, that the bell curve gives us the 80-20 rule or something akin to it. It also gives us a another rule, that twenty percent (or so) of warehouses will be at the small end, twenty percent at the top end, and the rest in the middle. So guess what? This puts eighty percent in the middle-of-the-road and not really on the Netezza radar for the moment. But perish the notion that Netezza intends to ignore that market. It plays very well in that zone too. In fact, Netezza can move-and-shake along the entire continuum of the solution spectrum.
But when it enters the upper twenty percent zone, the air starts to get thin. Nitrogen narcolepsy befalls the heroes-of-the-eighty-percent, and they start to fall away very quickly. When we get into the zone of searching, analyzing, crunching and even just summarizing data on the orders of billions, tens of billions, and hundreds of billions of rows, we have a number of additional rules that will impose themselves on our existence. These are no different than the laws of aerodynamics we would use to defeat the laws of gravitation. One set of rules has sway, while another set of rules is used to overcome the effect of the first. The rules we have to deal with are those concerned with scale. If we don't pay attention to scale, down to the lowest level possible, no amount of efficiency in software will bail us out. When it comes to scale, salvation is in the hardware. More importantly, the architecture of the hardware.
I often adjure prospective Enzees that when evaluating the TwinFin against other platforms, avoid the kool-aid softball questions that anyone can answer. When a proof-of-concept tournament is underway, the easy questions all float around in the eighty-percent zone where any of the technologies play in one form or another. Rather take a tough problem, one that only exists in the twenty-percent zone, and drop it in the lap of the tournament players. Take a look at three things: (1) what was the raw performance number, (2) what was the difficulty or complexity of the solution, (3) How many vendor engineers did it require and as a bonus (4) how long did it take the vendor engineers to give you an answer? Minutes, hours, days? How long they took to formulate an answer, and how many of them were required, is second only to the performance of the solution itself. After all, anyone can test-or-evaluate in the eighty-percent zone where everyone is a hero. Get the problem domain into the twenty-percent zone (where we'll be on day-one after the technology is installed) and get those answers now. Wouldn't it be a bit awkward to ask these questions when it's too late to do anything about it?
I have mentioned in other essays that the Netezza architecture, and its deep attention to details-of-scale are what come alongside us as our ally when the other heroes-of-eighty-percent simply gasp for air. One of these is the architecture of the disk drives themselves. The drives are of course dedicated to their own CPU, RAM, FPGA and all that, but the layout of the disk platter itself is also intriguing. As an aside, each Netezza disk shares information with another sister drive so that if it should crash, it can be hot-swapped and rebuilt from its sister drive very quickly, easily and automatically. So where is this drive data stored on the sister-drive? As the design would have it, on the inner rim of the sister drive's platter. Otherwise, each drive's outer rim holds user data and the middle ring holds temp space and system data. But we can see the wisdom in this model, that the outer rim is spinning fastest, so delivers the data to the user much faster than the inner rims would.
Does our eighty-percent-hero do this for us? Well of course not. The SMP-based model of Netezza's competitors share their drives like any other SMP device. The odd state of affairs is that they are shared-nothing at the functional level, but shared-everything at the hardware level. This dissipates the strength of the machine. Conversely for Netezza, it's shared-nothing all-the-way-down. This bodes well for query support, because it means every query gets the full-strength and undivided attention from the machine, even for the fractions of a second necessary to complete it, and thus each query is returned on a wave of strength, not a river of dissipation.
For those who want to roll out the eighty-percent models and be happy with it, hey I don't disparage anyone from making a living. Functionally speaking, those warehouses have some of the most advanced features on the planet - in the eighty-percent zone. Netezza's customers by definition have already breached the eighty-percent mark and will never revisit it again.
But true to form, when dealing with a non-scalable platform, our enemies rapidly reduce to size and complexity. That is, the larger the size of the solution, the less functionality we can derive from it. And the larger the complexity of the solution, its required storage capacity is stunted from growth. Guess what happens when we migrate either of these solutions to a Netezza machine? When the complex functionality is ported, it suddenly finds a breath of fresh air and data volumes start to grow, sometimes exponentially. If the data sizes are already too unwieldy and we move to Netezza, the additional capacity allows us to build more functionality which in turn leads to - you guessed it - even more growth of data volumes. The short answer is, when migrating or upgrading either of these animals, the new platform had better be able to immediately scale in orders of magnitude, not just incremental percentages. This is why so many migrations onto the eighty-percent-platforms sputter and fail. The initial migration is successful, but they quickly find themselves out of capacity in short order. This was never expected, so the migration is a bust.
But with Netezza, it has the capacity to take on the migrated functionality and grow with it - without any particular hiccups or worrisome meetings about its abilities. It just hums along, keeps all its promises and never complains.
Okay, so I'm a big fan of the technology. Mostly because it makes a migration easy and makes its customers happy. Who couldn't use a little more of that?
But for those heroes-of-the-eighty-percent, there's a place for them should they choose to don the necessary equipment for high-altitude flight. Of course, there aren't any heroes in the twenty percent zone, considering that the only true hero in that space is the TwinFin. And when flying above the clouds (with any technology), humility dictates that we set aside the hero-way - and we really have to trust the hardware, don't we?
I have noted in prior posts several of the "banes" of in-the-box data processing, not the least of which is harnessing the mechanics and nuances of the SQL statement itself. After all, the engine of in-the-box is a series of insert/select SQL statements. I've also noted that we need to squeeze the latency out of the inter-query handoff and management. These are important factors for efficiency, scalability and adaptability.
But this article deals primarily with "adaptive" SQL, that is, the ability to surgically and dynamically control the SQL, the paths of flow between SQL statements, their timings and the ability to conditionally execute them.
I am drawing a contrast between this approach and the common "wired" ETL application. In the wired application of an ETL tool, all components are known and flow-paths predefined. If we want to shut off a particular component or flow, we'd better make that decision at startup because we won't get to do this later. A benefit here is that if we add or change a flow-path, the ETL tool's dependency analysis will (usually) detect it and give it a thumbs-up or thumbs-down. We can (and do) perform this kind of design-time analysis, but what of dynamic run-time analysis?
Case in point: One group performs trickle-feed of data from a change-data-capture, so on any given loadng cycle, we don't know which files will show up. Not to worry in an ETL tool, since we would just build a separate mini-app to deal with the issues. The mini-app would key on the arrival of a specific file, process the file and present results to the database. This is a very typical implementation. But with hundreds of potential files, it's also logistically very daunting and hard to get the various streams to inter-operate. In fact, an ETL tool quickly reduces to "sphagetti-graphics" and the graphical user interface is just in-our-way at that point.
Case in point: One group has multiple query paths/flows where sql statements build one-to-the-next for the final outcome. These can follow a wide range of paths not unlike a labyrinth depending on a variety of different factors. The problem is, these factors aren't known until run-time and only appear in fleeting form as the data is processed. How do we capture these elements and use them as steering logic? In an ETL tool, our options are limited to none. In this particular case, three primary paths of logic were available each time the flows ran. Sometimes all three paths ran end-to-end. Sometimes only one, or two would run, or perhaps none-at-all. The starting conditions and unfolding data conditions determined the execution path.
But we have another name for this don't we? Isn't this just plain vanilla "computer programming"? Where the data shows up and we use the encountered-data and encountered-conditions to guide the IF-THEN-ELSE logic to conclusion? The problem you see, is that we are so accustomed to using IF-THEN-ELSE at the ROW/COLUMN level, we cannot imagine what this would look like at the SET level. Ahh, the conditional logic driving SETS is unique and distinct from that which drives basic elements. But then again, we can only scale in sets, not the basic elements. THis is where the dynamic nature of conditional-sets is invaluable.
But this isn't really about conditional sets, either. Only that conditional sets are a necessary capability and we have to account for them along with many other subtle nuances. Let's follow:
We have an external file and we load this into an intermediate/staging table (TABLE-A) in preparation for processing.
Now we build another target intermediate table (TABLE-B) and an insert/select statement to move / shape the data logically and physically from TABLE-A to TABLE-B.
From here we have several more similar operations, so we build intermediate tables for their results as well, such as TABLE-C, TABLE-D and TABLE-E
TABLE-A >>> TABLE-B >>> TABLE-C >>> TABLE-D >>> TABLE-E
Now let's say we have another chain of work starting from TABLE-V:
TABLE-V >>> TABLE-W >>> TABLE-X >>> TABLE-Y >>> TABLE-Z
Now something interesting happens, in that the developers sense a pattern that allows them to reuse certain logic if they only put these quantities into a couple of working tables, which we will call TABLE-G and TABLE-H, and now the flows look like this:
TABLE-A >>> TABLE-B >>>>>>>>TABLE-C >>> TABLE-D >>> TABLE-E
TABLE-V >>> TABLE-W >>>>>>>>TABLE-X >>> TABLE-Y >>> TABLE-Z
Notice how TABLE-G is feeding TABLE-C and TABLE-H is feeding TABLE-X, so that each of them have a 2-table dependency.
Now we get to the end of the chain of work and learn that TABLE-Z has to leverage some data in TABLE-C! We don't want to rebuild TABLE-C just for TABLE-Z, but in an ETL Tool this data would be bound/locked inside a flow. We could redirect the flow to TABLE-Z, unless the flow to TABLE-Z is entirely conditional and we don't know it until we encounter TABLE-C. What if, for example, the results of TABLE-C are conditional and if the condition is realized, none of the components following TABLE-C are executed. However, we could have TABLE-Z see this absence as acceptable and continue on.
Okay, that's a lot of stuff that might have your head spinning about now, but the simplicity in resolving the above is already in our hands. In any flow model, upstream components essentially have a "parent" relationship to a downstream "child" component. This parent-child relationship pervades flows (and especially trees) and as we can readily see, the above chain-of-events looks a lot like a tree (more so than a flow).
More importantly, each node of the tree is a checkpointed stop. We must build the intermediate table, process data into it and move on, but once we persist the data, we have a checkpointed operation. This is why it behaves so beautifully as a flow and a tree.
Now let's say over the course of SDLC (regular maintenance), that a developer needs to add some more operations and connect other existing operations to their results. This is essentially just introducing new source tables in the where/join clause, but the table as to exist. In short, if we add a new table to the logic of TABLE-X, it will now be dependent upon its original tables and the new ones. (Its query will break if they are not present at run time).
It is easy enough (honestly) to perform a quick dependency-check over all of our queries to make sure that their various source tables are accounted for. In other words, an operation actually exists that will produce the table. What if we picked the wrong table or even misspelled it? At run-time we would know, but we would rather know before execution because it's a design-time issue. This may verify that logically we have a plan to create the dependent table, but it does not deal with the simple fact that conditional circumstances may forego the physical instantiation of the table. Transforms ultimately do not operate on intent, but on the presence of physical assets.
As another nuance, this creates a disparity between the design-time flow of data, and the run-time flow of data. If the run-time is governed (e.g. ETL tools) so that the dependencies and conditions are all evaluated at the start of the application, the design-time and run-time are more easily mated for review by an auditor or analyst. But if any part of it is dynamically conditional, we can see how this could practically nullify the design-time form of the flow. They would simply say, "I know what the flow would do by design, but I want to see what it actually did at run time, because the data isn't matching up". Aha - so "intent" counts for design review, but "intent" is not what puts physical data into the tables. Operational processes do that.
As noted above with the necessity for conditionality and reduction of inter-transform latency, we now have a need to weave together at run-time what the flows will actually do. The "source tables" for a given transform are found in the where/join phrase and these had better be present when the SQL launches or it will be a short ride indeed.
And now, what you did not expect - one of the most powerful ways to use a Netezza machine is to forego the "serialization" of these flows and allow them to launch asynchronously. We can certainly throttle how many are "live" at at time, but if any or all of them can launch independently, how on earth are we supposed to manage the case where one or two of them really are dependent on another one or more? Do we put these in a separate flow? Do we really want our developers to have to remember that if they put an additional dependency in a transform that they have to regard whether that preceding transform has actually executed successfully?
So that's the real trick, isn't it? If I have forty transforms and all of them could run asynchronously except for about ten of them, that can only run after their predecessor completes, I have several options to see to it that these secondary operations do not fail (because their predecessor has not executed yet).
I can serialize them in by putting them into separate flows (or branches). One of them kicks off and runs to completion while the next one waits. This is logically consistent but also inefficient. If those secondary transforms are co-located with the original set, the optimizer can run them when there is bandwidth rather than waiting until the end. It is also logistically unwieldy because a developer has to remember to that if a transform should gain a dependency, it has to be moved to the second flow.
I can fully serialize them into a list, but this is the most inefficient since it "boxcars" the transforms and does not leverage the extra machine cycles we could have used to shrink the duration.
I can link them via their target table and source table, such that this relationship is dynamically identified and the flow path dynamically realized. If a given transform does not run (conditional failure) or simply fails to execute, the dependency breakage is dynamically known. What does this do? What if a given transform is supposed to use an incoming (intake) table if it is present (data was loaded) otherwise use a target-table's contents (e.g. trickle-feed, change-data-capture problem). This allows the transform to do its work with consistency but also have the ability to dynamically change its sources based on availability.
Now, we know ETL tools don't do this. Other tools may attempt to rise to this level of dynamic pathing, but the bottom line is that if those tools don't provide this kind of latency-reduction, high-throughput, dynamically adaptable model, they will not be able to leverage the full bandwidth of the machine. Trust me on this - the difference is between using 90 percent of the machine or only 10 percent at a time. That machine packs the virtual joules to make it happen, so let's make it happen.
When we originally developed our framework to wrap around some of these necessary functions, we had not considered these nuances of dynamic interdependence and frankly, ELT was so new that it didn't really matter. The overhead to execute "raw" SQL was zero, but we could not effectively parallelize/async the queries without losing control. Running async chains of transforms necessitated detailed control, but nobody had a decent algorithm for it, so once again Brightlight had to pioneer this capability. Our architecture allowed us to easily integrate these things into the substrate of the framework as a transparent function. This is the primary benefit of a framework, that the developers can continue to build their applications without disruption, but we can upgrade and enhance the framework to provide stronger and deeper functionality. Whether our framework is right for all applications is not the issue, but whether the complete implementation is right for Netezza. It's a powerful machine and we should not arbitrarily leave any cycles on the table.
Imagine slowly running out of steam because of latent implementation inefficiencies, then ultimately asking for a Netezza upgrade that, if the inefficiencies weren't present, the upgrade wouldn't be necessary. This has happened with more than one of our sites and rather than upgrading to all-new-hardware, we installed, converted and bought back an enormous amount of capacity. They eventually upgraded the hardware much later on, but for the right reasons.