Once again this entry will serve as an intro to the un-initiated. I get a lot of pings from folks who are new to the stuff and want to know their way around the machine from the beginning. Zone maps are bread-and-butter power tools of the Netezza technology. The companion capabilities are the materialized view and the cluster base tables (which are new to v 6.0).
About three years ago I sat in on a special presentation to our firm from one of Netezza's competitors. They put up some power-point slides to compare their technology to "the others". Their product was a data warehousing product and all the "others" were "OLTP" products. The usual suspects were in this list of course, but oddly so was Netezza. But Netezza is not an OLTP machine in any sense of the word. How odd that this company did not know enough about its primary rival to discern this simple truth. In addition, this vendor made certain odd claims about Netezza that were either architectural false-assumptions or just misinformation. One of which was this "In a Netezza machine, the more data you add to a table, the slower the table's response will be." This is of course completely false. Netezza tables are like any data storage with one major takeaway that should not be missed:
Netezza tables may behave inconsistently if the data is not properly configured and deployed. Notice I did not say "if the hardware is not properly ...." but the data itself. Netezza already has the hardware configuration under control and there's nothing we can do to affect it. We can however, align our data with the hardware architecture and will receive the benefit of a consistent experience no matter how much the data grows. It's hard to imagine this kind of consistency on any platform, so stay with me here.
A number of years ago I did some deep immersion in a health-care expert system for Value Health Sciences. The objective was to directly apply and enforce the codes found in the CPT code book used by physicians (those are the codes the doc will check on the sheet he hands to you before he leaves the office). Insurance companies have standard billing/invoicing arrangements with the physicians and these codes are the centerpiece. For VHS, their medical practitioners had set up rules for characterizing (and properly re-characterizing) a claim. For example, if a physician billed codes 1,2 and 4 but he was supposed to have billed these all as 1 and 5 (because code 5 bundles 2 and 4 to reduce cost) the system would have a rule in it that looked something like:
1,2,4 = 1, 5
The difficulty here of course, with some 100,000 codes and over 40,000 rules, is to organize them so that they are quickly accessible. By design, the first code in the rule (eg. code "1" above) was considered an "index" code. In other words, any and all rules that would use the code "1" as an anchor would also make "1" the first code. This lended itself to the opportunity to "vector" the rules. That is, create an in-memory array, indexed by the given code, and a list of rules would be handily available at that vector point. So that all we had to do was look at the claim (e.g. codes 1,2,4) fetch the various candidate rules from their vector lists ( the list for code 1 might have 3 rules, likewise for the others, reducing our total candidate rules from 40,000 to only 9 rules). We could then run the rules and dispatch the claim. When I had first received this model, the system was physically running through all 40,000 rules to find the right match. This vectored approach reduced the total rules, but also eliminated guesswork. And with all that, boosted the power and created a very consistent turnaround time for each patient claim.
The vector list above is a common way in programming to organize information so that - if we have something in our hands already - like a filter or a vector based on a natural key - we can quickly find where we need to start looking and eliminate the places where we are not interested. While this sounds a lot like a common index-structure approach, it's actually a process-of-elimination approach. The two have similar qualities, so I will delineate them here at a high level.
An index is a surgical approach to finding data by logging its record-based keys and a record locator into a separately maintained structure. By finding keys, we find physical records. This is important for transactional systems that need fast record-level turnaround. It is not ideal however, for analytics or data warehousing where the vast majority of queries don't care about individual records at all, but about answers that span records. This alternative approach of vectoring, is a way to bucket the information so that we're not attempting to specifically locate the record itself, but the bucket where the record may be located, that is, derive a group of candidates upon which we will take further action.
When we think about this, the candidate approach, as opposed to the surgical approach is the only way to scale anything that has to deal with multiple-record sets (set-based processing). Engines that are geared to examine one-record-at-a-time even if it is to aggregate or perform multi-record operations with the single-records it retrieves, cannot compete with an engine that is designed to defacto treat the records as multiples from the outset. Transactional versus bulk. The transactional engine is inherently slower and non-scalable compared to the bulk engine (for set-based operations).
I noted in a prior entry that Netezza's anti-indexing strategy heavily leverages "where not to look" because no matter how large a table gets, this "vectoring" or "candidate" approach will guarantee that our response time is very consistent. How does this apply to zone maps?
Recall the diagram with the CPU/disk combinations. Each of these disks is divided into its smallest actionable physical portion (an extent) and the extent then becomes the currency-of-the-realm when we're talking about searching for data. The machine automatically tracks the data that is added to an extent, and when we generate-statistics, every extent is accounted for in a zone map. The map has basic statistics on the columns in the table at the extent level, such as the high/low values for a column in the given extent.
Extent ColA ColB ColC
1 200 A B
1 300 A B
1 400 C D
1 500 E F
2 200 F G
2 400 H I
2 600 J K
3 100 L M
3 200 J O
Note the above layouts of the extents. If we were to look at the high/low values of each, we would find zone maps like this:
Extent ColA High ColA Low
1 500 200
2 600 200
3 200 100
So now if we ask a question to the machine based on say, a ColA value of 200, the machine will search all three extents. It has no choice because the zone map is telling it that pertinent data exists in all three.
Now let's physically sort the data by ColA
Extent ColA ColB ColC
1 100 L M
1 200 A B
1 200 F G
1 200 J O
2 300 A B
2 400 C D
2 400 H I
3 500 E F
3 600 J K
1 200 A B
1 200 F G
1 200 J O
2 300 A B
2 400 C D
2 400 H I
3 500 E F
3 600 J K
Now if we take a look at the zone map, we would see something like:
Extent ColA High ColA Low
1 200 100
2 400 300
3 600 500
So that now if we query the table for values of 200, the machine searches only one extent, not all of them. (Now, this is a simple example, because a given value could extend across many extents). Likewise a given disk has thousands of extents upon which to store the data, so it can spread the information out into much smaller, manageable chunks.
The point, however, is that by organizing the data this way, the zone maps reduce the total searchable extents. Keep in mind that we don't have to sort the data each time we want to use it. For many warehouses, the data naturally arrives in this contiguous form (for example, transaction date in a retail model). The Netezza machine will track all integers and dates automatically without us having to do anything additional to enable it. So we might take a penalty once to sort the data at the outset, then benefit from the data's natural arrival and collation.
This effectively creates the aforementioned vectored effect. It also guarantees a consistent experience because the data associated with the given value (say 200 above) isn't going anywhere. It will always be in the same location, tracked by the same zone map. Anytime we want to get back to the records associated with value 200, the zone map will serve them up. If the response time last week was 2 seconds, we could quadruple the size of the table and it will still be 2 seconds, because the data itself didn't go anywhere.
Contrast this to other heavy-lifting algorithms in software technologies. The Ab Initio platform, for example, will perform an in-memory Sort that will "intelligently" spill to disk when it runs out of memory. What this means is that it will take a predefined block of memory and attempt to perform all work inside it until it runs out. If so, it will manufacture 16 file locations within which to spill its overflow. These are organized with the same "vectored" approach as the sort keys would dictate. If these 16 files run out of space, they will further divide into 256 files to split the workload. Anyone working with the technology will tell you that this secondary split into 256 is to be avoided if at all possible because it negatively (and significantly) affects the performance of the sort. But we see the difference here. A general-purpose sort in software needs these 256 assets to functionally complete its mission, and adding more means less performance. Now compare this to Netezza which uses not just 256, but thousands of extents to divide the data out - and rather than experience the performance drag, will derive power and lift from this.
So let's apply something additional, especially for highly active tables. If we have a nightly batch window for processing, we likely have all the time we need to keep these zone maps healthy. But if we have an intraday model or even an aggressive micro-batch model, it is possible that the zone maps may be "violated". Note above that as long as the data arrives in contiguous chunks (like transaction date) the data will be set into the extents so that we minimize the total extents for a search. If a transaction date (for example) should appear outside of this contiguous range (say a record from last week's data has arrived) then it means the high/low range for that particular extent will be extended to include the errant record. It could mean that the given extent is also included in queries that don't related to either date, because the extent is simply being included because of its high-low range. That is, every query looking for this extent's primary transaction date, the errant transaction date, or any date in between them, will include this extent in the candidate list. While this is inconsequential in most cases, if it repeats over time it can slowly degrade the table's performance. Highly active micro-batch and intraday models have a tendency to see these effects if they remain unchecked or unguarded. But not to worry - the solution is simple.
What is daunting about this is that our highly-active-table might be a multi-billion-row beast that does not lend itself to re-sorting and in an intraday model, there's no time to re-sort the data anyhow.
A materialized view has the quality of being able to apply a virtual zone map to a table without physically re-sorting the information. All we have to do is pick the columns we want to zone map, and then form the materialized view with an order-by on those columns. Rebuilding one of these is usually very fast. In addition, we can be rebuilding it even while its prior version is active. At one site, we set up a process to rebuild the view thus:
- Rebuild the materialized view with a new name
- Wait for completion
- Drop the original materialized view
Note that for the few moment that two materialized views are applied to two tables bears no risk at all. And if we drop the original materialized view, it will wait for all pending queries to stop - meaning that this won't break anything the users are doing (another necessity for highly active intraday models)
Important here is that the materialized views are not being used as surrogate tables, but as a means to virtualize the zone maps for faster response time on the table associated with the view. Another important aspect of the materialized view is that it can zone-map additional data types that the regular zone maps do not (e.g. varchars).
Let's say that we have a very active table and even re-creating the materialized view is too much uptake to maintain it. Not to worry, the materialized view has another quality that means we don't have to refresh it each time the table changes. The materialized view maintains its contents until the underpinning table changes. We then have the option to rebuild the view or use it "as is". This "as is" is usually agreeable for the short term, since when data is added to the underpinning table, the materialized view creates an extension to itself. Now whenever we query the table, it will examine the materialized view for default zone-map information but will also examine the extent. Considering that this extra extent is very small it is also insignificant in its overhead. So we can keep this in place for say, all day long and then clean up the materialized view in the evening. The materialized view's performance will not drag over the course of a day unless we do some serious re-churning of the underpinning data (which is not normal for common micro-batch models).
And now - Cluster-based tables
With all the above preliminary stuff out of the way, a discussion on CBTs is fairly straightforward. As of the 6.0 release, Netezza has provided this as a multi-dimensional capability for even more powerful searches. The data in the CBT is physically organized into extents based on keys. And when we perform a groom operation, the data is physically reshuffled so that the keys are physically co-located on the fewest extents possible. Like a self-enforcing zone map. only a lot more powerful. In a zone map approach, the primary way to get the data contiguous is to re-sort the data and the zone maps fall out for free. In a CBT, we organize the data and the execute groom. The initial groom will take awhile, but each subsequent groom will take inconsequential time. The data in the CBT can be organized on more than one key, but the keys can behave independently like a multi-dimensional effect.
Case in point - let's say we have 200 extents of data and the particular set of data we want spans only 10 of those extents (a zone map). We will ignore the other 190 extents and focus all the machine's energy on the 10 where the data is located. But wait, what if we have an additional key upon which to search, and the combination of the first key (10 extents) and second key reduces the total extents to only 2? That is, the first key appears in 10 extents but the first and second key appear in only 2 extents of those 10. Now we search only 2 extents. Again, we have organized and configured data, not hardware, to achieve the goal. We have told Netezza where-not-to-look and now we have zoomed in on a mere 2 extents out of 200.
But let's say that we add a lot more data to the table over the year, expanding this 200 extents to many hundreds if not thousands more. Can we see now that even if this happens, we can go back to the table again and again, ask the the same question as before, and only 2 of the extents will be used to search for data. This of course dispels the vendor-promulgated propaganda that Netezza tables get slower when we add more data.