Modified on by DavidBirmingham
For the past many months I have been diligently updating and upgrading the original 2008 Netezza Underground to address the many features of TwinFin, Striper and other offerings from IBM. I have recently been notified that it has passed final edit and is available on Amazon.com.
All I can say is "whew!" and many thanks to those who helped put it together. It's been a whirlwind.
Here is the URL
When I started the project I realized that a big part of the original book remains timeless. I didn't leave it "as is" though - practically every page and all the chapters have new material, case studies and such. I peppered the book with some additional graphics since the intrinsic points require a bit more reinforcement than mere words will suffice.
The original chapter on "Distribution Stuff" is now "Performance Stuff" and is twice as long, covering the various aspects of setting up tables, troubleshooting, page-level zone maps and a lot more.
Fortunately, this time around there is a better mechanism to contact me if you have questions or want to report any errata (hey, it could happen!) - you can reach me through this blog, on linked-in or directly through my primary email address at Brightlight Consulting:
Modified on by DavidBirmingham
In this case, a tortoise-brained hare is an animal that is capable of going fast but wonders why he can't.
Over the last six months or so I have seen a "trending" situation with those who use the Organize-On. This is some pretty cool functionality so it's important that we get it right. Generally speaking, folks who understand how zone-maps work will have a splendid time with Organize. Others, not so much. The machine is supposed to be fast, so why are all my queries so slow?
All I have to do is add the Organize On keys and I'm good to go. Answer: No, you have to groom first, as this actually applies the Organize to the physical data. Response: Really? when we did this it took forever! Answer: Yes, the first groom may take a bit of time but every groom after that will be painless.
We CTAS a table every night for maintenance and the groom is always running slow. Answer: if you are using Organize On, don't use a CTAS for regular maintenance. With an Organize, the table is physically broken apart and spread across multiple additional extent-pages to provide for easier groom maintenance. A CTAS re-compresses it. So another groom is required to blow it back out again.
We like to use Materialized Views to optimize our tables but now the Organize doesn't let us. How can we get this back? Answer: You don't, because the Organize essentially replaces the Materialized View and does it so much better. Asking this question means you might not understand zone maps as well as you think you do! Just sayin'
We have included the distribution key in the Organize, along with some other join keys. Answer: (Sigh) -Remove them. Join-only keys do not belong in the Organize. The Organize is for filter attributes. Uh, perhaps you don't understand zone maps? Just sayin'
We have some lookup tables where we applied an Organize. But this didn't seem to matter. Answer- No it won't because if the table is not big enough the system won't use zone maps anyhow.
We were told to turn off Page-Level zone maps but this caused a 5x reduction in performance. Answer: Page-levels radically reduce disk I/O which is our number-one enemy. Turn them back on and go home happy.
It is clear that our outlying cast of folks who have never been introduced to zone maps think that the Organize is either a clustered-index or some other kind of indexing scheme. Fair enough, is this because of the deceptive naming? Like the Materialized View? (okay, don't get me started). The Organize keys are not keys in the same sense as indexes and are certainly not used as join keys.
They are filter-attributes. By this we mean that we fully intend to use these columns with constant values, lists or selected-lists as constraints.
Zone Map Primer
Here I will use my favorite example because it is time-tested and describes the capability. If I pay a visit to Wal-Mart and I'm looking for batteries, I might visit the customer kiosk and the lady tells me that the batteries are "on the end-cap, Aisle Three". This is an indexing model because she told me exactly where to look. This does not scale because when we get to billions of rows we spend more time searching the index than retrieving records.
In a Netezza model however, she would say "It's on an end-cap but I'm not sure which one." So I go to Aisle 1, then 2, then 3 and jackpot, I buy the batteries and go home. The most important part of her answer however, was in what she "did not say". She did not say that the batteries were in Men's or Women's clothing, Automotive or Electronics. In constraining the search to a particular location, she also told me "where not to look". This aspect of "where not to look" is critical to understanding zone maps. It is also critical to stratospheric scalability. Clearly if this Wal-Mart were to quintuple in size, my battery-buying-duration experience would not change in the least.
Now for the geeky part:
Each Netezza disk has 120K physical extents. Each extent has 3MB of space, divided across 24 pages(blocks) of 128k each. If we are running prior to OS 7.x, we will be zone-mapping at the extent-level. If we are 7.x or higher, we will use the page-level. Run don't walk to 7.x and page-level zone maps! They will radically reduce I/O on the extent and this is critically important. A set of records found only on a single page can return in 1/24th of the time than if it has to scan the full extent (96 percent faster). In a real-world experiment, the zone maps on a Twinfin-18 were changed from page-level to extent-level and the same battery of queries executed against it. It performed 5x to 10x slower than with the page-level turned on. Do not underestimating the influence of disk I/O on query turnaround. Netezza is a very physical machine.
Here's another internal example: Compression in Netezza is a nominal 4x. I have seen it much higher. Let's say we have an uncompressed record that is 10,000 bytes in size. When we read it, we will read all 10,000 bytes. If the data is compressed however, we will read only 2500 bytes, a 75% reduction in disk I/O. Netezza is the only platform where compression boosts the power in both reading and writing data because it reduces disk I/O tremendously.
Zone maps are a table that Netezza holds in memory for supporting a table. Each table has its own set, describing the contents of the extents/pages allocated to the table on a given dataslice. The contents of the zone maps are built automatically when we load up our table. It will collect the information from the records stored on the given extent/page for all integers, dates and timestamps. It will store the high and low value of each, representing the table's record "ranges" for that extent.
Thus when we execute a query using one of these columns as a filter-attribute, it will go to the zone maps first and cherry-pick only those extents containing the data we want, excluding all others. this means that the data on those other zone maps won't even see the light of day for the query-in-play. If we use an Organize, we can use a wider range of data types and be more deliberate with zone-map management.
Let's say we want to search-on the transaction-date on our fact table. If we have not physically organized the data, the same transaction-date value may appear across many hundreds or thousands of extents, affecting the high/low range of many zone maps. If however, we were to physically co-locate the records with common dates into tighter physical groups, the records will physically appear inside fewer zone maps. These are the extents/pages that the machine will cherry-pick for scanning and will completely scan each one. We want it to scan as few of these as possible.
In times past, we had three primary ways of doing this.
A brute-force sort, which is pretty egregious when the table gets very large.
A software program that separates the key's distinct values and executes a block-select from the original table to the new, selecting only once for each distinct value. This physically co-locates like-keys. (the data does not have to be sorted, only like-keys co-located)
A materialized view, which would manufacture a virtual zone map (which is why it's not allowed with the Organize)
Let's say that our table's data takes up 400 extents on each data slice. If our transaction-date appears in 300 of these (it is poorly organized) then when a query runs like so:
Select count(*) from mytable where transaction_date = '2014-01-01';
All 300 extents will be searched for this information. However if we Organize on this transaction-date and then groom - the data will be physically shuffled around to co-locate the records with the same transaction-date on as few extents as possible. Let's further say that once this happens, the given date appears in only one extent. What we have just done is optimized the table 300x. We have eliminated 299 other locations to look for data. This is important because scanning a 3MB extent is a lot of work. If we are scanning 299 additional extents for each dataslice, we're really doing a lot of extra work for nothing. If we translate this to a page-level problem, we may have 24x300 pages originally containing the keys. If with Organize we reduce this to a single page, we have further reduced our scanning load by another 96 percent of the extent containing the page.
The important factor in the example is the "out-in-the-open" value of "2014-01-01". It is a "filter attribute". It is not being joined to a table with this value, Doing it this way means that the FPGA/CPU will discover which zone maps have a high/low boundary that contains this value and will retrieve a candidate list of them, If there is only one as opposed to 300, we have radically reduced out workload. Netezza will literally exclude those extents from being examined at all. We have told the machine where-not-to-look. If we join this value however, such as using a time dimension, applying the date value to time dimensions and joining the time dimension to our fact table, we will require the system to fetch the record in order to examine it, at which time it will determine whether to keep it or toss it. this sort of thing can initiate a full table scan, nullifying the zone map entirely.
We don't want this to happen. We want the data to stay on disk and never see the light of day if it's not participating in the query.
To show how dramatic this can be, we were at one site hosting over 100 billion rows in the primary fact table. A full scan of this table took 8 minutes (which is not too shabby in itself, just sayin' ). The reporting users knew that if a query ever exceeded a few minutes in duration, it was probably ignoring the zone maps. This is because once-Organized, this table would return the average query in sub-second response. Think about that,100 billion rows in sub-second response.
This is why paying attention to zone maps is such a big deal. Optimizing distribution can get us boost in the single-digits (2x, 3x etc) on a given query. Optimizing zone maps can get us 1000x boost and higher.
The Organize-On accepts one or more keys which will be applied to physically co-locate records of like-valued keys, then it will update the zone maps. Here is a test to see if we understand the application:
Take one of your largest tables. CTAS the table to another database, order by the distribution key, or by the hidden "rowid" column to make sure that the given filter key is not ordered. This could take a bit of time, of course. Then perform a query using one of your date parameters as in the example above, and time the query. Now perform an
Alter Table tabname organize on (date column name here).
Then perform groom. Once completed, execute the same query and get a timing. We can see that a several-minute query can go sub-second very easily.
What's more the additional keys in the Organize are (more or less) independently organized. They will all enjoy a much faster turnaround than not using the Organize. If records arrive on the table out-of-order, no worries. Run the groom again. Subsequent runs of groom will always be shorter in duration than the first one.
Clearly however, if the zone-map is intended to apply filter-attributes to guarantee the exclusion of extents/pages completely, we cannot use a join-key. Or at least, not a join-only key. This also means that the distribution key is out (unless we plan to call-up an individual record based on the distribution key). Also, generally do not mix high cardinality keys with low cardinality keys. Netezza finds its strength somewhere in the middle. We will find that it disfavors the low-cardinality keys when high-cardinality ones are in-the-mix.
A good way to tell which of the filter attributes for a table are "high-traffic" is to turn on query history and then examine the table/view associated with "column access statistics - $vhist_column_access_stats. This will provide the number of times the column participated in a query and with which table(s) the base table interacted with. Perform a descending sort on the NUM_WHERE column and this will reveal all. In this short list we will see filter-attributes that are most useful. Don't use any of the join-only columns or the distribution key. These may adversely affect the multi-key algorithm's output and might not optimize any zone maps for this table.
At one site, we noted several inappropriate keys in the Organize, and simply by removing them and "grooming" again, the table experienced a 100x boost. The inappropriate keys were washing out the effectiveness of the other keys.
Many of us have seen this skyline, with buildings of various heights stabbing toward the sky. Compare this to the distribution graph that is part of the Netezza Administration GUI application. Normally this should be very flat (but a jagged-edge is usually okay). Clearly a Manhattan skyline is forbidden.
Or is it? if we have a table that is very skewed (like a Manhattan skyline) but the data can be easily "horizontally" sliced with zone maps, our round-trip time for a "tall" data-slice is no different than a "short" one. All we need to look out for is process-skew (too many horizontal slices on one dataslice)
Measuring the madness
Okay, David this is all very interesting but how can I know which extents or pages or whatever is being used by the keys? Well, there are a couple of handy hidden columns on each row that can help tell-the-tale. One is the _PAGEID and one is the _EXTENTID.
select count(*) , datasliceid dsid, _pageid pid from fact_customer group by datasliceid, _pageid
will tell us how many distinct pages are being used for each data slice.
select count(*), datasliceid, _pageid from fact_customer where transaction_date = '2013-01-01' group by datasliceid, _pageid order by datasliceid, _pageid;
In the above, the "count" should be reasonably even for each dataslice.
If the total count is radically more than "1", then let's organize on transaction_id and then groom. Now try the metrics again to see if it did not reduce the total pages.
I am sure with these two columns in-hand you can think of a variety of creative ways to use them. The conclusion of it all is to get the records with common key values packed as closely as we can so they take up as few extents / pages as possible.
It's a wrap
So now that the Organize seems a little better, you know, organized, maybe this will provide a bit of guidance on how to set up your own Organize and zone maps.
And don't forget to groom when changing the Organize keys.
We won't need to groom every time we do an operation. I would suggest a groom on both a schedule and a threshold. Pick a threshold ( a lot of folks like five percent). When the total deleted rows gets above five percent of the total non-deleted rows, or the total pages per unique data point gets above an unacceptable threshold, it's time to groom. But grooming on every operation is expensive, has marginal value and actually may throw away records we wanted to keep (in case of emergency rollback).
Modified on by DavidBirmingham
Sometimes the average Netezza user gets a bit tripped-up on how an MPP works and how co-located joining operates. They see the "distribute on" phrase and immediately translate "partition" or "index" when Netezza has neither. In fact, those concepts and practices don't even have an equivalent in Netezza. This confusion is simply borne on the notion that Netezza-is-like-other-databases-so-fill-in-the-blank. And this mistake won't lead to functional problems. They will still get the right answer, and get it pretty fast. But it could be soooo much faster.
As an example, we might have a traditional star-schema for our reporting users. We might have a fact table that records customer transactions, along with dimensions of a customer table, a vendor table, a product table etc. If we look at the size of the tables, we find that the product and vendor tables are relatively small compared to the customer, and the fact table dwarfs them all. A typical default would be to distribute each of these tables on their own integer ID, such as customer_id, vendor_id etc. and then putting in a transaction fact record id (transaction_id) that is separate from the others, even though the transaction record contains the ID fields from the other tables.
Then the users will attempt to join the customer and the transaction fact using the customer_id. Functionally this will deliver the correct answer but let's take a look under-the-covers what the performance characteristics will be. As a note, the machine is filled with SBlades, each containing 8 CPUs. For example, if we have a TwinFin-12, this is 12 SBlades with 8 CPUs, or 96 CPUs. They are interconnected with a proprietary, high-speed Ethernet configured to optimize inter-CPU cross-talk.
Also whenever we put a table into the machine, it logically exists in one place, the catalog, but physically exists on disks assigned to the CPUs. A simplistic explanation would be that if we have 100 CPU/disk combinations and load 100,000 rows to a table that is distributed on "random", each of the disks would receive exactly 1000 records. When we query the table, the same query is sent to all 100 CPUs and they only operate on their local portion of the data. So in essence, every table is co-located with every other table on the machine. This does not mean however, that they will act in co-location on the CPU. The way we get them to act in co-location (that is, joining them local to the CPU) is to distribute them on the same key.
But because our noted tables are not distributed on the same key, they cannot co-locate the join. This means that the requested data from the customer table will be shipped to the fact table. What does this look like? Because the customer table has no connection to the transaction_id, the machine must ship all customer records to all blades (redistribution) so that the CPUs there can attempt to join on the body of the customer table. We can see how inefficient this is. This is not a drawback of the Netezza machine. It is a misapplication of the machine's capabilities.
Symptoms: One query might run "fine". But two of them run slow. Several of them even slower. Results are inconsistent when other activities are running on the machine. We can see why this is the case, because the processing is competing for the fabric. Why is this important to understand? The inter-CPU fabric is a fairly finite resource and if we allow data to fly over it in an inefficient manner, it will quickly saturate the fabric. All the queries start fighting over it.
Taking a step back, let's try something else. We distribute the transaction_fact on the customer_id, not the transaction_id. Keep in mind that the transaction_id only exists on the transaction table so using it for distribution will never engage co-location. Once we have both tables distributed on the customer_id, let's look at the results now:
When the query initiates, the host will recognize that the data is co-located and the data will start to join without ever leaving the CPU where the two table portions are co-located. The join result is all that rises from the CPU, and no data is shipped around the machine to affect the answer. This is the most efficient and scalable way to deal with big-data in the box.
Now another question arises: If the vendor and product dimensions are not co-located with the transaction_fact, how then will we avoid this redistribution of data? The answer is simple: they are small tables so their impact is negligible. Keep in mind that we want to co-locate the big-ticket-or-most-active tables. I say that because we have sites that are similar in nature where the customer is as large as two of the other dimensions, but is not the most active dimension. We want to center our performance model on the most-active datasets.
This effect can rear its head in counter-intuitive ways. Take for example the two tables - fact_order_header and fact_order_detail. These two tables are both quite monstrous even though the detail table is somewhat larger. Fact_order_header is distributed on the order_header_id and the fact_order_detail is distributed on the order_detail_id. The fact_order_detail also contains the order_header_id, however.
In the above examples, the order header was being joined to the detail, along with a number of other keys. This achieved the correct functional answer, but because they were not using the same distribution key, the join was not co-located. So we suggested putting the order_detail table on the same distribution as the order-header (order_header_id). Since the tables were already being joined on this column, this was a perfect fit. The join received an instant boost and was scalable, no longer saturating the inter-CPU fabric.
The problem was in how the data architects thought about the distribution keys. They were using key-based thinking (like primary and foreign keys) and not MPP-based thinking. In key-based thinking, functionality flows from parent-to-child, but in MPP-based thinking, there is no overriding functional flow of keys - it's all about physics. This is not to say that "function doesn't matter" but we cannot put together the tables on a highly physical machine and expect it to behave at highest performance unless we regard the physics and protect the physics as an asset. Addressing the functionality alone might provide the right functional answer, but not the most scalable performance.
Two words: load balancing -
Is our chosen platform designed for set-based bulk processing, or load balancing? Both at the software and hardware levels? The load-balancing engine (and attendant SMP hardware) are simply the wrong architecture for large-scale bulk processing. There's no way to "properly" configure the wrong architecture.
Let's level-set the difference here. In an SMP-based scenario, our engineers have to carefully configure the hardware to garner the very best performance from it. We don't have that option in Netezza, because the hardware is pre-configured. Rather in Netezza, we gain power by how we organize and configure the data. We don't really have this option in an SMP-based model, because the database engine software pre-defines how we will organize the information (through index structures) and we cannot affect our fate without the indexes. Let's see the contrasts summarized:
Netezza - no indexes, no hardware config, performance is derived from data configuration
SMP machine - high hardware config, index-depedent, no ability to affect performance with data configuration
In short, the two performance tuning models are not only polar opposites, Netezza is far more adaptable and flexible because it is easier to reconfigure data than to reconfigure hardware.
I am continually impressed with the valiant attempts of various platform aficionados who assert, claim and champion the notion that "properly" configured SMP-based hardware is the the only issue in evaluating competitive performance between platforms. In short. a properly configured <name your platform here> is just as viable as the IBM Netezza platform. Just name your components and off you go.
Of course, most folks who are making these claims are not hardware aficionados at all. Now, I appreciate software folks because at heart I am one of them, but I cut my teeth as an software engineer on solutions that aligned high-powered hardware with other high-powered hardware, all the while respecting the fact that the software I was creating was actually orchestrating and controlling the interaction between these gravity-bending machines, not physically moving the "payload" as it were. Nothing, absolutely nothing could move the data faster than the hardware. We inherently know this, yet many seem to think that software products can overcome this issue by using RAM and other creative methods to accelerate the effect of software operation.
So before I launch into a more complex rant on this, a picture is worth 1000 words (at least). In Netezza Transformation I offered up some graphics for contrast and compare (and alluded to them in another blog entry here).
In the depiction (right) we have CPUs (on the top) and disk drives (on the bottom). The pipeline in between them is the
general-purpose backplane of the hardware configuration, which may include a SAN interface, optical or 10gigE networking or other connection
mechanism to transfer data between the server's CPUs and the SAN's disk drives. Even if these disk drives are local on the machine containing the CPUs, this backplane is still the connector between them.
Now we will load a data file containing 100 billion rows of information, some 25 terabytes in size. This is a medium data size for big-data aficionados. The data will necessarily enter the machine from an external network connection, into the software engine (runnning on the CPUs) which will deliver the data onto the assigned location of the disk drives. Seems like a very simplistic and remedial explanation doesn't it?
Now we will query this data. Our ad-hoc and reporting users will want to scan and re-scan this information. Note how now the bottleneck is actually the hardware itself. The data must be pulled in-total from the disk drives, through this backplane and into CPU memory before it can be examined or qualifed. Even if we use index structures, the more complex the query, the more likely we will encounter a full-table-scan. How long, do we suppose, it would take for this configuration to scan 25 terabytes? (Keep in mind that all 25 terabytes has to move through the backplane).
A server-level MPP model would suggest placing two of these configurations side-by-side and coordinating them. In short, one of the server frames would contain some portion of the information while another frame contained the other. We could imagine placing multiples of these side-by-side to create an MPP grid of sorts. This is the essential secret-sauce of many Java grids and other grid-based solutions. Divide the data across a general-purpose grid and then coordinate the grid for loading and querying.
But notice how deeply we are burying the data under many layers of administered complexity. Sure, we can do this, but is it practical and sustainable? I've seen setups like this that served one application (and served it well) but it was an inaccessible island of capability that served no other masters. As general-purpose as all of its parts were, it had been purpose-built and purpose-deployed for a single solution that required the most heavy lifting at the time of its inception. Now that it is in place, other solutions around it are growing in the capacity needs to serve the grid, and none of them have access to the power within the grid. The grid becomes starved from the outside-in. No satellite solutiion can feed it or consume it at the rate it can process data, and it has no extensibility to support their processing missions.
So now we come full circle, we have a "properly configured" one-trick-pony. Over time, the expense and risk of this pony will become self-evident. Parts will break. Data will get lost. Lots of moving parts, especially general-purpose moving parts that are out-in-the-open, only increases the total failure points of the entire solution. Debugging and troubleshooting at the system level become matters of high-engineering, not simple administration. As noted above, in the environment where I cut-my-teeth, I was surrounded by these high-end engineers because it was the nature of the project. I noticed that once the project went into a packaging mode to deploy and maintain, these engineers moved-on and cast their cares on the shoulders of the junior engineers who backfilled them. This was a struggling existence for them, because the complexity of the solution did not lend itself to simple maintenance by junior engineers.
The Caractacus Potts adventure begins! You know Potts from the movie Chitty-Chitty-Bang-Bang. We thrilled to his inventions, and laughed when they did not deliver. A "simple" machine to crack eggs, cook them and deliver-up breakfast worked fine for everyone but him, delivering a plate of two uncooked eggs still in their shells. The puzzled look on his face told us he recognized the problem but did not know where to look for resolution. With so many moving parts, it could be anywhere. This is a classic outcome of "eccentric innovation" and "eccentric engineering" a.k.a "skunkworks". More importantly, the innovation only solves one problem (e.g. egg-centric breakfast) not a general-purpose solution platform.
Well, let's keep it simple- what about a simple summary report? You know, national sales data summarized to the national, regional, district and store levels? Wouldn't this require a complete scan of the table to glean all of this? In the SMP-based depiction (above) how could be expect such a scan to operate? Software would pull the data in-total from the disk drives, choking the backplane. Software would then summarize the aforementioned quantities, in memory if possible and then deliver the result. Frankly, such a report could take many hours to execute, and keep the machine busy the whole time. Even if we "grid" this, the query would swamp the grid. On a tricked-out Sun E10k, we have about 12 gb/second throughput. Putting some math to this, with 25 terabytes in tow, we could expect the table scan to complete in about 30 minutes even if the full complement of 64 processors is on board, because the solution is I/O bound, not CPU-bound (no new CPUs will make it run faster). However, in reality the software engine and all its overhead drain the energy from the machine and this query will run for hours, even if it's the only operation running on the machine. So I guess we really will need more CPUs to balance off the software drain of the engine itself. (sigh).
This is because: engines that run on SMP-based devices are inherently load-balancing engines, not bulk-processing engines. Their processes stop, negotiate and resume even if there's nothing else going on. Think of it like this: Where I live in the country, at 5am in the morning all of the traffic lights on the main road blink-yellow until about 6am. If I travel on that road before then, I can go the speed-limit for over half-an-hour before hitting the first traffic light in the next (larger) town. But if those traffic lights all operated normally, I could get stopped at each one, protracting my 30-minute journey by orders of magnitude as I wait on traffic lights even when no other traffic is present. SMP-based engines automatically thread this way, where a flow-based model does not. A load-balancing engine will force all of its processes to stop at a virtual traffic light, come up for air to make sure nothing else requires attention, then go back to work. Transactional models absolutely require this but it is anathema to bulk processing.
Now we contrast this with IBM Netezza, which is a purpose-built platform for general-purpose across all solutions requiring such a platform. We don't have a one-trick pony. This would mean (ultimately) any form of data warehouse or bulk-processing solution, but more importantly anything that requires fast-retrieval of data, especially while performing high-intensity on-demand analytics on it.
In the IBM Netezza architecture (depicted right) each CPU has a shared-nothing disk drive and its own RAM. On the original Mustang Series 10400 (with 432 of these on board) we have a machine that costs far less than the Sun E10k noted above. Likewise we could scan those 100 billion rows in less than ten minutes. It won't ever take any longer than ten minutes. If we boost the CPUs of the machine, say by adding another frame to it (200+) CPUs, it will boost the machine's speed by another 50 percent. Queries that took 6 minutes now will take less than 4 minutes. It is a deterministic/predictable model, and adding more frames to the Netezza platform is simple and inexpensive compared to the sheer labor dollars of eccentric engineering.
As for the depiction (right) it has 16 CPU/Disk combinations. To use round numbers, let's say we put 1.5 terabytes on each disk for a grand total of 24 terabytes. With this configuration, for any given query, the path to conclusion is only 1.5 terabytes worth of scan-time away. Once we initiate the query, each CPU will run independently and will scan its 1.5 terabytes. All of them will complete simultaneously, meaning that the total duration of the query was no longer than it took to scan the 1.5 terabytes (they are all scanning in parallel). Now boost this to 400 CPUs, where each one now only has about 63 gigabytes share of the load. One scan of the entire 24TB table takes no longer than the time to scan 63 GB (they are all in parallel). We can measure our disk read-speed here and get very consistent estimates of how long a query should take.
Also keep in mind that (in a prior blog entry on indexes) I noted that we can radically reduce these operations to a fraction of their total scan times. But in the example above, full summary of the data on sales-boundaries, how much is that worth if we could do the sales base on a date? Or based on a range of dates? Perhaps even comparing last-years Independence-Day sales to this years?
In an SMP-based configuration, the information engineers would suggest partitions. The partition (for an SMP engine) is an artificial performance prop that anticipates the user's query needs based on known use-cases. It bundles data (say on a date boundary) so queries against that date can be fenced by the partition boundary. The Netezza zone map, on the other hand, automatically tells the machine where to look, and where not to look, to go capture the information required by the query. No props, no use-case anticipation, just the flexibility we really need if we want to keep fire-breathing users happy without special engineering to anticipate their needs.
Zone maps allow the sales-related comparison above to arrive in mere seconds to the user's fingertips. On an SMP machine, at best, even with partitions, indexes and other performance props, will require a maxed-out power frame (all CPUs on deck) and the best anticipatory information engineering to provide a consistent experience that even hopes to compete with Netezza. Even after all that, it won't come back in seconds, and won't provide the nimble flexibility so sought-after by even the average data analyst.
The conclusion is that the overall cost of deployment, ownership and ease of maintenance for a Netezza machine utterly eclipse the potential promise of SMP-based solutions. For an analyst, all columns, tables and functions are "fair game" for query - all over the database, 24x7. A Netezza machine provides just that. On an SMP-based engine, the analyst has to agree with information engineers on their entry points and usage patterns, and these have to be engineered into the model in order to support the users. Once engineered, the solution will support only that user base. All other user bases will require their own engineering model. This is not sustainable, durable or manageable, which is why those who are steeped in it will gladly embrace a Netezza machine. Value is recognized on so many levels.
When processing in bulk, leading us to scale (that is, tens of billions of rows) we can trust Netezza to pull this off rather handily. With hardware and architecture in the bag for us, is there anything we can do wrong, or perhaps inefficiently, that would deny us entry to the soaring heights of performance so elusive on other platforms? After all, if the stratosphere is within reach, we want to go there as quickly as possible.
As it turns out, there are some things that Netezza doesn't do well. And all of them are transactional in nature. Not to worry, as we really don't want Netezza to foray into those realms. We can do it faster and simpler in set-based form. In fact, we find that many newbies have to shake off some transactional thinking that in later days they refer to as cobwebs. The trappings of transactional thinking that are artificial constraints, necessary evils to our existence on an SMP-based RDBMS (but are in our rear-view-mirror and rapidly fading once we cross over to Netezza)
Now don't get me wrong, many people will deploy an SMP-based RDBMS for their warehouses and have a smashing time of it for many years, perhaps far into the future with no scalability issues at all. We could even venture to assume that over half of all warehouses rolled out this way will likely never see a capacity hiccup, simply because they are rolled out firmly in the center of the "bell curve". (A little statistics lingo there)
Recall many years ago, in the prob&stat class we may have slept through, that the bell curve gives us the 80-20 rule or something akin to it. It also gives us a another rule, that twenty percent (or so) of warehouses will be at the small end, twenty percent at the top end, and the rest in the middle. So guess what? This puts eighty percent in the middle-of-the-road and not really on the Netezza radar for the moment. But perish the notion that Netezza intends to ignore that market. It plays very well in that zone too. In fact, Netezza can move-and-shake along the entire continuum of the solution spectrum.
But when it enters the upper twenty percent zone, the air starts to get thin. Nitrogen narcolepsy befalls the heroes-of-the-eighty-percent, and they start to fall away very quickly. When we get into the zone of searching, analyzing, crunching and even just summarizing data on the orders of billions, tens of billions, and hundreds of billions of rows, we have a number of additional rules that will impose themselves on our existence. These are no different than the laws of aerodynamics we would use to defeat the laws of gravitation. One set of rules has sway, while another set of rules is used to overcome the effect of the first. The rules we have to deal with are those concerned with scale. If we don't pay attention to scale, down to the lowest level possible, no amount of efficiency in software will bail us out. When it comes to scale, salvation is in the hardware. More importantly, the architecture of the hardware.
I often adjure prospective Enzees that when evaluating the TwinFin against other platforms, avoid the kool-aid softball questions that anyone can answer. When a proof-of-concept tournament is underway, the easy questions all float around in the eighty-percent zone where any of the technologies play in one form or another. Rather take a tough problem, one that only exists in the twenty-percent zone, and drop it in the lap of the tournament players. Take a look at three things: (1) what was the raw performance number, (2) what was the difficulty or complexity of the solution, (3) How many vendor engineers did it require and as a bonus (4) how long did it take the vendor engineers to give you an answer? Minutes, hours, days? How long they took to formulate an answer, and how many of them were required, is second only to the performance of the solution itself. After all, anyone can test-or-evaluate in the eighty-percent zone where everyone is a hero. Get the problem domain into the twenty-percent zone (where we'll be on day-one after the technology is installed) and get those answers now. Wouldn't it be a bit awkward to ask these questions when it's too late to do anything about it?
I have mentioned in other essays that the Netezza architecture, and its deep attention to details-of-scale are what come alongside us as our ally when the other heroes-of-eighty-percent simply gasp for air. One of these is the architecture of the disk drives themselves. The drives are of course dedicated to their own CPU, RAM, FPGA and all that, but the layout of the disk platter itself is also intriguing. As an aside, each Netezza disk shares information with another sister drive so that if it should crash, it can be hot-swapped and rebuilt from its sister drive very quickly, easily and automatically. So where is this drive data stored on the sister-drive? As the design would have it, on the inner rim of the sister drive's platter. Otherwise, each drive's outer rim holds user data and the middle ring holds temp space and system data. But we can see the wisdom in this model, that the outer rim is spinning fastest, so delivers the data to the user much faster than the inner rims would.
Does our eighty-percent-hero do this for us? Well of course not. The SMP-based model of Netezza's competitors share their drives like any other SMP device. The odd state of affairs is that they are shared-nothing at the functional level, but shared-everything at the hardware level. This dissipates the strength of the machine. Conversely for Netezza, it's shared-nothing all-the-way-down. This bodes well for query support, because it means every query gets the full-strength and undivided attention from the machine, even for the fractions of a second necessary to complete it, and thus each query is returned on a wave of strength, not a river of dissipation.
For those who want to roll out the eighty-percent models and be happy with it, hey I don't disparage anyone from making a living. Functionally speaking, those warehouses have some of the most advanced features on the planet - in the eighty-percent zone. Netezza's customers by definition have already breached the eighty-percent mark and will never revisit it again.
But true to form, when dealing with a non-scalable platform, our enemies rapidly reduce to size and complexity. That is, the larger the size of the solution, the less functionality we can derive from it. And the larger the complexity of the solution, its required storage capacity is stunted from growth. Guess what happens when we migrate either of these solutions to a Netezza machine? When the complex functionality is ported, it suddenly finds a breath of fresh air and data volumes start to grow, sometimes exponentially. If the data sizes are already too unwieldy and we move to Netezza, the additional capacity allows us to build more functionality which in turn leads to - you guessed it - even more growth of data volumes. The short answer is, when migrating or upgrading either of these animals, the new platform had better be able to immediately scale in orders of magnitude, not just incremental percentages. This is why so many migrations onto the eighty-percent-platforms sputter and fail. The initial migration is successful, but they quickly find themselves out of capacity in short order. This was never expected, so the migration is a bust.
But with Netezza, it has the capacity to take on the migrated functionality and grow with it - without any particular hiccups or worrisome meetings about its abilities. It just hums along, keeps all its promises and never complains.
Okay, so I'm a big fan of the technology. Mostly because it makes a migration easy and makes its customers happy. Who couldn't use a little more of that?
But for those heroes-of-the-eighty-percent, there's a place for them should they choose to don the necessary equipment for high-altitude flight. Of course, there aren't any heroes in the twenty percent zone, considering that the only true hero in that space is the TwinFin. And when flying above the clouds (with any technology), humility dictates that we set aside the hero-way - and we really have to trust the hardware, don't we?
After completing another migration from a traditional, general-purpose RDBMS to the Netezza technology, I visited a friend who had several artifacts in his home that had to be the strangest things I'd ever seen.
Now I'd heard of genetically altered cat fur, you know, buying a cat online while picking the fur color of your choice (blue, lavender, teal - etc). Seems like an odd thing to do to a cat. I like cats and dogs both, so don't imagine that this blog essay attempts to take sides. Some folks are downright serious on their choice of pet, so I'll smooth that fur wherever I can.
Back to the hairless cats. I asked him "Where did you find these? And what happened to your other cats?" And he laughed, "That's a funny story. These are my cats, but I had to shave them." To this, I rolled my eyes, wondering where this was about to lead. He told the tale:
"We took a vacation down south and put the cats in kennels in the back of the truck. The round-trip took a toll on their fur and matted up everything from head to toe. Weeks later, the cat fur had not smoothed out. The kids had been brushing it out but it wasn't working. And if you think they look ridiculous shaved, you have no idea how silly they looked with their hair matted."
"So you shaved them?" I asked.
"Sure" he said, "Seems practical right? Just get rid of the matted hair altogether. Teasing it apart would have taken, well, years of time. Their fur will grow back out soon enough"
"Aren't there, you know, shampoos and stuff for that? I mean, shaving seems a little extreme."
"Tried all that. Bad thing about it, mats are bad for cats - they cause infections and all kinds of nasty side effects. Best to just shave it all and be done with it."
Laughing on the inside, I thought a bit about how we have to decompose and de-engineer an organically-grown data warehouse. Some would suggest porting (forklifting) the whole thing over "as is". Like taking a matted-hair cat and moving them from one house to another. It changes the venue for the cat, but doesn't help the cat at all. It's still sick and getting sicker from the mats. Such folks "tell a tale" of the success of their migration derring-do. But they are like nomads. Hunting the game until there's no more, then pulling up stakes to find another place to burn out. Forklift-migrations have value only to the ones who are doing the migrating, not the recipients. No sooner will they tie a bow on it than someone will request a change, and we will discover what we already knew: The original data model (now the new data model) isn't very resilient to change no matter where it is hosted.
We realize we have ported both the good and the bad from the old system, when we had the opportunity to port the good and leave the bad behind. We essentially are agreeing that we are about to standardize on the past and then accommodate the future, rather than a better approach: standardize on the future and accommodate the past.
Many years ago, at one site we had to carefully tease-apart the data and the stored procedured to find out what they were actually doing. Unfortunately we had carved up the work for several teams rather than reviewing it together. Had we done this, we would have discovered that the stored procs executed in chains of work, and that many of the chains were copy-pasted from one original chain that was too "matted" to risk breaking. So they copied-and-modified this chain to perform the new functionality. Enough of these and we see how the stored procedure doesn't benefit us (at all) for back-end data processing. In fact, we strongly suggest people use stored procs in Netezza for BI-adaptation and optimization, for the presentation layer. But not for the back end. Stored procs are not operationally viable for a wide range of reasons. It's even funny how folks move from one technology to another and try to replicate the stored procedure logic as a knee-jerk exercise, without realizing how flawed it really is. Perhaps the Netezza stored proc will run a lot faster. Trust me, performance is the least of your worries.
So once we converged the teams together, these themes started popping out like rabbits. By the end of the first day we are all laughing at the sheer level of redundancy in the back end. But not particularly surprised at the outcome. We'd seen it in lots of places before, but not so bad.
Of course, it never once occurred to us that we would port these hundreds of stored procs over to the new system. Rather we would functionally specify what they are doing now, and leverage tools to accommodate the vast majority of the functionality, only building what was left over. I mean, this is a standard functional port, why complicate things? Forklifting into a Netezza machine will certainly yield 10x performance, so why the beef? Without optimizing the data structures and processes to leverage Netezza's power, we might get 10x but leavel 100x on the table. Is this a good tradeoff?
Well, true to form, someone had the capacity to complicate things. He whipped out a spreadsheet and calculated the cost of the hundreds-of-stored-procs in the original system, not realizing we were planning to reduced these to maybe fifteen operations at most. Spreadsheet calculator in-hand, he estimated that it would take 24 people, 8 months, to handle on these stored procs. I sat back in my seat, stunned, because he was costing a project we weren't about to undertake. Rather, building out 15 or so operations would require a handful of people and 90 days at the outside. But also true to form, the project principals saw visions of sugar plums (another word for sales-comp) that got in the way of their better judgment. They actually went to the client with these inflated numbers, he rejected their proposal outright and gave the business to someone else. It's easy to lose a deal when the client sees the inflation on-the-page.
But what our "spreadsheet guy" missed, was that we weren't about to embark on a journey of finding a home for each stored proc (we already knew this had no value, and the client knew it too). He believed that we intended to bring the matted-cats into the house and put them on pillows, when we intended to pick the cats we wanted to keep, and shave them.
Okay, that's a strange analogy, but we had no intention whatsover of accepting all that convoluted spaghetti as the foundation for the go-forward system.
Netezza gives us the capacity - to simplify. We keep the parts we consider valuable (the cat) and get rid of all the mess that keeps the cat sick and unhappy. Taking only the functions we want, we then reconstruct (let the hair grow back out) only what we want to keep, and take the opportunity to apply some solid architectural principles and likewise capitalize on the strengths of the Netezza platform.
In the end, if we really have a platform that is standardized on the future, but accommodates the past, we also have something else that is even more powerful: A simpler, stronger engine that is ready to grow in functionality, adapting to our changing needs. The old system was never built with this kind of vision or priority, because the power wasn't there to affect it anyhow.
I have noted in prior posts several of the "banes" of in-the-box data processing, not the least of which is harnessing the mechanics and nuances of the SQL statement itself. After all, the engine of in-the-box is a series of insert/select SQL statements. I've also noted that we need to squeeze the latency out of the inter-query handoff and management. These are important factors for efficiency, scalability and adaptability.
But this article deals primarily with "adaptive" SQL, that is, the ability to surgically and dynamically control the SQL, the paths of flow between SQL statements, their timings and the ability to conditionally execute them.
I am drawing a contrast between this approach and the common "wired" ETL application. In the wired application of an ETL tool, all components are known and flow-paths predefined. If we want to shut off a particular component or flow, we'd better make that decision at startup because we won't get to do this later. A benefit here is that if we add or change a flow-path, the ETL tool's dependency analysis will (usually) detect it and give it a thumbs-up or thumbs-down. We can (and do) perform this kind of design-time analysis, but what of dynamic run-time analysis?
Case in point: One group performs trickle-feed of data from a change-data-capture, so on any given loadng cycle, we don't know which files will show up. Not to worry in an ETL tool, since we would just build a separate mini-app to deal with the issues. The mini-app would key on the arrival of a specific file, process the file and present results to the database. This is a very typical implementation. But with hundreds of potential files, it's also logistically very daunting and hard to get the various streams to inter-operate. In fact, an ETL tool quickly reduces to "sphagetti-graphics" and the graphical user interface is just in-our-way at that point.
Case in point: One group has multiple query paths/flows where sql statements build one-to-the-next for the final outcome. These can follow a wide range of paths not unlike a labyrinth depending on a variety of different factors. The problem is, these factors aren't known until run-time and only appear in fleeting form as the data is processed. How do we capture these elements and use them as steering logic? In an ETL tool, our options are limited to none. In this particular case, three primary paths of logic were available each time the flows ran. Sometimes all three paths ran end-to-end. Sometimes only one, or two would run, or perhaps none-at-all. The starting conditions and unfolding data conditions determined the execution path.
But we have another name for this don't we? Isn't this just plain vanilla "computer programming"? Where the data shows up and we use the encountered-data and encountered-conditions to guide the IF-THEN-ELSE logic to conclusion? The problem you see, is that we are so accustomed to using IF-THEN-ELSE at the ROW/COLUMN level, we cannot imagine what this would look like at the SET level. Ahh, the conditional logic driving SETS is unique and distinct from that which drives basic elements. But then again, we can only scale in sets, not the basic elements. THis is where the dynamic nature of conditional-sets is invaluable.
But this isn't really about conditional sets, either. Only that conditional sets are a necessary capability and we have to account for them along with many other subtle nuances. Let's follow:
We have an external file and we load this into an intermediate/staging table (TABLE-A) in preparation for processing.
Now we build another target intermediate table (TABLE-B) and an insert/select statement to move / shape the data logically and physically from TABLE-A to TABLE-B.
From here we have several more similar operations, so we build intermediate tables for their results as well, such as TABLE-C, TABLE-D and TABLE-E
TABLE-A >>> TABLE-B >>> TABLE-C >>> TABLE-D >>> TABLE-E
Now let's say we have another chain of work starting from TABLE-V:
TABLE-V >>> TABLE-W >>> TABLE-X >>> TABLE-Y >>> TABLE-Z
Now something interesting happens, in that the developers sense a pattern that allows them to reuse certain logic if they only put these quantities into a couple of working tables, which we will call TABLE-G and TABLE-H, and now the flows look like this:
TABLE-A >>> TABLE-B >>>>>>>>TABLE-C >>> TABLE-D >>> TABLE-E
TABLE-V >>> TABLE-W >>>>>>>>TABLE-X >>> TABLE-Y >>> TABLE-Z
Notice how TABLE-G is feeding TABLE-C and TABLE-H is feeding TABLE-X, so that each of them have a 2-table dependency.
Now we get to the end of the chain of work and learn that TABLE-Z has to leverage some data in TABLE-C! We don't want to rebuild TABLE-C just for TABLE-Z, but in an ETL Tool this data would be bound/locked inside a flow. We could redirect the flow to TABLE-Z, unless the flow to TABLE-Z is entirely conditional and we don't know it until we encounter TABLE-C. What if, for example, the results of TABLE-C are conditional and if the condition is realized, none of the components following TABLE-C are executed. However, we could have TABLE-Z see this absence as acceptable and continue on.
Okay, that's a lot of stuff that might have your head spinning about now, but the simplicity in resolving the above is already in our hands. In any flow model, upstream components essentially have a "parent" relationship to a downstream "child" component. This parent-child relationship pervades flows (and especially trees) and as we can readily see, the above chain-of-events looks a lot like a tree (more so than a flow).
More importantly, each node of the tree is a checkpointed stop. We must build the intermediate table, process data into it and move on, but once we persist the data, we have a checkpointed operation. This is why it behaves so beautifully as a flow and a tree.
Now let's say over the course of SDLC (regular maintenance), that a developer needs to add some more operations and connect other existing operations to their results. This is essentially just introducing new source tables in the where/join clause, but the table as to exist. In short, if we add a new table to the logic of TABLE-X, it will now be dependent upon its original tables and the new ones. (Its query will break if they are not present at run time).
It is easy enough (honestly) to perform a quick dependency-check over all of our queries to make sure that their various source tables are accounted for. In other words, an operation actually exists that will produce the table. What if we picked the wrong table or even misspelled it? At run-time we would know, but we would rather know before execution because it's a design-time issue. This may verify that logically we have a plan to create the dependent table, but it does not deal with the simple fact that conditional circumstances may forego the physical instantiation of the table. Transforms ultimately do not operate on intent, but on the presence of physical assets.
As another nuance, this creates a disparity between the design-time flow of data, and the run-time flow of data. If the run-time is governed (e.g. ETL tools) so that the dependencies and conditions are all evaluated at the start of the application, the design-time and run-time are more easily mated for review by an auditor or analyst. But if any part of it is dynamically conditional, we can see how this could practically nullify the design-time form of the flow. They would simply say, "I know what the flow would do by design, but I want to see what it actually did at run time, because the data isn't matching up". Aha - so "intent" counts for design review, but "intent" is not what puts physical data into the tables. Operational processes do that.
As noted above with the necessity for conditionality and reduction of inter-transform latency, we now have a need to weave together at run-time what the flows will actually do. The "source tables" for a given transform are found in the where/join phrase and these had better be present when the SQL launches or it will be a short ride indeed.
And now, what you did not expect - one of the most powerful ways to use a Netezza machine is to forego the "serialization" of these flows and allow them to launch asynchronously. We can certainly throttle how many are "live" at at time, but if any or all of them can launch independently, how on earth are we supposed to manage the case where one or two of them really are dependent on another one or more? Do we put these in a separate flow? Do we really want our developers to have to remember that if they put an additional dependency in a transform that they have to regard whether that preceding transform has actually executed successfully?
So that's the real trick, isn't it? If I have forty transforms and all of them could run asynchronously except for about ten of them, that can only run after their predecessor completes, I have several options to see to it that these secondary operations do not fail (because their predecessor has not executed yet).
I can serialize them in by putting them into separate flows (or branches). One of them kicks off and runs to completion while the next one waits. This is logically consistent but also inefficient. If those secondary transforms are co-located with the original set, the optimizer can run them when there is bandwidth rather than waiting until the end. It is also logistically unwieldy because a developer has to remember to that if a transform should gain a dependency, it has to be moved to the second flow.
I can fully serialize them into a list, but this is the most inefficient since it "boxcars" the transforms and does not leverage the extra machine cycles we could have used to shrink the duration.
I can link them via their target table and source table, such that this relationship is dynamically identified and the flow path dynamically realized. If a given transform does not run (conditional failure) or simply fails to execute, the dependency breakage is dynamically known. What does this do? What if a given transform is supposed to use an incoming (intake) table if it is present (data was loaded) otherwise use a target-table's contents (e.g. trickle-feed, change-data-capture problem). This allows the transform to do its work with consistency but also have the ability to dynamically change its sources based on availability.
Now, we know ETL tools don't do this. Other tools may attempt to rise to this level of dynamic pathing, but the bottom line is that if those tools don't provide this kind of latency-reduction, high-throughput, dynamically adaptable model, they will not be able to leverage the full bandwidth of the machine. Trust me on this - the difference is between using 90 percent of the machine or only 10 percent at a time. That machine packs the virtual joules to make it happen, so let's make it happen.
When we originally developed our framework to wrap around some of these necessary functions, we had not considered these nuances of dynamic interdependence and frankly, ELT was so new that it didn't really matter. The overhead to execute "raw" SQL was zero, but we could not effectively parallelize/async the queries without losing control. Running async chains of transforms necessitated detailed control, but nobody had a decent algorithm for it, so once again Brightlight had to pioneer this capability. Our architecture allowed us to easily integrate these things into the substrate of the framework as a transparent function. This is the primary benefit of a framework, that the developers can continue to build their applications without disruption, but we can upgrade and enhance the framework to provide stronger and deeper functionality. Whether our framework is right for all applications is not the issue, but whether the complete implementation is right for Netezza. It's a powerful machine and we should not arbitrarily leave any cycles on the table.
Imagine slowly running out of steam because of latent implementation inefficiencies, then ultimately asking for a Netezza upgrade that, if the inefficiencies weren't present, the upgrade wouldn't be necessary. This has happened with more than one of our sites and rather than upgrading to all-new-hardware, we installed, converted and bought back an enormous amount of capacity. They eventually upgraded the hardware much later on, but for the right reasons.
So I published this book last Spring ('11) on how the Netezza machine is a change-agent. It initiates transformation upon people or products that happen to intersect with it. Most of the time this transformation makes the subject better. Sort of like how heavy-lifting of weights will make the body stronger. Or the pressure can crush the subject. Stress works that way. We could imagine the Netezza machine as the change-agent entering the environment. Everything brushing against it or interacting with it will have to step-up, beef-up or adapt. I sometimes hear the new players say things like "But if the Netezza machine could only.." That's like a Buck Private saying of his drill sargeant, "If he could only ..." No, the subject must consider that the Netezza machine is never the object of transformation but rather is the initiator of it. But it's not a harsh existence by any means. Products that can adapt are far-and-away better than before. Those that cannot adapt now, will eventually, or remain in their current tier.
Having been directly or indirectly alongside these sorts of product integrations and proof-of-concepts (POCs) numerous times, it's always an interesting ride. The vendor shows up ready-to-go with visions-of-sugarplums in their head. And the suits who show up with them, are salivating for the ink on the license agreement. In less than an hour into the POC, all of them have a very different opinion of their product than when they arrived. Their bravado is reduced to a shy, sort of sheepish spin. Throw them a bone, not everyone walks out of this ring intact. Some of them shake their fist at the Netezza machine. It is unimpressed. Others shake their fist at their own product. Alas, it is but virtual, inanimate matter. What is transforming now? The person in the seat.
So I have watched them scramble to make the product hit-the-mark. Patches? We don't need no stinkin' patches. Except for today, when they will be on the phone in high-intensity conversations with their "engine-room" begging for special releases while on-site. Alas such malaise could have been avoided if only they had connected their product - at least once - to a Netezza machine. In so many cases, they will claim that they have Netezza machines in-the-shop so they are prepared-and-all-that. It is revealed, sometimes within the first hour, that the product has never been connected to a Netezza machine. It doesn't even do the basics, or address the machine correctly. It is especially humorous to hear them speak in terms of scalability as though a terabyte is a high-water mark for them. One may well ask, why are we wasting our time with underpowered technology? Well, in point of fact, when placed next to the Netezza machine it's all underpowered, so really it's just a matter of degree.
Case in point, Enzees know that in order to copy data from one database to another, we have to connect to the target database (we can only write to the database we are connected to). And then use a fully-qualifed database/tablename to grab data from elsewhere - in the "select" phrase. Forsooth, their product wants to do it like "all the others do" and connect to the source, pushing data to the target. Staring numb at the white board in realization of this fundamental flaw, they mutter "If only Netezza could....". But that's not the point. They arrived on site, product CD in hand, without ever having performed even one test on real Netezza machine, or this issue (and others) would have hit them on the first operation. They would have pulled up a chair in their labs, started the process of integration and perhaps call the potential customer "Can we push the POC off until next week? We have some issues (insert fabricated storyline here) and need to do this later."
Cue swarming engineers. Transformation ensues.
Another case in point, many enterprise products are built to standards that are optimized for the target runtime server. That is, they fully intend to bring the data out of the machine, process it and send it back. One of my colleagues made a joke about Jim Carrey's "The Grinch" and the mayor's lament for a "Grinch-less" Christmas. Well, didn't the Grinch tell Cindy-Lou Who that in order to fix a light on the tree, he would take the tree, fix it and bring it back? Seems like a lot of hassle for one light? Why can't you fix it here and not take it anywhere? Enzees see the analogy unfolding. No, we don't want to take the data out, process it and put it back. We want "Grinch-less" processing, too. Fix the data where it already is.
Why do this? Well, in 6.0 version of the NPS Host, the compression engine could easily give us up to 32x compression on the disk. Or even a nominal 16x compression, meaning that our 80 terabytes is now 5TB of storage. And while we may have to de-compress it on the inside of the machine to process it, the machine is well-suited to moving these quantities around internally. Woe unto the light-of- heart who would pull the data out into-the-open, blooming it to its full un-compressed glory, on the network, CPU, the network again - just to process it and put it back.
Unprepared for the largesse of such data stores, our vendor contender's product centers on common scalar data types. Integer, character, varchar, date. No big deal. Connect to the Netezza machine and find out that the "common" database size is in the many billions and tens of billions of rows. A chocolate-and-vanilla software product without regard to a BigInt (8 byte) data type, cannot exceed the ceiling of 2 billion (that's the biggest a simple integer can hold). This does not bode well for integrating to a database with a minmum of ten billion records and that's just the smallest table. Having integers peppered throughout the software architecture by default - would require a sweeping overhaul to remediate. As the day wears on, we see them struggle with singleton inserts (a big No-No in Netezza circles) and lack of process control over the Netezza return states and status. These are not exotic or odd, but no two databases behave the same way. The moment that Netezza returned the row-count that it had successfully copied four billion rows, we watched the product crash because it could not store the row-count anywhere - the product had standardized on integers, not big integers, so the internal variable overflowed and tossed everything overboard. Quite unfortunately, this was a data-transfer product and performed destructive operations on the data (copy over there, delete the original source over here). So any hiccup meant that we could lose data, and lots of it.
Cue announceer: "And the not-ready-for-prime-time-players..."
Oh, and that "lose data and lots of it" needs to be underscored. In a database holding tens of billions of rows (hundreds of terabytes) of structured data, that is, each record in inventory, with fiducial, legal, contractual, perhaps even regulatory wrappers around it, and we're way, way past the coffin zone. Some of you recall the "coffin zone" is the point-of-no-return for an extreme rock-face climber. Cross that boundary and you can't climb down. But we're not climbing a rock face are we? The principle is the same. Lose that data and we'll get a visit from the grim reaper. He doesn't hold a sickle, just a pink slip in one hand and a cardboard box in the other (just big enough for empty a desk-full of personal belongings).
One test after another either fails or reveals another product flaw. When the smoke clears, the "rock solid offering" complete with sales-slicks and slick-salesmen, is beaten and battered and ready for the showers. The product engineers must now overhaul their technology (transform it) and fortify it for Netezza, or remain in their tier. The Netezza machine has spoken, reset itself into a resting-stance, presses a fist into a palm, graciously bows, and with a terse, gutteral challenge of a sensei master, says: "Your Kung Fu is not strong!"
Now it's transformation-fu.
Superficially, this can look like a common product-integration firefight. But this kind of scramble tells a larger tale: They weren't really ready for the POC. This would be similar to an "expert" big-city fireman, supremely trained and battle-hardened in the art of firefighting and all its risks, joining Red Adair's oil-well -fire-fighting team ( a niche to be sure) and finding that none of the equipment or procedures he is familiar with apply any longer. He will have to unlearn what he knows in order to be effective on a radically larger scale. He might have been a superhero back home, faster than a speeding bullet, able to leap tall (burning) buildings in a single bound, but when he shows up at Red Adair's place, they will tell him to exchange his clothing for a fireproof form and get rid of the cape. Nobody's a hero around an oil-field fire. Heroes leave the site horizontally, feet-first. No exceptions.
Enzees have experienced a similar transformation (with a different kind of fire). The most-oft-asked questions at conferences are just that flavor: How do we bring newbees into the fold? How do we get them from thinking in row-based/transactional solutions into set-based solutions? How do we help them understand how to use sweeping-query-scans to process billions of rows? Or use one-rule-multiple-row approaches versus cursor-based multiple-rule-per-row? How do we get testers into a model of testing with key-based summaries instead of eyeballs-on-rows (when rows are in billions)?
We were dealing with a backup problem at one site because of a lack of external disk space. Commodity tools often use external disk space for this purpose, until they are connected to a Netezza machine and their admin tool complains that they need to add "another hundred terabytes" of workspace. We gulp, realizing that the workspace is only today a grand total of ten terabytes in size. And you need another hundred! Yeesh, you big-data-people!
Most of the universe outside the Enzee universe will never have to address problems on this scale. It is not the machine itself that is the niche. It is the problem/solution domain. Most of the commodity products that are stepping up are doing so only because it's clear that Netezza is here to stay and they need to step into Netezza's domain. I suppose at some point they expected Netezza to give them a call to start the integration process, but the Netezza Enzee Universe already had all that under control. It's amazing how lots-of-power can simplify hard tasks to the end of ignoring commodity products entirely.
Another case in point, a product vendor "popped over" with a couple of his newbee product guys and spent two weeks trying to get their product to play in-scale with Netezza. Before throwing in the towel, they offered up the common litany of observations. "No indexes? What the?" and "Netezza needs to change X", or the favorite "Nobody stores this much data in one place." The short version is, you brought a knife to a gun fight, as Sean Connery would assert, or perhaps, you brought a pick-axe and a rope to scale Mt. Everest. What were you thinking? You see, most people who have never heard of Netezza (I know, there really are folks out there who don't know about it, strange as is seems) do not understand the scale of data inside its enclosure. Billions of records? Tens of billions of records? A half-trillion records? Is that all you got?
We will watch a switch flip over in their brains as they assess what they are trying to bite off. A small group will embrace the problem and work toward harnessing the Netezza machine in every way possible. Another group will provide a bolt-on adapter for Netezza to interface to their core product engine. While another, larger group will assess the expense of such things, the marketplace they currently address, and conclude that they will for now remain in their current tier. This is like a 180-lb fighter climbing into the ring with a heavyweight, and walking away realiizing that they need to add some muscle, some speed, and some toughness or just stay in their own weight class and be successful there.
Another case-in-point is the need for high-intensity data processing in-the-box in a continuous form, coupled with the need for the reporting environment to share the data once-processed, likewise coupled with the need for backup/restore/archive and perhaps a hot-swap failover protocol. We can do these things with smaller machines and their supporting vendor software products. But what about Netezza, with such daunting data sizes, adding the complexity of data processing?
At one site we had a TwinFin 48 (384 processors) and two TwinFin 24's (192 processors) with the '48 doing the heavy-lifting for both production roles. When it came time to get more hardware, the architects decided to get another '48 and split the roles, so that one of the machines would do hard-data-processing and simply replicate-final-results to the second '48, limiting its processing impact for any given movement. This was not the only part of their plan. They then set up replicators to make "hot" versions of each of these databases on the other server. This allowed them to store all of the data on both, providing a hot DR live/live configuration, but it would only cost them storage, not CPU power. Configured correctly, neither of the live databases would know the difference. Our replicators (nzDIF modules) seamlessly operated this using the Netezza compressed format to achieve an effective 6TB/hour inter-machine transfer rate, plenty of power for incremental/trickle feeds across the machines.
Some say "I want an enterprise product that I can use for all of my databases". Well, this is the problem isn't it? Netezza is not like "all of our other databases". Products that have a smashing time with the lower-volume environments start to think that a "big" version of one of those environments somehow qualifies their product to step-up. I am fond of noting that Ab Initio, at one site loading a commodity SMP RDBMS, was achieving fifteen million rows in two hours. Ab Initio can load data a lot faster than that (and is on record as the only technology that can feed Netezza's load capacity). So what was the problem? The choice of database hardware? Software? Disk space? Actually it was the mistaken belief that any of those can scale to the same heights as Netezza. I could not imagine, for example, that if fifteen million rows would take two hours, what about a billion rows (1300 hours? ). Netezza's cruising-speed is over a million rows a second from one stream, and can load multiple streams-at-a-time.
Many very popular enteprise products have not bothered to integrate with a Netezza machine, and many of those who have, provide some form of bolt-on adapter for it. It usually works, but because the problem domain is a niche, it's not on their "product radar". It's not "integrated as-one". What does this mean? Netezza's ecosystem, and now assimilated by IBM, through IBM's product genius and sheer integration muscle, will ultimately have a powerful stack for enterprise computing such that none of the other players will be able to catch up. If those vendors have not integrated by now, the goal-line to achieve it is even now racing ahead of them toward the horizon. Perhaps they won't catch up. Perhaps they won't keep up. Some products (e.g. nzDIF) are at the front-edge, but nzDIF is not a shrink-wrapped or download-and-go kind of toolkit. We use it to accelerate our clients and differentiate our approaches. It's a development platform, an operational environment and expert system (our best and brightest capture Netezza best practices directly into the core). This has certainly been a year where we've gotten the most requests for it. But there's only one way to get a copy.
Cue Red Adair.
"No capes!" - Edna Mode, clothing-designer-for-the-gods, Disney/Pixar's The Incredibles
After review of a "high performance" ELT platform (that's ELT, database-transform-in-the-box) - I started asking hard questions about things they had not considered. It's a high-performance platform, isn't it?
Well, yes and no. It supports a "continuous" model, but the performance is all in the query and the data, right? Well, we'd like to think that for purposes of long-cycle queries anyhow. Here's the expectation:
In a general-purpose RDBMS, transformation-in-the-box is expensive. Each query can take minutes or even hours to complete. At this point, nobody really cares about the overhead to launch and shepherd the query, or to report its status when completed. All of these infrastructure issues are eclipsed by the duration of the query itself. If only one percent of the operation's duration is in the overhead, who cares if we spend time optimizing or minimizing it?
So in one scenario, the product would launch its queries end-to-end using a scheduler. Each of the queries would be packaged into its own little run-time, then the scheduler would kick off each one and wait for its closure, only then kicking off the next, etc. Some would call this reasonable, others sophisticated. After all, if the duration of each query is protracted, why do we care about inter-transform latency?
Contrast this to a Netezza-centric series of transforms. A general-purpose database, recall, requires us to dogpile lots of logic into each transform, protracting the duration as a matter of necessity. In a Netezza-centric scenario, we will see those ten-or-so general-purpose queries chopped apart into more efficient, tactical form, with each query building upon the last towards a final outcome (in a fraction of the time of the general-purpose equivalent).
Apart from the mechanics of how Netezza makes this happen, look at how the mechanics of the operation changes dramatically. I'll use a known working example of 42 Netezza transforms. When first we had ten-or-so-general-purpose queries running for an hour or more, we now have over forty MPP-queries, each of which runs in less than a second of duration (with some exceptions). So all 42 queries, if we could kick them off one-after-another, will execute in around 45 seconds. If in this case, the users (or process) kicking off the sequence wants less than one-minute turnaround, now we have to deal with squeezing out inter-transform latency.
In plain-vanilla terms, the mechanism using the scheduler noted above, put six-seconds of latency between each transform. What does this mean? With each transform running in one second, and six-seconds of overhead, what could run in less than a minute now runs in six minutes - we have dramaticallly breached our one-minute SLA!
Now David, get serious - who on earth wants to shave seconds off the inter-transform latency? Six seconds of delay between each transform seems perfectly reasonable! Sure, if the transform itself will take twenty or thirty minutes. But if it will take less than a second, our overhead for it is now the glaring culprit with a smoking gun, red-hands and all that. And for those Netezza users who want to eliminate this latency, don't pooh-pooh their needs. The fat is our problem to solve.
And what does this say for ETL tools that perform push-down to the machine? They will also have transitional latency as they hand-off control across components. Geared for the big-fat, long-duration queries of the general-purpose world, nobody has ever cared about the smidgeon of latency between them. Only now, the smidgeon is not so much, and looks a lot like a boulder next to a basketball.
Consider the following breakdown of a standard transform's parts. They look a lot like a CPU's fetch-decode-execute cycle (yeah, that's a little geeky, I know)
Startup Overhead Execution Shutdown
So imagine that we have a tool with controls (like a scheduler) with several seconds of latency in startup, formulating the query (recall, we want query-generation, not hand-coded queries) leading up to Execution, and then some minor overhead to accept the status and transition to the next operation. We'll call it at four (4) seconds of latency.
If we scale the above timeline with several inline / serialized transforms- we would see an effect like this:
|---x--x--| |----x--x--| |---x--x--| ----
See the "DDD" (for dead-time). We lose that time in the additional latency for transform management. This is akin to the startup/shutdown cost of a launcher, scheduler, or ETL tool shepherding a "component" to activate the underpinning SQL statement
Either way, take a look at the total time "between the x's " that is the actual execution time. If this time is several hours, the dead-time is a nit. If the execution time is proportional to the line-drawing above, we have far too much overhead.
So for 42 of these operations, we would have 42x4 seconds of latency plus the 1 second of execution, for a grand total of 210 seconds, or 3.5 minutes. When we seriously consider squeezing the fat from this operation, our primary problem is in the non-optional overhead.
Now take a look at the scenario below. I have kicked off five transforms asynchronously, with their execution times between-the-x's -
See how the first one hands off to the second one, like a baton in a relay race? Notice how each of the transforms has already incurred its overhead in parallel to the first transform, and are now merely waiting (see the "W") for their predecessor to trigger their execution.
What is the start-to-finish time of these five transforms if executed like "boxcars" in serialized mode, accepting the penalty for intertransform latency? From start to finish is around 25 seconds. But with the above example, the inter-transform latency has been practically removed except for the simple handshake as they hand-off. The first one launches and we incur the initial four-seconds of latency, a necessary penalty. Then each subsequent transform requires one second, for a grand total of 10 seconds of runtime. We have effectively moved from a model with 25 seconds of runtime to 10 seconds of runtime.
While this is around 36% of the original run time, it does not seem as dramatic as when we compare it to the original model of 42 transforms. That is, four seconds of overhead plus 42 seconds is 46 seconds of runtime. Add to this 1/10th second for the handoff delay (another 4 seconds), for 50 seconds of grand total run time.
Its original time was 42*5 seconds, or 210 seconds. This new time of 50 seconds is 24% of our original run time. That's a 300% improvement in run-time and of course, is well within the boundaries of the one-minute SLA.
Offsetting the transform run-times like this, so that the overhead is essentially invisible, is a common hardware-stablization approach for micro-electronics. Here's an example:
Many years ago I worked for a company that was repackaging a design into much smaller form. Essentually we were reducing a rack of hardware into a 1-foot cube. All of the large wire-wrapped boards had been designed-down to much smaller form. This is when instabilities were first detected.
In one particular case, the engineer described it to me as an engineering-101 mistake on the part of the original designer. Every hardware circuit is driven by signals that pass through conditioners and gates (transistors) so that the final outcome is a signal of some kind on a particular part of the board's interface(s). The mistake was in that the timing signal, a pulse sent to the hardware 60 times a second, was being applied at the first input of logic. So a general signal would "sit" on a given input location, the timing signal would "fire", opening the gate, and the signal would then traverse through the many other components and paths to reach the output path of the hardware. But here was the problem: the signal took too long to make it from point-A to point-B before all of the other signals had already left it behind. The solution to this: Put the timing-trigger signal on the output side of the board. This way, when the original signal first arrived, it would make its way across all the necessary components and paths and then present its signal to a final gate, which was triggered by the timer. So when the trigger hit, the signal was already present and passed through instantly without a problem.
This sort of "triggered-steering-logic" is the theme of the noted inter-transform handoff scenario above. The transforms navigate their overhead to a stopping point and wait for the signal. In this case, they are waiting for a "done" signal from the transform preceding them. When this signal hits, the next transform is queued and ready to immediately execute its query without further delay. The subsequent transforms fall like dominos.
But that's really only part of the story. These 42 transforms don't just run in a serialized stream. Some of them, after all, have no dependencies whatsoever. Others have dependencies in a discrete chain. Why serialize them when their dependencies are relatively few? Here's an example:
DEF JKL MNO VWX
Transforms A,B and C are dependent upon each other, so will serialize. In the meantime, the D,E,F and G,H,I transforms are independent of A,B,C, so can run side-by-side. J,K,L transforms are dependent on the outcomes of C,F and I, so will wait on them to complete, then finish the K,L transforms as serialized. But then another set-of-three transforms takes off. In ETL tools, we recognize this as branching or "component-parallelism". However, in an ETL tool the branches must be wired together and follow each other by design of the graphic on the canvas. ELT however, is much more dynamic than that.
Look at the effect: We're saving 6 seconds of time by not executing the A-I transforms serialized. Likewise the P-U transforms will now take 3 seconds, not 9 seconds, saving another 6 seconds. If we can find the otherwise independent transforms and move them to the front of the chain(s), the total time for the run is the longest chain of dependent transforms. In this case, we shrank 42 transforms into the same time frame as 20 serialized transforms. This shrank the total time from 50 seconds down to less than 28 seconds.
But David, you mean we have to carefully weave these transforms together so that we can squeeze the fat from the timeline? Actually no. We have a sniffer that examines the "filter clause" for the dependent tables, you know, what the given transform will actually join against. This allows the transforms to self-discover their own optimum path without having to deal with painful weaving. If we happen to add another transform, or change the filter phrase in a transform to include/exclude another table in the join, it will automatically realign the priorities based on the intrinsic dependencies of the transforms. And your ETL tool won't do this dynamically.
Then we have the intake protocol, that of transferring data from one machine to another, or loading from files. We have a couple of options, that of loading the data completely before taking actions in the transforms, or we can load the data asynchronously to the transforms. Just as the transforms will "wait" on a prior table in the chain-of-transforms, we can also make them "wait" for the arrival of a particular intake table. Rather than waiting for all of the intake tables to arrive, we can initiate a transform chain when its particular data set has arrived (or for that matter, not at all, in case its data never arrives).
An in our second example, the total time allocated for loading the data, executing it and ensconcing it to the proper target tables was less than three minutes. Since serialized, all-or-nothing loading easily absorbed over half of this time, we found it important to squeeze the fat from this process. By causing the entire chain to run asynchronously, when the given intake table is ready, its transform-chain would automatically launch. As fortune would have it, the longest processing chain also had small tables to load. So we could kick off those transforms very early. Some of the shortest chains also had the largest load files, so by the time saved by starting the loads as-early-as-possible.
In the end, what started out as a five-minute-plus operation shrank to 160 total seconds of time, well-inside the 3-minute SLA with some room for recovery. Here's how it looked:
Load Time = L
Transform Time = T
Original - all transforms launch when the all loads are complete
LLLL TTT TTTTT
Async form, transforms begin when their source tables are ready:
One may ask, why wouldn't we do it in the second form anyhow? It seems that this is the optimum way to process data. Well, one answer is simply : checkpointing. If we launch the transforms prior to all of the loads finishing, it is much harder to recover in case of load-failure. We would have to cancel the in-flight transforms and otherwise bring the processing to a halt. If we load it all first, then proceed, any errors can stop the processing immediately. No wasted cycles. Efficiency is important with these kinds of transform-chains in a database like Netezza. Still, if we need to shrink the total job time, the recovery-shutdown protocol (in case of load failure) is a necessary capability. Besides, our protocol does not itself finalize anything to the target database until the last transform is complete, lest the data start to arrive intermittently and out-of-context. So as long as the load time is balanced with overall transform time, shutting down before finalization is usually doable.
This of course invites the obvious question - are any of the ETL tools (or ELT offerings) attempting to squeeze out the fat in a similar manner? Methinks not, and for one reason: They, like the general-purpose databases, likewise consider themselves to be general-purpose tools. They are not philosophically committed to squeezing out latency because Netezza is the only platform that can benefit from it. The uptake in setting up and coordinating this sort of baton-like handoff, while not particularly difficult, is also non-trivial and represents a significant effort for a product tool to embrace. When compared to interfacing with the general-purpose-platforms - Netezza is not a large-enough user base to justify ramping-up this high-performance, zero-latency capability. In short, the product's features are market-driven.
The Netezza user base is growing, however, and with IBM pushing the hardware, it will grow more and faster.
Sitting at the top-end of this food chain are the big-ticket TwinFin 24/48/96 machines that do massive amounts of processing-in-the-box and want to move toward continuous models, processing data as-the-world turns. The age of the nightly batch cycle is waning and Netezza is stepping up to the vast opportunities within the "continuous" world. Latency in this world is like poison. Just because it appears to be acceptable to the general-purpose world, this is an illusion. If the general-purpose queries ever dropped into subsecond-duration, the tools facing them would need to re-tool. Actually they need to re-tool now - they just don't feel the pressure yet.
Modified on by DavidBirmingham
In a Netezza shop experiencing some performance stress with their machine, we ask the usual questions as to the machine's configuration, its functional mission. Ultimately we pop-the-hood to find that the data structures and the queries are not in harmony. For starters, the structures don't look like Netezza structures, at least, not optimized for Netezza. We receive feedback that they "just" moved the data from their former (favorite technology here) and ran-with-it. They received the usual 10x boost as a door-prize and thought they were done. Lurking in their solution however, were latent inefficiencies that were causing the machine to work 10x to 20x harder to achieve the same outcome. And their queries were likewise 20x inefficient in how they leveraged the data structures.
More unfortunately, the power of the machine was masking this inefficiency. It's like the old adage, when a person first starts day-trading on the stock exchange, the worst thing that can happen to them is that they are successful. Why? It put a false sense of security in their minds that gives them permission to take risks they would never take if they knew the real rules of the game. The 10x-boost for moving the data over is a "for-free" door prize not the go-to configuration.
What are the real rules of the Netezza game? The first rule is that extraordinary power masks sloppy work. Netezza can make an ugly duckling look like a swan without actually being one. It can make an ugly model into a supermodel without the necessary adult beverages to assist the transformation. It can make sloppy queries look like something even Mary Poppins would approve of, practically perfect in every way and all that.
What's lurking under-the-hood is nothing short of a parasitic relationship between the model, the queries and the machine. We received the 10x boost door-prize and think we have succeeded. But we have only succeeded in instantiating the model and its data into the machine. We have not succeeded in leveraging the entire machine. And, uh, we paid for the entire machine. So why aren't we using it?
In our old environment, the index structures worked behind-the-scenes, transparently assisting each join. Our BI environment is set up to leverage those joins so we get good response times. The Netezza machine has no indexes so the BI queries (whether we want to admit it or not) are improperly structured to take advantage of the machine's physics.
"But that's how we've always done it..." or "But we don't do it that way..."
The short version, the former solution and (favorite technology here) is casting a long shadow across the raised floor onto the Netezza machine. People are forklifting "what they know" to the new machine when very little of it applies.
For example, in a star schema, the index structures are the primary performance center. The query will filter the dimensions first, gather indexes from the participating dimensions and then use these to attack the fact table. The engine does all this transparently. The result is a fast turnaround born on index-level performance. These are software-powered constructs in a general-purpose engine. The original concept of the star-schema was borne on the necessity of a model that could overcome the performance weaknesses of its host platform. It is in fact an answer to the lack-of-power of commodity platforms. In short, just by configuring and loading a star schema on a commodity platform, we get boost from using it over a more common 3NF schema.
The Netezza machine doesn't have indexes. So the common understanding of how a star-schema works doesn't apply. At all. Don't get me wrong, the star-schema has a lot of functional elegance and utility. It does not however, inherently provide any form of performance boost for queries using it. It can simplify the consumer experience and certainly ease maintenance, but it is not inherently more performant than any other model. In fact, using such a model by default could hinder performance.
Why is this?
The primary performance boosters in Netezza are the distribution and the zone map. Where the distribution and co-location preserve resources so that more queries can run simultaneously with high throughput, zone maps boost query turnaround time. They work in synergy to increase overall throughput of the machine. How does installing a star-schema inherently optimize such things? It doesn't.
Can we use a star-schema? Sure, and we should also commit to distributing the fact table on the same key as the most-active or largest dimension (they are often one-and-the-same). This will preserve concurrency for the largest majority of queries. A better approach however, is to specifically formulate a useful dimensional model that leverages the same distribution key for all participating tables. Common star-schemas do not do this by default, and if only two tables are distributed on the same key, all other joins to the other tables will be less performant. They will have to "broadcast" the dimensional data to the fact table. Clearly having all tables distributed on the same key will preserve concurrency, but this doesn't give us the monster-boost we're looking for. Distrubution might get us up to 2x past the door-prize performance we get from moving to the machine. Zone maps are notorious for getting us 100x and 1000x boost.
At one site I watched as several analytic operations remanufactured the star-schema data into several other useful structures, each of which was distributed on a common key. At the end of the operation, these quere joined in co-located manner and the final result came back in orders-of-magnitude faster than the same query on the master tables. I asked where they had derived the key, and they explained that it was a composite key that they had reformulated into a single key because their dimensional tables could all be distributed on it and maintain the same logical relationship. Looking over the table structures, they had a "flavor" of a star schema but certainly not a purist star. The question remained, if the existing star schema wasn't useful to them but their reformulated structure was, why weren't they using the reformulated one as the primary model and ditch the old one? The answer was simple, in that the existing star was seen as a general-purpose model and not to be outfitted or tuned for a specific user group. This is one of the commodity/general-purpose lines-of-thought that must be buried before entering the Netezza realm.
This is the primary takeaway from all that: The way we make an underpowered machine work faster is is to contrive a star schema that makes the indexes work hard. We forget that the star schema is a performance contrivance in this regard. If we attempt to move this model to the Netezza machine because "it's what we do" then we may experience performance difficulties rather than a boost. A common theme exists here: people do what they are knowledgeable of, what they are comfortable with, what they find easy-to-explain and do not naturally push-the-envelope for something more useful and performant.
In Netezza, the star schema has functional value but (configured wrong) is a performance liability. We can mitigate this problem by simply reformulating the star to align with the machine's physics, and by adapting our "purist" modeling practices to something more practical and adaptable. After all, many modeling practices are in place specifically because doing otherwise makes a traditional platform behave poorly. If we forklift those practices to Netezza, we participate in casting-the-long-shadown of an underperforming platform onto the Netezza machine.
We have enormous freedom in Netezza to shape the data the way we want to use it and make it consumption-ready both in content and performance. We should not move from a general-purpose platform (using a purpose-built model like a star) into a purpose-built platform with a general-purpose model like a star. The odd part is that the star is an anomaly in a load-balancing, traditional database, but is seen as purpose-built for that platform. Exactly the opposite is true in Netezza. The machine is purpose-built and the star is only another general-purpose model that doesn't work as well as a model that is purpose-built for Netezza physics and for user needs.
The worst thing we can do of course is think-outside-the-box (the Netezza box). We really need to think-inside-the-box and shape the data structures and queries to get what we want. This mitigates the long-shadows. It's just a matter of adapting traditional thinking into something practical for the Netezza machine.