In a Netezza shop experiencing some performance stress with their machine, we ask the usual questions as to the machine's configuration, its functional mission. Ultimately we pop-the-hood to find that the data structures and the queries are not in harmony. For starters, the structures don't look like Netezza structures, at least, not optimized for Netezza. We receive feedback that they "just" moved the data from their former (favorite technology here) and ran-with-it. They received the usual 10x boost as a door-prize and thought they were done. Lurking in their solution however, were latent inefficiencies that were causing the machine to work 10x to 20x harder to achieve the same outcome. And their queries were likewise 20x inefficient in how they leveraged the data structures.
More unfortunately, the power of the machine was masking this inefficiency. It's like the old adage, when a person first starts day-trading on the stock exchange, the worst thing that can happen to them is that they are successful. Why? It put a false sense of security in their minds that gives them permission to take risks they would never take if they knew the real rules of the game. The 10x-boost for moving the data over is a "for-free" door prize not the go-to configuration.
What are the real rules of the Netezza game? The first rule is that extraordinary power masks sloppy work. Netezza can make an ugly duckling look like a swan without actually being one. It can make an ugly model into a supermodel without the necessary adult beverages to assist the transformation. It can make sloppy queries look like something even Mary Poppins would approve of, practically perfect in every way and all that.
What's lurking under-the-hood is nothing short of a parasitic relationship between the model, the queries and the machine. We received the 10x boost door-prize and think we have succeeded. But we have only succeeded in instantiating the model and its data into the machine. We have not succeeded in leveraging the entire machine. And, uh, we paid for the entire machine. So why aren't we using it?
In our old environment, the index structures worked behind-the-scenes, transparently assisting each join. Our BI environment is set up to leverage those joins so we get good response times. The Netezza machine has no indexes so the BI queries (whether we want to admit it or not) are improperly structured to take advantage of the machine's physics.
"But that's how we've always done it..." or "But we don't do it that way..."
The short version, the former solution and (favorite technology here) is casting a long shadow across the raised floor onto the Netezza machine. People are forklifting "what they know" to the new machine when very little of it applies.
For example, in a star schema, the index structures are the primary performance center. The query will filter the dimensions first, gather indexes from the participating dimensions and then use these to attack the fact table. The engine does all this transparently. The result is a fast turnaround born on index-level performance. These are software-powered constructs in a general-purpose engine. The original concept of the star-schema was borne on the necessity of a model that could overcome the performance weaknesses of its host platform. It is in fact an answer to the lack-of-power of commodity platforms. In short, just by configuring and loading a star schema on a commodity platform, we get boost from using it over a more common 3NF schema.
The Netezza machine doesn't have indexes. So the common understanding of how a star-schema works doesn't apply. At all. Don't get me wrong, the star-schema has a lot of functional elegance and utility. It does not however, inherently provide any form of performance boost for queries using it. It can simplify the consumer experience and certainly ease maintenance, but it is not inherently more performant than any other model. In fact, using such a model by default could hinder performance.
Why is this?
The primary performance boosters in Netezza are the distribution and the zone map. Where the distribution and co-location preserve resources so that more queries can run simultaneously with high throughput, zone maps boost query turnaround time. They work in synergy to increase overall throughput of the machine. How does installing a star-schema inherently optimize such things? It doesn't.
Can we use a star-schema? Sure, and we should also commit to distributing the fact table on the same key as the most-active or largest dimension (they are often one-and-the-same). This will preserve concurrency for the largest majority of queries. A better approach however, is to specifically formulate a useful dimensional model that leverages the same distribution key for all participating tables. Common star-schemas do not do this by default, and if only two tables are distributed on the same key, all other joins to the other tables will be less performant. They will have to "broadcast" the dimensional data to the fact table. Clearly having all tables distributed on the same key will preserve concurrency, but this doesn't give us the monster-boost we're looking for. Distrubution might get us up to 2x past the door-prize performance we get from moving to the machine. Zone maps are notorious for getting us 100x and 1000x boost.
At one site I watched as several analytic operations remanufactured the star-schema data into several other useful structures, each of which was distributed on a common key. At the end of the operation, these quere joined in co-located manner and the final result came back in orders-of-magnitude faster than the same query on the master tables. I asked where they had derived the key, and they explained that it was a composite key that they had reformulated into a single key because their dimensional tables could all be distributed on it and maintain the same logical relationship. Looking over the table structures, they had a "flavor" of a star schema but certainly not a purist star. The question remained, if the existing star schema wasn't useful to them but their reformulated structure was, why weren't they using the reformulated one as the primary model and ditch the old one? The answer was simple, in that the existing star was seen as a general-purpose model and not to be outfitted or tuned for a specific user group. This is one of the commodity/general-purpose lines-of-thought that must be buried before entering the Netezza realm.
This is the primary takeaway from all that: The way we make an underpowered machine work faster is is to contrive a star schema that makes the indexes work hard. We forget that the star schema is a performance contrivance in this regard. If we attempt to move this model to the Netezza machine because "it's what we do" then we may experience performance difficulties rather than a boost. A common theme exists here: people do what they are knowledgeable of, what they are comfortable with, what they find easy-to-explain and do not naturally push-the-envelope for something more useful and performant.
In Netezza, the star schema has functional value but (configured wrong) is a performance liability. We can mitigate this problem by simply reformulating the star to align with the machine's physics, and by adapting our "purist" modeling practices to something more practical and adaptable. After all, many modeling practices are in place specifically because doing otherwise makes a traditional platform behave poorly. If we forklift those practices to Netezza, we participate in casting-the-long-shadown of an underperforming platform onto the Netezza machine.
We have enormous freedom in Netezza to shape the data the way we want to use it and make it consumption-ready both in content and performance. We should not move from a general-purpose platform (using a purpose-built model like a star) into a purpose-built platform with a general-purpose model like a star. The odd part is that the star is an anomaly in a load-balancing, traditional database, but is seen as purpose-built for that platform. Exactly the opposite is true in Netezza. The machine is purpose-built and the star is only another general-purpose model that doesn't work as well as a model that is purpose-built for Netezza physics and for user needs.
The worst thing we can do of course is think-outside-the-box (the Netezza box). We really need to think-inside-the-box and shape the data structures and queries to get what we want. This mitigates the long-shadows. It's just a matter of adapting traditional thinking into something practical for the Netezza machine.