Two words: load balancing -
Is our chosen platform designed for set-based bulk processing, or load balancing? Both at the software and hardware levels? The load-balancing engine (and attendant SMP hardware) are simply the wrong architecture for large-scale bulk processing. There's no way to "properly" configure the wrong architecture.
Let's level-set the difference here. In an SMP-based scenario, our engineers have to carefully configure the hardware to garner the very best performance from it. We don't have that option in Netezza, because the hardware is pre-configured. Rather in Netezza, we gain power by how we organize and configure the data. We don't really have this option in an SMP-based model, because the database engine software pre-defines how we will organize the information (through index structures) and we cannot affect our fate without the indexes. Let's see the contrasts summarized:
Netezza - no indexes, no hardware config, performance is derived from data configuration
SMP machine - high hardware config, index-depedent, no ability to affect performance with data configuration
In short, the two performance tuning models are not only polar opposites, Netezza is far more adaptable and flexible because it is easier to reconfigure data than to reconfigure hardware.
I am continually impressed with the valiant attempts of various platform aficionados who assert, claim and champion the notion that "properly" configured SMP-based hardware is the the only issue in evaluating competitive performance between platforms. In short. a properly configured <name your platform here> is just as viable as the IBM Netezza platform. Just name your components and off you go.
Of course, most folks who are making these claims are not hardware aficionados at all. Now, I appreciate software folks because at heart I am one of them, but I cut my teeth as an software engineer on solutions that aligned high-powered hardware with other high-powered hardware, all the while respecting the fact that the software I was creating was actually orchestrating and controlling the interaction between these gravity-bending machines, not physically moving the "payload" as it were. Nothing, absolutely nothing could move the data faster than the hardware. We inherently know this, yet many seem to think that software products can overcome this issue by using RAM and other creative methods to accelerate the effect of software operation.
So before I launch into a more complex rant on this, a picture is worth 1000 words (at least). In Netezza Transformation I offered up some graphics for contrast and compare (and alluded to them in another blog entry here).
In the depiction (right) we have CPUs (on the top) and disk drives (on the bottom). The pipeline in between them is the
general-purpose backplane of the hardware configuration, which may include a SAN interface, optical or 10gigE networking or other connection mechanism to transfer data between the server's CPUs and the SAN's disk drives. Even if these disk drives are local on the machine containing the CPUs, this backplane is still the connector between them.
Now we will load a data file containing 100 billion rows of information, some 25 terabytes in size. This is a medium data size for big-data aficionados. The data will necessarily enter the machine from an external network connection, into the software engine (runnning on the CPUs) which will deliver the data onto the assigned location of the disk drives. Seems like a very simplistic and remedial explanation doesn't it?
Now we will query this data. Our ad-hoc and reporting users will want to scan and re-scan this information. Note how now the bottleneck is actually the hardware itself. The data must be pulled in-total from the disk drives, through this backplane and into CPU memory before it can be examined or qualifed. Even if we use index structures, the more complex the query, the more likely we will encounter a full-table-scan. How long, do we suppose, it would take for this configuration to scan 25 terabytes? (Keep in mind that all 25 terabytes has to move through the backplane).
A server-level MPP model would suggest placing two of these configurations side-by-side and coordinating them. In short, one of the server frames would contain some portion of the information while another frame contained the other. We could imagine placing multiples of these side-by-side to create an MPP grid of sorts. This is the essential secret-sauce of many Java grids and other grid-based solutions. Divide the data across a general-purpose grid and then coordinate the grid for loading and querying.
But notice how deeply we are burying the data under many layers of administered complexity. Sure, we can do this, but is it practical and sustainable? I've seen setups like this that served one application (and served it well) but it was an inaccessible island of capability that served no other masters. As general-purpose as all of its parts were, it had been purpose-built and purpose-deployed for a single solution that required the most heavy lifting at the time of its inception. Now that it is in place, other solutions around it are growing in the capacity needs to serve the grid, and none of them have access to the power within the grid. The grid becomes starved from the outside-in. No satellite solutiion can feed it or consume it at the rate it can process data, and it has no extensibility to support their processing missions.
So now we come full circle, we have a "properly configured" one-trick-pony. Over time, the expense and risk of this pony will become self-evident. Parts will break. Data will get lost. Lots of moving parts, especially general-purpose moving parts that are out-in-the-open, only increases the total failure points of the entire solution. Debugging and troubleshooting at the system level become matters of high-engineering, not simple administration. As noted above, in the environment where I cut-my-teeth, I was surrounded by these high-end engineers because it was the nature of the project. I noticed that once the project went into a packaging mode to deploy and maintain, these engineers moved-on and cast their cares on the shoulders of the junior engineers who backfilled them. This was a struggling existence for them, because the complexity of the solution did not lend itself to simple maintenance by junior engineers.
The Caractacus Potts adventure begins! You know Potts from the movie Chitty-Chitty-Bang-Bang. We thrilled to his inventions, and laughed when they did not deliver. A "simple" machine to crack eggs, cook them and deliver-up breakfast worked fine for everyone but him, delivering a plate of two uncooked eggs still in their shells. The puzzled look on his face told us he recognized the problem but did not know where to look for resolution. With so many moving parts, it could be anywhere. This is a classic outcome of "eccentric innovation" and "eccentric engineering" a.k.a "skunkworks". More importantly, the innovation only solves one problem (e.g. egg-centric breakfast) not a general-purpose solution platform.
Well, let's keep it simple- what about a simple summary report? You know, national sales data summarized to the national, regional, district and store levels? Wouldn't this require a complete scan of the table to glean all of this? In the SMP-based depiction (above) how could be expect such a scan to operate? Software would pull the data in-total from the disk drives, choking the backplane. Software would then summarize the aforementioned quantities, in memory if possible and then deliver the result. Frankly, such a report could take many hours to execute, and keep the machine busy the whole time. Even if we "grid" this, the query would swamp the grid. On a tricked-out Sun E10k, we have about 12 gb/second throughput. Putting some math to this, with 25 terabytes in tow, we could expect the table scan to complete in about 30 minutes even if the full complement of 64 processors is on board, because the solution is I/O bound, not CPU-bound (no new CPUs will make it run faster). However, in reality the software engine and all its overhead drain the energy from the machine and this query will run for hours, even if it's the only operation running on the machine. So I guess we really will need more CPUs to balance off the software drain of the engine itself. (sigh).
This is because: engines that run on SMP-based devices are inherently load-balancing engines, not bulk-processing engines. Their processes stop, negotiate and resume even if there's nothing else going on. Think of it like this: Where I live in the country, at 5am in the morning all of the traffic lights on the main road blink-yellow until about 6am. If I travel on that road before then, I can go the speed-limit for over half-an-hour before hitting the first traffic light in the next (larger) town. But if those traffic lights all operated normally, I could get stopped at each one, protracting my 30-minute journey by orders of magnitude as I wait on traffic lights even when no other traffic is present. SMP-based engines automatically thread this way, where a flow-based model does not. A load-balancing engine will force all of its processes to stop at a virtual traffic light, come up for air to make sure nothing else requires attention, then go back to work. Transactional models absolutely require this but it is anathema to bulk processing.
Now we contrast this with IBM Netezza, which is a purpose-built platform for general-purpose across all solutions requiring such a platform. We don't have a one-trick pony. This would mean (ultimately) any form of data warehouse or bulk-processing solution, but more importantly anything that requires fast-retrieval of data, especially while performing high-intensity on-demand analytics on it.
In the IBM Netezza architecture (depicted right) each CPU has a shared-nothing disk drive and its own RAM. On the original Mustang Series 10400 (with 432 of these on board) we have a machine that costs far less than the Sun E10k noted above. Likewise we could scan those 100 billion rows in less than ten minutes. It won't ever take any longer than ten minutes. If we boost the CPUs of the machine, say by adding another frame to it (200+) CPUs, it will boost the machine's speed by another 50 percent. Queries that took 6 minutes now will take less than 4 minutes. It is a deterministic/predictable model, and adding more frames to the Netezza platform is simple and inexpensive compared to the sheer labor dollars of eccentric engineering.
As for the depiction (right) it has 16 CPU/Disk combinations. To use round numbers, let's say we put 1.5 terabytes on each disk for a grand total of 24 terabytes. With this configuration, for any given query, the path to conclusion is only 1.5 terabytes worth of scan-time away. Once we initiate the query, each CPU will run independently and will scan its 1.5 terabytes. All of them will complete simultaneously, meaning that the total duration of the query was no longer than it took to scan the 1.5 terabytes (they are all scanning in parallel). Now boost this to 400 CPUs, where each one now only has about 63 gigabytes share of the load. One scan of the entire 24TB table takes no longer than the time to scan 63 GB (they are all in parallel). We can measure our disk read-speed here and get very consistent estimates of how long a query should take.
Also keep in mind that (in a prior blog entry on indexes) I noted that we can radically reduce these operations to a fraction of their total scan times. But in the example above, full summary of the data on sales-boundaries, how much is that worth if we could do the sales base on a date? Or based on a range of dates? Perhaps even comparing last-years Independence-Day sales to this years?
In an SMP-based configuration, the information engineers would suggest partitions. The partition (for an SMP engine) is an artificial performance prop that anticipates the user's query needs based on known use-cases. It bundles data (say on a date boundary) so queries against that date can be fenced by the partition boundary. The Netezza zone map, on the other hand, automatically tells the machine where to look, and where not to look, to go capture the information required by the query. No props, no use-case anticipation, just the flexibility we really need if we want to keep fire-breathing users happy without special engineering to anticipate their needs.
Zone maps allow the sales-related comparison above to arrive in mere seconds to the user's fingertips. On an SMP machine, at best, even with partitions, indexes and other performance props, will require a maxed-out power frame (all CPUs on deck) and the best anticipatory information engineering to provide a consistent experience that even hopes to compete with Netezza. Even after all that, it won't come back in seconds, and won't provide the nimble flexibility so sought-after by even the average data analyst.
The conclusion is that the overall cost of deployment, ownership and ease of maintenance for a Netezza machine utterly eclipse the potential promise of SMP-based solutions. For an analyst, all columns, tables and functions are "fair game" for query - all over the database, 24x7. A Netezza machine provides just that. On an SMP-based engine, the analyst has to agree with information engineers on their entry points and usage patterns, and these have to be engineered into the model in order to support the users. Once engineered, the solution will support only that user base. All other user bases will require their own engineering model. This is not sustainable, durable or manageable, which is why those who are steeped in it will gladly embrace a Netezza machine. Value is recognized on so many levels.
DavidBirmingham 2700043KNU Tags:  analytics performance pda puredata netezza amazon 6 Comments 5,994 Views
For the past many months I have been diligently updating and upgrading the original 2008 Netezza Underground to address the many features of TwinFin, Striper and other offerings from IBM. I have recently been notified that it has passed final edit and is available on Amazon.com.
All I can say is "whew!" and many thanks to those who helped put it together. It's been a whirlwind.
Here is the URL
When I started the project I realized that a big part of the original book remains timeless. I didn't leave it "as is" though - practically every page and all the chapters have new material, case studies and such. I peppered the book with some additional graphics since the intrinsic points require a bit more reinforcement than mere words will suffice.
The original chapter on "Distribution Stuff" is now "Performance Stuff" and is twice as long, covering the various aspects of setting up tables, troubleshooting, page-level zone maps and a lot more.
Fortunately, this time around there is a better mechanism to contact me if you have questions or want to report any errata (hey, it could happen!) - you can reach me through this blog, on linked-in or directly through my primary email address at Brightlight Consulting:
DavidBirmingham 2700043KNU Tags:  analytics performance physical logical netezza puredata 2,202 Views
How many logical modelers does it take to screw in a lightbulb? None, it's a physical problem.
I watched in dismay as the DBAs, developers and query analyts threw queries at the machine like darts. They would formulate the query one way, tune it, reformulate it. Some forms were slightly faster but they needed orders-of-magnitude more power. Nothing seemed to get it off the ground.
Tufts of hair flew over the cubicle walls as seasoned techologists would first yelp softly when their stab-at-improvement didn't work, then yelp loudly as they pulled-out more hair. Yes, they were pulling-their-hair-out trying to solve the problem the wrong way. Of course, some of us got bald the old-fashioned way: general stress.
I took the original query as they had first written it, added two minor suggestions that were irrelevant to the functional answer to the query, and the operation boosted into the stratosphere.
What they saw however, was how I manipulated the query logic, not how I manipulated the physics. I can ask for a show of hands in any room of technologists and get mixed answers on the question "where is the seat of power?". Some will say software, others hardware, others a mix of the two, while those who still adhere to the musings of Theodoric of York, Medieval Data Scientist, would say that there's a small toad or dwarf living in the heart of the machine.
To make it even more abstract, users sit in a chair in physical reality. They live by physical clocks in that same reality, and "speed of thought" analytics is only enabled when we can leverage physics to the point of creating a time-dilation illusion: where only seconds have passed in the analyst's world while many hours have passed in reality. After all, when an analyst is immersed in the flow of a true speed-of-thought experience, they will hit the submit-key as fast as their fingers can touch the keyboard, synthesize answers, rinse and repeat. And do this for many hours seemingly without blinking. If the machine has a hiccup of some kind, or is slow to return, the illusion-bubble is popped and they re-enter the second-per-second reality that the rest of us live in. Perhaps their hair is a little grayer, their eyes a little more dilated, but they will swear that they have unmasked the Will-O-The-Wisp and are about to announce the Next Big Breakthrough. Anticipation is high.
But for those who don't have this speed-of-thought experience, chained to a stodgy commodity-technology, they will never know the true meaning of "wheels-up" in the analytics sense. They will never achieve this time-dilated immersion experience. The clock on the wall marks true real-time for them, and it is maddening.
Notice the allusions to physics rather than logic. We don't doubt that the analyst has the logic down-cold. But logic will not dilate time. Only physics can do that. Emmett Brown mastered it with a flux capacitor. We don't need to actually jump time, but a little dilation never hurt anybody.
The chief factor in query-turnaround within the Netezza machine is the way in which the logical structures have been physically implemented. We can have the same logical structure, same physical content, with wildly different physical implementations. The "distribute on" and "organize on" govern this factor through co-location at two levels: Co-location of multiple tables on a common dataslice and co-location (on a given dataslice) of commonly-keyed records on as few disk pages as possible (zone maps). The table can contain the same logical and physical content, but its implementation on Netezza physics can be radically different based on these two factors.
Take for example the case of a large-scale fund manager with thousands of instruments in his portfolio. As his business grows, he crosses the one-million mark, then two million. His analytics engine creates two million results for each analytics run, with dozens of analytics-runs every day, adding up quickly to billions of records in constant churn. His tables are distributed on his Instrument_ID, because in his world, everything is Instrument-centric. All of the operations for loading, assimilating and integrating data centers upon the Instrument and nothing else. They are organized on portfolio_date, because the portfolio's activity governs his operations.
His business-side counterpart on the other hand, sells products based on the portfolio. The products can serve to increase the portfolio, but many of the products are analytics-results from the portfolio itself. This is a product-centric view of the data. Everything about it requires the fact-table and supporting tables to be distributed on the Product_id plus being organized on the product-centric transaction_date. This aligns the logical content of the tables to the physics of the machine. It also aligns the contents of the tables with the intended query path of the user-base. One of them will enter-and-navigate via the Instrument, where the other will use the Product.
We can predict the product manager's conversation with the DBAs:
PM: "I need a version of the primary fact table in my reporting database."
I watched this conversation from a healthy distance for nearly fourteen months before the DBA acquiesced and installed the necessary table. Prior to this, the PM had been required to manufacture summary tables and an assortment of other supporting tables in lieu of the necessary fact-table. The user experience suffered immensely during this interval, many of them openly questioning whether the Netezza acquisition had been a wise choice. But once the new distribution/organize was installed, user-queries that had been running in 30 minutes now ran in 3 seconds. Queries that had taken five minutes were now sub-second. Where before only five users at a time could be hitting the machine, now twenty or more enjoyed a stellar experience.
How does making a copy of the same data make such a difference? Because it's not really a copy. When we think "copy" we think "photocopy", that is "identical". A DBA will rarely imagine that using a different distribution and organize will create a version of the table that is, in physical terms, radically different from its counterpart. They see the table logically, just a reference in the catalog.
The physics of the Netezza machine is unleashed with logical data structures that are configured to leverage the physical power of the machine. Moreover, the physical implementation must be in synergy (and harmony) with how the users intend to consume them with logical queries. In the above case, the Instrument-centric consumer had a great experience because the tables were physically configured in a manner that logically dovetailed with his query-logic intentions. The Product-centric manager however, had a less-than-stellar experience because that same table had not been physically configured to logically dovetail with his query-logic intentions. The DBA had basically misunderstood that the Netezza high-performance experience rests on the synergy between the logical queries and physical data structures.
In short, each of these managers required a purpose-built form of the data. The DBA thinks in terms of general-purpose and reuse. To him, capacity-planning is about preserving disk storage. He would never imagine that sacrificing disk storage (to build the new table) would translate into an increase in the throughput of the machine. So while the DBA is already thinking in physical terms, he believes that users only think in logical terms. Physics has always been the DBA's problem. Who do those "logical" users think they are, coming to the DBA to offer up a lecture on physics?
In this regard, what if the DBA had built-out the new table but the PM's staff had not included the new distribution key in their query? Or did not leverage the optimized zone-maps as determined by the Organize-On? The result would be the same as before: a logical query that is not leveraging the physics of the machine. At this point, adding the distribution key to the query, or adding filter attributes, is not "tuning" but "debugging". Once those are in place, we don't have to "tune" anything else. Or rather, if the data structures are right, no query tuning is necessary. If the data structures are wrong, no query tuning will matter.
And this is why the aforementioned aficionados were losing their hair. They truly believed that the tried-and-true methods for query-tuning in an Oracle/SQLServer machine would be similar in Netezza. Alas - they are not.
What does all of this mean? When a logical query is submitted to the machine, it cannot manufacture power. It can only leverage or activate the power that was pre-configured for its use. This is why "query-tuning" doesn't work so well with Netezza. I once suggested "query tuning in Netezza is like using a steering wheel to make a car go faster." The actual power is under the hood, not in the user's hands. While the user can leverage it the wrong way, the user cannot, through business-logic queries, make the machine boost by orders-of-magnitude.
Where does the developer/user/analyst need to apply their labor? They already know how they want to navigate the data, so they need to work toward a purpose-built physical implementation, using a logical model to describe and enable it. Notice the reversal of roles: the traditional method is to use a logical model to "physicalize" a database. This is because in a commodity platform (and a load-balancing engine) the physics is all horizontal and shared-everything. We can affect the query turnaround time using logical query statements because we can use hints and such to tell the database how to behave on-demand.
We cannot tell Netezza how to "physically" behave on-demand in the same way. We can use logical query statements to leverage the physics as we have pre-configured it, but if the statement uses the tables outside of their pre-configured physics, the user will not experience the same capacity or turnaround no matter how they reconfigure or re-state the query logic.
All of this makes a case for purpose-built data models leading to purpose-built physical models, and the rejection of general-purpose data models in the Netezza machine. After all, it's a purpose-built machine quite unlike its general purpose, commodity counterparts in the marketplace. In those machines (e.g. Oracle, SQLServer) we have to engineer a purpose-built model (such as a star-schema) to overcome the physical limitations of the general-purpose platform. Why then would we move away from the general-purpose machine into a purpose-built machine, and attempt to embrace a general-purpose data model?
Could it be that the average Netezza user believes that the power of the machine gives it a magical ability to enable a general-purpose model in the way that the general-purpose machines could not? Ever see a third-normal-form model being used for reporting in a general-purpose machine? It's so ugly that they run-don't-walk toward engineering a purpose-built model, photocopying data from the general-purpose contents into the purpose-built form. No, the power of the Netezza machine doesn't give it magical abilities to overcome this problem. A third-normal form model doesn't work better in Netezza than a purpose-built model.
Enter the new solution aficionado who wants their solution to run as fast as the existing solution. They will be told, almost in reflex by the DBA that they have to make their solution work with the existing structures, even though they don't leverage physics in the way the new solution will need it. And this is the time to make a case for another purpose-built model. One that faces the new user-base with tables that are physically optimized to support that user base. Will all tables have to come over? Of course not. Will all of the data of the existing fact table(s) have to come over? Usually not, which is silver lining of the approach.
But think about this: The tables in Netezza are 4x compressed already. If we make another physical copy of the table, itself being 4x compressed, the data is still (on aggregate) 2x compressed across the two tables. That is, the data is doubled at 4x compression, so it only uses the same amount of space as the original table would have if it were only 2x compressed. In this perspective, it's still ahead of the storage -capacity curve. And in having their physics face the user base, we preserve machine capacity as well.
This is perhaps the one most-misunderstood tradeoff of requiring multiple user bases to use the same tables even though their physical form only supports one of the user-bases. And that is simply this: When we kick off queries that don't leverage the physics, we scan more data on the dataslices and we broadcast more data between the CPUs. This effectively drags the queries down and saturates the machine. The query drag might be something only experienced by the one-off user base, but left to itself the machine capacity saturation will affect all users including the ones using the primary distribution. Everyone suffers, and all for the preservation of some disk space. Trust me, if there is a question between happy and productive users versus burning some extra disk space, it's not a hard decision. Preserving disk storage in the heat of unhappy users is a bad tradeoff.
Or to make an absurd analogy, let's say we show up for work on day one, have a great day of onboarding and when we leave, we notice that our car is a little "whiney". Taking it to a shop, he tells us that someone has locked the car in first-gear and he can't fix it. We casually make this complaint the next day (it took us a little longer to get to work).
DBA: Oh sure, all new folks have their car put in first gear. It's a requirement.
The problem with working with tables that aren't physically configured as-we-intend-to-use-them is that using them will cause the machine to work much harder than it has to. Not only will our queries be slower, we can't run as many of them. And while we're running, those folks with high-gear solutions in the same machine will start to look like first-gear stuff too. The inefficiency of our work will steal energy from everyone. We cannot pretend that the machine has unlimited capacity. If our solution eats up a big portion of the capacity then there's less capacity for everyone else. Even if we use workload management, whatever we throttle the poorly-leveraged solution into will only make it worse, because if a first-gear solution needs anything, it most certainly uses more capacity than it would normally require.
Energy-loss is the real cost of a poor physical implementation. All solutions start out with a certain capacity limit (same number of CPUs, memory, disk storage) and it is important that we balance these factors to give the users the best possible experience. Throttling CPUs or disk space, or refusing to give up disk space merely to preserve disk capacity, only forestalls the inevitable. The solution's structures must be aligned with machine physics and the queries must be configured to leverage that physics.
The depiction(above) describes how the modeler's world (a logical world) in no wise intersects with the physical world, yet the physical world is what will drive the user's performance experience. The high-intensity physics of Netezza is not just something we "get for free", it is a resource we need to deliberately leverage and fiercely protect.
In the above, the "Logical data structure" is applied to the machine catalog (using query logic to create it). But once created, it doesn't have any content, so we will use more logic to push data into the physical data structure. The true test of this pairing is when we attempt to apply a logical query (top) and it uses the data structure logic/physics to access the machine physics (bottom). Can we now see why a logical query, all on its own, cannot manufacture or manipulate power? It is the physical data structure working in synergy with the logical query that unleashes Netezza's power. And this is why some discussions with modelers may require deeper communication about the necessity to leverage the deep-metal physics while we honor and protect the machine's power.
DavidBirmingham 2700043KNU Tags:  recovery twinfin disaster scalable infrastructure capacity high netezza scale archive backup 1 Comment 8,172 Views
When introducing the Netezza platform to a new environment, or even trying to leverage existing technologies to support it, very often the infrastructure admins will have a lot of questions, especially concerning backups and disaster recovery. Not the least of which are "how much" "how often" and such like. More often than not, every one our responses will be met with a common pattern, a sentence starting with the same two words:
Case in point, when we had a casual conversation with some overseers of backup technology as a precursor to "the big meeting" we almost - quite accidentally - shut down the conversation entirely. Just the mention of "billions" of rows, or speaking of the database in "Carl Sagan" scaled terms, caused them to want to scramble for budget and market surveys of technologies that were more scalable than their paltry nightly tape-backup routines. In this particular conversation, we were talking about nightly backups that were larger than the monthly backups of all of their other systems combined. Clearly we were about to pop the seams of their systems and they wanted a little runway to head off the problem in the right way.
But what is the "right way" to perform a backup where one of the tables is over 20 terabytes in size, the entire database is over 40 terabytes in size and the backup systems require two or even three weeks to extract and store this information for just one backup cycle? Quipped one admin "It takes so long to back up the system that we need to start the cycle again before the first cycle is even partially done." and another "Forget the backup. What about the restore? Will it take us three weeks to restore the data too? This seems unreasonable."
Yes it does seem unreasonable precisely because it is - quite unreasonable. As many of you may have already discovered, the Netezza platform is a change-agent, and will either transform or crush the environment where it is installed, so voracious are its processing needs and so mighty is its power to store mind-numbing quantities of information.
The aforementioned admins simply plugged their backup system into the Netezza machine, closed their eyes and flipped the switch, then helplessly watched the malaise unfold. It doesn't have to be so painful. These are very-large-scale systems that we are attempting to interface with smaller-scale systems. We might think that the backup system is the largest scaled system in our entire enclosure, but put a Netezza machine next to it and watch it scream like a little girl.
So here's the deal: No environment of this size shoiuld be handled in a manner that is logistically unworkable for the infrastructure hosting it. We can say all day that these lower-scaled technologies should work better or that Netezza should pony-up some stuff to bridge the difference, but we all know that it's not that easy. Netezza has simplified a lot of things, but simplification of things outside the Netezza machine - aren't we asking a bit much of one vendor?
To avoid pain and injury, think about the things that we need to accomplish that are daunting us, and solve the problem. The problem is not in the technology but in the largesse of the information. We would have the same problem on our home computers if we had a terabyte of data to backup onto a common 50-gig tape drive. We would need twenty tapes to store the data. The backup/restore technology works perfectly fine and reasonably well for a variety of large-scale purposes. We simply need to be creative about adapting it to the Netezza machine. Don't plug it in and hope for the best. Don't do monolithic backups. The data did not arrive in the machine in a monolithic manner so why are we trying to preserve it that way? Leave large-scale storage and retrieval to the Netezza machine and don't crush the supporting technologies with a mission they were never designed for.
Several equally viable schools of thought are in play here. What we are looking for is the most reliable one. Which one will instill the highest confidence with least complexity? The more complex a backup/restore solution becomes, the less operational confidence we have in it. If it cannot backup and restore in a reasonable time frame, we exist in a rather anxious frontier, wondering when the time will come that the restoration may be required and we put our faith in the notion that it either won't, or when it does all of the other collateral operational ssues will eclipse the importance of the restoration. In other words. future circumstance will get us off the hook. There is a better way, like a deterministic and testable means to truly backup and restore the system with high reliability and confidence.
On deck is the simplest form of the solution - another Netezza machine. Many of you already have a Disaster-Recovery machine in play. Trust me when I tell you that this should be fleshed out as a fully functional capability (discussed next) and then the need for a commodity backup/restore technology evaporates. Using another Netezza machine, especially when leveraging the Netezza-compressed information form, allows us to replicate terabytes of information in a matter of minutes. I don't have to point out that none of our secondary technologies can compete with this.
A second strategy requires a bit more thought, but it actually does leverage our current backup/restore technology in a manner that won't choke it. It won't change the fact that the restoration, while reliable, may be slow simply because moving many terabytes in and out of one of these secondary environments is inherently slow already.
A third strategy is a hybrid of the second, in that enormous SAN/NAS resources are deployed as the active storage mechanism for the data that is not on (or no longer on) the Netezza machine. This can be a very expensive proposition on its own. We know of sites that keep the data on SAN in load-ready form, and then load data on-demand to the Netezza machine, query the just-loaded data, return the result to the user and then drop the table. You may not have on-demand needs of this scale, but this shows that Netezza is ready to scale into it.
A fourth strategy is a hybrid of the first, that is we still use a Netezza machine to back up our other Netezza machine, but we use the more cost-effective Netezza High-Capacity server, which is less expensive than the common TwinFin (fewer CPUS, more disk drives) but otherwise behaves in every way identically to its more high-powered brethren. And honestly if we were to put apples-to-apples in a comparison between the cost of a big SAN plant to store these archives versus the High Capacity Server, the server wins hands-down. It's cheaper, simpler to operate, doesn't require any special adaptation and we can replicate data in terms guided by catalog metadata rather than adapting one technology to another.
So let's take these from the least viable to the most viable and compare them in context and contrast, and let the computer chips fall where they may.
Commodity backup/restore technology
If we want to leverage this, we need to understand that it cannot be used to perform monolothic operations. These are unmanageable for a lot of reasons:
To mitigate the above, we need to adapt the large-scale database to the backup technology by decoupling and downsizing the operation into manageable chunks. This is a direct application of the themes surrounding protracted data movement in any environment. The larger the data set, the more the need for checkpointed operations so that the overall event is an aggregation of smaller, manageable events. If any single event fails, it is independently restartable without having to start over. Case in point, if I have 100 steps to complete a task and they are all dependent upon one another, and the series should die on step 71, I still have 29 steps remaining that may have completed without incident, but I cannot run them without first completing step 71. This is what a monolithic backup buys us - an all-or-nothing dependency that is not manageable and I would argue is entirely artificial.
To continue this analogy, lets say that any one of these 100 steps only takes one minute. In the above, I am still 30 minutes away from completion. I arrive at 6am to find that step 71 died, and now I have to restart from step 1 and it will cost me another 100 minutes. Even if I could restart at step 71, I am still 30 minutes away from completion.
Contrast the above to a checkpointed, independent model. If we have 100 independent steps and the step 71 should die, the remaining 29 steps will still continue. We arrive at 6am to find that only one of the 100 steps died and we are only 1 minute away from full completion. The difference in the two models is very dramatic:
Monolithic means we are operationally reactive when failure occurs. The clock is ticking and we have to get things back on track and keep moving. Checkpointed means we are administratively responsive when failure occurs. We don't have to scramble to keep things going. In fact, in the above example, if step 71 should fail and the operator is notified, doesn't the operator have at least half an hour to initiate and close step 71 independently of the remaining 29 steps? Operators need breathing room, not an anxious existence.
Monolithic methods are supported de-facto by the backup/restore technology. If we want to perform a checkpointed operation, we have to adapt the backup/restore process to the physical or logical content of the information. We don't want to directly mate the backup technology itself, so we need to adapt it.
Logical Content Boundaries
This means we have to define logical content boundaries in the data. What's that? You don't have any logical content boundaries, not even for operational purposes? Well, per my constant harping on enhancing our solution data structures with operational content, such as job/audit ID and other quantities, perhaps we need to take a step back and underscore the value of these things because they exist for a variety of reasons. One of them is now upon us - the operatonal support of content-bounded backups. It is required for scalability and adaptability and is not particularly hard to apply or maintain.
A more important aspect of content boundary is the ability to identify old versus new data. If the data is carved out in manageable chunks, some will always be least-recent and some more-recent. Invariably the least-recent chunks will be identical in content no matter how many times we extract them for backup. This means we can extract/backup these only once and then focus our attention on the most active data. In a monolithic model, there is no distinction between least or most active, least or most recent. In large-scale databases, the least-recent data is the majority, so the monolithic backup is painfully redundant when it need not be.
Do we absolutely need content-bounded backups for all of our tables to function correctly? Of course not. But by applying this as a universal theme it allows us to treat all tables as part of a backup framework where all of them behave the same way. So part of this is in the capture of the data but a larger part is the operational consistency of the solution.
Many reference tables such as lookups will never grow larger and we know this. In fact, their data may remain static for many years. For the ones that are tied to the application and grow or change every day, these we will call solution tables. They are typically fed by an upstream source and are modified on a regular basis. Any of these tables can grow out of control. The reference data then represents a very low operational risk. Why then would we not simply fold the reference data into the larger body and treat all the tables the same? There is no operational penalty for it, but enormous benefit from being able to treat all tables the same way inside a common solution.
At this point, the backup/restore framework will address all of the tables the same way, but now we have the ability to leverage rules and conditions within the framework so that special handling is available if necessary. This is a common theme in large-scale processing: Handle everything as though it will grow, but accommodate exceptions with configuration rules. I'll forego this aspect for now and let' take a look at what we need in basic terms:
Setup formal archival databases. Whether these reside on the same server, or on a DR / High Capacity server (below) is immaterial. The point is that the data in the master tables will be actively rolled into these tables, which will form the backbone for all backup and restoration operations. We therefore replicate the masters (with streaming replication) into the archival stores and then at some interval perform the backup of the archives, not the master.
Foreshorten the Master Tables
Now that we have a means to define content boundaries - you did apply those boundaries right? We can now look at the database holistically for optimizations based on active data.
At one site we have a table with ninety billion records in three different fact tables spanning over ten years of information, However, the end-users and principals claimed (and we verified) that the most active data on any given day is the most recent six months. Anything prior to that, they would query perhaps once or twice a year for investigative purposes, but had not tied any reports to it.
So now we have an opportunity to get agreement from the users to shorten the master tables. This is especially necessary for those fact tables. In the end, these fact tables (above) were shortened to less than four billion rows each and these are kept trimmed on a regular basis. The original long-term data is held in another archival database that is the foundation for the backups and restores.
The above protocol, while likely easy to pull off an administer, still has a number of moving parts that the Netezza based-equivalent (later) will not have.
The only difference between the above protocol and one using a SAN-based storage mechanism, is the absence of a formal backup/restore technology. Rather the SAN is the long-term storage location and we perform the incremental extractions onto it. Rather than delete the files, we keep them.
This has significant implications for the cost of the SAN. After all, if we intend to interface this to the Netezza machine, we would not want common NAS storage because it is too slow and the vendors actively disclaim their technology from being viable for data warehousing. The primary reason is that the network and CPUs are set up for load-balancing, like a transactional database, but not bulk onload/offload of the data.
Not only will we need enough SAN to backup the environment, but to also carry fathers and grandfathers if need be (this is a policy decision). With checkpointed extracts, the father/grandfather issue is largely moot. This is because once a checkpointed extract of older data is pulled and stored, it won't be changing and capturing another one just like it has no value.
In this approach we leverage another Netezza machine like a DR server, as our backup, archival and restore foundation. It can easily hold the information quantity. The difference here is in price, of course, since a fully-functional TwinFin is more expensive than most common SAN installations. However, the High Capacity Server (below) mitigates this pricing problem while delivering a consistent data experience.
One primary benefit of performing backups into the DR server is that it can automatically serve the role of a hot-swap server in case of failure in the primary server.
For this scenario to work however, we would want streaming replication between the active databases and the DR server so that the data is being reconciled while being processed. This allows us to have a fully functional hot-swap if the primary crashes, and we can continue uninterrupted while the primary is serviced. Word to the wise on this kind of scenario however, bringing back the primary means that it is out of sync, since the secondary took over for a span of time. So we would need to be able to reverse the streaming replication to make it whole.
Scenarios like this often embrace the practice of operationally swapping the two machines on defined boundaries, like once a month or once a quarter, where they actually switch roles each time. This allows the operations staff to gain confidence in the two machines as redundant to each other in every way. I have seen cases like this where the primary machine went down, the secondary machine kicked in seamlessly and all was well. I have also seen cases where the principals kept the DR server up to date but when it came time to operationally switch, some important piece (usually in the infrastructure between the devices) was missing causing the failover itself to fail. It is best to have a plan in place, but it's better to have tested the plan and that it actually works.
A word on Netezza-compressed transfer. I wrote about this in Netezza Transformation but it is important to highlight here. We performed an experiment moving half a terabyte scattered across a hundred or more tables. This data was moved from its original home to a database in another machine. The first method used simple SQL-extract into an nzLoad component. This process took over an hour. The second method used transient external tables with compression, coupled with an nzload in compressed mode. The entire transfer took less than six minutes. This was because the compressed form of the data was already 14x compressed.
In other experiment using over 20x compression for the data, we were able to transfer ten terabytes in less than an hour. This kind of data transfer speaks well for the streaming replication necessary for DR server operation (above) but underscores the fact that even when transferring between Netezza machines, it's as though we haven't left the machine at all.
Netezza-based High-Capacity Server
This option is simply a form of the Netezza-based hybrid (above) but on a dedicated server designed to support backup and recovery.
The better part about this server is that is has more disk drives and fewer CPUs, making it far more cost effective for storage than common SAN devices. Couple this with the minimal overhead required for transferring data between machines, and the ability to surgically control the content with the content-boundaries and catalog metadata, and we get the best of all worlds with this device.
Not only that, but it is also scalable to support storage of all other Netezza devices in the shop as well as any non-Netezza device where we simply want to capture structured information for archival purposes. The High Capacity server is queryable also, meaning that even the ad-hoc folks will find some value in keeping the data online and available.
Lastly, in Spring 2010, as part of the safe-harbor presentation one of the principals at IBM Netezza announced plans for a replication server. I can only imagine that this device will deliver us from any additional hiccups associated with streaming replication that we might now be doing in script or other utility control language.
At Brightlight, our data integration framework (nzDIF) has the nz_migrate techniques built directly into the flow substrate of the processing controller, as well as the enforcement and maintenance of the aforementioned content boundaries. We are actively acquiring and applying best, most scalable and simplified approaches as a solution framework firmly lashed to a purpose-built machine. I am a big proponent of encouraging Enzees to take on these things themselves, or at least let us coach you on how to make it happen. The solutions are simple because the Netezza platform itself is simplified in its operation. Stand on the shoulders of genius - the air is good up here.
DavidBirmingham 2700043KNU Tags:  performance transformation elt netezza sql etl transform 3,474 Views
I have noted in prior posts several of the "banes" of in-the-box data processing, not the least of which is harnessing the mechanics and nuances of the SQL statement itself. After all, the engine of in-the-box is a series of insert/select SQL statements. I've also noted that we need to squeeze the latency out of the inter-query handoff and management. These are important factors for efficiency, scalability and adaptability.
But this article deals primarily with "adaptive" SQL, that is, the ability to surgically and dynamically control the SQL, the paths of flow between SQL statements, their timings and the ability to conditionally execute them.
I am drawing a contrast between this approach and the common "wired" ETL application. In the wired application of an ETL tool, all components are known and flow-paths predefined. If we want to shut off a particular component or flow, we'd better make that decision at startup because we won't get to do this later. A benefit here is that if we add or change a flow-path, the ETL tool's dependency analysis will (usually) detect it and give it a thumbs-up or thumbs-down. We can (and do) perform this kind of design-time analysis, but what of dynamic run-time analysis?
Case in point: One group performs trickle-feed of data from a change-data-capture, so on any given loadng cycle, we don't know which files will show up. Not to worry in an ETL tool, since we would just build a separate mini-app to deal with the issues. The mini-app would key on the arrival of a specific file, process the file and present results to the database. This is a very typical implementation. But with hundreds of potential files, it's also logistically very daunting and hard to get the various streams to inter-operate. In fact, an ETL tool quickly reduces to "sphagetti-graphics" and the graphical user interface is just in-our-way at that point.
Case in point: One group has multiple query paths/flows where sql statements build one-to-the-next for the final outcome. These can follow a wide range of paths not unlike a labyrinth depending on a variety of different factors. The problem is, these factors aren't known until run-time and only appear in fleeting form as the data is processed. How do we capture these elements and use them as steering logic? In an ETL tool, our options are limited to none. In this particular case, three primary paths of logic were available each time the flows ran. Sometimes all three paths ran end-to-end. Sometimes only one, or two would run, or perhaps none-at-all. The starting conditions and unfolding data conditions determined the execution path.
But we have another name for this don't we? Isn't this just plain vanilla "computer programming"? Where the data shows up and we use the encountered-data and encountered-conditions to guide the IF-THEN-ELSE logic to conclusion? The problem you see, is that we are so accustomed to using IF-THEN-ELSE at the ROW/COLUMN level, we cannot imagine what this would look like at the SET level. Ahh, the conditional logic driving SETS is unique and distinct from that which drives basic elements. But then again, we can only scale in sets, not the basic elements. THis is where the dynamic nature of conditional-sets is invaluable.
But this isn't really about conditional sets, either. Only that conditional sets are a necessary capability and we have to account for them along with many other subtle nuances. Let's follow:
We have an external file and we load this into an intermediate/staging table (TABLE-A) in preparation for processing.
Now we build another target intermediate table (TABLE-B) and an insert/select statement to move / shape the data logically and physically from TABLE-A to TABLE-B.
From here we have several more similar operations, so we build intermediate tables for their results as well, such as TABLE-C, TABLE-D and TABLE-E
TABLE-A >>> TABLE-B >>> TABLE-C >>> TABLE-D >>> TABLE-E
Now let's say we have another chain of work starting from TABLE-V:
TABLE-V >>> TABLE-W >>> TABLE-X >>> TABLE-Y >>> TABLE-Z
Now something interesting happens, in that the developers sense a pattern that allows them to reuse certain logic if they only put these quantities into a couple of working tables, which we will call TABLE-G and TABLE-H, and now the flows look like this:
TABLE-A >>> TABLE-B >>>>>>>>TABLE-C >>> TABLE-D >>> TABLE-E
TABLE-V >>> TABLE-W >>>>>>>>TABLE-X >>> TABLE-Y >>> TABLE-Z
Notice how TABLE-G is feeding TABLE-C and TABLE-H is feeding TABLE-X, so that each of them have a 2-table dependency.
Now we get to the end of the chain of work and learn that TABLE-Z has to leverage some data in TABLE-C! We don't want to rebuild TABLE-C just for TABLE-Z, but in an ETL Tool this data would be bound/locked inside a flow. We could redirect the flow to TABLE-Z, unless the flow to TABLE-Z is entirely conditional and we don't know it until we encounter TABLE-C. What if, for example, the results of TABLE-C are conditional and if the condition is realized, none of the components following TABLE-C are executed. However, we could have TABLE-Z see this absence as acceptable and continue on.
Okay, that's a lot of stuff that might have your head spinning about now, but the simplicity in resolving the above is already in our hands. In any flow model, upstream components essentially have a "parent" relationship to a downstream "child" component. This parent-child relationship pervades flows (and especially trees) and as we can readily see, the above chain-of-events looks a lot like a tree (more so than a flow).
More importantly, each node of the tree is a checkpointed stop. We must build the intermediate table, process data into it and move on, but once we persist the data, we have a checkpointed operation. This is why it behaves so beautifully as a flow and a tree.
Now let's say over the course of SDLC (regular maintenance), that a developer needs to add some more operations and connect other existing operations to their results. This is essentially just introducing new source tables in the where/join clause, but the table as to exist. In short, if we add a new table to the logic of TABLE-X, it will now be dependent upon its original tables and the new ones. (Its query will break if they are not present at run time).
It is easy enough (honestly) to perform a quick dependency-check over all of our queries to make sure that their various source tables are accounted for. In other words, an operation actually exists that will produce the table. What if we picked the wrong table or even misspelled it? At run-time we would know, but we would rather know before execution because it's a design-time issue. This may verify that logically we have a plan to create the dependent table, but it does not deal with the simple fact that conditional circumstances may forego the physical instantiation of the table. Transforms ultimately do not operate on intent, but on the presence of physical assets.
As another nuance, this creates a disparity between the design-time flow of data, and the run-time flow of data. If the run-time is governed (e.g. ETL tools) so that the dependencies and conditions are all evaluated at the start of the application, the design-time and run-time are more easily mated for review by an auditor or analyst. But if any part of it is dynamically conditional, we can see how this could practically nullify the design-time form of the flow. They would simply say, "I know what the flow would do by design, but I want to see what it actually did at run time, because the data isn't matching up". Aha - so "intent" counts for design review, but "intent" is not what puts physical data into the tables. Operational processes do that.
As noted above with the necessity for conditionality and reduction of inter-transform latency, we now have a need to weave together at run-time what the flows will actually do. The "source tables" for a given transform are found in the where/join phrase and these had better be present when the SQL launches or it will be a short ride indeed.
And now, what you did not expect - one of the most powerful ways to use a Netezza machine is to forego the "serialization" of these flows and allow them to launch asynchronously. We can certainly throttle how many are "live" at at time, but if any or all of them can launch independently, how on earth are we supposed to manage the case where one or two of them really are dependent on another one or more? Do we put these in a separate flow? Do we really want our developers to have to remember that if they put an additional dependency in a transform that they have to regard whether that preceding transform has actually executed successfully?
So that's the real trick, isn't it? If I have forty transforms and all of them could run asynchronously except for about ten of them, that can only run after their predecessor completes, I have several options to see to it that these secondary operations do not fail (because their predecessor has not executed yet).
I can serialize them in by putting them into separate flows (or branches). One of them kicks off and runs to completion while the next one waits. This is logically consistent but also inefficient. If those secondary transforms are co-located with the original set, the optimizer can run them when there is bandwidth rather than waiting until the end. It is also logistically unwieldy because a developer has to remember to that if a transform should gain a dependency, it has to be moved to the second flow.
I can fully serialize them into a list, but this is the most inefficient since it "boxcars" the transforms and does not leverage the extra machine cycles we could have used to shrink the duration.
I can link them via their target table and source table, such that this relationship is dynamically identified and the flow path dynamically realized. If a given transform does not run (conditional failure) or simply fails to execute, the dependency breakage is dynamically known. What does this do? What if a given transform is supposed to use an incoming (intake) table if it is present (data was loaded) otherwise use a target-table's contents (e.g. trickle-feed, change-data-capture problem). This allows the transform to do its work with consistency but also have the ability to dynamically change its sources based on availability.
Now, we know ETL tools don't do this. Other tools may attempt to rise to this level of dynamic pathing, but the bottom line is that if those tools don't provide this kind of latency-reduction, high-throughput, dynamically adaptable model, they will not be able to leverage the full bandwidth of the machine. Trust me on this - the difference is between using 90 percent of the machine or only 10 percent at a time. That machine packs the virtual joules to make it happen, so let's make it happen.
When we originally developed our framework to wrap around some of these necessary functions, we had not considered these nuances of dynamic interdependence and frankly, ELT was so new that it didn't really matter. The overhead to execute "raw" SQL was zero, but we could not effectively parallelize/async the queries without losing control. Running async chains of transforms necessitated detailed control, but nobody had a decent algorithm for it, so once again Brightlight had to pioneer this capability. Our architecture allowed us to easily integrate these things into the substrate of the framework as a transparent function. This is the primary benefit of a framework, that the developers can continue to build their applications without disruption, but we can upgrade and enhance the framework to provide stronger and deeper functionality. Whether our framework is right for all applications is not the issue, but whether the complete implementation is right for Netezza. It's a powerful machine and we should not arbitrarily leave any cycles on the table.
Imagine slowly running out of steam because of latent implementation inefficiencies, then ultimately asking for a Netezza upgrade that, if the inefficiencies weren't present, the upgrade wouldn't be necessary. This has happened with more than one of our sites and rather than upgrading to all-new-hardware, we installed, converted and bought back an enormous amount of capacity. They eventually upgraded the hardware much later on, but for the right reasons.
DavidBirmingham 2700043KNU Tags:  netezza solution problem power referential enterprise 3,381 Views
A while back we bought a new house and decided that we would keep our first house for a few additional months to muse about whether we would keep it for a rental home. Well, shortly afterward was Sept 11th, 2001 and we were boxed into keeping it for quite a while longer. In all this, opportunists came out of the woodwork to take our house for a "song".
We tired of the songbirds quickly, but some of them were just bizarre. One couple looked at the house and said they would buy it if we would put $20k of upgrades into it. New this, new that. Our answer was no, we're selling "as is". But they persistent and felt that their requests were reasonable. No, I said, you can spend your money fixing it up like you want it, because I'm not spending my money to fix it up like you want it. They were confused. We weren't "playing the game" you see. Oddly, they had a realtor working with them who kept chanting, "Ask for what you want. It never hurts to ask."
Well, when you're asking the wrong questions, and it's obvious, it definitely hurts to ask. When on site with a large financial services firm, we were being examined for the replacement of their current consulting firm. Their lament: "They ask all the wrong questions." Apparently we were asking the right questions, because they gave the business to us.
Recently I walked through a demo of some things we've done for other clients. Two minutes into the demo, they started asking questions. This or that? Those or these? But nothing whatsoever to do with the core problems the solution was geared for, or the core questions that everyone asks.
It's like this: I have a list of questions on left side of the white board and another list on the right side. The questions on the right side deal with things like very-large-scale backup and high-speed recovery, disaster recovery, hot-failover between two Netezza machines, real-time replication, continuous processing, and of course, heavy-lifting processing inside-the-box.
These are things that accelerate business logic, the rapid turnaround of business rules and features, catalog-aware design-time and run-time modules, developer productivity, testing accuracy and turnaround, rapid-testing and rollout, release management and patched releases - operational integrity, a firm harness around the functional data, and radical simplification of very complex problems - in short, a high-wall of protection around a high-speed delivery model. Something that shrinks the time-to-delivery as well as shrinks operatiional processing time and protects capacity, while protecting the data itself.
I knew better than to hold my breath for a question on migration -
Whew! Whenever I give this short list to a CIO, they are usually asking in terms of having rolled out a large-scale environment with Netezza and are experiencing pain in one or more of those areas. Of course, we focus on those areas. Painkiller is not enough. You have to remove the source of pain..
How do I know that these are the questions Enzees are asking? Because they are - to me personally, to the Netezza community forums, and are the hot-topics at every Enzee Universe. Good grief, that's the short list of the lightning-round question-and-answer sessions at the Best Practice forums.
But they did not hear any of this. They were asking all of the wrong questions. The reason, methinks, is that they weren't in any pain. They felt the freedom to ask about the inconsequential because they had never experienced the consequences, or have even foreseen the consequences. In other words, I am solving problems they don't have.
Or do they? Anyone who buys a large-capacity storage devicesis by definition biting off a data management challenge. One of our clients has over 200 TB (uncompressed) on their TwinFin. They currently do their backups to another TwinFin of the same size. They would like to perform their backups offline, but no commodity backup/recovery system on the planet can handle this kind of load with any aplomb. Some claim to, of course, but have not considered the true needs of the Netezza user. It's all Texas-Sized Data Warehousin'
For example, one of the off-the-shelf products claims that once the data is offloaded, they have "hooks" allowing the user to query the offline archives. That's right, the user query will be dispatched to their proprietary engine and it will fetch the answer back for you. But wait a second. That protocol is for a transactional system. Going out onto the file system to fetch a transaction might have some value, since the only alternative is to bulk-load the entire dataset back into the device to query it, when all we wanted was one transaction.
However, for a Netezza "scanning analytic" query, potentially spanning tens of terabytes in scale, such a mechanism is no more than a child's toy. What we need is the ability to rapidly reload the data back into the machine so we can query it there, inside Netezza's MPP. More importantly, once it's back online we can join it to other tables, that's right, down on the MPP where the CPUs burn brightest.
But this is simply an example of how one vendor has solved a problem that no Netezza user is asking. Netezza users need to scan and analyze the data in-the-box. Those tools provide data access outside-the-box, without considering that having the data outside-the-box is the problem we were trying to solve in the first place!
Another issue is how the ETL tools interact with Netezza. All of them do, some better than others. Some are certainly getting better than the others. The game is on, they know that data processing is migrating back into the machine. Netezza is leading the way, and they want a piece of the action. Can you blame them?
But hold on. Is bulk-data processing in an ETL tool the same as bulk-data processing inside a Netezza box? No, not really. "Going parallel" in an ETL tool means spreading work out across CPUs for a select set of operations. Not only this, we will compete with other processes for those same CPUs. The net is, we cannot put all the CPUs on the machine to work for us. Clearly some of the ETL tools are otherwise proud of their scalability. And they should be. Add a few more CPUs to that rack and watch the ETL tool scale with it. Nothing bad about that. Capture those business rules in a point-click and off-we-go. They have spent millions of dollars tuning their performance models, pouring the brain power of their brightest people into making their product go-parallel and push that data hard. Imagine them handing off this hard-won, multi million-dollar performance capability to an upstart appliance? Hmm, no, they'll be the last who are draggged kicking and screaming into the inexorable future.
Anyone who is kicking the tires of a large-scale storage and processing device like Netezza is also in for a subtle surprise: Once you are migrated, you will unleash the creativity of you staff to add functionality that was never possible before. This will generate business, which will generate more data. The extra capacity will naturally offer confidence that all-new-business-great-and-small are not only do-able, but with strength that isn't available with other platforms. No worries, sez aye, the Netezza machine will scale with you. And you have chosen well, grasshopper.
Then one day they look around and say - have we backed-up this data recently? How ahout disaster recovery? What about archiving old data to make room for new? What about the ability to make the offline data available for the ad-hoc folks? That is, available in time for them to actually use it? These simple questions raise fear in the hearts of the mightiest of IT champions when they know it should have been asked, and applied a long time before ---- Before the locomotive was moving at 90 miles an hour dragging 500 boxcars. Before the locomotive even left the station. The sheer logistics of performing a backup of data this size, and this transient, is mind-numbing. I've noted that one client, hosting over 150 TB (uncompressed) naively plugged their commodity backup tool into the Netezza machine. After over-a-week of whirr-click backup activity with no end in site, someone finally said "Kill it. If it takes a week to back it up, it will take even longer to restore it". This is a wise observation, but also tells us something:
The commodity tools expect us to accept that the restore-process will be slower than the backup-process. In the Netezza world, the backup and restore can both happen faster, and take up less storage than the commodity backup tool. If the commodity backup has to offload in uncompressed form, we need to provide generous workspace for it. If it's a Netezza-compressed backup, we only need to provide for the amount relevant to our compression ratio. Some sites get a 16x, others get almost twice that. The mileage varies because of the nature and compressibility of the data. Either way, offload is fast because it comes right off the disk without passing through the host. Likewise the reload, directly to the disk without the host. In a traditional RDBMS, offload/reload has to pass through the host to get on and off the device. For bulk analytic data, the missing middle-man is just the ticket for rapid-reload.
But they weren't asking this question either. The problems we had already solved and were bringing to the table as Netezza-centric solutions, the bread-and-butter, core-mission capabilities that people ask for all the time, wasn't even on their radar. The disconnect is simple: They have read white-papers on what people are doing after they get the back-end squared-away. The nice-to-haves. The critical parts, you know, the failure points that could mean the demise of the business, or perhaps their own paychecks? Not even on the table.
This is akin to someone about to engage in the construction of a large cargo ship. To be sure, some folks are concerned about the utility of the ship's interior. But when putting pen to paper, which is more important, that the break rooms have contemporary wallpaper, or that the ship can master tempestuous seas and high swells? Methinks the crew of said ship couldn't care less about the amenities if they got wind that the architects gave short shrift to things like hull stress and multiple watertight compartments. You know, things to keep the ship from being claimed by the sea when stress is high. Boy, those light fixtures sure look cool, don't they?
Back to basics. Netezza is an appliance. It can perform as a load-and-query device just like all the other load-and-query devices. A primary differentiator, one that more customers are experiencing all the time, is that Netezza is a powerful data processing platform. When leveraged this way, it also becomes a problem-solving platform. We simply wrapped some additional logic around these core capabilities in order to harness the logistics of multiple sequential (or asynchronous) queries being fired off to the device, managing its workload, intermediate tables and whatnot. The appliance makes such an endeavor simple, and further simplifies the user's interaction with the machine for purposes of pattern-based utilization. Change-data-capture, referential integrity checking etc are all far more effective inside the machine than outside the machine.
For the shops that would rather master these aspects in the ETL tool, hey, no harm done. Lots of people do it that way. But they all eventually get to the same point in the game: the ETL tool runs out of gas. Or to give it more gas will require a significantly larger power-plant. They then realize that all those CPUs in the Netezza machine are just sitting there, doing nothing most of the time. Enough CPUs respond to peak, which is about five percent of the time. The remaining 95 percent of time, the box is running less than half capacity with a big chunk of that completely idle. Wouldn't it be great to get a handle on all that power? Recover it as part of an ongoing capacity model?
The most important part of this approach is to decide you want to do it. Lots of options naturally come your way when you try to get creative in the direction of power- because power drives the creatviity further. Ideas are given wings that can fly higher and farther than any other traditional RDBMS could even dream about. Managers start having ideas too. By golly if that machine can do this, then why not that? And why not, indeed?
In fact, most of the time when dealing with a standard RDBMS, the managers will ask why-not? in the sincerest of spirits. Then forth from the mouths of IT come the exact, precise reasons why-not. They are good reasons, logical reasons, and effectively put a wide moat around the management's idea-factories. Soon their ideas fade. They forget how good the ideas were. They will never see the light of day. The underpowered nature of the machines constrain them.
Contrasted with the Netezza machine, the question of why-not is more rhetorical in nature. The person asking the question is not expecting an answer either, but not for the same reasons as the manager above. He's not expecting an answer because the IT folks are already on it. His answer to the question will not be verbal, but will be positively expressed in real functional terms. Likely in a very short period of time.
Why-not? becomes more of a water-cooler phrase. Almost like, "I'm trying do decide whether we should eat at the club or eat at the diner. How about today we eat at the club?" To which the respondent says - "why not?" This is a very different and rarefied existence than the one borne on the constraints of an underpowered machine.
Once you gather a head of steam on all this, you realize that not only is Netezza an enabler for large-scale, complex problem solving, it provides the impetus for us to construct a problem-solving platform upon it. The ability to capture larger patterns of set-based processing, express them in simpler terms, and have them available to all - as capabilities that leverage the capabilites of the Netezza machine.
In your own domain, if you have a Netezza machine in-house and you're using it for data processing, even if it's the most inefficient model on the planet, it still beats the socks off the plain-vanilla ETL tool's work for the same operations. If you have coupled an ETL tool with this approach and are getting the gains out of it, even though this process may have initially been painful, you have answered the question with the right answers. The tools may not be quite up to speed, but that's okay. Your compass has not betrayed you. Stay the course and the benefits will be magnanimous indeed.
And for those who continue to use the machine as a load-and query device, and have not forayed into the radically rewarding realm of ELT and data processing inside the machine, there is only one question to ask, and it's the right question:
Tossing around the term "Big Data" these days seems to elicit a wide variety of feedback, concern, conjecture, etc. Those in the "old school" of Big Data wrestled with billions of rows of structured data. The buzz now is for unstructured Big Data, and for folks in this zone, it's as though the "other" form of Big Data never existed. After all, they were never exposed to it. Their introduction to Big Data, as though it was brand new, was big "unstructured" data. I recently posted out on Linked-In a tongue-in cheek rendition of the branding problem Big Data is having. I have received so many emails on it that I thought it deserved a little more exposure, if only for pure entertainment purposes. Here we go:
Big (unstructured) Data was so-dubbed for lack of a better word. Unfortunately it has suffered the same ambiguity as Kleenex (don't we ask for a Kleenex when we mean any-old-tissue?) or Coke (don't kids ask for a Coke when they really mean any-old-soda, and didn't Coca-Cola have to go on a market scourge to avoid losing the meaning of the word? If you ask for a "Coke" the person at the counter is required to correct your choice if they don't serve it) and don't we use the word "cellophane" to mean - oh wait - cellophane really is a distinct product that never protected the meaning of its name, so it really is lost.
In a Netezza shop experiencing some performance stress with their machine, we ask the usual questions as to the machine's configuration, its functional mission. Ultimately we pop-the-hood to find that the data structures and the queries are not in harmony. For starters, the structures don't look like Netezza structures, at least, not optimized for Netezza. We receive feedback that they "just" moved the data from their former (favorite technology here) and ran-with-it. They received the usual 10x boost as a door-prize and thought they were done. Lurking in their solution however, were latent inefficiencies that were causing the machine to work 10x to 20x harder to achieve the same outcome. And their queries were likewise 20x inefficient in how they leveraged the data structures.
More unfortunately, the power of the machine was masking this inefficiency. It's like the old adage, when a person first starts day-trading on the stock exchange, the worst thing that can happen to them is that they are successful. Why? It put a false sense of security in their minds that gives them permission to take risks they would never take if they knew the real rules of the game. The 10x-boost for moving the data over is a "for-free" door prize not the go-to configuration.
What are the real rules of the Netezza game? The first rule is that extraordinary power masks sloppy work. Netezza can make an ugly duckling look like a swan without actually being one. It can make an ugly model into a supermodel without the necessary adult beverages to assist the transformation. It can make sloppy queries look like something even Mary Poppins would approve of, practically perfect in every way and all that.
What's lurking under-the-hood is nothing short of a parasitic relationship between the model, the queries and the machine. We received the 10x boost door-prize and think we have succeeded. But we have only succeeded in instantiating the model and its data into the machine. We have not succeeded in leveraging the entire machine. And, uh, we paid for the entire machine. So why aren't we using it?
In our old environment, the index structures worked behind-the-scenes, transparently assisting each join. Our BI environment is set up to leverage those joins so we get good response times. The Netezza machine has no indexes so the BI queries (whether we want to admit it or not) are improperly structured to take advantage of the machine's physics.
"But that's how we've always done it..." or "But we don't do it that way..."
The short version, the former solution and (favorite technology here) is casting a long shadow across the raised floor onto the Netezza machine. People are forklifting "what they know" to the new machine when very little of it applies.
For example, in a star schema, the index structures are the primary performance center. The query will filter the dimensions first, gather indexes from the participating dimensions and then use these to attack the fact table. The engine does all this transparently. The result is a fast turnaround born on index-level performance. These are software-powered constructs in a general-purpose engine. The original concept of the star-schema was borne on the necessity of a model that could overcome the performance weaknesses of its host platform. It is in fact an answer to the lack-of-power of commodity platforms. In short, just by configuring and loading a star schema on a commodity platform, we get boost from using it over a more common 3NF schema.
The Netezza machine doesn't have indexes. So the common understanding of how a star-schema works doesn't apply. At all. Don't get me wrong, the star-schema has a lot of functional elegance and utility. It does not however, inherently provide any form of performance boost for queries using it. It can simplify the consumer experience and certainly ease maintenance, but it is not inherently more performant than any other model. In fact, using such a model by default could hinder performance.
Why is this?
The primary performance boosters in Netezza are the distribution and the zone map. Where the distribution and co-location preserve resources so that more queries can run simultaneously with high throughput, zone maps boost query turnaround time. They work in synergy to increase overall throughput of the machine. How does installing a star-schema inherently optimize such things? It doesn't.
Can we use a star-schema? Sure, and we should also commit to distributing the fact table on the same key as the most-active or largest dimension (they are often one-and-the-same). This will preserve concurrency for the largest majority of queries. A better approach however, is to specifically formulate a useful dimensional model that leverages the same distribution key for all participating tables. Common star-schemas do not do this by default, and if only two tables are distributed on the same key, all other joins to the other tables will be less performant. They will have to "broadcast" the dimensional data to the fact table. Clearly having all tables distributed on the same key will preserve concurrency, but this doesn't give us the monster-boost we're looking for. Distrubution might get us up to 2x past the door-prize performance we get from moving to the machine. Zone maps are notorious for getting us 100x and 1000x boost.
At one site I watched as several analytic operations remanufactured the star-schema data into several other useful structures, each of which was distributed on a common key. At the end of the operation, these quere joined in co-located manner and the final result came back in orders-of-magnitude faster than the same query on the master tables. I asked where they had derived the key, and they explained that it was a composite key that they had reformulated into a single key because their dimensional tables could all be distributed on it and maintain the same logical relationship. Looking over the table structures, they had a "flavor" of a star schema but certainly not a purist star. The question remained, if the existing star schema wasn't useful to them but their reformulated structure was, why weren't they using the reformulated one as the primary model and ditch the old one? The answer was simple, in that the existing star was seen as a general-purpose model and not to be outfitted or tuned for a specific user group. This is one of the commodity/general-purpose lines-of-thought that must be buried before entering the Netezza realm.
This is the primary takeaway from all that: The way we make an underpowered machine work faster is is to contrive a star schema that makes the indexes work hard. We forget that the star schema is a performance contrivance in this regard. If we attempt to move this model to the Netezza machine because "it's what we do" then we may experience performance difficulties rather than a boost. A common theme exists here: people do what they are knowledgeable of, what they are comfortable with, what they find easy-to-explain and do not naturally push-the-envelope for something more useful and performant.
In Netezza, the star schema has functional value but (configured wrong) is a performance liability. We can mitigate this problem by simply reformulating the star to align with the machine's physics, and by adapting our "purist" modeling practices to something more practical and adaptable. After all, many modeling practices are in place specifically because doing otherwise makes a traditional platform behave poorly. If we forklift those practices to Netezza, we participate in casting-the-long-shadown of an underperforming platform onto the Netezza machine.
We have enormous freedom in Netezza to shape the data the way we want to use it and make it consumption-ready both in content and performance. We should not move from a general-purpose platform (using a purpose-built model like a star) into a purpose-built platform with a general-purpose model like a star. The odd part is that the star is an anomaly in a load-balancing, traditional database, but is seen as purpose-built for that platform. Exactly the opposite is true in Netezza. The machine is purpose-built and the star is only another general-purpose model that doesn't work as well as a model that is purpose-built for Netezza physics and for user needs.
The worst thing we can do of course is think-outside-the-box (the Netezza box). We really need to think-inside-the-box and shape the data structures and queries to get what we want. This mitigates the long-shadows. It's just a matter of adapting traditional thinking into something practical for the Netezza machine.
"We're charter members," claimed Bjorn, a tech from across-the-pond, "been working with the stuff for ages."
"Same here," asserted Jack, a DBA from Kansas, "You'd be surprised how fast this thing scoots around in first gear."
"Exactly," said Bjorn, "Those who master the roads in first gear are actually better off. You run at a speed that gets you around but not fast enough that you accidentally run into things."
"So it's a safe speed?" asked the interviewer.
"Indeed," said the pair, almost in unison.
"So what happens when you take it for a spin on the highway?"
The two glanced at each other, then to the interviewer, looked down or elsewhere and then one at a time eventually answered.
"Highway driving is problematic," said Jack. "We'd rather stay on the city roads."
"Yes, city roads," Bjorn agreed, "No need for the highway."
The Director of Integration watched the interview on video and made a smacking sound against his teeth, "Pathetic. We bought the car to drive on the highway most of the time. These guys are acting like a couple of children who are scared of the Big Bad Freeway."
"Well, it has too many lanes in it," smirked his assistant, "If they go on the highway, they might actually get somewhere."
"What's that logo on their shirts?" the Director noted, "Zoom in on that -"
"It's FGA," said the assistant.
"Is that a misspelling? I thought the machine used an FPGA?"
"Oh no, that's correct. It's the First Gear Association. But now they call themselves the First Gear Society. It's way past a club. Now it's a philosophy. Or maybe culture is a better word."
"You mean they actually run around justifying why they stay in first gear?"
"It's almost like a discipline sir," grinned the assistant, "They have rules and protocols for their followers. It's cultish if you ask me."
"I didn't ask you."
"So what are some of these so-called protocols?"
"Well for starters, they can't stand deep-metal. It's like garlic to a vampire."
"But deep-metal is - " he sighed, "Never mind, just keep going."
"If they find themselves going too slow, they try to get the machine to go faster in first gear. If they can't they blame the machine. They think that having a powerful machine is good enough. After all, a couple-hundred horsepower under the hood looks good no matter what gear you're in. And to the other chap's point. If you put it into a higher gear, it'll go so fast that it's hard to steer. They aren't very good at steering, so - "
"So it's a fear thing?"
"Not so much. They like rollercoaster rides, judging from some of their project schedules and deliverables anyhow. They like the whole livin-on-the-edge brinksmanship. As long as someone else is driving."
"Ahh, so just afraid to punch the accelerator themselves, eh?"
"They'll punch it, just in first gear. It's almost like any gears higher than that are a fearful thing."
"How do we get them into the deep-metal? It can't be that hard."
"Oh they go into the deep-metal on expeditions and stuff. They like popping the hood and exploring the architecture. But when they get behind the wheel, it's a first-gear-only driving experience.""Perfect."
"So tell me what to look for in a First-Gear-Society member. What can I expect?"
"Here's a short list," the assistant accessed his handheld and pulled an image onto its screen, showing it to the Director.
"Let me see if I understand this -"
Common symptoms are:
"We were doing just fine until a week or so ago, then everything started to slow down. Jobs are running a lot slower, we're missing all of our deadlines. Has the hardware worn out?"
"We just added some functionality to the solution and two days later the whole machine started to drag. We backed out the solution but the machine is still dragging. I think we broke something. There's a problem with the hardware."
"The machine is having a hardware problem. We've been running these applications for years and then one day everything slowed down. We upgraded to a bigger machine and we didn't get any lift, but a few days later things got worse. A lot worse. This is clearly a hardware problem."
"Wow, they think it's the hardware when they haven't even taken it out of first gear. Amazing."
"Here's the punch line - they honestly believe that they can get more power from the machine if they learn to steer it better."
"No kidding. Take a look at these quotes:"
"We reworked the logic in the queries to tighten things down. Every time we change the query, we get a few percentage points difference in performance, but we need a multiplied boost, like 10x or better."
"We put in summary tables to make things go faster but it didn't help. We have query engineers tightening down the queries so that the join logic is efficient."
"All of our query-tuning has failed. We are at a standstill."
"See all that? They really believe that they can make it go faster from the steering wheel. Like the machine they are sitting in is not the source of power."
"How do we convince them to take it out of first gear?"
"Not sure that there is a way, sir. They are so accustomed to believing that the steering wheel is the place to affect performance that they haven't bothered to examine where the power-plant actually resides."
"Under the hood."
"That's right, the power is in the hardware. The steering wheel can only direct the hardware to the right location, but cannot affect power."
"Well, they can certainly steer it poorly."
"Oh sure, like driving it through mud and all that. But that brings up another issue. Some of them like an extreme challenge, so rather than taking it out of first gear, they try to make the machine go faster by making the gears grind."
"Sure, a grinding gear just seems like it's working hard."
"But they don't see that the friction of the grinding is slowing them down?"
"Not in the least. Look here:"
"We built our tables to provide fast back-end data processing, but the distribution we use for back-end processing doesn't work so well for reporting. Every time they issue a query, it's terribly slow. If we copy the larger tables to another distribution, it makes back-end processing slow but the reporting fast. We can't seem to win here. So we're going to keep the distribution for the back-end and then find a way to build something else for the front end. They want us to make two versions of the big tables, each on different distributions. The experts tell us that this is common and the best thing to do, but it sounds crazy to us."
"See that? They don't even listen to the experts. They would rather grind the gears."
"But all they are saving is disk space. I mean, seriously?"
"Old ways die hard. They would rather preserve disk space and incur the wrath of the users."
"Or preserve disk space and then build-out a convoluted set of summary tables that is twice as hard to maintain."
"Yep, the simple, dumb photocopy of data into another distribution is just too simple. They have a need to engineer."
"Engineering in first gear is not really engineering. It's more like polishing the stick-shift."
"Here's another quote:"
"We have a lot of complex views, and those views join to other views. I've seen EXPLAIN plans for this hit fifty to one-hundred snippets deep. Netezza returns this stuff fast all the time. Only now it doesn't anymore.
"Ahh, so their underpinning tables finally reached a tipping point."
"Yeah, this is funny. They add a ten pound weight to the trunk of the car every day and three hundred days later, and 3000 pounds heavier, the car is starting to run slow. Imagine that."
"Running slow because they have it in first gear."
"Well of course, if they were to use a higher gear, all that weight would be practically weightless."
"Funny how the density of deep metal is a lot like anti-gravity."
"Well, the competition sees these things happening and they say, you know, with that particular vehicle, the more weight you add, the slower it gets. It's just the nature of the machine."
"But that's not true on deep-metal."
"Not at all. We know folks who have tons and tons of weight on the machine but it's light as a feather."
"They're using higher gears and the deep metal together."
"Absolutely. Nobody from the First Gear Society is allowed in their shop. They know better."
"Well that's a problem with nested views. The master query attempts to provide the filter attributes but if the nested view doesn't pass them along, the underpinning tables end up table-scanning."
"And as you know, Netezza is the best table-scanner in the business. It can scan tables, really big tables, in no time flat."
"But it's not supposed to be scanning the tables. It's supposed to be using the zone maps and filter attributes so that it doesn't have to scan the tables."
"Well, exactly, but the people in first gear don't know this. They can't really tell that their queries are getting fractionally slower with each passing day. Then one day it reaches a point of no return. Sort of like how rust slowly clogs up a pipe until one day the pipe just closes over. It doesn't happen in a single query or just because we added a new table or two. It's pervasive and pandemic across the entire implementation."
"I bet when they hear that, they go insane."
"Yeah it's because with any other machine, remediating something like this would be very hard and tedious. But with Netezza we can remediate it incrementally and get more lift with each change. It's just not hard to recover the capacity if you know what you're looking for."
"I think the bottom line is that a really powerful machine tends to hide inefficient work."
"Or like one aficionado put it: Netezza can make a really ugly data model look like a super-model, and can make really bad queries look great. They just don't realize that the ugly models and bad queries are sapping machine's strength like a giant parasite."
"Ugh, now there's a visual I can live without."
DavidBirmingham 2700043KNU Tags:  spu netezza distribution parallel skew 45 Comments 29,191 Views
One of the most significant questions to answer on the Netezza platform is that of "distribution" - what does it mean? how does it apply? and all that. If you are a Netezza aficionado, be warned that the following commentary serves as a bit of a primer for the un-initiated. However, feel free to make comments or suggestions to improve it, or pass it on if you like.
In the prior entry I offered up a simple graphic (right) as to how the Netezza architecture is laid out internally. In this depiction, each CPU is harnessing its own disk drive. In the original architecture, this was called a SPU, because the CPU was coupled with a disk drive and an FPGA (field-programmable gate array) that holds the architectural firmware "secret sauce" of the Netezza platform. The FPGA is still present and accounted for in the new architecture as the additional blade(s) on the blade server. Not to over-complicate things, so let's look at the logical and physical layout of the data itself and this will be a bit clearer.
When we define a table on the Netezza host, it logically exists in the catalog once. However, with the depicted CPU/disk combinations (right) we can see 16 of these available to store the information physically. If the table is defined with a distribution of "random", then the data will be evenly divided between all 16 of them. Let's say we have a table arriving that has 16 million records. Once loaded, each of these SPUs would be in control over 1 million records. So the table exists logically once, and physically multiple times. For a plain straight query on this configuration, any question we ask the table will initate the same query on all SPUs simultaneously. They will work on their portion of the data and return it an answer. This Single-Instruction-Multiple-Data model is a bread-and-butter power strategy of Massively Parallel computing. It is powerful because at any given moment, the answer we want is only as far away as the slowest SPU. This would mean as fast as the SPU can scan 1 million records, this is the total duration of the query. All of the SPUs will move at this speed, in parallel, so will finish at the same time.
What if we put another table on the machine that contains say, 32 million rows? (trying to keep even numbers here for clarity). The machine will load this table such that each SPU will contain 2 million rows each. If no other data is on the machine, we effectively have 3 million records per SPU loaded, but the data itself may be completely unrelated. Notice how we would not divide the data another way, like putting 16 million records on 6 of the SPUs and the other 32 million records on the remaining 10 SPUs. No, in massively parallel context, we want all the SPUs working together for every query. The more SPUs are working, the faster the query will complete.
Now some understand the Massively Parallel model to be thus: Multiple servers, each containing some domain of the information, and when a query is sent out it will find its way to the appropriate server and cause it to search its storage for the answer. This is a more stovepiped model than the Netezza architecture and will ultimately have logistics and hardware scalability as Achilles's heels, not strengths for parallel processing. Many sites are successful with such models. But they are very application-centric and not open for reuse for other unrelated applications. I noted above that the information we just loaded on the Netezza machine can be in unrelated tables, and databases and application suites, because the Netezza machine is general-purpose when it comes to handling its applicability for data processing. It is purpose-built when we talk about scale.
But let's take the 16-million-record vs 32-million record model above and assume that one of the tables is a transactional table (the 32 million) and one is a dimensional table (the 16 million). The dimensional table contains a key called "store_id" that is represented also on the transactional table such that we can join them together in various contexts. Will the Netezza machine do this, and how will it? After all, the data for the tables is spread out across 16 SPUs.
Well, we have the option of joining these "as is" or we can apply an even more interesting approach, that of using a distribution key. Here's where we need to exercise some analysis, because it appears as though the STORE_ID is what we want for a distribution, but this might skew the data. What does this mean? When we assign a distribution, the Netezza machine will then hash the distribution key into one of 16 values. Every time that particular key appears, it will always be assigned the same value and land on the same SPU. So now we can redistribute both of these random tables on STORE_ID and be certain that for say, SPU #5, all of the same store_id's for the dimension table and for the transaction table are physically co-located on the same disk. You probably see where this is going now.
Ahh, but what if we choose store_id and it turns out that a large portion of the IDs hash to a common SPU? What if we see, rather than 1 million records per SPU, we now see around 500k per SPU with one SPU radically overloaded with the remainder? This will make the queries run slow, because while all the SPUs but one will finish fast, in half the time of before, they will all be waiting on the one overworked SPU with all the extra data. This is called data skew and is detrimental to performance.
However, had we chosen a date or timestamp field, our skew may have been even worse. We might see that the data is physically distributed on the SPUs in a very even form. But when we go to query the data, we will likely use a date as part of the query, meaning that only a few SPUs, perhaps even one SPU, will actually participate in the answer while all the others sit idle. This is called process skew and is also detrimental to performance.
Those are just some things to watch out for. If we stay in the model of using business keys for distribution and using dates for zone maps (which I will save for another entry) we will usually have a great time with the data and few worries at all.
Back to the configuration on store_id's. If we now query these two tables using store_id in the query itself, Netezza will co-locate the join on the SPU and will only allow records to emit from this join if there is a match. This means that the majority of the work will be happening in the SPU itself before it ever attempts to do anything else with it. What if we additionally clamped the query by using a date-field on the transaction table? The transaction table would then be filtered on this column, further reducing the join load on the machine and returning the data even faster.
So here is where we get lifting effects where other machines experience drag. For an SMP-based machine, adding joins can make the query run slower. On a Netezza machine, joins can serve as filter-effects, limiting the data returned by the query and increasing the machine's immediate information on where-not-to-look. Likewise the date-filter is another way we tell the machine where-not-to-look. As I have noted, the more clues we can give the machine as to where-not-to-look, the faster the query will behave.
I have personally witnessed cases where adding a join to a large table actually decreases the overall query time. Adding another one further decreases it. Adding another one even further and so on - five-joins later we were getting returns in less than five seconds where the original query was taking minutes. This filter-table and filter-join effect occurs as a free artifact of the Netezza architecture.
Distribution is simply a way to lay the data out on the disks so that we get as many SPUs working for us as possible on every single query. The more SPUs are working, the more hardware is being applied to the problem. The more hardware, the more the physics is working for us. Performance is in the physics, always and forever. Performance is never in the software.
While the above is powerful for read-query lift (co-located reading) there is another power-tool called the co-located write. If we want to perform an insert-select operation, or a create-table-as-select, we need to be mindful that certain rules apply when we create these tables. For example, if we want to perform a co-located read above and drop the result into an intermediate table, it is ideal for the intermediate table to be distributed on the same key as the tables we are reading from. Why is this? Because the Netezza machne will simply read the data from one portion of the disk and ultimately land it to another portion of the same disk. Once again it never leaves the hardware in order to affect the answer. If we were to distribute the target table on another key, the data will have to leave the SPU as it finds a new SPU home to align with its distribution key. If we actually need to redistribute, this is fine. But if we don't and this is an arbitrary configuration, we can buy back a lot of power just by co-locating the reads and writes for maximum hardware involvement.
So that's distribution at a 50,000 foot level. We will look at more details later, or hop over to the Netezza Community to my blog there where some of these details have already been vetted.
It is important to keep in mind that while distribution and co-location are keys to high(er) performance on the machine, the true hero here is the SPU architecture and how it applies to the problem at hand. We have seen cases where applying a distribution alone has provided dramatic effects, such as 30x and 40x boost because of the nature of the data. This is not typical however, since the co-located reads will usually provide only a percentage boost rather than an X-level boost. This is why we often suggest that people sort of clear-their-head of jumping directly into a distribution discussion and instead put together a workable data model with a random distribution. Once loaded, profile the various key candidates to get a feel for which ones work the best and which ones do not. We have seen some users struggle with the data only because they prematurely selected a distribution key that - unbeknownst to them - had a very high skew and made the queries run too slow. This protracted their workstreams and made all kinds of things take longer than they should have.
So at inception, go simple, and then work your way up to the ideal.
On multiple distribution keys:
Many question arise on how many distribution keys to use. Keep in mind that this is as much a physical choice as a functional one. If the chosen key provides good physical distribution, then there is no reason to use more keys. More granularity in a good distribution has no value. However, a distribution key MUST be used in a join in order to achieve co-location. So if we use multiple keys, we are committing to using all of them in all joins, and this is rarely the case. I would strongly suggest that you center on a single distribution key and only move to more than one if you have high physical skew (again, a physical not functional reason). The distribution is not an index - it is a hash. In multi-key distributions, the join does not first look at one column then the next - it looks at all three at once because they are hashed together for the distribution. Joined-at-the-hip if you will.
On leaving out a distribution key:
One case had three tables joining their main keys to get a functional answer. They were all distributed on the same key, which was a higher-level of data than the keys used in the join. The developer apparently thought that because the distribution keys are declared that the machine will magically use them behind the scenes with no additional effort from us. This is not the case. If the distribution key(s) (all of them) are not mentioned in the join, the machine will not attempt co-location. In this particular case, using the higher-level key would have extended the join somewhat but would not have changed the functional answer. Simply adding this column to the join reduced that query's duration by 90 percent. So even if a particular distribution key does not "directly" participate in the functional answer, it must directly participate in the join so as to achieve co-location. And if this does not change the functional outcome, we get the performance and the right answer.
How it affects concurrency:
Many times people will ask: Why is it that the query runs fast when it's running alone, but when it's running side-by-side with another instance of itself they both slow to a crawl? This is largely due to the lack of co-location in the query. When the query cannot co-locate, it must redistribute the data across the inter-blade network fabric so that all the CPUs are privy to all the data. This quickly saturates the fabric so that when another query launches, they start fighting over fabric bandwidth not the CPU bandwidth. In fact some have noted that the box appears to be asleep because the CPUs and drives aren't doing anything during this high-crunch cycle. That's right, and it's a telltale sign that your machine's resources are bottlenecked at the fabric and not the CPU or I/O levels. By co-locating the joins, we keep the data off the fabric and down in the CPUs where it belongs, and the query will close faster and can easily co-exist with other high-intensity queries.