Two words: load balancing -
Is our chosen platform designed for set-based bulk processing, or load balancing? Both at the software and hardware levels? The load-balancing engine (and attendant SMP hardware) are simply the wrong architecture for large-scale bulk processing. There's no way to "properly" configure the wrong architecture.
Let's level-set the difference here. In an SMP-based scenario, our engineers have to carefully configure the hardware to garner the very best performance from it. We don't have that option in Netezza, because the hardware is pre-configured. Rather in Netezza, we gain power by how we organize and configure the data. We don't really have this option in an SMP-based model, because the database engine software pre-defines how we will organize the information (through index structures) and we cannot affect our fate without the indexes. Let's see the contrasts summarized:
Netezza - no indexes, no hardware config, performance is derived from data configuration
SMP machine - high hardware config, index-depedent, no ability to affect performance with data configuration
In short, the two performance tuning models are not only polar opposites, Netezza is far more adaptable and flexible because it is easier to reconfigure data than to reconfigure hardware.
I am continually impressed with the valiant attempts of various platform aficionados who assert, claim and champion the notion that "properly" configured SMP-based hardware is the the only issue in evaluating competitive performance between platforms. In short. a properly configured <name your platform here> is just as viable as the IBM Netezza platform. Just name your components and off you go.
Of course, most folks who are making these claims are not hardware aficionados at all. Now, I appreciate software folks because at heart I am one of them, but I cut my teeth as an software engineer on solutions that aligned high-powered hardware with other high-powered hardware, all the while respecting the fact that the software I was creating was actually orchestrating and controlling the interaction between these gravity-bending machines, not physically moving the "payload" as it were. Nothing, absolutely nothing could move the data faster than the hardware. We inherently know this, yet many seem to think that software products can overcome this issue by using RAM and other creative methods to accelerate the effect of software operation.
So before I launch into a more complex rant on this, a picture is worth 1000 words (at least). In Netezza Transformation I offered up some graphics for contrast and compare (and alluded to them in another blog entry here).
In the depiction (right) we have CPUs (on the top) and disk drives (on the bottom). The pipeline in between them is the
general-purpose backplane of the hardware configuration, which may include a SAN interface, optical or 10gigE networking or other connection mechanism to transfer data between the server's CPUs and the SAN's disk drives. Even if these disk drives are local on the machine containing the CPUs, this backplane is still the connector between them.
Now we will load a data file containing 100 billion rows of information, some 25 terabytes in size. This is a medium data size for big-data aficionados. The data will necessarily enter the machine from an external network connection, into the software engine (runnning on the CPUs) which will deliver the data onto the assigned location of the disk drives. Seems like a very simplistic and remedial explanation doesn't it?
Now we will query this data. Our ad-hoc and reporting users will want to scan and re-scan this information. Note how now the bottleneck is actually the hardware itself. The data must be pulled in-total from the disk drives, through this backplane and into CPU memory before it can be examined or qualifed. Even if we use index structures, the more complex the query, the more likely we will encounter a full-table-scan. How long, do we suppose, it would take for this configuration to scan 25 terabytes? (Keep in mind that all 25 terabytes has to move through the backplane).
A server-level MPP model would suggest placing two of these configurations side-by-side and coordinating them. In short, one of the server frames would contain some portion of the information while another frame contained the other. We could imagine placing multiples of these side-by-side to create an MPP grid of sorts. This is the essential secret-sauce of many Java grids and other grid-based solutions. Divide the data across a general-purpose grid and then coordinate the grid for loading and querying.
But notice how deeply we are burying the data under many layers of administered complexity. Sure, we can do this, but is it practical and sustainable? I've seen setups like this that served one application (and served it well) but it was an inaccessible island of capability that served no other masters. As general-purpose as all of its parts were, it had been purpose-built and purpose-deployed for a single solution that required the most heavy lifting at the time of its inception. Now that it is in place, other solutions around it are growing in the capacity needs to serve the grid, and none of them have access to the power within the grid. The grid becomes starved from the outside-in. No satellite solutiion can feed it or consume it at the rate it can process data, and it has no extensibility to support their processing missions.
So now we come full circle, we have a "properly configured" one-trick-pony. Over time, the expense and risk of this pony will become self-evident. Parts will break. Data will get lost. Lots of moving parts, especially general-purpose moving parts that are out-in-the-open, only increases the total failure points of the entire solution. Debugging and troubleshooting at the system level become matters of high-engineering, not simple administration. As noted above, in the environment where I cut-my-teeth, I was surrounded by these high-end engineers because it was the nature of the project. I noticed that once the project went into a packaging mode to deploy and maintain, these engineers moved-on and cast their cares on the shoulders of the junior engineers who backfilled them. This was a struggling existence for them, because the complexity of the solution did not lend itself to simple maintenance by junior engineers.
The Caractacus Potts adventure begins! You know Potts from the movie Chitty-Chitty-Bang-Bang. We thrilled to his inventions, and laughed when they did not deliver. A "simple" machine to crack eggs, cook them and deliver-up breakfast worked fine for everyone but him, delivering a plate of two uncooked eggs still in their shells. The puzzled look on his face told us he recognized the problem but did not know where to look for resolution. With so many moving parts, it could be anywhere. This is a classic outcome of "eccentric innovation" and "eccentric engineering" a.k.a "skunkworks". More importantly, the innovation only solves one problem (e.g. egg-centric breakfast) not a general-purpose solution platform.
Well, let's keep it simple- what about a simple summary report? You know, national sales data summarized to the national, regional, district and store levels? Wouldn't this require a complete scan of the table to glean all of this? In the SMP-based depiction (above) how could be expect such a scan to operate? Software would pull the data in-total from the disk drives, choking the backplane. Software would then summarize the aforementioned quantities, in memory if possible and then deliver the result. Frankly, such a report could take many hours to execute, and keep the machine busy the whole time. Even if we "grid" this, the query would swamp the grid. On a tricked-out Sun E10k, we have about 12 gb/second throughput. Putting some math to this, with 25 terabytes in tow, we could expect the table scan to complete in about 30 minutes even if the full complement of 64 processors is on board, because the solution is I/O bound, not CPU-bound (no new CPUs will make it run faster). However, in reality the software engine and all its overhead drain the energy from the machine and this query will run for hours, even if it's the only operation running on the machine. So I guess we really will need more CPUs to balance off the software drain of the engine itself. (sigh).
This is because: engines that run on SMP-based devices are inherently load-balancing engines, not bulk-processing engines. Their processes stop, negotiate and resume even if there's nothing else going on. Think of it like this: Where I live in the country, at 5am in the morning all of the traffic lights on the main road blink-yellow until about 6am. If I travel on that road before then, I can go the speed-limit for over half-an-hour before hitting the first traffic light in the next (larger) town. But if those traffic lights all operated normally, I could get stopped at each one, protracting my 30-minute journey by orders of magnitude as I wait on traffic lights even when no other traffic is present. SMP-based engines automatically thread this way, where a flow-based model does not. A load-balancing engine will force all of its processes to stop at a virtual traffic light, come up for air to make sure nothing else requires attention, then go back to work. Transactional models absolutely require this but it is anathema to bulk processing.
Now we contrast this with IBM Netezza, which is a purpose-built platform for general-purpose across all solutions requiring such a platform. We don't have a one-trick pony. This would mean (ultimately) any form of data warehouse or bulk-processing solution, but more importantly anything that requires fast-retrieval of data, especially while performing high-intensity on-demand analytics on it.
In the IBM Netezza architecture (depicted right) each CPU has a shared-nothing disk drive and its own RAM. On the original Mustang Series 10400 (with 432 of these on board) we have a machine that costs far less than the Sun E10k noted above. Likewise we could scan those 100 billion rows in less than ten minutes. It won't ever take any longer than ten minutes. If we boost the CPUs of the machine, say by adding another frame to it (200+) CPUs, it will boost the machine's speed by another 50 percent. Queries that took 6 minutes now will take less than 4 minutes. It is a deterministic/predictable model, and adding more frames to the Netezza platform is simple and inexpensive compared to the sheer labor dollars of eccentric engineering.
As for the depiction (right) it has 16 CPU/Disk combinations. To use round numbers, let's say we put 1.5 terabytes on each disk for a grand total of 24 terabytes. With this configuration, for any given query, the path to conclusion is only 1.5 terabytes worth of scan-time away. Once we initiate the query, each CPU will run independently and will scan its 1.5 terabytes. All of them will complete simultaneously, meaning that the total duration of the query was no longer than it took to scan the 1.5 terabytes (they are all scanning in parallel). Now boost this to 400 CPUs, where each one now only has about 63 gigabytes share of the load. One scan of the entire 24TB table takes no longer than the time to scan 63 GB (they are all in parallel). We can measure our disk read-speed here and get very consistent estimates of how long a query should take.
Also keep in mind that (in a prior blog entry on indexes) I noted that we can radically reduce these operations to a fraction of their total scan times. But in the example above, full summary of the data on sales-boundaries, how much is that worth if we could do the sales base on a date? Or based on a range of dates? Perhaps even comparing last-years Independence-Day sales to this years?
In an SMP-based configuration, the information engineers would suggest partitions. The partition (for an SMP engine) is an artificial performance prop that anticipates the user's query needs based on known use-cases. It bundles data (say on a date boundary) so queries against that date can be fenced by the partition boundary. The Netezza zone map, on the other hand, automatically tells the machine where to look, and where not to look, to go capture the information required by the query. No props, no use-case anticipation, just the flexibility we really need if we want to keep fire-breathing users happy without special engineering to anticipate their needs.
Zone maps allow the sales-related comparison above to arrive in mere seconds to the user's fingertips. On an SMP machine, at best, even with partitions, indexes and other performance props, will require a maxed-out power frame (all CPUs on deck) and the best anticipatory information engineering to provide a consistent experience that even hopes to compete with Netezza. Even after all that, it won't come back in seconds, and won't provide the nimble flexibility so sought-after by even the average data analyst.
The conclusion is that the overall cost of deployment, ownership and ease of maintenance for a Netezza machine utterly eclipse the potential promise of SMP-based solutions. For an analyst, all columns, tables and functions are "fair game" for query - all over the database, 24x7. A Netezza machine provides just that. On an SMP-based engine, the analyst has to agree with information engineers on their entry points and usage patterns, and these have to be engineered into the model in order to support the users. Once engineered, the solution will support only that user base. All other user bases will require their own engineering model. This is not sustainable, durable or manageable, which is why those who are steeped in it will gladly embrace a Netezza machine. Value is recognized on so many levels.