"What's this?" asked the CFO of the Data Warehousing Director, now holding a fresh-off-the-press request for more hardware.
"We've hit capacity," said the Director with a sigh, lowering himself wearily into a padded chair.
"What? That machine is barely a year old! And just last month you claimed it would be two years before we would need more capacity. I didn't allocate any budget for hardware because you said we wouldn't need it!"
"I was wrong. We just implemented an upgrade of the application and it started hammering the machine. We didn't know we had hit capacity. We thought we had enough."
"I've had enough," said the CFO, "I want some answers before I sign off on this."
And so began three weeks of malaise in searching for the right answer. An answer that nobody was particularly trained to find. Throwing in the towel, they sought outside help and eventually discovered some very interesting artifacts.
- Firstly, the solution using the machine had been inefficient from the beginning. The machine's power had masked the inefficiency. The new implementation only created a tipping point.
- Secondly, they discovered that of all the SQL statments hitting the machine, over eighty percent of them were singleton operations. Of these, over half of them were singleton updates. Of these, none of them were using a distribution key to support the update. This information was generally available in the query history database, if one only knew where to look. And if one had the time. And the patience.
- Thirdly, many other inefficiencies were pervasive, including the use of big-fat transforms (as opposed to more and leaner transforms with intermediate tables) and the inability to see the spikes in application activity.
- Lastly, much of this activity went across machines (several machines shared the load of processing) so that finding the nefarious operations was nothing short of a submarine hunt.
- Several of the large tables had been distributed on keys that were more suitable for zone maps than distribution, meaning that the vast majority of queries would only leverage one or two CPUs out of hundreds. Sort of like trying to pull liquid cement through a soda straw. This is process skew rather than data skew and is sometimes difficult to diagnose.
- Couple the above with the lack of proper controls for update/delete operations, resulting in an enormous count of aborted queries and their attendant instabilties - and the pernicious growth of unrecovered space and we have hidden inefficiencies that require a different level of resolution and investigation.
- The cost to remediate these problems was not as substantial as wholesale upgrade, but could have been avoided or at least mitigated had the principals been aware, or had visibility to the lurking dangers and inefficiencies that were literally draining the lifeblood from the systems like parasitic insects on an unsuspecting host.
Now the reader may well assert, ahh, we have enterprise tools that allow us to see the machines on the network and in the operations center. Those tools should tell us everything we want to know. Well, sure, about the Linux host of the machine, but not about the health of the various databases, usage stats, trending and most certainly not application-level nuances like misapplied or ignored distribution keys, mangled zone maps and tables that haven't been groomed. What's that, the nzPortal could give us some of this? Sure, on a per-machine basis. But the machine admins want to deal with machine health on one level, while the information architects want a completely different, even oblique level of information that may require deep immersion into the logistics of one or more enteprise applications.
What's the real problem here? Logistical complexity. The simple fact that as we add more technologies, applications and functionality around the machine, it becomes the fulcrum for the enterprise. Without management of the activities both in detail and in context, logistical complexity arises. We want to embrace the simplicity of management and administration, and on a purely administrative level (e.g. what the common DBA role fulfills) it's still a part-time job. But this may hide the fact that the application engineers and implementors have been given, or rather been delegated with, the additional responsibility of logistics. Perhaps they didn't realize that implementation and architectural logistics was now on their plate. After all, they only build applications. Deploying, operating and maintaining them has always been someone else's gig. In any other technology besides Netezza, the DBAs make it their business to know the data and its processing nuances. But in those technologies, there is no power in the machine to take us very high, so the possibility of logistical complexity is held in check by the lack of power in the machine. This constraint released, logistical complexity now becomes a very real threat.
A number of years ago these needs reached critical mass and the ninjas of the Enzee Community that had been addressing these issues as onsite consultants finally congealed these capabilities into a platform that can not only see these trends, nuances and capacity issues, it can do it across multiple machines and applications. In the spirit of Netezza's "driving toward simplicity", this platform plumbs the depths of Netezza's more complex interior and serves up the nuggets in visualized, actionable form. It is essentially the application-administrative level of business intelligence - for the machines.
In the aforementioned "submarine hunt" the principals learned an important lesson of very-powerful machines: they can make even the most horrible implementation look stellar. By correcting the discovered issues, they reduced the machine's load by over eighty percent. Imagine recovering eighty percent of a machine's capacity just by applying some simple fixes. One day the machine is overloaded and we're looking at upgrading for non-trivial costs, and the next day the machine is barely breaching twenty percent capacity and we won't be chatting with the CFO anytime soon. Years even.
Some of you have similar stories (I've heard many of them!) and sincerely want a better way to deal with multiple machines and a wide array of applications and users in a more holistic manner, with an eye on what counts in a Netezza machine, not just what a typical database does. This takes administration to a level that simplifies and clarifies the complex in a form that intersects to the language base and nomenclature of the Enzee Universe.More importantly, the Enzee Universe is an ecosystem where many of us have found, through experience of success or pitfall, what works and what most people are asking for (and wondering why some of the leaders aren't taking the bull by the horns to solve it). Well, many members of the ecosystem have worked around the logistical issues without privy to the expanding capabilities of core observational needs, codified and implemented by people in-the-trench.
The inception of this observation technology was both brilliant and simple. It would faithfully gather stats from each of the machines as a summarized, transmission-safe extract and drop it to disk. Their on-site consultants could install it, kick it off and then go about their other analytic activities. At the end of a day (or two) they could examine the summaries and these would reveal all. As scheduled operations using the machine's power to process the machine's own statistics, it has a very low footprint and can be brought under workload management. As structured extracts. it was only a matter of time before they would put a pretty face on it. Now it's a dashboard to the inner sanctum of the Netezza machines, from the people who have collectively installed and shepherded more Netezza installations into production than any other, with no close second.
The Observation Deck, from Brightlight Consulting.
Next Tuesday, August 30th is a Webinar demonstration of this technology and I would like to cordially invite the Enzee Community (and all other interested parties!) to attend and take a deep look at what it can do, and what it could do for you.No pressure and no sales push. The need for this is obvious and we are simply demonstrating that it can be filled and supported. But the participants will be the judge as to its viability. It is the ecosystem's way after all.
The webinar is on Tuesday, 8/30 at 11 am pacific / 2 pm eastern. You can get more details and sign up for it at: http://advancedmonitoring.eventbrite.com/