I recently did a Virtual Enzee presentation and listed the Top Ten requirements for scalable bulk data processing inside a Netezza machine.
I'll come back periodically and elaborate on them
We have a Netezza machine, so what could go wrong? I was asked a desperate question by an Enzee as to how to get more power out of their machine. After nearly two days of struggling with them I finally asked how big their machine was. It was a TwinFin-3. The answer I gave them, they clearly did not like and even sought solace on the shoulder of another. Who told them the same thing. Get a bigger box. TwinFin-3 is a dev box, not a production box.
Stress comes in many forms. Constantly changing requirements. The need for functional and physical agility. As these things increase, we need a platform that will work with us, not against us.
This means ALL aspects, not just operational ones. Everything from table maintenance to application development. AUTOMATE!
It is humorous to hear testers offer up their methods, with naive blurbs like "open the application and examine the contents". No, with billions of rows there is no such thing. We must use statistical checking that operates on sets, such as summaries, counts-of etc. No longer can be "eyeball" the data.
Likewise with runtime processes. Define a table with 200 columns and try to put an ELT query against it. 200 columns in the insert phrase, 200 entries in the select phrase, and to maintain it we have to keep them in sync with "eyeballs". No, this doesn't scale.
Oddly application developers, like those who develop using stored procedures, whip out a bunch of application-centric 'code" and when the smoke clears, they see repeatable patterns all over it. Unfortunately, they can't take the patterns anywhere because they are hard-wired.
The more architectural approach is to harness the patterns as capabilities and allow our applications to express from them. The application is then an expression of the capabilitis not the center of gravity.
Large-scale systems can have mind-numbing characteristics for the un-initiated. It is incumbent upon us to deliberately simplify their interface points to it. Simple utilities, fewer keystrokes to achieve mundane goals, automation for rote tasks..
This can be as simple as, when data arrives and has errors, we don't come to a full stop. We cordon off the error records into an adminstrative /logical status and report them for later remediation. In systems of scale, we cannot halt the processing of tens of millons/billions of records just because a few stragglers are misbehaving. The time it will take to process the data is the problem. If we are 20 minutes away from the process being complete, then we are always 20 minutes away if we have fully stopped the flow for the sake of a few records. If we allow the process to proceed with error-capture, we will close the 20 minutes and then the admins have more breathng room to fix the problem without the scrutiny or pressure of the clock.
The environment can no longer be driven by application code. It has to be driven by an architectural harness that responds and adapts to data and metadata. This is a non-trivial endeavor, of course, but entirely possible to achieve.
What does this look like? The data model is arguable the most volatile component of the solution. Changes in it can destablize a solution. We need ways and utilities to buffer ourselves from the impact of change all-the-while enabling the change. It won't do to tell the users that the data model is frozen for 6 months because we fear impact to our tightly-woven application code (e.g. stored procs)
One doesn't have to make an exclusive choice between ETL and ELT. People really want to leverage the power inside the machine but feel constrained that doing so may obviate the ETL tool. Not so - both of these technologies have a major role to play and we should balance them for the best-of-breed solution
In the VIrtual Enzee I offered several examples of templates for SQL transforms (insert-into-select-from), views (to avoid nesting) and stored procedures (to build from a template rather than editing them in a SQL tool)
Why do this? The developer puts application logic into the template. At run time, or installation time in case of the SP or View, we formulate the product from the template. This allows us to automatically include non-optional aspects like operational controls, inline status reporting and other elements that we don't want the developer to worry about, much less hand-craft on their own.
Need another bit of operational control? Add it to the template factory and don't worry about the application logic
More importantly, we can generate a template from the catalog and by definition it is tied to the catalog. It is therefore easy to compare the already-deployed templates to changes in the data model. Since 90 percent of all new columns are invariably pass-through columns, running an impact analysis like this captures over 90 percent of the issues in one shot.
One of our clients had a TwinFin 48 they were planning to use for their development phase and then cutover internally to production. I asked them to dial back the developers so that it had the effective power of a TwinFin 12. They were a bit stunned at my request until I noted: The TwinFin 12 has a lot of power for development, but a TwinFin 48 will hide bad data models and sloppy code. Lots of power can make any lousy code/data model look spectacular.
Many cases of Netezza machine under stress, upon review we find that many of their inefficient practices have been going on for years, some since the box arrived. But the machine was so powerful it masks the inefficiency, like allowing the box to eat itself from the inside out
Preserve capacity at the processing level, not by guarding the data storage level. Do not be afraid to spin off replica data structures (even large ones) just for a different distribution, if it means that the machine can close its queries faster.
We understand this as a matter of quality control. Hamburgers from a national chain should taste the same no matter where we buy them. This is not accidental. The end user data is only as good as the processes that are delivering it.
If we make it so the operators have a difficult time handling it, or the admins don't understand it, or the troubleshooters can't get things done, they will start to grouse about the quality of their existence.
On the flip side, I know folks that we radically simplfied things for, and when we showed them the various utilities they would need to keep things in order, they balked. "Do we have to know all this stuff? Why is there so much stuff to know?" And yet, we have reduced a thousand things down to one, but they cannot grasp how much more complex it would be without our having simplified it.
We know that Netezza embraces simplicity. We just have to be mindful to maintain this spirit when we build things around it.
At the functional/capability level,we need to drive operational integrity into the data itself, outfitting the tables and rows with additional columns for the sole purposes of operational control. Otherwise the functional model is pretty much out-in-the-open and we won't have a way to manage the tables in a consistent, harnessed, repeatable form.