Modified on by DavidBirmingham
I'm wont to say the Netezza technology does a bit more than suggest it will run fast - it's an explicit promise practically nailed to the cabinet.
One may well ask (I'm glad you did) we did everything we were supposed to do, but it still runs slow. Why?
Many sites I visit, I hear something similar, and the irony is, they aren't far from the mark. It's not like we'll have to revamp everything, or overhaul the model, or perform some massive architectural rework threatening the vast spacetime continuum-
No, it's usually a tweak here and there. Sometimes not.
Here's the deal. It's a data warehouse appliance. It circumscribes decades of tried-and-true data warehousing, making it easier than ever to rollout fairly standardized warehouse design. Problem is, many folks who buy one have never been exposed to data warehouse principles. No wonder when we pop-the-hood, we don't find what we expect to find - an implemented data warehouse. What we find is an oddity.
Stay with me, it'll be painless, I promise. In the data warehouse world, we embraced a common, simple truth: 3NF Third-Normal-Form (transaction-style) schemas are lousy for reporting. They require too many joins and drag the transactional operation to its knees. Formulating optimized tables in the same machine won't help, because reporting is set-based, and transactional is entity-based, and the very engine we're running on favors the latter and disfavors the former. Hence the need to move the data off the machine and reformulate it for better performance.
The spirit in play here, is the acknowledgement that the 3NF model doesn't work, and the embracing of a model that does. This typically involves a dimensional-style model. It means the tables of the transactional model are merged, consolidated etc into fewer tables. It also means data types, data values and the like are all scrubbed to a known value, and their types aligned, to avoid any kind of on-demand drag. After all, would we rather zap a field into upper case on the back-end, or run the upper() function in the on-demand where-clause? Running it in the where-clause creates egregious drag. Zapping it in the back-end costs us nothing.
But once these sorts of schemas are moved into Netezza technology, something seems to happen to the mind. One might think it's okay to take those source tables, CDC-them over to the Netezza machine, slap some reports on top of these tables, and call it a day.
Not so fast. (Oops, that's what we're trying to fix - the system that's not so fast - I make myself laugh sometimes)
Bringing the source's 3NF model into Netezza is fine for Staging. Even for a central Hub schema to consolidate all sources. What the 3NF model is notoriously poor at, is set-based operations such as analytic queries for reporting. These were poor in our source system and they are poor over here, too. They are always poor for reporting. No amount of high-powered hardware will change this.
But that's okay. If we have the data inside the machine, we're not far from finishing-up the next step. It's just a lot of people now who aren't taking the next step. They leave the original source structures "as is" and wonder why they can't get a faster query. I mean, the high-powered hardware's there, right?
Pre-integration vs Re-Integration
Pre-Calculation vs Re-calculation
Pre-formulation vs re-formulation
Pre-scrub vs re-scrub
See a pattern here?
Saw a chat between several folks describing the following configuration:
Sources -> Hadoop - > Atomic EDW -> Dimensional Analytics
And claimed that the Atomic EDW and Dimensional model were no longer necessary. We can go straight to the interface to Hadoop, and push all our queries against it. While this might be functionally true, how does it dovetail with the above assertions about pre-integration and pre-calculation?
A configuration without the Atomic EDW and Dimensional (data marts) - requires the user queries to re-calculate and re-integrate the very same data, on demand, repeatedly.
For example, if it takes three minutes to bring all the "raw" data together, what if we brought the data together and take the three-minute-hit just once? Then the user queries are benefitting from this pre-integration because they are not experiencing integration-on-demand.
It goes like this: We walk into a bike shop and take a look at all the cool bikes, various frames and options, and pick one. Rather than take it off the rack, the owner says he will have one ready in a week. What the? Just sell me that one, No, he says, we will build if from scratch, shower it with love, have reruns of the Dick Van Dyke show playing in the background, and it's all good. You'll love it. So you come back a week later to pick up the bike, and a lackey does the transaction for you while the owner is having the same speech with another customer. Doesn't seem very efficient, does it?
A 3NF model in Netezza practically requires a complete data rebuild every time we issue the query. The same tables are completely re-scanned, their data re-joined, their quantities re-calculated - hmm - we just did this five minutes ago when we ran the query last time, didn't we? On-demand re-integration/re-calculation is where we end up dissipating our power, and wasting our time.
If we were to take the next step with the raw data, reformulate new tables that pre-integrate and pre-calculate the necessary values, we have a consumption-ready model. When our queries hit it, the data streams from the machine faster, because all the heavy-lifting is already done. Integrate-once/consume-many.
At a site where a "raw" model like this prevailed, a very complex view was placed on the tables, joining dozens of them and performing a wide variety of work. As an experiment, we simply persisted the contents of the view without any filters. The same raw data as experienced by all on-demand queries would be pre-integrated into a single table. Where queries on the original view would take ten minutes or longer, the queries on this persisted form took under twenty seconds. We can see why - all the heavy lifting had been persisted - integrate once / consume many. This of course is not operationally practical since all those source tables were being trickle-fed every five minutes.
But the spirit of the operation remains. We should trickle-feed the inbound data to staging, and then forward the changes to a persisted location optimized for reporting.
I didn't mention that this view was used for over 200 "location" reports. Meaning that the same query ran over 200 times - that's 200 times 10 minutes - or 2000 minutes - 33 hours if running linearly. Because this query saturated the machine with data, they couldn't run anything in parallel. Their reports were always behind. After we converted it to the 20-second model, they could run the reports in a little over an hour if running linearly. Once we applied zone maps for their dates and their LOCATION_ID, the queries became sub-second, and could run dozens of them in parallel. The entire reporting cycle, over 200 reports finished in less than three minutes.
Isn't this the sort of performance we were looking for? The pre-integration and the attention to machine physics reduced a 33-hour job to under 3 minutes.
Now can we see how not-addressing inefficiency will dissipate the machine's power rather than harness it?
A symptom of a poorly implemented model can often be found in the shape of the query. If the query is complex, has a lot of work happening on-demand, or is performing a lot of left-outer joins, it simply means the data isn't consumption-ready. If we open our BI tool and find that a lot of columns are wrapped with NVL or COALESCE into known defaults, it's time to apply those defaults to the data and not clutter the business logic with scrubbing that should have already been done.
Let's look at a query template, or rather, the structure of a query attempting to leverage these transaction-facing tables.
Select (some stuff here)
from (Sub-select here) sub-select here)
Join master (columns, where-clause stuff )
Join secondary (join columns, where-clause stuff )
It's the presence of deep sub-selects, multi-nested views, and the left-outer than are all more than telltale - the data is in a raw state and not ready for mass consumption.
Our teams have remediated countless models from this malaise. Not particularly hard to do, but at all these sites, their folks were too busy to engage the problem full-time, and with the necessary confidence to get it right. Others took our advice to create "sandbox" databases, pull some data inside and formulate tables that worked for them, as a prototype. Once the prototype showed promise, they had evidence to discuss changes with the data modelers to eventually make it operational. This is exactly the process you'd see us follow on your site, because nothing works better than your own data to answer your performance questions.
By the same token, "some" of these sites knew what to do, but didn't do it. One site's leader said to me, "We went to Enzee Universe and took a lot of notes. We hired a guy to put this together and he said we didn't need any of that." Followed by a long awkward pause. I said "Is that why I'm here now?"" and he said, "Exactly." He was worried I was going to tell him he had to rebuild from scratch, but we know that's never the case. Performance optimization is centered in the big-ticket, heavy-lifting tables. If we reduce the noise around those, everything starts to hum like a sewing machine.
The takeaway in all this - if performance is an issue (don't feel you have a lot of data but the system seems stressed) the problem isn't your queries - it's your data model. Performance in Netezza is solely derived from hardware, and the hardware's physics is unlocked by optimized data structures, not efficient queries. Tuning a query in Netezza is a lot like using a steering wheel to make a car go faster. A query is logical but a performance is physical.
Get the data structures in a place where they leverage machine physics, and get the queries in a place where they unlock that physics, and cut 'em off the chain.
Modified on by DavidBirmingham
A "little" blurb about IBM Integrated Analytics System (nicknamed Sailfish) - IBM has had the Sailfish version of the Analytics MPP in beta for a bit, and has recently announced (today).
Our team at Sirius was privileged to get an orientation and a little more than a sneak-peek, but access to our own beta machine to give it a go. The Enzee community will be pleased about a number of things in this release.
But even at the orientation, a number of things popped out that would "take your feet off the desk" so to speak. This is no ordinary release. IBM has integrated the Db2 BLU engine with the Netezza MPP, so now we can build and issue portable queries with a common, consistent experience. This is quite an achievement and many kudos to the engineers who have participated.
Power Systems vs Blade Servers
Under the hardware covers we now have Power Systems. This matters for a lot of good reasons, the best one being horizontal elasticity. Most Enzees know whenever we want to upgrade our hardware to more power, say from one rack to two, we have to bring the two-rack into the shop, copy the data from the one-rack to the two-rack, and off-we-go.
With Sailfish, we simply add-a-frame, perform some simple configuration commands, the data rebalances to the new capacity, and off-we-go. Simple and painless. How cool is that?
In this hardware scenario, dataslices and zone maps (now synopsis tables) behave the same. Distribution and organization follow the same rules. Enzees will experience no changes in these core areas.
Db2 BLU experienced this radical speed also, but with Netezza-oriented MPPs, it's even more profound. Netezza tables are row-oriented, so the "read" operation takes a page from disk into the FPGA, where the desired columns are stripped from rows, and undesired rows are filtered from the stream.
In Columnar mode, the pages don't store rows, but columns. When we ask for certain columns, the other columns' pages aren't touched. So now we read even less of the disk itself, and even less data meets the CPU.
Oh, yeah, you heard that right. In current Netezza, an electro-mechanical drive will not only burn out, but even in operation, the read-head has to seek a page on the disk, fetch it, and multi-task with other read operations to optimize the physical read-head over the spinning disk. Even the spinning disk is optimized to carry the user data on the outer-third of the disk where it spins fastest. This can serialize queries at the read-head and is potential for concurrency bottlenecks.
With solid-state drives, the read-time is radically faster, and memory is fair-game by any process, without waiting. It's solid-state so mechanical breakdown is not even on the radar. More importantly, it reads and writes data at phenomenal speed compared to a hard drive.
Where before, the disk read speed was the number-one drag on a query, now we will likely see this re-balanced to other areas of the machine. Don't get me wrong, it still takes time to read memory, so we should not forego the use of zone maps - er - synopsis tables now - to reduce and filter the total amount we read. This is just good stewardship of the CPUs and other resources. Just because we "can" read a lot of data faster, doesn't mean we "should" - we should filter and reduce the total data arriving into the CPU for the most efficient query utility.
What it does not have, is the FPGA. This hardware-filtration workhorse, IBM is taking the step to remove, because the disk drives are so powerful, columnar tables are so data-efficient, and the CPUs are so much stronger, the hardware filtration power may well be superfluous. Let's shake that tree. We're all scientists, right?
Common SQL Engine
This means SQL statement portability and consistency across various platforms, and the libraries such as SQL Toolkit, Inza, Fluid Query etc are now baked-in to the engine and available at our fingertips when we power-up.
We'll have more consistent experience, and less guesswork for functional/operational behaviors.
And we'll experience seamless integration with the Data Science Experience (DSX), machine learning, and a bunch of other offerings IBM has already announced or is in the queue.
Ease of configuration
As we were moving through our POC, the IBM team assigned to us was amazingly helpful. They knew where all the hooks were to tweak this or tune that, because the common engine and underpinning metadata are pervasive and well-understood. Anytime we had a question or thought we'd encountered an issue, the solution was invariably a simple configuration change.
This speaks volumes for the level of effort, thought and insight poured into this release. IBM has thought-through the wide variety of priorities between these platforms and has provided the ones that matter most, seamlessly.
Expect to hear good things as this rollout proceeds.
Synopsis Tables vs Zone Maps
Essentially work the same way.
DashDb and DashDb-local have been rolled-up to Db2 Warehouse. Same SQL engine. Same built-in analytic libraries. Db2 Warehouse is not Sailfish, but we can put the Sailfish hardware under it, and voila, it has the MPP power of the Netezza Technology.
If we plan to rollout the Db2 Warehouse first and later upgrade to Sailfish hardware, we need to enable the queries and tables for this, and this simply means to leverage Synopsis Tables rather than indexes. Do this, and we'll be 90+ percent on the way to leveraging Sailfish hardware when we go there. If not, and the tables/queries are index-dependent, we'll need to review the environment (for the largest tables) to make sure we're leveraging the MPP correctly.
These are not hard to do and much of it can be automated.
Modified on by DavidBirmingham
Hey, I just couldn't resist!
Please note that in the following commentary, I am attempting to unpack the POTUS 2016 election analytics via the polls, not take political sides. I am one of those "independent/undecideds" that everyone complains about, partly because no particular party "represents" me. Call me a rogue!
Storytime: - Can Mind have an effect over Matter - is Telekinesis possible?
Setup a random number generator in a computer so that it generates hundreds of numbers per second, all of them integers from 0 thu 99. Have it collect as many as it can in one minute. When it's done, the average of all numbers generated should hover around 50 (due to the Gaussian distribution (bell curve)). Once this is working, set up an observation and then focus your mind on something in the computer, something imaginary even, and "declare" this object in your mind to be the CPU or the generator itself.
Now start the recording and focus your mind and think - or even say out loud - "aim high". Continue these thoughts with as much intensity as possible. Perform this observation many times to get a lot of data. When the experiment is done and the observations are collated, the majority if not all of the averages will be above the value of 50. Conversely, when it's repeated with the phrase "aim low" the final results will show a downward skew below 50.
How is this possible? Is the mind truly able to affect an electronic random number generator? Does the concentration of the mind somehow force this outcome?
When independent observers watched the researchers repeat this and collected data both for "aim high" and "aim low" - an interesting pattern emerged. At the end of the test, an observer said "I noticed you kept resetting the test until you got it calibrated, what was that all about?" Well, said the researcher, when I'm trying to take a reading I have to make sure everything is the same. If I see that the average isn't moving I reset and refocus." "How many times do you do this?" Well, I don't really keep track of that."
The problem here is that the researcher was throwing out valuable information. It was showing that the return numbers really were no better than random chance but the researcher wanted to believe so much that a lack of rigor in the test protocols was a-okay. The researcher was throwing out observations that proved the hypothesis wrong and keeping those that agreed with the hypothesis.
So now that the election is done and the outcome is known, let's have a little fun with the analytics. You know, the pollsters. They claimed to have the answer (predictive analytics) - but most of them were dead wrong. Why was that?
Many of the news stories today start out with "The <winning candidate name here> in a surprising victory" - or "stunning upset", or "the nation changed its mind in the eleventh hour" - are all just weak ways of failing to admit that they got it wrong. Not only did they have it wrong in the end, they had it wrong all along.
The Electoral College complicates the analytics, so before we get started, a lot of foreigners read this blog and I have received "casual" questions in the past about the Electoral College, so here's a quick primer on it.
Ironically, the Broadway show "Hamilton" has been in post-election news, but the players in this production have all admitted that they didn't vote. The creator of the Electoral College was Alexander Hamilton, who also advocated that the POTUS be called a "king", and advocated steep immigration restrictions and even closed borders. This is ironic in the light of recent statements from the "Hamilton" cast. I wonder if they know these things about the play's namesake?
The election of a President Of The United States (POTUS) is unlike many other elections, in that it's not a popular vote. It is a collection of results of fifty-one separate elections (50 states and the District of Columbia).
There's a strong reason for this. The POTUS is elected by states that are each independently sovereign of one another. The United States is just that- a union of sovereign states, each with their own governments and laws.
In America, state sovereignty is protected in a variety of ways. The FBI cannot insinuate itself into every criminal investigation. There are strong lines of jurisdiction that determine whether its a state or federal crime, all borne on the sovereignty of the state. The federal government can't arbitrarily send troops into a state without the governor's permission. The same is true for a variety of federal agencies. For example, when Katrina hit Louisiana. President Bush was on-site the next day with an offer to the governor to send in troops to keep order and distribute food and provisions for survivors. The governor hesitated on this offer. A day or two later the levees in New Orleans broke and stranded millions. The fact remains, the troops could not enter Louisiana without permission. The states have rights.
The Constitution protects the rights of the states to maintain their sovereignty. A popular vote for "anything federal" would violate this sovereignty and put all voters, regardless of their geography, into a common pool. A candidate that made a lot of promises to the population centers would win hands-down. Moreover, election fraud in one location affects the whole. The POTUS election is this way by design, to keep democracy at bay, so that cheating is more complicated, the sovereignty of each state is honored, unbalanced power is not given to population centers to leave the majority of the country out of the race entirely.
Even as this is written, the state of California has laws that allow non-citizens to vote for local elections, and their totals are driving a "popular vote" for the loser because some 4 million votes were illegally cast for POTUS. Those haven't been excluded yet. They will likely line-up the popular vote with the electoral vote.
"Popular vote" would become a way for "big city dwellers" to tyrannize the country. Our founding fathers understood tyranny quite well, and the many forms it can take.
But what if the outcome were decided by popular vote? Consider that in states that are lopsided, like California, New York and Texas, voters for one candidate are always represented while voters for the other are not. Many voters for the other candidate don't go to the polls at all because they know it's a waste of time. But if this were a popular vote - those folks would have a vote that counted and be encouraged to go vote. Such things come into play when the votes are counted differently and would significantly affect how many additional votes are cast.
What if the candidates tie in Electoral votes or don't reach the necessary 270? The contest goes to the House of Representatives. Many people mistakenly believe that the House would cast 435 votes for President. This is not the case. The representatives of each state must "caucus" and have their own internal election, and only one vote from that state is cast for POTUS - 50 votes in all. Why? Because the states elect the President, not the people. It's a running theme.
This is a bit of a startling realization for some, that even the Father of the Constitution, James Madison, commented that "democracy is evil" - primarily because it allows the will of the many to trample the rights of the few. The founders deliberately put in place a variety of checks and balances that keep democracy at bay. As such, America is a Representative Republic, not akin to any form of democracy at all. This is confusing to some, who've been told America is a democracy and the Electoral College is "undemocratic" and if it's not democratic it must be bad.
History has proven that "the majority is often wrong" - and this was proven out in the POTUS pre-election polling - all but a few were completely wrong.
Recall that many hundreds of years ago, a chap named Galileo challenged the majority "opinion" on celestial mechanics. Oddly, the Aristotelians in the universities - all of academia - stood in opposition to him. Even the leaders of his own faith disowned him. He was a lone voice in in a sea of consensus - and all they had to do was look into the telescope. In all walks of life, we often find that a small number of people "have it right" while the majority is wrong. The founders of America wisely recognized this "herd effect" of humanity and didn't want it to have any power. One of the reasons for this is that the "herd effect" is often driven by a majority of the least-informed and most-fearful.
Should population centers control the presidency? This question was asked and answered by the founding fathers in the form of the Electoral College.
This system however, radically complicates the prediction of an overall outcome. Each state's activity has to be predicted and modeled, and depends on a lot of very dynamic factors.
Even further - in the 2016 election - the media, the newspapers, the consultants and a wide range of politicians, including many politicians in the same party as the eventual winner - all claimed that the eventual winner would lose in a landslide and be the most devastating loss in history. All of them were wrong. Instead, the very opposite became reality. The naysayers were proven wrong, the "favored" candidate lost and that candidate's party was decimated.
The problem here isn't that the nation "suddenly turned at the last moment" as some have suggested, but the pollsters themselves fell ill to a common problem in these kinds of analytics. They had it wrong from the beginning and their polls failed to reflect reality. There's an obvious reason we'll get to a little later.
Here was a chance for predictive analytics to shine like no other - a time-boxed, measurable event and outcome. This is why those of us who champion analytics are so appalled by the epic-fail. It could have been a shining moment but instead just a fizzle. For analytics. The worst part is that people without any predictive computing also predicted what would happen in the aftermath of the election - and they were also right. No computer algorithms required. Perhaps the nature of a "learning machine" needs to be couched in terms of what is being learned and how we know it's true?
Many of us who want excellence in analytics were absolutely appalled at the abject lack of accuracy of the presidential polling. It seemed a lot like a circus and the election night seemed like their version of damage control more so than reporting an expected outcome.
After all, if the polls were scientific and on-the-money, when the polls closed and the votes were reported, the science should have already reflected the reality of the outcome. No surprises at all. Why was it such a shocker? An upset? A cinderella story?
Think about that for a moment. We hire someone to do analytics for us, find market opportunities and reduce risk, and when we act on their carefully crafted predictions, it goes south and we lose millions of dollars. Would this be a good testimony to the accuracy of our analysts, or an abject failure? Would we ever trust them again? Would we throw out the baby with the bathwater, so to speak, and forsake the value of analytics altogether? Some have done this, and later revisited it with a bit more scrutiny ( and are happy with their results now).
When it comes to our companies, our livelihoods, etc - trusting the data and the analysts is hard because it's personal. It's not like a video game where we get unlimited retries. The one failure may put us out of business. The one success may make us handsomely solvent for decades.
Let's get some things out of the way - we have a few broad reasons why they got it so wrong.
- On the dark side - they knew the results and were lying. I don't see any benefit for the pollster or the candidate in this. In fact, showing a candidate artificially high in the polls could spur the opposition voters to the polls!
- On the lighter side - they were incompetent - I can't accept that so many were off because of this - that's just me talkin'
- On the analytic side - they were victims of "confirmation bias". This explains it much better. They were sincere, but sincerely wrong. They were simply too personally invested in the outcome.
- Reuters and their partner IPSOS, recently shared that much of their polling data favored the eventual winner, including the most controversial issues - by over 83%. They don't see their failure to report these known facts as having any influence on the election. In fact, they completely changed their time-honored measurement system in a manner that favored the eventual loser. Why would they do this? It's really simple: they could not bring themselves to believe that they were wrong.
Many who have attempted to find a reason for the failure fall into one of two major buckets - those who look at it scientifically and those who look at it politically. The political view has a problem in that the presence of bias - of the modelers and analysts - tends to taint the model and inject a "confirmation bias". That is, the polls say what the analyst expected them to say, but the analyst may not realize that their bias has already affected the outcome. Many who look at it scientifically are using politics first, so the confirmation bias creeps in again.
Case in point, an analyst may ask the question, "If the election were held today, who would you vote for?" And it's interesting how many of those voters polled were dishonest in their response. How we know this will follow shortly. Taking these responses at face value, the analyst sees that the numbers trend as they expected, so they report the results. Another form of confirmation bias is when the analyst has a hypothesis in hand, such as "We know that nobody could possibly want to vote for Candidate A for so many important reasons, thus..." and this hypothesis is what drives their questions and likewise the rest of the analysis. They only ask questions in this context and only hear answers in this context.
This is a lot like the circular question of "Do you still kick your dog?" If someone answers "yes", they're a reprobate. If they answer "no", it implies they used to kick their dog...
Unbeknownst to many, this circularity is one of the major flaws of the scientific method. This is why the scientific method can't be used to "prove" anything. It can be used to falsify, but not to prove. Bias rears its head in common scientific experiments and even in police forensics.
"What happened to the evidence you collected in the bathroom of the crime scene?"
"We analyzed it and threw it out as irrelevant."
"You threw it out as irrelevant? Why?"
"Because it didn't match the suspect."
(real conversation between a prosecutor and CSIs in a triple murder)
In scientific circles - a professor at an Ivy league university had an intern who reported that the professor had collected a wide array of samples from a recent field trip and the intern had expected to spend the weekend collating them as part of his duties to the professor. But when the labeled/bagged samples arrived, only a fraction of them made it into triage. He asked the professor what happened to the rest of it, and the professor said it had to be discarded because "it didn't fit the profile".
This "doesn't match the suspect" or "doesn't fit the profile" is a problem because it means the analysts have a hypothesis concerning a specific outcome and have thrown out evidence suggesting, or even proving, that their hypothesis is wrong. The professor, in throwing out evidence that didn't match a profile, is aligning to his original hypothesis and discarding evidence that disagrees with it. In the case of the CSI, if they throw out all but the data that "matches the suspect" they are by definition throwing out the case against the real criminal. If any of that evidence could exonerate the "suspect" but the "suspect" is falsely convicted, the real criminal goes free.
In the case of Reuters/IPSOS - their hypothesis was that the eventual loser would win - has to win - in a total landslide. Any data that contradicted this simply had to be erroneous. Yet it was not - it was speaking the truth and they ignored it.
So the flaw of the scientific method is that the scientist can inadvertently use the hypothesis as the filter through which all evidence is examined, rather than it's intended purpose - a springboard for investigation to be either confirmed, rejected or modified based on the evidence discovered. Since many researchers apply for grants based on "the hypothesis", they must have a reasonable confidence that the hypothesis has merit in order to influence donors for funding. If the scientist has enough failures, those funding sources will dry up.
Thomas Edison claimed to have failed over 1000 times in inventing the lightbulb. His explanation was that in each case, he eliminated a candidate with the expectation that a final candidate would succeed - claiming 1 percent genius, 99 percent perspiration. Unfortunately, donors these days aren't so forgiving. Scientists like Edison, Newton and Lavoisier had independent means of supporting themselves while performing science. Today, scientists expect to be financially supported even while they do the work. That's reasonable, but it increases the cost of science, and tends to tempt the scientist into fudging the data to match the hypothesis. He has bills to pay, kids in college, vacations to pay for. A dependency on money, and lots of it, has been shown to affect their judgment.
As for polling, the polling firms sell their results to the candidates and it's collectively worth billions of dollars. Down-ballot candidates in 2012, 2014 and 2016 spent over a billion dollars in polling alone.
Some reason this way: if you're a pollster in the business of selling polls, and the first poll you take shows your "guy" waaay ahead of the pack, do you share this or do you show the race to be "very close"? This "very close" will cause your client to be nervous and want to purchase another polling result as soon as possible. You have a vested interest - even a conflict of interest - in whether or not you deliver the right polling results because it's your job to sell more results. Since everyone knows that the only polls that count - are the polls in the final week - hey - why not massage the numbers? Who's it gonna hurt?
In the book "Wrong: Why Experts Can't be Trusted" - the author cites one case after another of scientists eliminating, adding or fudging data. One researcher went so far as to use a magic marker to put stripes in the fur of a lab rat so that it could pass certain acceptance criteria. He notes that even though doctors have formally accepted a correct form to administer Cardio-Pulmonary Resuscitation (CPR) the average training manual and even the Red Cross still use the former method.
Periodically in the news will appear a case of fraud or collusion in the area of climate science. If the science is there, why the fraud? For example, the IPCC announced that global warming has been in a "pause" since 1997. The current trend is more toward cooling, as exemplified by the glacier forming in Mt. St. Helens' caldera. Many scientists are asking the impertinent question: Is the "pause" really a "pause" or is it something else, like "reality"?
In May of 2015 NASA announced that the polar ice caps are not receding, in fact the Antarctic sea ice is expanding.
For folks to claim that "the majority of climate scientists believe..." isn't relevant, because many centuries ago the majority of scientists believed the Sun revolved around the Earth. Still, science is science, and not subject to consensus or democratic vote. One "uncovered" memo after another has revealed that the "result" of climate science can be purchased regardless of the contradicting data. So why does anyone trust it? After all, doesn't Earth itself do more damage to itself, than mankind could possibly keep up with?
In the case of Presidential polling, there's no compelling reason to deliberately get it wrong. Some like Reuters may withhold information, but this isn't the same as false reporting. Statistics show that false polls don't do a lot to suppress or affect voter turnout or sentiment. In fact, one could make the case that a false poll strongly in favor of one candidate could strengthen the resolve of a person voting for the opponent. Likewise if a candidate is seen as strong in the polls, some of the candidate's supporters may not go vote under the assumption that their vote wouldn't count all that much. Both of these dynamics are partially in play in 2016 but nobody can measure it, so nobody knows for certain its impact.
I worked with a firm some ten years ago that had so much cash rolling in, they could have wallpapered the offices with 100-dollar bills and not missed any of it. Their business was so dramatically profitable that their investors loved them, their customers loved them, but one love was lost and had not returned - working for the corporate offices was drudgery. They had not upgraded their computing systems in many years, so many of the employees spent countless hours, every day, pulling data and collating spreadsheets.
In the meantime, a lack of visibility across their corporation made their expenses invisible. They were hemorrhaging cash in most departments and could not see it, and didn't care because so much more cash arrived to replace it, and then some. Such situations always have a day of reckoning. We were onsite because that day had arrived and the senior management may as well have been in witness protection they were so frightened.
Analytics matters. It tells you where you stand.
Unless it's presidential polling. Which is just bizarre. I hear that it's - uh - like the most powerful position in the world? I could be wrong about that, but for the presidential polling to be so completely off?
The reason we as voters don't particularly care about the presidential polling is that it's rarely accurate. We're sort of "inoculated" to it by now. We hear all sorts of stories of pollsters using the polls to shape opinion rather than reflect it. We want to believe that they're doing it for the right reasons - to be accurate and regarded as reliable, not as agenda-driven political hacks. Hope springs eternal, but at the end of every election season we see how completely wrong they were. At the beginning of every election season, the lessons of four-years-prior are forgotten and we find ourselves watching the polls ever-so-hopefully.
Pardon the analogy, but isn't this a lot like Lucy and Charlie Brown, where she holds the football for him and rips it away at the last moment? She promises each time she won't do that, but always does. Why does Charlie keep coming back? Well, because it's funny, and it's Peanuts, and we know it's make-believe.
But presidential polling isn't make believe and the outcome has real-world consequences.
Only three of the mainstream pollsters were even remotely accurate. All the rest (dozens of them) couldn't have been more off if they had just manufactured the numbers from thin air. In fact, any of us could lick a finger, test the political wind, and produce a poll that was more accurate than ninety-percent of those claiming to use analytical science. Just embarrassing.
And if it's science, why were they so wrong? I mean, so completely wrong?
This is why the final election outcome was such a "shocker". Expectations. Just the setting of false expectations is enough to set someone off. Tell your wife you have a romantic weekend planned and then at the last minute get called into a non-optional emergency meeting at a client site - uh - yeah - set those expectations carefully! In one particular case, I had set my client's expectations that I would be unavailable. They didn't even bother to call.
Nate Silver, famed analytics guru of many past elections, put his private formula alchemy to work and at the beginning of the election night, already had one candidate favored with 70 percent chance of winning. No margins, you see, just whether the candidate would win. As the polls closed in each time zone, Silver changed his percentages. They went to 60, then 50 and dipped below 50, moving downward "as the world turned" and polls closed by the hour. The opponent likewise rose in the other direction. Betting odds were a reversal of fortune for many.
People look at Silver's messaging in real-time, analyze it and proclaim, "The frontrunner is losing ground" or "the underdog is gaining ground". Why doesn't anyone see such sentiments as odd? The reason I say this is simply:
Friends of mine go to the tracks on occasion, probably more often than their wives would like, and spend time betting on dogs or horses. At the beginning of the day, the track officials publish the betting odds, much like Silver published "odds" during the course of the campaign. But those odds at the track are only good before the race begins. The officials don't change the odds after the starting bell.
And what happens when the gate opens and the horses charge forth? Seems to me they're just like Olympic runners in a starting block. They all have the same starting point - zero - and all of them have to gain ground faster than the opponents - to break the ribbon first. I mean, everyone gets that, it's why we watch races. I still recall many Summer Olympics ago, one of my favorite runners (Gail Devers) was in the 100-meter hurdle. She was at least five hurdles ahead when she hit the last hurdle and tripped over it. She landed on her knees and tried to recover, but finished third. She was favored to win, too. That really was a case of gaining ground only to lose it later. Many recall the Winter Olympic snowboarder who was favored to win second place. The frontrunner and backrunner tangled right out of the gate and the hero ran well ahead of them. When she hit the last hill, she decided to "hot dog" and brought her board up to touch it, lost her balance and spilled out. She hurriedly tried to make it right but the other two sped past her, an opportunity lost.
The lesson in all this, is that once the polls close - the outcome is prescribed. They can't gain or lose ground. At the voting precincts, there is no frontrunner or underdog after the polls close. It's all over but the counting.
The odd part about this being applied to the 2016 presidential race, is that when the polls closed and votes were reported, one jumped out in front and stayed out in front, and ultimately won the contest. Anything prior to the votes being counted - predictions, polling, exit polling - didn't matter any longer - because most of them got it completely wrong. The only question people have later is - if most of the pollsters were completely wrong, how do we know the others aren't just a fluke?
Isn't that what we'd say about any other kind of contest? If fifty people make a prediction for an outcome and only three get it right, we chalk it up to random chance, not the skills of the predictor. The point being - if science is directly applied, we move closer to the expected outcome. If no science is applied, we can't expect better than random chance.
So for Silver to claim "gaining ground" or "losing ground" after the polls close, is ridiculous. It's Silver's form of "damage control" after being proven so completely wrong. Moreover, he was repeatedly proven completely wrong about the winning candidate from the time the candidate announced a presidential bid. Every prediction he made - polling to the primaries - crashed and burned. He and others expected their front-runner to win in a landslide of 500 or more electoral points. Epic fail. How embarrassing is that?
Not for him, per se, but for analytics in general. It is because of Silver's past prior success that advanced analytics has gained ground in the marketplace, but when folks like Silver so completely and visibly fail, it sets-back analytics and in some ways can bring shame to those who sold their analytics based on Silver's success with it.
To Silver's credit, part of his "damage control" was his explanation of how close he called the "popular vote". Well, Nate, that's reeeeeal nice. But as noted above, the popular vote doesn't mean squat. This is the United States, each state is a sovereign entity not beholden to the other states, and do not participate in a popular-vote-based election. It is in their best interests, and always has been, to avoid being pooled with the popular vote.
A similar effect happened with the 2004 POTUS election, where one candidate was handily whipping the incumbent - based on exit polls alone - but when they started counting votes, the incumbent immediately jumped in front and never fell behind even once. Various groups in favor of the challenger cried foul - but strangely did not point a finger at the exit pollsters. The outcome of the election would have been the very same without their reports. And since their reports were so completely wrong, why report them at all? The more sinister among us would claim they were trying to affect the outcome. Voters are a little smarter than that, so it's hard to swallow. Especially if its across fifty sovereign states, each with the own vested interests - this just makes it a lot harder to cheat.
Another strange effect happened with the 1980 election, where the polls leading into the election night had the incumbent ten points ahead, but as the night unfolded, the challenger took the race by a total landslide. Only many years later was it learned that in the week prior to the election, the incumbent was taken aside and told that he wasn't ten points ahead, but ten point behind. No way, no how would he recover this ten points in a few days. This was a "brace yourself" moment, so that nothing about the election night was unexpected for the candidate. I suspect that the same numbers were available to the challenger as well. This was the first time that the loser conceded the race even before the polls closed on the west coast
In the 2016 election, one candidate was behind in most of the prior predictive polls, and led in only a few, while the other candidate enjoyed a comfortable "lead" throughout the campaign season. The underdog claimed that the polls could not be trusted. People laughed. The polls that showed him ahead were ridiculed. They however, were closer to the mark than anyone realized. Oddly, these same polls were the most accurate ones four years ago in 2012. Why weren't they trusted this time? Confirmation bias. They disagree with our hypothesis - so there's just no way they can be right.
Since both candidates were "seeking low ground" in their rhetoric - in the most bizarre and tumultuous race ever, the candidates dished out their fair share of mud. As a result, many voters were uncomfortable in openly committing to either candidate. Of course, each candidate had their own "openly loyal base" but knew they could not win with the base alone. They had to reach the "independents" and "undecideds" - but how to find them? How to know what they really think?
One of the pollsters used an interesting question - "How are your neighbors voting?" - and this seemed to unlock a wealth of information. In the final analysis, this one question unlocked the hidden information. He had found the true sentiment of "undecided" voters. While a person may not feel comfortable sharing their own opinion, it was easy to share the opinion of an "imaginary neighbor". This pollster's numbers were the most accurate, state by state than any of the other pollsters. He called it, but he had to use a little subterfuge to make it happen.
This subterfuge it seems, is the hallmark theme of politics. The politicians keep a public face and a private face. One candidate was revealed to have told donors to expect public responses to the voters that were incongruous with the private responses to the donors - not to worry, it's just politics.
On a personal note, my father was a District Attorney (an elected office) for over 30 years in East Texas, and was in office at the time of my wedding (to my wife of now 30 years) - and my wife's parents had taken-on the expense of the wedding reception. Dad told them to invite an additional 300 guests to the wedding and reception, which would have blown their budget sky-high. They objected but Dad said - no worries, it's just politics. None of them will actually show up, but it's bad form not to invite them. Sure enough, none of them showed up. But this is a bit of a subterfuge in itself, is it not?
Politics is an art of partial truths and partial subterfuge. Anyone who reveals their agenda from the outset is considered a poor political player. One must have a public agenda and a private agenda, if they want to "get anything done". At least, that's the "common wisdom".
This is why many technologists often divorce themselves from politics entirely. If I took a poll of technologists nationwide, of those eligible to vote I would find that only a small percentage are actually registered to vote. Technologists often watch politics like a sporting event, if they watch sporting events at all.
But this politics-as-usual problem - the subterfuge and hidden agendas - had apparently wearied the American voter. So when a candidate stepped forward with no political experience at all - nobody knew how to measure it. They still don't. Even now they are attempting to describe the candidate's victory within a political paradigm, and nothing they come up with is accurate. I read one dissertation that attempted to do the same-old shoe-horn of analyzing based on demographics, when the winning candidate had clearly appealed to a populist voter base that cross-sectioned a wide array of disparate demographics.
No wonder they were so completely wrong- they were looking in the wrong place, and asking the wrong questions. Conspiracy theories emerge. Are they that incompetent? Are they deliberately lying? Either way, how can we trust them?
Silver moves into "damage control" in the next days with "we were only 2 percentage points off" - well no, claiming that one candidate had over 70 percent chance of success is a lot more than 2 percentage points - but since he doesn't do polls himself but bases his information on existing polls and other information - he was effectively drinking from a poisoned well. If he had taken a superficial look at the candidates he would have seen why. One was a career politician and one had never run for office, and didn't understand the first thing about politics - so didn't conform to the common model. This not only confounded the opponent, it confounded the media - and the analytics.
I kept after my kids to pay attention to this election season because they'll tell their grandchildren about it - this will never happen again in our lifetimes.
In large part, this was unlike any other presidential political season for these very reasons, but the pollsters treated the second candidate like the first, attempting to shoe-horn everything into a common "career politician" model. One time after another, the non-politician candidate beat the predictions and nobody could understand why. In the end, the "political class" of consultants and the "establishment" were very afraid. Here they had set up a system through which all political candidates had to arrive, but this candidate proved that none of it was necessary. It relegated the established political engine to irrelevancy.
Moreover, the winner is about to enter office without any obligations to donors or other influencers- and no fingerprints on any of the problems taking place in government now. This was not the case with any of the other candidates.
Has anyone ever witnessed something so strange? The candidate stuck to several core issues that resonated with voters in all demographics and gathered more diversity under the candidacy than anyone prior. The losing candidate on the other hand, kept going through one re-invention after another, as if a phoenix rising from flames. Voters saw this as phony.
How does this apply to our internal corporate issues? Does dirty politics play a role in how numbers are reported? Do we seriously think that if the eeevil political player down-the-hall is able to manipulate numbers to his/her advantage, that they will be altruistic and avoid the urge to cheat? No doubt many avoid the urge, but there is an ever-present propensity to cheat, to capture numbers and spin the story to one's favor. It's just human nature.
When Inmon coined the phrase "Single version of the truth" - this is exactly the problem it addressed - to make sure, in certain, objective and scientific terms, that everyone was reporting from the same place, same totals, same everything, so that nobody could cheat. It's bad enough that someone would cheat to pad their numbers and look better, it's even worse when an underperformer pads their numbers to look even marginally acceptable. Corporate heads wanted neither, but a single place to go where everything was laid bare, the good, bad and the ugly.
In Red Storm Rising, Tom Clancy tells a story of the Russian Politburo and their analysts, who would arrive with three reports in-hand. One was the worst-case, one the best-case and one the middle-ground. When the analysts arrived, they would attempt to "read" the sentiment of the Politburo members - before choosing which report to proffer. The Politburo was known for being harsh with people who disagreed with them. So the analysts would attempt to discern the sentiment and intersect it with a report that aligned with Politburo sentiment rather than challenged it. In this particular storyline of Clancy's, this sentiment was ill-placed, the report was the "best case" and the outcome was disastrous.
Sometimes the folks asking the questions are their own worst enemy. They ask strongly biased questions in the wake of harsh outcomes for dissenters. If a person wants to keep from getting their head lopped off, they do whatever, say whatever to avoid this outcome, but it's not doing the decision-maker any good. If anything, it's misleading the decision-maker down a disastrous path. Unless of course, the decision-maker is just so good at what they do, they don't care about the opinions of others anyhow. Except to see who is loyal to them, of course.
This is the dichotomy presented to us by presidential polling versus our internal analysts. The pollsters and the analysts both have a loyalty or bias, so it is incumbent upon us to either determine that bias, or guide that bias in our favor. In business we want our analysts loyal to our goals and success. We want their honest answer as to where we're headed and whether or not it's a good idea, how to steer toward success and how to avoid danger, and the most effective way is to join-at-the-hip. Their fate is our fate - they have a vested interest in helping us get it right.
Consider the story of the king who went to visit a wise old soothsayer, who told the king, "You will cross a river, and a great king will be defeated." So the king mustered his troops, crossed the river and his entire army was routed. The soothsayer was right - a king had been defeated - but I'll bet the king asking the question would have wanted a more specific answer.
Conversely, who are the pollsters loyal to? Clearly at least one pollster was chasing "the answer" and found it - and reported it regardless of how many other laughed at him for it. The others, who had it wrong, were loyal to something else. We don't need to know what that something else was, just that they weren't pursuing the truth. And if they were pursuing the truth but were that-far-off - they certainly weren't pursuing it well. With the stakes so high, wouldn't we want the person to pursue it well? Don't we want them to be loyal to us?
And "loyal to us" isn't the Politburo gambit of "reading" us to tell us what we want to hear. We want them to tell us what we need to know.
Or are we really okay with analysts who "found what they were looking for" and upon "finding it" proclaimed "see I told you so" even as the company was entering Chapter 11? One particular very-large energy company in Texas (Enron) had one and only one analyst telling them they were on the wrong path. He was right, and could say "see I told you so" - but the decision-makers weren't listening.
This is a lesson for the analysts out there - just like only a few 2016 pollsters got it right while the others laughed at them - you as an analyst might be up against similar odds - and feel like Galileo in conflict with the greatest academics of his time. Your chief analysts may tell you that you're wrong - that you're making a bad career move to disagree with them. What if they are using the "common metrics" and you have found an outlier, a significant anomaly that creates tension in all the common answers? If the data is on your side, you have some decisions to make.
Many years ago I worked with a chap who built hardware parts for computers. One IEEE-certified schematic for a device showed that he needed a much larger wire than was necessary. The wire could hold more power than common house-current, but the device was powered by a nine-volt battery. The larger wire seemed like overkill, but the specification called for it. He took his case for the larger wire to the boss, who told him that he needed to use a smaller wire. The engineer stood his ground on the side of the schematic - and claimed that he'd taken an engineering oath not to follow instructions from people in opposition to a schematic specification. A battle ensued that lasted for many days until the young engineer tendered his resignation. A contractor was called in to fill his shoes, and he noted the same issue with the size of the wire. He called the vendor who said "This has already been published in errata. Do you not have a copy of it?" The contractor said no so they faxed the same. Lo and behold, the new schematic had specified a smaller wire. Why didn't the first engineer think to do this instead of sticking to the original data - in the face of such a glaring anomaly?
What does all this mean? We can stand our ground with bad data in our hands and be sincerely wrong. Or we can look at other aspects of the problem and regard glaring inconsistencies as problems to solve rather to ignore. Nate Silver ignored the "poisoned well" of the data he was using, as did the other pollsters. The ones who got it right, stuck to their answers even through ridicule - because they knew this race was different in too many ways to count, and required more than just the common metrics.
An old joke goes like this: Some analysts got together to determine the "meaning" of "two plus two" - and brought in a mathematician. His answer was "four - what's your point?"
They brought in a philosopher, who answered with "Well, "two" in one universe might mean different things than in ours, the same for the meaning of "plus" or even "four", so can you be more specific?" They thanked him for his time.
Then they brought in the attorney. Upon hearing the question, he rose from his seat, shut the door, seated himself and leaned into them, "What do we want it to be?"
Modified on by DavidBirmingham
Many years ago we encountered an environment where the client wanted the old system refactored into the new. The "new" here being the Netezza platform and the "old" here being an overwhelmed RDBMS that couldn't hope to keep up with the workload. So the team landed on the ground with all hopes high. The client had purchased the equivalent of a 4-rack striper for production and a 1-rack Striper for development. Oddly, the same thing happened here as happens in many places. The 4-rack was dispatched to the protected production enclave and the 1-rack was dropped into the local data center with the developers salivating to get started. And get started they did.
The first team inherited about half a terabyte of raw data from the old system and started crunching on it. The second team, starting a week later, began testing on the work of the first team. A third team entered the fray, building out test cases and a wide array of number-crunching exercises. While these three teams dogpiled onto and hammered the 1-rack, the 4-rack sat elsewhere, humming with nothing to do.
We know that in any environment we encounter, with any technology we can name, the development machines are underpowered compared to the production environment. And while the production environment has a lot of growing priorities for ongoing projects, we don't have this scenario for our first project, do we? Our first project has a primary, overarching theme: it is a huge bubble of work that we need to muscle-through with as much power as possible. That "as much as possible" in our case, was the 4-rack sitting behind the smoked glass, mocking us.
And this is the irony - for a first project we have a huge "first-bubble" of work before us that will never appear again. the bubble includes all the data movement, management and backfilling of structures that we will execute only once, right? Really? I've been in places where these processes have to be executed dozens if not hundreds of times in a development or integration environment as a means to boil out any latent bugs prior to its maiden - and only - conversion voyage. But is this a maiden-and-only voyage? Hardly - typically the production guys will want to make several dry runs of the stuff too. We can multiply their need for dry runs with ours, because we have no intention of invoking such a large-scale movement of data without extensive testing.
And yet, we're doing it on the smaller machine. No doubt the 1-rack has some stuff - but I've seen cases where it might take us two weeks to wrap up a particularly heavy-lifting piece of logic. If we'd done this on the larger 4-rack, we would ahve finished it in days or less. Double the power, half the time-to-deliver (when the time is deliver is governed by testing)
In practically every case of a data warehouse conversion, the actual 'coding' and development itself is a nit compered to the timeline required for testing. I've noted this in a number of places and forms, in that the testing load for a data warehouse conversion is the largest and most protracted part of the effort. And if testing (as in our case) is largely loading, crunching and presenting the data, we need the strongest possible hardware to get past the first bubble. A data conversion project is a "testing" project more so than a "development" project, and with the volumes we'll ultimately crunch, hardware is king.
But I've had this conversation with more people than I can count. Why can't you deploy the production environment with all its power, for use in getting past the first bubble, then scratch the system and deploy for production? What is the danger here? I know plenty of people, some of them vendor product engineers, who would be happy to validate such a 'scratch' so that the production system arrives with nothing but its originally deployed default environment.
Yet another philosophy is that we would pre-configure the machine for production deployment, but nobody likes developers doing this kind of thing in a vacuum. They would rather see deployment/implementation scripts that "promote" the implementation. I'm a big fan of that, too, for the first and every following deployment. That's why I would prefer we used the production-destined system to get past the first-bubble-blues, then scratch it, and get the original environment standing up straight, and only then treat it as an operational production asset.
Most projects like this have a very short runway in their time-to-market, and we do a disservice to the hard-working folks who are doing their best to stand up this environment, They need all the power they can get, especially when they enter the testing cycle.
And for this, it's an 80/20 rule for every technical work product we will ever produce. Take a look sometime at what it takes to roll out a simple Java Bean, or a C# application, or a web site. Part of the time is spent in raw development, and part of it in testing. If I include the total number of minutes spent by the developer in unit testing, and then by hardcore testers in a UAT or QA environment, and it is clear that the total wall-clock hours spent in producing quality technology breaks into the 80/20 rule - 20 percent of the time is spent in development, and 80 percent in testing.
And if the majority of the time is spent in testing, what are we testing on Enzee space? The machine's ability to load, internally crunch and then publish the data. On a Netezza machine, this last operation is largely a function of the first two. But we have to test all the loading don't we? And when testing the full processing cycle we have to load-and-crunch in the same stream, no? What does it take to do this? Hardware, baby, and lots of it.
I can say that multiple small teams can get a lot of "ongoing" work done on a 1-rack, no doubt a very powerful environment. I can also say that a machine like this, for multiple teams in the first-bubble effort, will gaze longingly at the 4-rack in the hopes they can get to it soon, because so much testing is still before them, and they need the power to close.
What are some options to make this work? Typically the production controllers and operators don't like to see any "development" work in the machines that sit inside the production enclosure. They want tried-and-tested solutions that are production-ready while they're running. At the same time, they have no issues with allowing a pre-production instance into the environment because they know a pre-production instance is often necessary for performance testing. Here's the rub: the entire conversion and migration is one giant performance test! So designating the environment as pre-production isn't subtle, nuanced, disingenuous or sneaky - it accurately defines what we're trying to do. It's a performance-centric conversion of a pre-existing production solution, now de-engineered for the Netezza machine. As I noted, development is usually a nit, where the testing is the centerpiece of the work.
With that, Netezza gives us the power to close, to handily muscle-through this first-bubble without the blues - we only hurt ourselves with "policies" for the environment that are impractical for the first-bubble.
This brings us full-circle yet again to a common problem with environments assimilating a Netezza machine. The scales and protocols put pressure on policies, because those policies are geared for general-purpose environments. There's nothing wrong with the policies, they protect things inside those general-purpose environments. But the same policies that protect things in general-purpose realm actually sacrifice performance in the Netezza realm. Don't toss those policies - adapt them.
Modified on by DavidBirmingham
I hear a lot of feedback on the use of CDC to put data into a PureData for Analytics, Powered By Netezza Technology device. In the other machines (traditional database engines) the data flies into the box, the CDC is on-and-off the machine in seconds. But in my Netezza machine, the CDC seems to grind. I have it running every fifteen minutes, they say, and the prior CDC instance is still running when the next instance kicks off. This is totally unacceptable. Maybe we shouldn't be using CDC for this?
Or maybe they just don't have it configured correctly?
There are two major PDA principles in play here. One is strategic and the other is tactical. Many people can look at the tactical principle and accept it because it is testable, repeatable and measurable. The strategic one however, they will hold their judgment on because it does not fit their paradigm of what a database should do. I'll save the strategic one for last, because its implications are further reaching.
The CDC operation will accumulate records into a cache and then apply these at the designated time interval. This micro-batch scenario fits Netezza databases well. The secondary part of this is that the actual operation will include a delete/insert combination to cover all deletes, updates and inserts. So when the operation is complete, the contents of the Netezza table will be identical to the contents of the source table at that point in time (even though we expect some latency, that's okay).
The critical piece is this: An update operation on a Netezza table is under-the-covers a full-record-delete and full-record-insert. It does not update the record in place. A delete operation is just a subset of this combination. This is why the CDC's delete/insert combination is able to perfectly handle all deletes, updates and inserts. The missing understanding however, is the distribution key.
If we have a body of records that we need to perform a delete operation with against another, larger table, and the larger table is distributed on RANDOM, think about what the delete operation must do in a mechanical sense. It must take every record in the body of incoming records and ship it to every SPU so that the data is visible to all dataslices. It must do this because the data is random and it cannot know where to find a given record to apply the operation - it could literally be anywhere and if the record is not unique, could exist on every dataslice. It's random after all. This causes a delete operation (and by corollary an update operation) to grind as it attempts to locate its targets.
Contrast this to a table that is distributed on a key, and we actually use the key in the delete operation (such as a primary key). The incoming body of records is divided by key, and only that key's worth of data is shipped to the dataslice sharing that key - the operation is lightning-fast. This is why we say - never, ever perform a delete or update on a random table, or on a table that doesn't share the distribution key of the data we intend to affect. Deletes and Updates must be configured to co-locate, or they will grind.
Now back to the CDC operation. Whenever I hear that the CDC operation is grinding, my first question is: Do you have the target Netezza tables distributed on the same primary keys of the source table? The answer is invariably no (we will discover why in a moment). So then I ask them, what would it take to get the tables distributed on the primary key? How much effort would it be? And they invariably answer, well, not much, but it would break our solution.
Why is that?
Because they are reporting from the same tables that the CDC is affecting. And when reporting from these tables, the distribution key has to face the way the reporting users will use the tables, not the way CDC is using the tables. This conversation often closes with a "thank you very much" because now they understand the problem and see it as a shortcoming of Netezza or CDC, but not a shortcoming of how they have implemented the solution.
Which brings us to the strategic principle: There is no such thing as a general purpose database in Netezza.
What are we witnessing here? The CDC is writing to tables that should be configured and optimized for its use. They are not so, because the reporting users want them configured and optimized for their own use. They are using the same database for two purposes because they are steeped in the "normalization" protocol prevalent in general-purpose systems - that the databases should be multi-use or general-purpose.
But is this really true in the traditional databases? If we were using Oracle, DB2, SQLServer - to get better performance out of the data model wouldn't we reconfigure it into a star schema and aggressively index the most active tables? This moves away from the transactional flavor of the original tables to a strongly purpose-built flavor.
Why is it that we think this model is to be ignored when moving to Netezza? Oddly, Netezza is a Data Warehouse Appliance - it was designed to circunscribe and simplify the most prevalent practices of data warehousing - not the least of which - is the principle that there is no such thing as a general-purpose database. In a traditional engine we would never attempt to use transactional tables for reporting - they are too slow for set-based operations and deep-analytics. Yet over here in the Netezza machine, somehow this principle is either set-aside or ignored - or perhaps the solution implementors are unaware of it - and so these seemingly inexplicable grinding mysteries arise and people scratch their heads and wonder what's wrong with the machine.
And again, they never wonder what's wrong with their solution.
If we take a step back, what we will see are reports that leverage the CDC-based tables, but we will see a common theme, which I will cover in a part-2 of this article. The theme is one of "re integration" versus "pre-integration". That is, integration-on-demand rather than data that is pre-configured and pre-formulated into consumption-ready formats. What is a symptom of this? How about a proliferation of views that snap-together five or more tables with a prevalence of left-outer-joins? Or a prevalence of nested views (five, ten, fifteen levels deep) that attempt to reconfigure data on-demand (rather than pre-configure data for an integrate-once-consume-many approach?) Think also about the type of solution that performs real-time fetches from source systems, integrates the data on-the-fly and presents it to the user - this is another type of integration-on-demand that can radically debilitate source systems as they are hit-and-re-hit for the very same set-based extracts dozens or hundreds of times in a day.
I'll take a deep-dive on integration-on-demand in the next installment, but for now think about what our CDC-based solution has enticed us to do: We have now reconfigured the tables with a new distribution key that helps the reports run faster, but because this deviates from the primary-key design of the source tables (which CDC operates against) then the CDC operation will grind. And when it grinds, it will consume precious resources like the inter-SPU network fabric. The grinding isn't just a duration issue - it's inappropriately using resources that would otherwise be available to the reporting users.
What's missing here is a simple step after the CDC completes. Its a really simple step. It will cause the average "purist" data modeler and DBA to retch their lunch when they hear of it. It will cause the admins of "traditional" engines to look askance at the Netezza machine and wonder what they could have been thinking when they purchased it. But the ultimate users of the system, when they see the subsecond response of the reports and way their queries return in lightning fashion compared to the tens-of-minutes, or even hours - of the prior solution, these same DBAs, admins and modelers will want to embrace the mystery.
The mystery here is "scale". When dealing with tables that have tens of billions, or hundreds of billions of records, the standard purist protocols that rigorously and faithfully protect capacity in the traditionl engines - actually sacrifice capacity and performance in the Netezza engine. It's not that we want to set aside those protocols. We just want to adapt them for purposes of scale.
The "next step" we have to take is to formulate data structures that align with how the reporting users intend to query the data, then use the CDC data to fill them. It's not that the CDC product can do this for you. It gets the data to the box. This "next step" in the process is simply forwarding the CDC data to these newly formulated tables. When this happens, the pre-integration and pre-calculation opportunities are upon us, and we can use them to reduce the total workload of the on-demand query by including the pre-integration and pre-calculation into the new target tables. These tables are then consumption-ready, have far fewer joins (and the need for left-outer joins often fall by the way-side). After all, why perform the left-outer operations on-demand if we can perform them once, use Netezza's massively parallel power for it, and then when the users ask a question, the data they want is pre-formulated rather than re-formulated on demand.
This necessarily means we need to regard our databases in terms of "roles". Each role has a purpose - and we deliberately embrace the notion of purpose-built schemas, and deliver our solution from the enslavement of a general-purpose model. The CDC-facing tables with support CDC - we won't report from them. The reporting tables face the user - we won't CDC to them.
Keep in mind that this problem (of CDC to Netezza) can rear its head with other approaches also - such as streaming data with a replicator or ETL tool to simulate the same effect of CDC. Either way, the data arrives in the Netezza machine looking a lot like the source structures and aren't consumption-ready.
I worked with a group some years ago with a CDC-like solution, and they took the "next step" to heart, formulated a set of target tables that were separate from staging and then used an ETL tool to fill them. The protocol was simply this: The ETL tool sources the data and fills the staging tables, then the ETL sources the staging tables and fills the target tables. This provided the necessary separation, so functionally fulfilled the mission. The problem with the solution however, was that for the transformation leg, the ETL tool was yanking the data from the machine into the ETL tool's domain, reformulating it and then pushing it back onto the machine. The data actually met itself coming-and-going over the network. A fully parallelized table was being serialized, crunched and then re-paralellized into the machine. As the data grew, this operation became slower and slower. That's what we would expect right? The bottleneck now is the ETL tool. The proper way to do this, if an ETL tool must be involved, is to leverage it to send SQL statements to the machine, keep the data inside the box. The Netezza architecture can process data internally far faster than any ETL tool could ever hope - so why take it out and incur the additional penalty of network transportation?
The ETL tool aficionados will balk at such a suggestion because it is such a strong deviation from their paradigm. But this is why Netezza is a change-agent. It requires things that traditional engines do not because it solves problems in scales that traditional engines cannot. In fact, performing such transformations inside a traditional engine would be a very bad idea. The ETL tools are all configured and optimized to handle transformation one way - outside the box. This is because it is a general-purpose tool and works well with general-purpose engines. There is a theme here: the phrase "general purpose" has limited viability inside the Netezza machine. If we embrace this reality with a full head of steam, the Netezza technology can provide all of our users with a breathtaking experience and we will have a scalable and extensible back-end solution.
Within a few minutes, Neon and Ruth Guardian arrived at the office of the Architect. The label "Most Recent" was alongside the open door. Inside, they could see a high-back leather chair and the back of a man's head. Beyond him. an array of computer screens in a grid, filled and scrolling with operational information.
Neon knocked and entered. He wondered why such a complex command console was required for such a simple data warehouse "Mr. Recent?"
The Architect turned around to greet him, "Hello, Neon - I've been expecting you." He clicked his remote mouse toward the wall and several screens blinked. "But I'm afraid that Most Recent is a status, not a name." He then pointed to the floor, "Please remain standing. You won't be long."
Neon glanced toward Ruth and wanted to roll his eyes, but refrained. "I have a few questions," Neon said, pulling up a chair.
The Architect confidently smiled and said, "I suspect that you do, but keep in mind that you are still irrevocably a consultant. Some of my answers will make sense, and some won't."
Guardian leaned into Neon's ear and whispered, "Actually none of his answers will make sense."
The Architect continued with the tiniest smile, "The question you ask first will be the most important to you, but is also the most irrelevant."
Neon paused, then asked, "Why didn't you protect the company and its interests when you designed the data warehouse?"
"Okay, I was wrong - that is a very good question. Let me think about that for a moment."
"Take your time."
"I suppose the question is not about the data warehouse itself, but the outcome of its applications."
"No, it's the data warehouse itself." Neon focused.
"But the data warehouse is practically perfect in every way." He was noticeably uncomfortable.
'It's full of junk," Guardian sighed.
The Architect continued, "Look at the algorithms, the flows, the data management, the sheer muscle in the machinery. Infinitely capable and infinitely scalable." He grinned and laughed to himself, "Null, missing and invalid values are simply scattered anomalies within what is otherwise a harmony of mathematical precision."
"But you're supposed to correct those," Neon challenged, "the data warehouse process has a responsibility to scrub bad data and either exclude it or make it right."
"What if the data doesn't want to be excluded? What if we don't know how to make it right? What if it cannot be made right? What if excluded data finds a way to come back?"
"The data isn't like an animal," Neon huffed, "it doesn't have a personality. It doesn't feel pain or make decisions. It doesn't - " he paused, leaning backward.
"It doesn't," The Architect agreed, "and it isn't." He shared a long, intense eye-to-eye gaze with Neon, "you still don't understand, do you? All data is merely binary one's and zero's recorded on media, and the metadata to describe it is no different. Data or metadata - what is the difference? - it isn't real. It is a collection of magnetic signals that record the semblance of something in the real world"
There is no data Neon recalled. And if metadata itself is data, then it, too is not real. There is no data, no metadata. What then - is reality? "This is just a philosophical crock," he concluded.
"When data flows through the systems," The Architect explained, "it is not really flowing, only signals representing the data are changing in memory, organized such that the mathematical states cascade in value from one machine to the next - but it's not like water. It doesn't have - physicality."
It's not real in the physical sense, Neon's brain was on fire, but it represents reality, and affects reality - like - sorcery?
"He's doing it too," Guardian said, "talking in circles and not accepting responsibility."
"Don't reduce," Neon focused, "the data is junk, no matter how you choose to describe it, or how capable the technology is. Magnetic state leaves one place and lands in another, and you took no responsibility to make it right - is that about it?"
"Anomalies in the source data reduce its quality," the Architect asserted, "to a level that is beyond correction, but not beyond control."
"They are beyond neither," Neon corrected, "correction and control are simply a function of diligence."
"Diligence and control are easy," he said, "but correction is the realm of the data stewards. What if I correct the data? From what incorrect state to what correct state? And who decides? Who controls?"
It suddenly dawned on Neon that a major missing piece had nothing to do with the architecture, data, or technologies involved. It had to do with governance. No controlling authority existed to control quality of process or result.
"But you're not controlling it at all. Data quality is not beholden to anomalies," Guardian pointed out, "Increasing the quality - eliminating junk - is a deliberate act of will."
"Choice?" the Architect clarified, "is not relevant. The data flows. The programs run. Everything is doing what it is supposed to do. Everything is fulfilling its purpose."
"And that would be?"
"This data mart here," Neon pointed to a screen, "What is it's purpose?"
"Interesting - ", the Architect raised an eyebrow," you asked that question much sooner than any of the others. Impressive." He cleared his throat and continued, "the mart was designed to deliver business facts and dimensions."
"He didn't answer your question," Guardian said, "ask it a different way."
"I don't decide purpose any more than you choose functionality," said the Architect, "Only the users know the purpose. Why do you need to know?"
Neon was growing impatient, "And who are the users of this data mart?"
"The warehouse has developed over the years through several iterations," the Architect continued. "In each case, the basic data remained the same, only the architects and technology changed. We stabilized the staff and the technology, so now all things are safe from harm."
"All things," Neon said - "like - "
"Jobs, careers," the Architect said, "and their security. So we are not beholden to fate. You do understand fate, don't you Neon?'
"But the data -"
"The data supports our existence, but is not the reason for it."
"The technology is about to change. We're tossing out the hand-crafted scripts and replacing them with a framework environment." Neon asserted.
"Netezza?" the Architect rubbed his chin, "will change nothing. We will implement it in the same manner as all before it."
"The changing of the technology might change more than you think. It may change how you approach the problem. And reduce anomalies to zero."
"A problem is simply an opportunity without a solution. The technology serves us, and we are its master. What is the significance of technology without the technologists?"
Neon recalled his conversation with Miro the Virginian, and muttered to himself, "the slaves."
"Pardon?" the Architect queried.
"Change," Neon observed, crossing his arms, "the problem is change. Your architecture is not adaptive, so cannot account for change. It exists with the fear of change, not the adaptation to it. Your technologists are slaves to their own creation - and you are their slavemaster."
Ruth's eyes widened. She'd never seen a consultant speak to the Architect this way before.
Neon continued, "Hmm. From where I'm sitting, the only thing more significant than the data warehouse itself is -"
"Me," interjected the Architect. "Data changes shape, is delivered different ways, but we continue."
"I was about to say that the only thing more significant than the warehouse is its failure," Neon sighed, "to preserve the security and integrity of the data and fulfill known purposes of the users."
"Users, schmoozers," the Architect snickered, "Developers implement a data warehouse. Users get what they ask for."
"And not what they need?"
"Who can say what they need?"
"I can," Guardian offered, "they need to be successful. The data isn't helping them. You seem to think the data is secondary."
"Secondary only to my own importance, which cannot be questioned."
"I am an engine of rhetoric. I ask rhetorical questions that are not meant to be answered, and I give rhetorical answers that are not meant to be questioned," he stared down his inquisitor, irritated and unamused. "I am the Architect."
"Do you really?", the Architect leaned forward, now oblivious to the screens flashing all around him, "and what exactly do you see?"
"Someone on his way out," Neon said, rising, "you have a choice. Pick the door that cooperates with me and save your skin. Or pick the door that defends this nonsense and - "
"The loss will be for Red Corporation either way," the Architect said, "the loss of a sublime data architecture in favor of something simple - and maintainable by simpletons," he cleared his throat, "or the loss of very important staff and the business knowledge between their ears."
"It can be better - "
"Optimism," The architect observed, "your greatest strength - and your greatest weakness."
Neon sighed, "I'm a realistic optimist. I think things can't get any worse than they are now - "
Neon turned to leave and Ruth followed.
"I've built a lot of these," the Architect called after him, "and I will build many more. I've gotten very efficient at it!"
Neon and Ruth made a quick exit. Walking down the hallway back to their offices, Neon wondered what Ruth's role in this mess had been. "Why didn't you stop them?"
"I wasn't here," Ruth said, "to guide the effort." Ruth thought about it for a moment and said, "The leaders need good data to make sound decisions. It protects their careers, the company and ultimately my job and the reason for it. I hunt for the bad data and kill it. I hunt for the good data and preserve it."
"So your something of a data hunter?"
"More like a data protector."
"We shouldn't borrow lines from another script."
"Oh, sorry, couldn't resist. In any case, I protect things."
"I protect, " she paused for effect, "that which matters most."
Modified on by DavidBirmingham
Neon browsed the list of strange and cryptic entries rolling on the screen. "These are back doors" he said under his breath.
"Yes," said Ruth Guardian, the data quality expert for Red Corp, "Each one represents a potential hole in security for private information."
"It's worse than that."
"If this data content gets to the desktop of a decision maker," he sighed, "it could take the company in a dangerous direction."
Her eyes widened.
"And if the wrong or incomplete data appears on the customer website, the customer will assume - and rightly so - that you're not protecting their information."
"Who's in charge of applications here?"
"His name is Miro," Ruth informed, rising, "Follow me." They exited and moved quickly down the hallway. "He's was born in Virginia, but I think he spent some time overseas in his youth."
"Miro the Virginian doesn't talk like a Southerner, then?" Neon smiled.
"Maybe the South of France," she laughed.
Miro was standing on a platform watching three teams of application developers, slinging code and user screens with abandon. Miro looked like the conductor of - some strange orchestra.
"Need some of your time," Neon approached behind him.
"I can take time," Miro said casually, "if I don't take the time, how will I ever have the time?" He laughed at his own joke.
Neon peered over the shoulder of an application programmer - the complexity of the environment was mind-crushing. But something was odd. The complexity seemed to be - artificial.
"You know why I'm here."
"Yes, your arrival was heralded in company newsletters," Miro said, "but the question is - do you know why you are here?"
"I am here to reduce instability," Neon said.
"But that is a task, not a goal. What is the reason for you to do this?" He smiled and turned away, "You are here because you were sent here by the company leaders. They sent you, and you obeyed. This is the reality."
"The reality is that the environment is unstable."
"No, the only reality - is causality. Dirt breaks the screens, non?" Miro waved his hand, "We have dirt. We have screens. We have to clean the dirt. More dirt, more breakage. More code, less breakage. It's a cause and effect. Dirt is the cause. It creates one of two effects - more code or more breakage. Cause and effect."
Neon held up his hands, "I get it, alright? You have to harden the presentation layer because of all the dirt and disconnection in the data layer - "
"Watch this," Miro smirked as delivery people entered the room with boxes of pizza and two portable soda fountains. They quickly refilled the tankers on each developer's desktop and left the pizza in the middle of the room. "On the hour - every hour."
One of the developers spun around, popped open the first pizza box and hungrily wolfed down two big pieces. He washed it down by gulping his soda loudly.
"You see, " Miro pointed out, "the food and soda are like opiates. It doesn't matter how much work they have, as long as they are fed and watered." He held his hand to his ear and cupped it as though expecting to hear -
BUUUURP! The developer belched and returned to work.
"Satisfaction comes in many forms," Miro grinned, "Cause and effect. The food is the cause and the work is the effect. They don't have to understand it."
Neon recognized them as slaves of the same opiates that once were his master. "All you're doing is dealing with symptoms, not root causes," Neon asserted, "The root cause is bad data - and you've just found ways to manage the symptoms - like over-coding."
"Overcoding?" Miro corrected, "We only code what is needed to make things work. It fulfills its purpose."
Neon held up a finger as if to say hold that thought - then reached down, hit three keys - and all of the screens in the room went BSOD - Blue Screen of Death.
Miro gasped, the developers screamed and began a recovery fire drill. Miro quickly composed himself, raised an eyebrow and said "Okay, you have some skills."
"The environment is brittle because of the overcoding. Less code, fewer failure points. Cause and effect. The extra code is not necessary - it's a symptom -"
"Oh? How will you solve the problem otherwise?" Miro challenged.
Neon rolled his eyes and muttered, "Clean the data first."
"It is so simple for you, consultant," Miro huffed, "But you know nothing. The data is what it is. If you think you can change it - you must approach the Architect. He will not suffer the likes of you - "
"But all of the code, and all of the work," Neon observed, "gets in the way of serving what the users really need."
"You know what the users need?"
"Yes, well -"
"No, I don't think you do." Miro asserted mysteriously, "You think you know what they need, but do you really?"
"He's talking in circles," Ruth observed, "welcome to my world."
Neon turned to leave with Ruth on his heels.
"This isn't over," Miro snapped.
"Oh yes it is, " Neon retorted, "The leaders turned this over to me and I will put an end to it."
"I survived your predecessors, consultant " Miro called after him, "and I will survive you!"
Yep, Neon thought, I've heard that before, too.
Ruth quickly joined Neon in the hallway, "Now what?"
"Can I speak to the data architect?"
"Which one? There have been six before you. The most recent -"
"Just let me talk to him."
Modified on by DavidBirmingham
Neon strode through the crowd, half-gliding as his long black raincoat floated behind him. Orpheus had called - and now it was time to meet with the database vendor. He made his way confidently toward the large black building.
Orpheus joined Neon outside building and both of them entered. Orpheus led him to a large lobby full of people, signed him into the visitor list, turned and said, "You're here to see Oracle. Whatever they tell you is only for you to know and nobody else. It's a non-disclosure thing." With that, he departed and left Neon amidst the others.
Neon found his way to a chair and sat down to wait quietly. Next to him was a young man who looked like he'd just graduated college. He was tapping feverishly on the keyboard of a well-worn laptop.
"Why are you here?" the kid asked.
"To see Oracle," Neon muttered, closing his eyes, "and you?"
"Metadata," the kid whispered, almost hypnotized by his own work. "If you notice," he tapped several keys and filled the screen with data, "how the information moves when the metadata changes..."
He was right. As he moved his mouse, the metadata changed, controlling the information on the screen almost like water in a pool.
"That's interesting," Neon said, "how are you doing that?"
"Making the information move like that?"
"There is no information," the kid said, "only metadata."
"But there has to be -"
"Think about it," said the kid mysteriously, "When we see an "A" on the screen, it's not really an "A", but the way the screen has been programmed to represent the digital code for "A". And below that, the "A" is a hexadecimal value, which is another representation for binary digits. And we know that binary digits are representations of electrical signals that can only be "on" or "off". Ultimately, each level is a representation of the one below it, so there is no true data, just representations of electrical signals. It's all metadata."
"There is no information," Neon whispered.
"Mr. Neon," an admin called out, and Neon left the young man to his own mysterious devices. "Oracle is down the hall to the left."
Neon entered the conference room where only a small, unassuming little woman stood pouring herself some coffee. "Oracle?"
"Shortly," she said, not turning around, "they're in the next room wrapping up. But we can talk for now."
Neon waited in the uncomfortable silence while the woman slowly stirred her coffee.
"You don't drink coffee do you?" she asked nicely.
"Not really, no."
"How did I know that?" she asked, motioning for him to sit down. She answered her own question, "Your eyes are naturally bright. You don't need the stuff."
They moved toward the large conference table while she continued, "Ever wanted to predict the future?" She sipped her coffee and paused for effect. "Or see things others cannot?"
"What do you mean?"
"Predictive analytics. Remember the little girl on the beach in Indonesia? The water rushed out to sea, and everyone thought it was odd, but still they ignored it. She knew a tsunami was coming because she'd studied it in school. She sounded the warning, saved lives - and I noticed - that none of the other tabloid prophets had anything to say about it."
Neon smiled, "That's because you can't really predict the future."
"What if you could?" she smiled, "in the same way that she did - she obviously knew what was about to happen before it happened."
"That's different. That's not predicting, that's analytics."
"In the business world, and in technology, there's no difference."
Neon paused for a moment, "Go on."
"People go about their business, everything is routine. Then something odd happens one day. Maybe it's deja-vu. Maybe something more. Ever been somewhere that the odd things were so glaringly out of place to you, but nobody else seemed to notice?"
Neon was waiting for her to make her point. Perhaps like information, there is no point, he thought.
"Have a seat," she invited, "I've got something to show you."
She produced a laptop, popped it open and said, "Look." Filling the screen were various utility scripts.
Neon's eyes got wide, "Is this what I think it is?"
"Yes," she said, "From the underground." She went on to explain that while all of the scripts had been built for utility, she'd done it under the license of a prior customer. The customer had received value for their money, and she had walked out with the reusable collateral.
"So these weren't developed with a bootleg copy of Linux?" Neon queried.
"Would it matter?" she said, "Either way, they're from the underground. How much would something like this be worth to you?"
Neon thought for a moment, "Functionally I'm sure there's value, but we'd have to retrofit these into our current naming conventions, configuration management - I mean, it would be easier to build them from scratch than accepting them from you."
"But you have to know what to build," she said mysteriously," you have to know ahead of time what you're going to need. What if these contain ideas about things you didn't know you needed, but really do?"
"Cut to the chase, just give me the insights and I'll apply them."
"Insights are valuable too," she said, "like seeing into the future. You will need some of these functions, and you won't know it until it's too late."
"It will never be too late," Neon said, "we can always add it later."
"And if you never do, you'll suffer when there could be a better way. Ignorance will enslave you."
"I'll take that risk."
"That's not the Netezza way," she said, standing, "the Netezza way - adaptive architecture - is to add it to the mix - keep it in the arsenal - knowing that you may need it because so many people often do."
"That's a little extreme - "
"Is it?" she leaned forward, "What must you have in place before developing? What are the top three pitfalls of development? What are the top three missed opportunities if you start wrong? Do you know how to apply metadata so that data doesn't really matter anymore?"
Neon shifted his weight in the chair.
"Like the little girl - predicting the future from prior knowledge. Harness the information, and harness the future."
"Nobody can amass that much information," he shook his head.
"It's not about the information," she said, "it's about how the information is used."
Neon whispered in recollection, "There is no information... only -"
"What transforms raw data into actionable information? Metadata - Behavioral and structural metadata are not the realm of tacticians," she said, "they only understand raw data. Rise above the data - connecting it with metadata - and become an information alchemist. You'll have the persona of a shaman, or a sorcerer, not simply a developer - once you master the power of metadata, and its power over data to create information."
"But data drives the processes - "
"But metadata determines which processes to drive - "
Neon's eyes widened, "I'll take that cup of coffee now."
She retrieved the coffee, using the time to explain, "In the corporate layoffs of earlier years, many IT professionals developed a deep distrust for corporate IT, to the point that most of them had gone underground, to a "1099-culture" of contractors that popped on and off the grid like ghosts. This created an underground of disavowed professionals who sometimes take extreme measures to be successful."
"What do you mean, ghosts?" Neon asked.
"You see those folks out there?" she pointed out a small window to a view of a hallway, "they work here and fulfill a purpose. Sometimes they are replaced by someone who is better or different than they are - happens all the time." She took a long draw of her coffee, "then there are those who don't really have a purpose, or their purpose is lost. They might leave the company and reappear as consultants or contractors."
"Happens all the time," Neon observed.
"Yes, but the legends that you hear - of vampire firms that suck the life blood and money from a company, or ghosts that appear on and off the books to fix spot problems, or aliens that come from far away places", she shook her head, "are part of an underground - or an underworld. In many cases these are people who have lost their original purpose - or who are trying to find one - some are even outcasts who live in the underground and supplant or assist people - " she pointed to the hallway, "like them."
Neon reflected, "Go on -"
"So there's nothing really wrong with being a ghost," she said, "I'm something of one myself. The point is - the underground is a dangerous place for unprotected hex," she grinned.
Neon raised an eyebrow.
"I mean software - and software licenses. There are bootleggers and pirates, crackers and hackers all over the globe. Vendors have wisely chosen to lock down their own licensing model to keep the vampires at bay. A form of garlic, if you will."
"And the vampires reciprocate with a model of entry by invitation only," Neon smiled.
"Or we'd have bootleg AI products," Neon finished, "But this makes it even harder for a consultant to succeed - they always have to be working for a licensed customer in order to get exposure to the product - making them even more desperate to get a copy of it themselves."
"Bingo," she said, taking another sip, "it's a pickle, no doubt about it."
"So how do we solve it?"
"What's to solve? People work their way around it," she said, then explained how one consulting firm put their daytime workforce in play and then replaced them with a night-time workforce to do unit testing for the daytime work force. The night work force was "free" to the company, but they were getting trained in a live environment.
"So they found a way," Neon leaned back in his chair.
"They always do," she said, sipping her coffee and maintaining a long silence.
"I interviewed a fellow the other day," she began, "for the third time."
"Why so many -"
"Wasn't my intention," she smiled, "he just used different names each time. The third attempt he had someone in the room with him coaching him on the answers. Desperate times call for desperate measures." she took a long sip of the coffee. "It's one thing to have the answers - it's another thing to actually know them."
"Seems like it would be easier to get a job somewhere else doing something else?"
"Where else?" she said, "and what else? The tactical underground lives for the information - bootleg or otherwise. Anything involving information will always give them work to do. They live for the "I", so to speak, the "I" in IT".
"But you said there is no information- "
"Bingo. They are living for something that is not real. The information will always change, faster than you can blink."
"So when we say there is no information, it's not that the information isn't there, only that it's passing through so quickly - "
"Like water in the plumbing. The plumbing is the metadata - it carries the water - regulates it - manages it. Which is more important, the water or the plumbing - and who makes more impact - the plumber or the one who is constantly focused on the water?"
"Here's another dilemma," she continued, "a team with ill-trained Netezza people arrives on site and does damage trying to understand how to make the technology fit into how they do things. It's how they do things that's more important to them, because they are predicting they may need a backup - like another technology and another team - to backfill if they fail."
"They are predicting their own failure," Neon said.
"Predicting is hard to do," she smiled, "but planning it is not. You can't fool me - some teams do such a bad job that I can't be convinced their failure wasn't deliberate. Netezza is so easy to use it would be like failing to cross a country road. You have to trip yourself a bunch of times - deliberately - you see what I mean?"
"Sounds almost - insidious."
She sipped her coffee, "You've seen the so-called support systems in the underground - Is this sort of like the blind leading the blind - or perhaps the blind misleading the blind?"
"Never thought about it that way, " Neon mused.
"Think about where you are now," she continued, "I bet you'd like to learn more - can you share information like: How much did you spend on your development staff? Your consulting staff? That much? That little? What was the outcome? How long did it take? that long? That quick? How do you schedule work and scope projects? etc. etc. " she waved her hand in the air, "the list goes on and on - "
"What's the answer?"
"Collaboration," she said slowly, then paused, "and someone to make it happen." She leaned forward - "a facilitator. In gaming systems, it's a sprite - something that exists only to bridge the player into the game." She leaned forward, "the real game."
Neon's eyes sparkled, "So if end users collaborate on each other's work, they could trade ideas and watch each other's backs." He stopped himself, "But that is a huge amount of information -"
"You don't learn very fast, do you?" She smiled. "It's not about the information, it's about the facilitation. As far as you're concerned, they need a broker. Someone to bridge them into the game."
Neon's cell phone beeped and he turned around to check the display. It was Ruth Guardian, and it was time to leave. He turned to the woman to apologize, but she and her laptop were gone. On her chair was a business card with the Oracle logo, and on the back was scribbled a cell phone number.
He looked around quickly, and noted that the door on the far end of the conference room was slowly closing, telling him that she'd exited through it. He darted toward the door and opened it quickly.
The hallway was empty.
Musing over his recent adventure with the white rabbit, Neon sat silently, thinking Dilbertesque thoughts - still steaming from a reprimand by his pointy-haired boss (someone who strangely looked like an elf from Lord of the Rings, but without whispering most of his words)
A courier dropped off a package containing a cell phone (with a design that not even Mr. Spock's agent would have agreed to)
The cell phone rang -
"Neon do you know who this is?"
Neon whispered, disbelieving - "Orpheus?"
"Yeeesss. I've been wanting to meet you, but we haven't much time."
"What do you mean?"
"You must quickly take this phone across the building. Don't let anyone see you."
Neon complied, ducking low as he darted through and between the cubicles.
"When you get to cube 4184, look inside. There's a consultant there from one of my competitors working on a project that should have been awarded to me".
"Yes I see him."
"Take him out."
"What?" Neon whispered as the consultant looked into his eyes.
"Okay, tell him to go to the nearest window and jump. I'll be waiting to catch him."
"Wha -?" Neon said again in disbelief.
"Okay, tell him to take a red pill, then stick his finger in the nearest mirror. We'll drop a harness in the water and lift him out."
"You've lost me."
"Just make sure he takes the pill, we'll do the rest"
"Wait - "
"The pill, Neon, and wash your hands afterward."
"Look - "
"No you look, do I have to spell it out? He's a competitor. Keywords are jump. Water. Long walk off a short pier. I'm throwing you some bones here."
"You want to get rid of him?"
"No, not get rid of him, just cancel his contract. Punch his ticket, you know - he'll find plenty of work to do elsewhere."
"Like where else?"
"Where do you want him to go?"
"Out? Work with me here..."
"I don't have time for this - he's got work to do and so do I"
"Yes but his work should be my work - and that's the way it works."
"I'm still lost. He's giving them the same thing you would - in fact exactly the same thing."
"Yes but with a major difference, mine is half the cost for the same value."
Neon thought about this for a moment. "That's impossible," he whispered.
"And in half the time, or less," Orpheus added.
"So where are you?"
"Everywhere and anywhere, but mostly in Chicago."
"No, I mean - forget it. Look, Orpheus, I'm uncomfortable with this conversation..."
BZZZZT! Orpheus sent an electric shock through the phone
"Ow! " Neon dropped the phone, shook his hand and retrieved it from the floor
"How's that for discomfort? Now get to it."
"Okay - look - I can't do this -"
"What - next thing you'll tell me is that someone becomes 'special' by doing an action flick with a runaway bus opposite Dennis Hopper and Sandra Bullock. Is that what I'm about to hear?"
Neon was silent. That one really hurt.
"Look, if you hang up the phone, a bunch of those same guys will lock you in a room, put fake flesh-colored putty in your mouth and then let an electronic shrimp-creature crawl into your navel -"
"Your only other way to escape is to jump out on a rickety scaffold in high winds - but there's an Infinity with the motor running waiting for you - 20 stories down."
"We're done here"
"We'll see...." Orpheus snorted, "Just wait until you meet the Mad Hatter".
"What is with you and this Alice In Wonderland thing?"
"Seeya, Alice - "
So Neon slowly walked back to his desk, not understanding just how competitive this business could be. The words of Orpheus rang in his mind as he flopped into his chair.
Charging more money for no better value? Less time and less money for the same result? Could that be possible? If so - the stakes were very high indeed. More importantly, the more that others knew that the value was the same, the less traction the competitor would have. The less traction, the less money, and they would implode. They would have a vested interest in shutting down and silencing anyone who got in their way - or who could reveal their secret.
Within moments, several members of the competitive firm appeared in Neon's cubicle, one of them holding a ball of putty. The following ordeal was horrific.
BEEP BEEP BEEP
Neon awoke from his dream state clutching his stomach. Electronic shrimp creature, indeed. It was all just a dream, wasn't it?
The phone rang, "Yes?" Neon asked. His head hurt and his mouth tasted like - putty.
"Infinity, is that you?"
"Yes, You must meet me downstairs, I have a statement of work for you, and a new contract"
"Does it involve surfing?"
"No, it - "
"Time travel? I don't want to do another time travel - "
"Shut up and meet me downstairs - "
Neon met Infinity at the coffee shop on the first floor. They stood in line, made small talk, ordered their respective cups of coffee and found a table toward the rear. Neon broke the ice, "Whatcha got?"
Infinity took a light sip of her brew and smiled. "There's something you need to know," she said, "that thing about the shrimp creature may have felt like a dream, but just in case, eat some creole tonight to wash it all down. Can't be too careful."
Neon just stared at her. What a strange person.
"Someone wants to meet you," Infinity said.
Neon felt a light touch as he looked up and heard "Hello, Neon," - it was a voice he instantly recognized -
"Yes, and it's a pleasure to meet you, Neon," he smiled, "I've been waiting a long time." He pulled up a chair next to Infinity, who simply smiled, rose, and departed without a word.
"I have two contracts here," Orpheus said, pulling out the portfolios and laying them side by side. "They are both very difficult projects, with especially difficult managers in charge of them. Both of them are a real pill."
Neon scanned the first several pages of each statement of work.
"These two projects, they are very similar, yes?" Orpheus asked.
"Practically identical, except for one thing," Neon flipped to the back pages of each - "The Blue project says it will take 10 months with 10 people. Yet the second one, for Mr. Red, will take 5 months and 5 people. How is this possible?"
"Which one do you believe?" Orpheus asked, "The Red or Blue?"
"Do you believe in fate, Neon?"
"I know precisely what you mean. Why not?"
"Because I am the master of my destiny."
"Is that really true? Think of the project you are on now - are you its master, or its servant?"
"Well, I - "
"Do you answer support calls in the middle of the night? Is the environment unstable? Do you feel the wheels could fly off at any moment?"
"Well, sometimes but - "
"And this feels like control to you? It sounds like the warehouse is your master, and you are its servant. You are not in control - your fate is tied to the health of the fact tables and the Service Level Agreement."
Neon sank in his chair, "go on."
"There is a question in your mind, Neon," Orpheus said ominously, "like a splinter - a question you cannot answer yourself. What is the question?"
"There is only one question," Neon affirmed, "and only one way to solve the same problem in half the time - without killing myself."
"What is the question?"
"How do I effectively apply an adaptive approach?"
"Almost - "
"Okay, how do I effectively apply an adaptive approach with PureData For Analytics?"
"Bingo," Orpheus laughed, "Next, we'll go up on the rooftop and jump from building to building."
Neon's eyes widened, "what?"
"PSYCHE!" Orpheus pointed his finger, "are you always this gullible? No - but you're right. The only way to produce the same quality in the half the time and half the people - creating a more stable and resilient environemnt - is with an adaptive model."
Neon glared. What a strange person.
"So which will it be?" Orpheus added, "the Red or the Blue?"
Neon held both projects in his hands, "If I pick the Blue, I get more revenue and put more people to work. If I pick Red I get less revenue and less people working."
"Yes, yes," Orpheus was excited - Neon was catching on.
"So as a consultant I am predisposed to pick the Blue project."
"But as a master of destiny you are predisposed to pick the project that will lead to mastery of the data warehouse, rather than being its servant."
Neon stared at the two projects, then took a big gulp of coffee.
"The other approaches - we'll call them application-centric - are a dream world," Orpheus continued, "designed to control and master the architect and all his developers. It turns a good developer into this," he held up a tiny cage with a rock inside it.
"A pet rock?"
"It was the best I could do on short notice - work with me. It means you are the servant of the warehouse, instead of its master. The warehouse should be in the cage, not you. Take the Blue project, and you will remain a servant."
"And the Red project?"
"I'll show you how deep the rabbit hole goes," Orpheus grinned.
"Again with the Alice in Wonder -"
"Okay, Okay, I'll show you how the adaptive approach builds automatic resilience into the design."
"Red or Blue, my choice?"
"Free will, Neon."
"No, that's a line from another script I'm looking at."
"Oh, okay, well then - you are free to choose. And remember," Orpheus leaned close and whispered, "I am only offering you productivity." He finished his coffee, "and the opportunity to be the master of your destiny."
Neon downed his coffee also, snatched up the Red folder and said "I'm in".
"Good," said Orpheus, "follow me please."
They made their way to the street where a stretch limosine awaited. As they approached, Neon noted his distorted image on the vehicle's smoked glass window.
Once inside the vehicle, the driver said "Hello, Neon."
"He knows me," Neon quizzed Orpheus.
"Zephyr is my best architect," Orpheus laughed, "he knows the adaptive approach better than anyone."
"Stick with me Dorothy," said Zephyr, "and Kansas will be bye-bye."
As the car moved forward, it meandered through traffic, passing a black van, and Neon could see his reflection again in the car window glass - from the inside. It was crystal clear now. "Can you see that?" He said.
"The reflection?" Orpheus smiled, "depends on which side of the glass you are on. From the outside, it's distorted - but on the inside it's a clear picture." He leaned forward to Neon, "More to the point - the image on the outside was of a servant. Now its an adaptive architect. Things will be clearer every time you view the looking glass."
"Again with the Alice in - "
"Okay, okay - I'll put it like this: One day you'll look in the mirror and see what you've become - an adaptive architect - and you'll no longer be a servant."
Neon was fixated on the window.
"CopperTop?" Orpheus asked, handing him a battery for his beeper.
"Energizer," Neon said flatly, "Rabbits are a theme with me. It's personal."
Modified on by DavidBirmingham
Neon had followed the girl called White Rabbit to the rave party. Now standing against a wall, alone and his paranoia rising, he wondered what he could have been thinking. If she's the rabbit, then I'm in the rabbit hole. There's no telling how far down this goes.
A dark-haired athletic woman with a "don't try it" expression sauntered towards him and said, "You're Neon."
"How do you know that name?" Neon crossed his arms and glanced around, feeling exposed.
"I'm Infinity," she said with a curt smile.
"The one that cracked the NSA mainframe?"
"That's the one."
"I thought you were a guy."
"I get that a lot," she moved closer, and closer still, then leaned into his face and said with a sultry tone, her breath against his cheek, "Why are you here, Neon?"
"I'm not sure."
"Of course you are," she whispered into his ear, "You want the answer to the question."
"Do you know the question?"
"What the heck is a zone map?"
BEEP BEEP BEEP the alarm clock awoke Neon from his restless nightmare.
- excerpt from The Waitrix
I fell asleep dreaming, even as I hit the pillow, glad to be free of that whining analyst on the third floor. Every time he sends me a report, it turns out to be worse than wrong. It's too wrong to be worth the paper it's printed on, yet his supervisors say he's the next Einstein. When I ask him personally about the paper's details, he just scratches his head. When I ask about his conclusions, he scratches it again. Whenever other people ask him for things, he just bobs his head like everything is under control. So between bobbing his head and scratching it, I nickname him Bob Scratchit. Many can appreciate the irony.
But I had acted on his analysis. Woe came to me in waves for many weeks. Nobody wanted to sit with me in the lunch room, return my phone calls or give me access to the data necessary to see what-went-wrong. I became a scourge and a pariah. I blame Scratchit. He started it. My partner had also listened to and acted on Scratchit's advice. He was practically perp-walked out the door and doesn't look particularly good in an orange jumpsuit. I need to watch myself, because Scratchit always has an agenda, and even if his conclusions are wrong, I get the impression that he's toying with me. And that's kinda scary.
Later that evening I startled awake as my laptop screen flashed from the other side of the room. I rose to inquire, and on the screen was the smiling face of my partner Jake, his carefully coiffed Bob Marley locks draping his shoulders. He hadn't changed much, but looked a little tired, and I wondered how he was able to find me and impose himself on my evening, like a ghost from my workplace.
"Knock, knock!" Jake said with a grin, "Anyone home?"
I touched the camera-On button, "Hi Jake, yes I'm here."
"Ahh, good to see you," he said, "Knees'er shakin'?"
"I can tell your knees-er shakin'. Don't be scared."
"Not scared. Just annoyed. It's late, can't this wait?"
"You haven't changed. Still keep puttin' things off 'till later. That's gonna change, my friend."
"I'm sending you three WebEx appointments."
"What are they about?"
"You'll see," Jake said mysteriously. "Come here every night by 8pm for the appointments." He
seemed to pause, pain written on his face, "Don't put it off, and don't take my word for it." he beeped out.
In that moment, the clock chimed for 8pm and startled me. The WebEx session fired up to reveal
the smiling face of-
"Jack Blaines," I said before he could get a word out.
"That's right, mate," the native Australian grinned broadly.
"But, I recall you from what, twenty years ago, when data warehousing was barely a concept and -"
"That's right mate, we were all part of the inception."
"You haven't aged a day - "
"Nope, and where I am, data warehousing hasn't moved an inch either. Here, we had to make the most of what we had."
"The most of data warehousing past, you mean."
"That's right, do more with less. In this time, we don't have any network bandwidth because, well, we don't have any networks. But we still have to do bulk processing and analytics."
"The hard way."
"Here's a takeaway - software only guides the hardware, but the hardware does the work. I'll bet that never changes even for you."
"Well, if I'm honest - "
Just then a series of pictures started flashing across the screen, of machines and technologies that had long ago entered the fossil graveyard of computing history.
"Look - gotta go - hard stop at midnight. see-ya."
"Wait, I have some more quest - "
But the WebEx evaporated leaving only a screen with the usual icons. Why would Jake connect me with Blains? Data warehousing of yesteryear wasn't going to help me.
As I drifted off to sleep, the screen fired-up again and another WebEx started. This time it was the smiling image of Hunter Hardwick, an aficionado of all-things-warehousing.
"Hell0 my friend," he called to me, "I'm so happy we could be together for this celebration." he then raised a glass filled with a bubbly drink.
"What are we celebrating?"
"A toast!" Hardwick ignored me, "A toast to the present-state of data warehousing."
"Hasn't data warehousing sort of - you know - been eclipsed by Big Data?"
"Oh not hardly! The Big Data technologies are still a valuable part of the warehouse, but the warehouse itself isn't going away."
"Wait a second, you're speaking of the warehouse as something larger, not just a data store."
"Well of course, the warehouse is a living environment, not unlike a brick-and-mortar warehouse, with workers, forklifts, loading docks, storage carrels and operational protocols to tie it all together."
"Okay, I see that, but why are you making a toast to data warehousing's present-state? Wouldn't you want to toast to something more profound?"
"What's not to like? Data warehousing appliances, fluid queries, various options for large-scale storage, big-ticket network bandwidth - the world is my oyster. Glasses up!"
Something overcame me and my eyes grew heavy. As much as I wanted to celebrate the present, my body could not go on. I must have slept for hours before the computer screen fired-up again and started flashing. The third WebEx foretold by my former partner Jake, now attempted to materialize on the display.
"Are you there?" I heard my own voice, am I still dreaming? "Is this thing on?"
"Yes I'm here," I said groggily, wondering if this was a recording of me.
"That's what the bad comedians say in Vegas, you know," my voice jested.
"Is this thing on?" my voice chuckled, "Hey, I'm calling you from your future," but the signal was breaking up severely, like a bad cell-phone connection, and the image on the screen was full of static. I could see a silhouette of what might be an image of me, but too fuzzy to make out.
"You mean, I'm calling me from my future," I corrected, trying to gather my senses.
"Whatever," my voice laughed, "You're gonna love it here."
"In the future?"
"Sure, you have a lot to look forward to."
"Tell me more," I sat up.
"Well, networks here are all wireless, with practically unlimited bandwidth."
"Uh, no kidding -?"
"Absolutely, With wireless networks, everything is infinitely elastic. I can transfer terabytes in micro-seconds. Wires are pretty much a thing of the past."
"Well, except for power I suppose."
"Nope, wireless power too. Tesla finally won the argument. Changed the entire architecture of data centers. Now we don't have raised floors, we have raised machines. We can use the horizontal and vertical space in the building. IBM's Websphere machine really is a sphere now, most machines now use their entire outer-cabinet real-estate for something. No more boxes and racks."
"And storage is all in the cloud, nobody has real data centers hosting storage anymore."
"So centralized data centers all over the world?"
"No, as storage became denser and faster, we can put a petabyte on a flash stick. No big deal."
"To your point though, with the use of quantum entanglement and basically tossing out all of Einstein's constraining math, we can transfer data instantly. Storage isn't really a priority anymore."
"Pretty cool," I grinned, "But unlimited power, how -"
"No gas-powered cars anymore either. They all run on electromagnetic quantum engines. Basically generate their own fuel by capturing particles that leap-in-and-out of existence."
"That's - well, incredible."
"Yeah, when I go to work, I fly-by-wireless in my hovercar. No need to refuel - and it gets all its power wirelessly too," my voice yawned, "So yesterday."
"How far away is work?"
"I still live in Texas and can be at our Colorado center within minutes."
"But that's - "
"And I visit our site at Maria Tranquility within the hour."
"Funny, naming a site after a place on the Moon."
"You named a site after a location on the Moon."
"That's because it's actually on the Moon."
"Oh, no way!"
"YES way, dude. Dozens of shuttles and private craft leave every hour. Our biggest problem now is traffic control to the lunar surface."
"And data warehousing, well, it has undergone several facelifts. What was impossible in your time is literally yesterday's news."
"That's a pretty big claim."
"The boast of data warehousing's future was for real," my voice said, "actually, the folks in your time thought they were being visionary, but actually underplayed, or should I say, underestimated the innovators."
"What are your saying?"
"Their predictions came true," my voice said, "but fell short. Their boast of data warehousing's future fell radically short of what actually came to be."
"Like I said, you have a lot to look forward to. But when you think about it, even with predictive analytics, if you can see the future, or even accurately predict it, you've already changed it."
"If you know the outcome, doesn't it affect your behavior? If a corporate exec is very sure of the outcome, won't this affect how he markets his products, or would be just market them the same-old-way because its always been that way? And once he makes this change, it can radically affect the outcome of the prediction, with one problem - the future's not here yet so we won't know if we were right. And if it turns out wrong, we can't go back and replay it because we were the ones who changed the future by acting on it."
"So you're saying that if I can see the future, it's already in my past?"
"Something like that."
"I'd like to think - "
"Stop just thinking about it and start knowing it. We know that if someone tells us that a certain lottery number is a winner, won't we go out and buy a ticket when we wouldn't before? And when that corporate exec unleashes his campaign upon the masses, won't it affect the customer behavior? His predictive model cannot tell anyone what will happen when he acts on the predictions."
"You're right, it's not a problem here. Not yet anyhow."
"Look," my voice said, holding a square object against his silhouette, "Can you see this?"
"No, it's too fuzzy."
"It's a book," the voice said, "Changes a lot of things."
"Who's the author?"
"Someone you know," my voice said mysteriously.
"When will it arrive on bookshelves?" I implored, "And what's the title? I'd like to get a copy."
"That's kind of up to you, But there are people nipping at your heels even now. Those ideas you had for the future really work. Don't let someone else publish them before you do."
"Wait, what are you saying? Is that my book?"
"It could have been. You trained the author on everything in it. You just delayed publication because you thought you had more time."
"But I don't, you're saying."
"These ideas are already in your head, and this author was over a decade late in delivering them," my voice paused, "Why don't you just deliver them now?"
"Where do I start?"
"Start where you would start," my voice was breaking up even more, "Good luck."
The image faded and my voice trailed off. Now the screen was filled with static. I stared at it for long moments, trying to assimilate what I had just experienced.
The most of data warehousing's past, the toast of it's present, and the boast of its future, were these just a dream, or should I -
The phone rang.
"Hello, this is Jim Stone, how are you today?" said the pleasant voice of the robo-caller.
I know that if I say anything except "fine, how are you?" the robot will politely say "Goodbye" and hang up.
So I scream into the receiver, "They have me trapped in a box filled with scorpions! You have to help me!"
"Goodbye," it said pleasantly, and disconnected.
BEEP BEEP BEEP
The alarm clock woke me up.
Modified on by DavidBirmingham
How many logical modelers does it take to screw in a lightbulb? None, it's a physical problem.
I watched in dismay as the DBAs, developers and query analyts threw queries at the machine like darts. They would formulate the query one way, tune it, reformulate it. Some forms were slightly faster but they needed orders-of-magnitude more power. Nothing seemed to get it off the ground.
Tufts of hair flew over the cubicle walls as seasoned techologists would first yelp softly when their stab-at-improvement didn't work, then yelp loudly as they pulled-out more hair. Yes, they were pulling-their-hair-out trying to solve the problem the wrong way. Of course, some of us got bald the old-fashioned way: general stress.
I took the original query as they had first written it, added two minor suggestions that were irrelevant to the functional answer to the query, and the operation boosted into the stratosphere.
What they saw however, was how I manipulated the query logic, not how I manipulated the physics. I can ask for a show of hands in any room of technologists and get mixed answers on the question "where is the seat of power?". Some will say software, others hardware, others a mix of the two, while those who still adhere to the musings of Theodoric of York, Medieval Data Scientist, would say that there's a small toad or dwarf living in the heart of the machine.
To make it even more abstract, users sit in a chair in physical reality. They live by physical clocks in that same reality, and "speed of thought" analytics is only enabled when we can leverage physics to the point of creating a time-dilation illusion: where only seconds have passed in the analyst's world while many hours have passed in reality. After all, when an analyst is immersed in the flow of a true speed-of-thought experience, they will hit the submit-key as fast as their fingers can touch the keyboard, synthesize answers, rinse and repeat. And do this for many hours seemingly without blinking. If the machine has a hiccup of some kind, or is slow to return, the illusion-bubble is popped and they re-enter the second-per-second reality that the rest of us live in. Perhaps their hair is a little grayer, their eyes a little more dilated, but they will swear that they have unmasked the Will-O-The-Wisp and are about to announce the Next Big Breakthrough. Anticipation is high.
But for those who don't have this speed-of-thought experience, chained to a stodgy commodity-technology, they will never know the true meaning of "wheels-up" in the analytics sense. They will never achieve this time-dilated immersion experience. The clock on the wall marks true real-time for them, and it is maddening.
Notice the allusions to physics rather than logic. We don't doubt that the analyst has the logic down-cold. But logic will not dilate time. Only physics can do that. Emmett Brown mastered it with a flux capacitor. We don't need to actually jump time, but a little dilation never hurt anybody.
The chief factor in query-turnaround within the Netezza machine is the way in which the logical structures have been physically implemented. We can have the same logical structure, same physical content, with wildly different physical implementations. The "distribute on" and "organize on" govern this factor through co-location at two levels: Co-location of multiple tables on a common dataslice and co-location (on a given dataslice) of commonly-keyed records on as few disk pages as possible (zone maps). The table can contain the same logical and physical content, but its implementation on Netezza physics can be radically different based on these two factors.
Take for example the case of a large-scale fund manager with thousands of instruments in his portfolio. As his business grows, he crosses the one-million mark, then two million. His analytics engine creates two million results for each analytics run, with dozens of analytics-runs every day, adding up quickly to billions of records in constant churn. His tables are distributed on his Instrument_ID, because in his world, everything is Instrument-centric. All of the operations for loading, assimilating and integrating data centers upon the Instrument and nothing else. They are organized on portfolio_date, because the portfolio's activity governs his operations.
His business-side counterpart on the other hand, sells products based on the portfolio. The products can serve to increase the portfolio, but many of the products are analytics-results from the portfolio itself. This is a product-centric view of the data. Everything about it requires the fact-table and supporting tables to be distributed on the Product_id plus being organized on the product-centric transaction_date. This aligns the logical content of the tables to the physics of the machine. It also aligns the contents of the tables with the intended query path of the user-base. One of them will enter-and-navigate via the Instrument, where the other will use the Product.
We can predict the product manager's conversation with the DBAs:
PM: "I need a version of the primary fact table in my reporting database."
DB: "You mean a different version of the fact? Like different columns?"
PM: "No, the same columns, same content."
DB: "Then use the one we have. We're not copying a hundred billion rows just so you can keep your own version."
PM: "Well, it's logically the same but the physical implementation is completely different."
DB: "Oh really? You mean that instead of doing an insert-into-select-from, we'll move the data over by carrier pigeon?"
PM: (staring) "No, the new table has a different distribution and organize, so it's a completely different physical table than the original"
DB: "You're just splitting semantic hairs with me. Data is data."
I watched this conversation from a healthy distance for nearly fourteen months before the DBA acquiesced and installed the necessary table. Prior to this, the PM had been required to manufacture summary tables and an assortment of other supporting tables in lieu of the necessary fact-table. The user experience suffered immensely during this interval, many of them openly questioning whether the Netezza acquisition had been a wise choice. But once the new distribution/organize was installed, user-queries that had been running in 30 minutes now ran in 3 seconds. Queries that had taken five minutes were now sub-second. Where before only five users at a time could be hitting the machine, now twenty or more enjoyed a stellar experience.
How does making a copy of the same data make such a difference? Because it's not really a copy. When we think "copy" we think "photocopy", that is "identical". A DBA will rarely imagine that using a different distribution and organize will create a version of the table that is, in physical terms, radically different from its counterpart. They see the table logically, just a reference in the catalog.
The physics of the Netezza machine is unleashed with logical data structures that are configured to leverage the physical power of the machine. Moreover, the physical implementation must be in synergy (and harmony) with how the users intend to consume them with logical queries. In the above case, the Instrument-centric consumer had a great experience because the tables were physically configured in a manner that logically dovetailed with his query-logic intentions. The Product-centric manager however, had a less-than-stellar experience because that same table had not been physically configured to logically dovetail with his query-logic intentions. The DBA had basically misunderstood that the Netezza high-performance experience rests on the synergy between the logical queries and physical data structures.
In short, each of these managers required a purpose-built form of the data. The DBA thinks in terms of general-purpose and reuse. To him, capacity-planning is about preserving disk storage. He would never imagine that sacrificing disk storage (to build the new table) would translate into an increase in the throughput of the machine. So while the DBA is already thinking in physical terms, he believes that users only think in logical terms. Physics has always been the DBA's problem. Who do those "logical" users think they are, coming to the DBA to offer up a lecture on physics?
In this regard, what if the DBA had built-out the new table but the PM's staff had not included the new distribution key in their query? Or did not leverage the optimized zone-maps as determined by the Organize-On? The result would be the same as before: a logical query that is not leveraging the physics of the machine. At this point, adding the distribution key to the query, or adding filter attributes, is not "tuning" but "debugging". Once those are in place, we don't have to "tune" anything else. Or rather, if the data structures are right, no query tuning is necessary. If the data structures are wrong, no query tuning will matter.
And this is why the aforementioned aficionados were losing their hair. They truly believed that the tried-and-true methods for query-tuning in an Oracle/SQLServer machine would be similar in Netezza. Alas - they are not.
What does all of this mean? When a logical query is submitted to the machine, it cannot manufacture power. It can only leverage or activate the power that was pre-configured for its use. This is why "query-tuning" doesn't work so well with Netezza. I once suggested "query tuning in Netezza is like using a steering wheel to make a car go faster." The actual power is under the hood, not in the user's hands. While the user can leverage it the wrong way, the user cannot, through business-logic queries, make the machine boost by orders-of-magnitude.
Where does the developer/user/analyst need to apply their labor? They already know how they want to navigate the data, so they need to work toward a purpose-built physical implementation, using a logical model to describe and enable it. Notice the reversal of roles: the traditional method is to use a logical model to "physicalize" a database. This is because in a commodity platform (and a load-balancing engine) the physics is all horizontal and shared-everything. We can affect the query turnaround time using logical query statements because we can use hints and such to tell the database how to behave on-demand.
We cannot tell Netezza how to "physically" behave on-demand in the same way. We can use logical query statements to leverage the physics as we have pre-configured it, but if the statement uses the tables outside of their pre-configured physics, the user will not experience the same capacity or turnaround no matter how they reconfigure or re-state the query logic.
All of this makes a case for purpose-built data models leading to purpose-built physical models, and the rejection of general-purpose data models in the Netezza machine. After all, it's a purpose-built machine quite unlike its general purpose, commodity counterparts in the marketplace. In those machines (e.g. Oracle, SQLServer) we have to engineer a purpose-built model (such as a star-schema) to overcome the physical limitations of the general-purpose platform. Why then would we move away from the general-purpose machine into a purpose-built machine, and attempt to embrace a general-purpose data model?
Could it be that the average Netezza user believes that the power of the machine gives it a magical ability to enable a general-purpose model in the way that the general-purpose machines could not? Ever see a third-normal-form model being used for reporting in a general-purpose machine? It's so ugly that they run-don't-walk toward engineering a purpose-built model, photocopying data from the general-purpose contents into the purpose-built form. No, the power of the Netezza machine doesn't give it magical abilities to overcome this problem. A third-normal form model doesn't work better in Netezza than a purpose-built model.
Enter the new solution aficionado who wants their solution to run as fast as the existing solution. They will be told, almost in reflex by the DBA that they have to make their solution work with the existing structures, even though they don't leverage physics in the way the new solution will need it. And this is the time to make a case for another purpose-built model. One that faces the new user-base with tables that are physically optimized to support that user base. Will all tables have to come over? Of course not. Will all of the data of the existing fact table(s) have to come over? Usually not, which is silver lining of the approach.
But think about this: The tables in Netezza are 4x compressed already. If we make another physical copy of the table, itself being 4x compressed, the data is still (on aggregate) 2x compressed across the two tables. That is, the data is doubled at 4x compression, so it only uses the same amount of space as the original table would have if it were only 2x compressed. In this perspective, it's still ahead of the storage -capacity curve. And in having their physics face the user base, we preserve machine capacity as well.
This is perhaps the one most-misunderstood tradeoff of requiring multiple user bases to use the same tables even though their physical form only supports one of the user-bases. And that is simply this: When we kick off queries that don't leverage the physics, we scan more data on the dataslices and we broadcast more data between the CPUs. This effectively drags the queries down and saturates the machine. The query drag might be something only experienced by the one-off user base, but left to itself the machine capacity saturation will affect all users including the ones using the primary distribution. Everyone suffers, and all for the preservation of some disk space. Trust me, if there is a question between happy and productive users versus burning some extra disk space, it's not a hard decision. Preserving disk storage in the heat of unhappy users is a bad tradeoff.
Or to make an absurd analogy, let's say we show up for work on day one, have a great day of onboarding and when we leave, we notice that our car is a little "whiney". Taking it to a shop, he tells us that someone has locked the car in first-gear and he can't fix it. We casually make this complaint the next day (it took us a little longer to get to work).
DBA: Oh sure, all new folks have their car put in first gear. It's a requirement.
US: (stunned) What the?
DBA: Well, if you had been here first, you could keep all the gears, but everyone we've added since then has to be in first gear for everything.
DBA: Yes, first-gear for your car, your development machine, even your career ladder. About the only thing that they don't put in first-gear are your work hours. Those are unlimited.
US: That's outrageous!
DBA: We can't give everyone all the gears they want. It's just not scalable.
The problem with working with tables that aren't physically configured as-we-intend-to-use-them is that using them will cause the machine to work much harder than it has to. Not only will our queries be slower, we can't run as many of them. And while we're running, those folks with high-gear solutions in the same machine will start to look like first-gear stuff too. The inefficiency of our work will steal energy from everyone. We cannot pretend that the machine has unlimited capacity. If our solution eats up a big portion of the capacity then there's less capacity for everyone else. Even if we use workload management, whatever we throttle the poorly-leveraged solution into will only make it worse, because if a first-gear solution needs anything, it most certainly uses more capacity than it would normally require.
Energy-loss is the real cost of a poor physical implementation. All solutions start out with a certain capacity limit (same number of CPUs, memory, disk storage) and it is important that we balance these factors to give the users the best possible experience. Throttling CPUs or disk space, or refusing to give up disk space merely to preserve disk capacity, only forestalls the inevitable. The solution's structures must be aligned with machine physics and the queries must be configured to leverage that physics.
The depiction(above) describes how the modeler's world (a logical world) in no wise intersects with the physical world, yet the physical world is what will drive the user's performance experience. The high-intensity physics of Netezza is not just something we "get for free", it is a resource we need to deliberately leverage and fiercely protect.
In the above, the "Logical data structure" is applied to the machine catalog (using query logic to create it). But once created, it doesn't have any content, so we will use more logic to push data into the physical data structure. The true test of this pairing is when we attempt to apply a logical query (top) and it uses the data structure logic/physics to access the machine physics (bottom). Can we now see why a logical query, all on its own, cannot manufacture or manipulate power? It is the physical data structure working in synergy with the logical query that unleashes Netezza's power. And this is why some discussions with modelers may require deeper communication about the necessity to leverage the deep-metal physics while we honor and protect the machine's power.
Modified on by DavidBirmingham
For the past many months I have been diligently updating and upgrading the original 2008 Netezza Underground to address the many features of TwinFin, Striper and other offerings from IBM. I have recently been notified that it has passed final edit and is available on Amazon.com.
All I can say is "whew!" and many thanks to those who helped put it together. It's been a whirlwind.
Here is the URL
When I started the project I realized that a big part of the original book remains timeless. I didn't leave it "as is" though - practically every page and all the chapters have new material, case studies and such. I peppered the book with some additional graphics since the intrinsic points require a bit more reinforcement than mere words will suffice.
The original chapter on "Distribution Stuff" is now "Performance Stuff" and is twice as long, covering the various aspects of setting up tables, troubleshooting, page-level zone maps and a lot more.
Fortunately, this time around there is a better mechanism to contact me if you have questions or want to report any errata (hey, it could happen!) - you can reach me through this blog, on linked-in or directly through my primary email address at Brightlight Consulting:
Modified on by DavidBirmingham
In a Netezza shop experiencing some performance stress with their machine, we ask the usual questions as to the machine's configuration, its functional mission. Ultimately we pop-the-hood to find that the data structures and the queries are not in harmony. For starters, the structures don't look like Netezza structures, at least, not optimized for Netezza. We receive feedback that they "just" moved the data from their former (favorite technology here) and ran-with-it. They received the usual 10x boost as a door-prize and thought they were done. Lurking in their solution however, were latent inefficiencies that were causing the machine to work 10x to 20x harder to achieve the same outcome. And their queries were likewise 20x inefficient in how they leveraged the data structures.
More unfortunately, the power of the machine was masking this inefficiency. It's like the old adage, when a person first starts day-trading on the stock exchange, the worst thing that can happen to them is that they are successful. Why? It put a false sense of security in their minds that gives them permission to take risks they would never take if they knew the real rules of the game. The 10x-boost for moving the data over is a "for-free" door prize not the go-to configuration.
What are the real rules of the Netezza game? The first rule is that extraordinary power masks sloppy work. Netezza can make an ugly duckling look like a swan without actually being one. It can make an ugly model into a supermodel without the necessary adult beverages to assist the transformation. It can make sloppy queries look like something even Mary Poppins would approve of, practically perfect in every way and all that.
What's lurking under-the-hood is nothing short of a parasitic relationship between the model, the queries and the machine. We received the 10x boost door-prize and think we have succeeded. But we have only succeeded in instantiating the model and its data into the machine. We have not succeeded in leveraging the entire machine. And, uh, we paid for the entire machine. So why aren't we using it?
In our old environment, the index structures worked behind-the-scenes, transparently assisting each join. Our BI environment is set up to leverage those joins so we get good response times. The Netezza machine has no indexes so the BI queries (whether we want to admit it or not) are improperly structured to take advantage of the machine's physics.
"But that's how we've always done it..." or "But we don't do it that way..."
The short version, the former solution and (favorite technology here) is casting a long shadow across the raised floor onto the Netezza machine. People are forklifting "what they know" to the new machine when very little of it applies.
For example, in a star schema, the index structures are the primary performance center. The query will filter the dimensions first, gather indexes from the participating dimensions and then use these to attack the fact table. The engine does all this transparently. The result is a fast turnaround born on index-level performance. These are software-powered constructs in a general-purpose engine. The original concept of the star-schema was borne on the necessity of a model that could overcome the performance weaknesses of its host platform. It is in fact an answer to the lack-of-power of commodity platforms. In short, just by configuring and loading a star schema on a commodity platform, we get boost from using it over a more common 3NF schema.
The Netezza machine doesn't have indexes. So the common understanding of how a star-schema works doesn't apply. At all. Don't get me wrong, the star-schema has a lot of functional elegance and utility. It does not however, inherently provide any form of performance boost for queries using it. It can simplify the consumer experience and certainly ease maintenance, but it is not inherently more performant than any other model. In fact, using such a model by default could hinder performance.
Why is this?
The primary performance boosters in Netezza are the distribution and the zone map. Where the distribution and co-location preserve resources so that more queries can run simultaneously with high throughput, zone maps boost query turnaround time. They work in synergy to increase overall throughput of the machine. How does installing a star-schema inherently optimize such things? It doesn't.
Can we use a star-schema? Sure, and we should also commit to distributing the fact table on the same key as the most-active or largest dimension (they are often one-and-the-same). This will preserve concurrency for the largest majority of queries. A better approach however, is to specifically formulate a useful dimensional model that leverages the same distribution key for all participating tables. Common star-schemas do not do this by default, and if only two tables are distributed on the same key, all other joins to the other tables will be less performant. They will have to "broadcast" the dimensional data to the fact table. Clearly having all tables distributed on the same key will preserve concurrency, but this doesn't give us the monster-boost we're looking for. Distrubution might get us up to 2x past the door-prize performance we get from moving to the machine. Zone maps are notorious for getting us 100x and 1000x boost.
At one site I watched as several analytic operations remanufactured the star-schema data into several other useful structures, each of which was distributed on a common key. At the end of the operation, these quere joined in co-located manner and the final result came back in orders-of-magnitude faster than the same query on the master tables. I asked where they had derived the key, and they explained that it was a composite key that they had reformulated into a single key because their dimensional tables could all be distributed on it and maintain the same logical relationship. Looking over the table structures, they had a "flavor" of a star schema but certainly not a purist star. The question remained, if the existing star schema wasn't useful to them but their reformulated structure was, why weren't they using the reformulated one as the primary model and ditch the old one? The answer was simple, in that the existing star was seen as a general-purpose model and not to be outfitted or tuned for a specific user group. This is one of the commodity/general-purpose lines-of-thought that must be buried before entering the Netezza realm.
This is the primary takeaway from all that: The way we make an underpowered machine work faster is is to contrive a star schema that makes the indexes work hard. We forget that the star schema is a performance contrivance in this regard. If we attempt to move this model to the Netezza machine because "it's what we do" then we may experience performance difficulties rather than a boost. A common theme exists here: people do what they are knowledgeable of, what they are comfortable with, what they find easy-to-explain and do not naturally push-the-envelope for something more useful and performant.
In Netezza, the star schema has functional value but (configured wrong) is a performance liability. We can mitigate this problem by simply reformulating the star to align with the machine's physics, and by adapting our "purist" modeling practices to something more practical and adaptable. After all, many modeling practices are in place specifically because doing otherwise makes a traditional platform behave poorly. If we forklift those practices to Netezza, we participate in casting-the-long-shadown of an underperforming platform onto the Netezza machine.
We have enormous freedom in Netezza to shape the data the way we want to use it and make it consumption-ready both in content and performance. We should not move from a general-purpose platform (using a purpose-built model like a star) into a purpose-built platform with a general-purpose model like a star. The odd part is that the star is an anomaly in a load-balancing, traditional database, but is seen as purpose-built for that platform. Exactly the opposite is true in Netezza. The machine is purpose-built and the star is only another general-purpose model that doesn't work as well as a model that is purpose-built for Netezza physics and for user needs.
The worst thing we can do of course is think-outside-the-box (the Netezza box). We really need to think-inside-the-box and shape the data structures and queries to get what we want. This mitigates the long-shadows. It's just a matter of adapting traditional thinking into something practical for the Netezza machine.
Modified on by DavidBirmingham
About a year ago I engaged to assess a Netezza-centric data processing environment. They had used stored procedures to build-out their business processing inside the machine using SQL-transforms. As you know, I'm a big fan of the SQL-transforms approach, but I'm not a big fan of how they implemented it. Stored procedure for back-end processing are a bad idea on any platform. But even if they had done it without stored procs, the implementation was a "total miss". I mean, it could not have been more "off" if they'd done it outside the machine entirely.
I received word some months ago that while their shop remains a strong Netezza environment, for this particular application they intended to go in a different direction, with a different technology. This was unfortunate since I had told them exactly what the problem was - not the hardware but the way it had been deployed. But they were in denial! Forklifting their application onto the new machine, they attempted to tweak and tune it. They actually received marginally more lift at the outset, but then it rapidly degraded when more data started arriving. Now I'm in dialog with them to discuss "what went wrong".
What went wrong began, quite literally, many years prior.
It's like this: If rust starts to build in a water pipe, we won't know it until the water pressure starts to slowly degrade. Eventually it becomes a drip and then one day it's closed off altogether. We could attempt tracing it to a single cause, but it would only be another straw on the proverbial camel's back. What "really" happened was that we treated the machine in a sloppy manner. Or rather, we saw that it had incredible power but we weren't particularly good stewards of it. Netezza makes a an ugly model look great, a good model look stellar, a marginal model look like a superstar, and can make the sloppiest query look like the most eligible bachelor in town. Power tends to make people starry-eyed.
Time and again we coach people on a migration. They say "Wow, we just went from a five-hour query on that Oracle machine to a five-minute query on the Netezza machine. Sold!" and they move everything over "as-is" from the other machine. Never mind that those old data structures were optimized for a different technology entirely and never mind that the data processes running against them were likewise optimized with the older structures in mind. They were both optimized in context of a machine that could not handle the load to begin with. They just didn't know it yet. Now standing in the Netezza machine's shadow, it's painfully obvious what shortcomings the old machine had. Not the least of which being a load-balancing transactional engine, which is always the wrong technology for anything using a SQL-transform.
The bottom line: what if just a little tuning of that five-minute query could make it a five-second query? What if we received a 10x boost moving data "as is" from the old machine, but if we had engaged a little data-tuning we could have received 100x? In short, how many "x" have we left on the cutting-room floor? Enzees have learned (some the hard way!) that performance in a Netezza machine is found in the hardware. This hardware has to actually arrive in massively parallel form, not marginally parallel form. So we know that that expecting production performance from the Emulator or the TwinFin-3 is a quixotic existence. This ultimately leads to two universal maxims:
We don't tune queries, we configure data structures. The data structures unlock hardware power.
We use queries to activate the data structures. In Netezza, "query-tuning" is a lot like using a steering wheel to make the car go faster. It just doesn't work that way.
This "additional boost" or "leftover power" is an important question, especially for the aforementioned Netezza-centric application. Even if we had kept the entire application in stored procedures, their implementation could not have been more wrong. They had of course, outsourced the whole development project to a firm in a distant country, who had given them a marginal development team at best. This team proceeded to treat the Netezza machine like "any other database" and completely missed the performance optimizations that make one of these machines a source of legend.
What that team did, was pull two hundred million records from a master store and use this body as a working data set even though only twenty columns were being processed at any given point. Dragging over two-hundred columns (90 percent dead weight) through every processing query (many dozens of them), and without regarding distribution to manage co-located reads and co-located writes, turned a twenty-minute operation into a fifty-five hour operation. We showed them with a simple prototype how a given fifteen hour process could be reduced to four minutes. The point is, they were ridiculously inefficient in the use of the machine. Nobody in the leadership of the company would accept that they were as much as 20x inefficient.
A major "miss" is in believing that Netezza is a traditional database. It is not. It is an Asymmetric Massively Parallel Processor (AMPP) with a database "façade" on the front end. Anything resembling a "database" is purely for purposes of interacting with the outside world, "adapting" the MPP layer for common utility. This is why it is firmly positioned as an "appliance". If the internal workins' of the machine were directly exposed, it would cause more trouble than not. Interfacing to the MPP "as a database" is where the resemblance ends. This is the first mistake made by so many new users. They plug-in their favorite tools and such, all of which interface (superficially) just fine. Then they wonder why the machine doesn't do what they wanted. Or that they are experiencing the legendary performance.
When an inefficient model is first deployed, we could imagine that even in taking up 10x more machine resources than necessary, it still runs with extraordinary speed. But let's say we have 100 points of power available to us. The application requires at least five points of power but we are running at 50 points of power (10x inefficient). We still haven't breached any limits. As data grows and as we add functionality, this five points of power rises to 8 points, where we are now using 80 points of power in the machine. Wow, that 100 points of power is getting eaten up fast. We're not all that far away from breaching the max. But data always grows and functionality always rises, and one day we breach the 10th point of power. And at 10x inefficiency, we have finally hit the ceiling of the machine. It spontaneously starts running slow. All the time. Nobody can explain it.
The odd thing, is that the Netezza machine is a purpose-built appliance. Why then did we allow our people to migrate a schema optimized for a general-purpose machine into a purpose-built machine? Moreover, why did we continue to maintain that the so-called purpose-built model in the old machine was really a general-purpose model in disguise? Did we use general-purpose techniques? Why?
Did we load data into one set of structures expecting them to be the one-stop-shop for all consumers? A common-mistake in Netezza-centric implementations is that one data structure can serve many disparate constituents. The larger the data structure, the more we will need to configure the structure for a constituent's utility, and may need redundant data stores to serve disparate needs. Is this reprehensible? Which is better? Just declare another identical table with a different distribution/organize and then load the two tables simultaneously? If we go the summary-table route, the cost is in maintaining the special rules to build them along with the latency penalty for their construction. It seems counter-intuitive to just re-manufacture a table, but the only cost is disk space. On these larger platforms, preserving disk space while the users are at-the-gates with spears and torches, doesn't seem to be a good tradeoff.
The point: Don't waste an opportunity to build exactly the data model you need to support the user base. Don't settle for a contrived, purist, general-purpose model. If the modelers say "we don't do that", this is a sure sign that we're leaving something very special on the cutting-room-floor. It's a purpose-built machine, so create a purpose-built model, and like the Enzees say, "give the machine some love."
When capacity seems to be topping-off, with a few, simply-applied techniques we can easily recover that capacity. It's just very annoying that it's caught up with us and nobody can seem to explain why. It's because they are looking in the wrong place. If we had concerned ourselves with the mechanics of the machine and its primary power-brokers, the distribution and the zone-maps, and avoided the most significant sources of drain, like nested views and back-end stored procedures, we might be closer to resolution. If not, it may be that resolution would require rework or retrofit. In the above 50+ hour operation, the only answer would be to overhaul the working queries end-to-end. We wouldn't need to do much to the functional mechanics, just streamline the way the queries perform them.
What does this streamlining look like? Well, if we already knew that, we wouldn't be having any problems. We would have been streamlining all along and most of our capacity would still be well-preserved. People do it all the time.
Symptoms of a PDA system under stress:
The thing is slowing down and we haven't changed anything. It's slowing down and we're doing what we've always done.
It's the last application/table/load we implemented. Things went south from there. But then we backed out the application and it's still bad. I think we broke something. It must be the hardware.
More users are querying the machine than ever before. I think we've reached a hardware limit.
We could not get the base tables to return fast enough, so we built summary tables, but these are pain to maintain and we don't like the latency in their readiness. We were told that we would never have to use these.
It's sort of funny how deterministic the machine is. "We've been doing the same thing all along" and now it's not working right? Perhaps we weren't doing the right thing all along, regardless of how consistent our "wrongness" was? This is how we know it's not a hardware problem. In fact, if our folks are blaming the hardware first, it's the first sign of denial that the implementation itself is flawed. If our people build contrived structures like summaries, it's a sure sign that the data model is flawed. It's also a sure sign that we're trying to implement a general-purpose schema rather than a purpose-built one. If our people spend a lot of time swarming around query-tuning, it's a sure sign that our data structures aren't ready for consumption. Nested views never have a good ending in Netezza.
At one site, the users had to link together eight or nine different tables to get a very common and repeatable result. If the users must have the deep-knowledge of these tables and their results are repeatable, we need to take their work and manufacture a set of core tables that require less joining and are more consumption-ready. Consolidation, denormalization and shrinking the total tables in the model are actually performance boosters. Why is that? The more tables are involved in the mix, the more we have to deliberately touch the disk through joining. If we have fewer tables and more data on larger tables, zone maps allow us to exclude whole swaths of the tables altogether. We stay off the disks because the zone maps are optimized for it.
It sure "seems" right to put everything into a third-normal-form and make the model "look like the business", but nobody is reporting on the business. They're analyzing, not organizing. We should be the ones organizing the data for analysis, not requiring the analysts to organize the data "on demand" through piling tables together in their queries.
Modified on by DavidBirmingham
"We're charter members," claimed Bjorn, a tech from across-the-pond, "been working with the stuff for ages."
"Same here," asserted Jack, a DBA from Kansas, "You'd be surprised how fast this thing scoots around in first gear."
"Exactly," said Bjorn, "Those who master the roads in first gear are actually better off. You run at a speed that gets you around but not fast enough that you accidentally run into things."
"So it's a safe speed?" asked the interviewer.
"Indeed," said the pair, almost in unison.
"So what happens when you take it for a spin on the highway?"
The two glanced at each other, then to the interviewer, looked down or elsewhere and then one at a time eventually answered.
"Highway driving is problematic," said Jack. "We'd rather stay on the city roads."
"Yes, city roads," Bjorn agreed, "No need for the highway."
The Director of Integration watched the interview on video and made a smacking sound against his teeth, "Pathetic. We bought the car to drive on the highway most of the time. These guys are acting like a couple of children who are scared of the Big Bad Freeway."
"Well, it has too many lanes in it," smirked his assistant, "If they go on the highway, they might actually get somewhere."
"What's that logo on their shirts?" the Director noted, "Zoom in on that -"
"It's FGA," said the assistant.
"Is that a misspelling? I thought the machine used an FPGA?"
"Oh no, that's correct. It's the First Gear Association. But now they call themselves the First Gear Society. It's way past a club. Now it's a philosophy. Or maybe culture is a better word."
"You mean they actually run around justifying why they stay in first gear?"
"It's almost like a discipline sir," grinned the assistant, "They have rules and protocols for their followers. It's cultish if you ask me."
"I didn't ask you."
"So what are some of these so-called protocols?"
"Well for starters, they can't stand deep-metal. It's like garlic to a vampire."
"But deep-metal is - " he sighed, "Never mind, just keep going."
"If they find themselves going too slow, they try to get the machine to go faster in first gear. If they can't they blame the machine. They think that having a powerful machine is good enough. After all, a couple-hundred horsepower under the hood looks good no matter what gear you're in. And to the other chap's point. If you put it into a higher gear, it'll go so fast that it's hard to steer. They aren't very good at steering, so - "
"So it's a fear thing?"
"Not so much. They like rollercoaster rides, judging from some of their project schedules and deliverables anyhow. They like the whole livin-on-the-edge brinksmanship. As long as someone else is driving."
"Ahh, so just afraid to punch the accelerator themselves, eh?"
"They'll punch it, just in first gear. It's almost like any gears higher than that are a fearful thing."
"How do we get them into the deep-metal? It can't be that hard."
"Oh they go into the deep-metal on expeditions and stuff. They like popping the hood and exploring the architecture. But when they get behind the wheel, it's a first-gear-only driving experience."
"So tell me what to look for in a First-Gear-Society member. What can I expect?"
"Here's a short list," the assistant accessed his handheld and pulled an image onto its screen, showing it to the Director.
"Let me see if I understand this -"
Common symptoms are:
"We were doing just fine until a week or so ago, then everything started to slow down. Jobs are running a lot slower, we're missing all of our deadlines. Has the hardware worn out?"
"We just added some functionality to the solution and two days later the whole machine started to drag. We backed out the solution but the machine is still dragging. I think we broke something. There's a problem with the hardware."
"The machine is having a hardware problem. We've been running these applications for years and then one day everything slowed down. We upgraded to a bigger machine and we didn't get any lift, but a few days later things got worse. A lot worse. This is clearly a hardware problem."
"Wow, they think it's the hardware when they haven't even taken it out of first gear. Amazing."
"Here's the punch line - they honestly believe that they can get more power from the machine if they learn to steer it better."
"No kidding. Take a look at these quotes:"
"We reworked the logic in the queries to tighten things down. Every time we change the query, we get a few percentage points difference in performance, but we need a multiplied boost, like 10x or better."
"We put in summary tables to make things go faster but it didn't help. We have query engineers tightening down the queries so that the join logic is efficient."
"All of our query-tuning has failed. We are at a standstill."
"See all that? They really believe that they can make it go faster from the steering wheel. Like the machine they are sitting in is not the source of power."
"How do we convince them to take it out of first gear?"
"Not sure that there is a way, sir. They are so accustomed to believing that the steering wheel is the place to affect performance that they haven't bothered to examine where the power-plant actually resides."
"Under the hood."
"That's right, the power is in the hardware. The steering wheel can only direct the hardware to the right location, but cannot affect power."
"Well, they can certainly steer it poorly."
"Oh sure, like driving it through mud and all that. But that brings up another issue. Some of them like an extreme challenge, so rather than taking it out of first gear, they try to make the machine go faster by making the gears grind."
"Sure, a grinding gear just seems like it's working hard."
"But they don't see that the friction of the grinding is slowing them down?"
"Not in the least. Look here:"
"We built our tables to provide fast back-end data processing, but the distribution we use for back-end processing doesn't work so well for reporting. Every time they issue a query, it's terribly slow. If we copy the larger tables to another distribution, it makes back-end processing slow but the reporting fast. We can't seem to win here. So we're going to keep the distribution for the back-end and then find a way to build something else for the front end. They want us to make two versions of the big tables, each on different distributions. The experts tell us that this is common and the best thing to do, but it sounds crazy to us."
"See that? They don't even listen to the experts. They would rather grind the gears."
"But all they are saving is disk space. I mean, seriously?"
"Old ways die hard. They would rather preserve disk space and incur the wrath of the users."
"Or preserve disk space and then build-out a convoluted set of summary tables that is twice as hard to maintain."
"Yep, the simple, dumb photocopy of data into another distribution is just too simple. They have a need to engineer."
"Engineering in first gear is not really engineering. It's more like polishing the stick-shift."
"Here's another quote:"
"We have a lot of complex views, and those views join to other views. I've seen EXPLAIN plans for this hit fifty to one-hundred snippets deep. Netezza returns this stuff fast all the time. Only now it doesn't anymore.
"Ahh, so their underpinning tables finally reached a tipping point."
"Yeah, this is funny. They add a ten pound weight to the trunk of the car every day and three hundred days later, and 3000 pounds heavier, the car is starting to run slow. Imagine that."
"Running slow because they have it in first gear."
"Well of course, if they were to use a higher gear, all that weight would be practically weightless."
"Funny how the density of deep metal is a lot like anti-gravity."
"Well, the competition sees these things happening and they say, you know, with that particular vehicle, the more weight you add, the slower it gets. It's just the nature of the machine."
"But that's not true on deep-metal."
"Not at all. We know folks who have tons and tons of weight on the machine but it's light as a feather."
"They're using higher gears and the deep metal together."
"Absolutely. Nobody from the First Gear Society is allowed in their shop. They know better."
"Well that's a problem with nested views. The master query attempts to provide the filter attributes but if the nested view doesn't pass them along, the underpinning tables end up table-scanning."
"And as you know, Netezza is the best table-scanner in the business. It can scan tables, really big tables, in no time flat."
"But it's not supposed to be scanning the tables. It's supposed to be using the zone maps and filter attributes so that it doesn't have to scan the tables."
"Well, exactly, but the people in first gear don't know this. They can't really tell that their queries are getting fractionally slower with each passing day. Then one day it reaches a point of no return. Sort of like how rust slowly clogs up a pipe until one day the pipe just closes over. It doesn't happen in a single query or just because we added a new table or two. It's pervasive and pandemic across the entire implementation."
"I bet when they hear that, they go insane."
"Yeah it's because with any other machine, remediating something like this would be very hard and tedious. But with Netezza we can remediate it incrementally and get more lift with each change. It's just not hard to recover the capacity if you know what you're looking for."
"I think the bottom line is that a really powerful machine tends to hide inefficient work."
"Or like one aficionado put it: Netezza can make a really ugly data model look like a super-model, and can make really bad queries look great. They just don't realize that the ugly models and bad queries are sapping machine's strength like a giant parasite."
"Ugh, now there's a visual I can live without."
Modified on by DavidBirmingham
In this case, a tortoise-brained hare is an animal that is capable of going fast but wonders why he can't.
Over the last six months or so I have seen a "trending" situation with those who use the Organize-On. This is some pretty cool functionality so it's important that we get it right. Generally speaking, folks who understand how zone-maps work will have a splendid time with Organize. Others, not so much. The machine is supposed to be fast, so why are all my queries so slow?
All I have to do is add the Organize On keys and I'm good to go. Answer: No, you have to groom first, as this actually applies the Organize to the physical data. Response: Really? when we did this it took forever! Answer: Yes, the first groom may take a bit of time but every groom after that will be painless.
We CTAS a table every night for maintenance and the groom is always running slow. Answer: if you are using Organize On, don't use a CTAS for regular maintenance. With an Organize, the table is physically broken apart and spread across multiple additional extent-pages to provide for easier groom maintenance. A CTAS re-compresses it. So another groom is required to blow it back out again.
We like to use Materialized Views to optimize our tables but now the Organize doesn't let us. How can we get this back? Answer: You don't, because the Organize essentially replaces the Materialized View and does it so much better. Asking this question means you might not understand zone maps as well as you think you do! Just sayin'
We have included the distribution key in the Organize, along with some other join keys. Answer: (Sigh) -Remove them. Join-only keys do not belong in the Organize. The Organize is for filter attributes. Uh, perhaps you don't understand zone maps? Just sayin'
We have some lookup tables where we applied an Organize. But this didn't seem to matter. Answer- No it won't because if the table is not big enough the system won't use zone maps anyhow.
We were told to turn off Page-Level zone maps but this caused a 5x reduction in performance. Answer: Page-levels radically reduce disk I/O which is our number-one enemy. Turn them back on and go home happy.
It is clear that our outlying cast of folks who have never been introduced to zone maps think that the Organize is either a clustered-index or some other kind of indexing scheme. Fair enough, is this because of the deceptive naming? Like the Materialized View? (okay, don't get me started). The Organize keys are not keys in the same sense as indexes and are certainly not used as join keys.
They are filter-attributes. By this we mean that we fully intend to use these columns with constant values, lists or selected-lists as constraints.
Zone Map Primer
Here I will use my favorite example because it is time-tested and describes the capability. If I pay a visit to Wal-Mart and I'm looking for batteries, I might visit the customer kiosk and the lady tells me that the batteries are "on the end-cap, Aisle Three". This is an indexing model because she told me exactly where to look. This does not scale because when we get to billions of rows we spend more time searching the index than retrieving records.
In a Netezza model however, she would say "It's on an end-cap but I'm not sure which one." So I go to Aisle 1, then 2, then 3 and jackpot, I buy the batteries and go home. The most important part of her answer however, was in what she "did not say". She did not say that the batteries were in Men's or Women's clothing, Automotive or Electronics. In constraining the search to a particular location, she also told me "where not to look". This aspect of "where not to look" is critical to understanding zone maps. It is also critical to stratospheric scalability. Clearly if this Wal-Mart were to quintuple in size, my battery-buying-duration experience would not change in the least.
Now for the geeky part:
Each Netezza disk has 120K physical extents. Each extent has 3MB of space, divided across 24 pages(blocks) of 128k each. If we are running prior to OS 7.x, we will be zone-mapping at the extent-level. If we are 7.x or higher, we will use the page-level. Run don't walk to 7.x and page-level zone maps! They will radically reduce I/O on the extent and this is critically important. A set of records found only on a single page can return in 1/24th of the time than if it has to scan the full extent (96 percent faster). In a real-world experiment, the zone maps on a Twinfin-18 were changed from page-level to extent-level and the same battery of queries executed against it. It performed 5x to 10x slower than with the page-level turned on. Do not underestimating the influence of disk I/O on query turnaround. Netezza is a very physical machine.
Here's another internal example: Compression in Netezza is a nominal 4x. I have seen it much higher. Let's say we have an uncompressed record that is 10,000 bytes in size. When we read it, we will read all 10,000 bytes. If the data is compressed however, we will read only 2500 bytes, a 75% reduction in disk I/O. Netezza is the only platform where compression boosts the power in both reading and writing data because it reduces disk I/O tremendously.
Zone maps are a table that Netezza holds in memory for supporting a table. Each table has its own set, describing the contents of the extents/pages allocated to the table on a given dataslice. The contents of the zone maps are built automatically when we load up our table. It will collect the information from the records stored on the given extent/page for all integers, dates and timestamps. It will store the high and low value of each, representing the table's record "ranges" for that extent.
Thus when we execute a query using one of these columns as a filter-attribute, it will go to the zone maps first and cherry-pick only those extents containing the data we want, excluding all others. this means that the data on those other zone maps won't even see the light of day for the query-in-play. If we use an Organize, we can use a wider range of data types and be more deliberate with zone-map management.
Let's say we want to search-on the transaction-date on our fact table. If we have not physically organized the data, the same transaction-date value may appear across many hundreds or thousands of extents, affecting the high/low range of many zone maps. If however, we were to physically co-locate the records with common dates into tighter physical groups, the records will physically appear inside fewer zone maps. These are the extents/pages that the machine will cherry-pick for scanning and will completely scan each one. We want it to scan as few of these as possible.
In times past, we had three primary ways of doing this.
A brute-force sort, which is pretty egregious when the table gets very large.
A software program that separates the key's distinct values and executes a block-select from the original table to the new, selecting only once for each distinct value. This physically co-locates like-keys. (the data does not have to be sorted, only like-keys co-located)
A materialized view, which would manufacture a virtual zone map (which is why it's not allowed with the Organize)
Let's say that our table's data takes up 400 extents on each data slice. If our transaction-date appears in 300 of these (it is poorly organized) then when a query runs like so:
Select count(*) from mytable where transaction_date = '2014-01-01';
All 300 extents will be searched for this information. However if we Organize on this transaction-date and then groom - the data will be physically shuffled around to co-locate the records with the same transaction-date on as few extents as possible. Let's further say that once this happens, the given date appears in only one extent. What we have just done is optimized the table 300x. We have eliminated 299 other locations to look for data. This is important because scanning a 3MB extent is a lot of work. If we are scanning 299 additional extents for each dataslice, we're really doing a lot of extra work for nothing. If we translate this to a page-level problem, we may have 24x300 pages originally containing the keys. If with Organize we reduce this to a single page, we have further reduced our scanning load by another 96 percent of the extent containing the page.
The important factor in the example is the "out-in-the-open" value of "2014-01-01". It is a "filter attribute". It is not being joined to a table with this value, Doing it this way means that the FPGA/CPU will discover which zone maps have a high/low boundary that contains this value and will retrieve a candidate list of them, If there is only one as opposed to 300, we have radically reduced out workload. Netezza will literally exclude those extents from being examined at all. We have told the machine where-not-to-look. If we join this value however, such as using a time dimension, applying the date value to time dimensions and joining the time dimension to our fact table, we will require the system to fetch the record in order to examine it, at which time it will determine whether to keep it or toss it. this sort of thing can initiate a full table scan, nullifying the zone map entirely.
We don't want this to happen. We want the data to stay on disk and never see the light of day if it's not participating in the query.
To show how dramatic this can be, we were at one site hosting over 100 billion rows in the primary fact table. A full scan of this table took 8 minutes (which is not too shabby in itself, just sayin' ). The reporting users knew that if a query ever exceeded a few minutes in duration, it was probably ignoring the zone maps. This is because once-Organized, this table would return the average query in sub-second response. Think about that,100 billion rows in sub-second response.
This is why paying attention to zone maps is such a big deal. Optimizing distribution can get us boost in the single-digits (2x, 3x etc) on a given query. Optimizing zone maps can get us 1000x boost and higher.
The Organize-On accepts one or more keys which will be applied to physically co-locate records of like-valued keys, then it will update the zone maps. Here is a test to see if we understand the application:
Take one of your largest tables. CTAS the table to another database, order by the distribution key, or by the hidden "rowid" column to make sure that the given filter key is not ordered. This could take a bit of time, of course. Then perform a query using one of your date parameters as in the example above, and time the query. Now perform an
Alter Table tabname organize on (date column name here).
Then perform groom. Once completed, execute the same query and get a timing. We can see that a several-minute query can go sub-second very easily.
What's more the additional keys in the Organize are (more or less) independently organized. They will all enjoy a much faster turnaround than not using the Organize. If records arrive on the table out-of-order, no worries. Run the groom again. Subsequent runs of groom will always be shorter in duration than the first one.
Clearly however, if the zone-map is intended to apply filter-attributes to guarantee the exclusion of extents/pages completely, we cannot use a join-key. Or at least, not a join-only key. This also means that the distribution key is out (unless we plan to call-up an individual record based on the distribution key). Also, generally do not mix high cardinality keys with low cardinality keys. Netezza finds its strength somewhere in the middle. We will find that it disfavors the low-cardinality keys when high-cardinality ones are in-the-mix.
A good way to tell which of the filter attributes for a table are "high-traffic" is to turn on query history and then examine the table/view associated with "column access statistics - $vhist_column_access_stats. This will provide the number of times the column participated in a query and with which table(s) the base table interacted with. Perform a descending sort on the NUM_WHERE column and this will reveal all. In this short list we will see filter-attributes that are most useful. Don't use any of the join-only columns or the distribution key. These may adversely affect the multi-key algorithm's output and might not optimize any zone maps for this table.
At one site, we noted several inappropriate keys in the Organize, and simply by removing them and "grooming" again, the table experienced a 100x boost. The inappropriate keys were washing out the effectiveness of the other keys.
Many of us have seen this skyline, with buildings of various heights stabbing toward the sky. Compare this to the distribution graph that is part of the Netezza Administration GUI application. Normally this should be very flat (but a jagged-edge is usually okay). Clearly a Manhattan skyline is forbidden.
Or is it? if we have a table that is very skewed (like a Manhattan skyline) but the data can be easily "horizontally" sliced with zone maps, our round-trip time for a "tall" data-slice is no different than a "short" one. All we need to look out for is process-skew (too many horizontal slices on one dataslice)
Measuring the madness
Okay, David this is all very interesting but how can I know which extents or pages or whatever is being used by the keys? Well, there are a couple of handy hidden columns on each row that can help tell-the-tale. One is the _PAGEID and one is the _EXTENTID.
select count(*) , datasliceid dsid, _pageid pid from fact_customer group by datasliceid, _pageid
will tell us how many distinct pages are being used for each data slice.
select count(*), datasliceid, _pageid from fact_customer where transaction_date = '2013-01-01' group by datasliceid, _pageid order by datasliceid, _pageid;
In the above, the "count" should be reasonably even for each dataslice.
If the total count is radically more than "1", then let's organize on transaction_id and then groom. Now try the metrics again to see if it did not reduce the total pages.
I am sure with these two columns in-hand you can think of a variety of creative ways to use them. The conclusion of it all is to get the records with common key values packed as closely as we can so they take up as few extents / pages as possible.
It's a wrap
So now that the Organize seems a little better, you know, organized, maybe this will provide a bit of guidance on how to set up your own Organize and zone maps.
And don't forget to groom when changing the Organize keys.
We won't need to groom every time we do an operation. I would suggest a groom on both a schedule and a threshold. Pick a threshold ( a lot of folks like five percent). When the total deleted rows gets above five percent of the total non-deleted rows, or the total pages per unique data point gets above an unacceptable threshold, it's time to groom. But grooming on every operation is expensive, has marginal value and actually may throw away records we wanted to keep (in case of emergency rollback).
Modified on by DavidBirmingham
I recently did a Virtual Enzee presentation and listed the Top Ten requirements for scalable bulk data processing inside a Netezza machine.
I'll come back periodically and elaborate on them
1.Platforms easily scale for increasing stress
We have a Netezza machine, so what could go wrong? I was asked a desperate question by an Enzee as to how to get more power out of their machine. After nearly two days of struggling with them I finally asked how big their machine was. It was a TwinFin-3. The answer I gave them, they clearly did not like and even sought solace on the shoulder of another. Who told them the same thing. Get a bigger box. TwinFin-3 is a dev box, not a production box.
Stress comes in many forms. Constantly changing requirements. The need for functional and physical agility. As these things increase, we need a platform that will work with us, not against us.
2.Human intervention eliminated wherever possible (no eyeball-based actions)
This means ALL aspects, not just operational ones. Everything from table maintenance to application development. AUTOMATE!
It is humorous to hear testers offer up their methods, with naive blurbs like "open the application and examine the contents". No, with billions of rows there is no such thing. We must use statistical checking that operates on sets, such as summaries, counts-of etc. No longer can be "eyeball" the data.
Likewise with runtime processes. Define a table with 200 columns and try to put an ELT query against it. 200 columns in the insert phrase, 200 entries in the select phrase, and to maintain it we have to keep them in sync with "eyeballs". No, this doesn't scale.
3.Architecture-centric platforms express applications with patterns
Oddly application developers, like those who develop using stored procedures, whip out a bunch of application-centric 'code" and when the smoke clears, they see repeatable patterns all over it. Unfortunately, they can't take the patterns anywhere because they are hard-wired.
The more architectural approach is to harness the patterns as capabilities and allow our applications to express from them. The application is then an expression of the capabilitis not the center of gravity.
4.Deliberately simple to leverage and operate
Large-scale systems can have mind-numbing characteristics for the un-initiated. It is incumbent upon us to deliberately simplify their interface points to it. Simple utilities, fewer keystrokes to achieve mundane goals, automation for rote tasks..
5.Built for administrative recovery, not reactionary recovery
This can be as simple as, when data arrives and has errors, we don't come to a full stop. We cordon off the error records into an adminstrative /logical status and report them for later remediation. In systems of scale, we cannot halt the processing of tens of millons/billions of records just because a few stragglers are misbehaving. The time it will take to process the data is the problem. If we are 20 minutes away from the process being complete, then we are always 20 minutes away if we have fully stopped the flow for the sake of a few records. If we allow the process to proceed with error-capture, we will close the 20 minutes and then the admins have more breathng room to fix the problem without the scrutiny or pressure of the clock.
6.Data and metadata-driven
The environment can no longer be driven by application code. It has to be driven by an architectural harness that responds and adapts to data and metadata. This is a non-trivial endeavor, of course, but entirely possible to achieve.
What does this look like? The data model is arguable the most volatile component of the solution. Changes in it can destablize a solution. We need ways and utilities to buffer ourselves from the impact of change all-the-while enabling the change. It won't do to tell the users that the data model is frozen for 6 months because we fear impact to our tightly-woven application code (e.g. stored procs)
7.Blended/hybrid approaches quickly adapt and scale
One doesn't have to make an exclusive choice between ETL and ELT. People really want to leverage the power inside the machine but feel constrained that doing so may obviate the ETL tool. Not so - both of these technologies have a major role to play and we should balance them for the best-of-breed solution
8.Template-driven applications: SQL is an artifact, not the center-of-gravity
In the VIrtual Enzee I offered several examples of templates for SQL transforms (insert-into-select-from), views (to avoid nesting) and stored procedures (to build from a template rather than editing them in a SQL tool)
Why do this? The developer puts application logic into the template. At run time, or installation time in case of the SP or View, we formulate the product from the template. This allows us to automatically include non-optional aspects like operational controls, inline status reporting and other elements that we don't want the developer to worry about, much less hand-craft on their own.
Need another bit of operational control? Add it to the template factory and don't worry about the application logic
More importantly, we can generate a template from the catalog and by definition it is tied to the catalog. It is therefore easy to compare the already-deployed templates to changes in the data model. Since 90 percent of all new columns are invariably pass-through columns, running an impact analysis like this captures over 90 percent of the issues in one shot.
9.Inefficiencies are our number one enemy
One of our clients had a TwinFin 48 they were planning to use for their development phase and then cutover internally to production. I asked them to dial back the developers so that it had the effective power of a TwinFin 12. They were a bit stunned at my request until I noted: The TwinFin 12 has a lot of power for development, but a TwinFin 48 will hide bad data models and sloppy code. Lots of power can make any lousy code/data model look spectacular.
Many cases of Netezza machine under stress, upon review we find that many of their inefficient practices have been going on for years, some since the box arrived. But the machine was so powerful it masks the inefficiency, like allowing the box to eat itself from the inside out
Preserve capacity at the processing level, not by guarding the data storage level. Do not be afraid to spin off replica data structures (even large ones) just for a different distribution, if it means that the machine can close its queries faster.
10.Operational integrity drives functional integrity
We understand this as a matter of quality control. Hamburgers from a national chain should taste the same no matter where we buy them. This is not accidental. The end user data is only as good as the processes that are delivering it.
If we make it so the operators have a difficult time handling it, or the admins don't understand it, or the troubleshooters can't get things done, they will start to grouse about the quality of their existence.
On the flip side, I know folks that we radically simplfied things for, and when we showed them the various utilities they would need to keep things in order, they balked. "Do we have to know all this stuff? Why is there so much stuff to know?" And yet, we have reduced a thousand things down to one, but they cannot grasp how much more complex it would be without our having simplified it.
We know that Netezza embraces simplicity. We just have to be mindful to maintain this spirit when we build things around it.
At the functional/capability level,we need to drive operational integrity into the data itself, outfitting the tables and rows with additional columns for the sole purposes of operational control. Otherwise the functional model is pretty much out-in-the-open and we won't have a way to manage the tables in a consistent, harnessed, repeatable form.
Modified on by DavidBirmingham
I recently bumped up against a Proof-Of-Concept where "Two MPP Powerhouses went Toe-To-Toe" - and I was fairly excited to see that there might be a contender in the ring, stalking the PureData Analytics/Netezza machine. These POC's are always fun to watch. I am sure not quite as gratuitous as gladiators in the Colosseum, but engaging nonetheless.
In this corner...
The contender, dressed in white, was an MPP, er - clustered servers posing as an MPP. Now let's level-set on what an MPP is, and what it is not.
As we can see with the above high intensity graphics - 6 - 2 Cylinder Fiats versus a 12-Cylinger Jag. Now here's the trick question so don't be shy: Which one really accelerates to close the distance faster? Take your time, I'll be right here.
It's no mystery that orchestrated, optimized and purpose-built hardware beats general-purpose commodity hardware every time it's tried. The contender was a cluster of commodity servers posing as an MPP. When we tried to launch scanning analytic queries on it, we could practically hear the whirrrrr-click of the machines as they quietly, well, went silent. For a very long time. I wondered if they would ever offer up the answer. Unlike the Hitchhiker's Guide where they had to wait a million years to get the answer to the ultimate question, we decided to kick off the same query on the PureData/Netezza machine.
I hit the "enter' button while I was standing at the keyboard, then recalled that I needed to check on something else and lowered myself into the chair, but before I could sit - Netezza had the answer. No, it wasn't "42" but something a bit more actionable.
We left the bulding that day satisfied that the Jaguar had in fact smoked the competition. I probably should mention that even as we left the building, the "other guys" still had not come back with an answer. Sad indeed.
Scalability for scanning/bulk operations is a result of strong architecture. It cannot be cobbled together with general-purpose parts. The cluster of servers posing as an MPP had failed. Cluster Failure. Send it back.
Modified on by DavidBirmingham
Sometimes the average Netezza user gets a bit tripped-up on how an MPP works and how co-located joining operates. They see the "distribute on" phrase and immediately translate "partition" or "index" when Netezza has neither. In fact, those concepts and practices don't even have an equivalent in Netezza. This confusion is simply borne on the notion that Netezza-is-like-other-databases-so-fill-in-the-blank. And this mistake won't lead to functional problems. They will still get the right answer, and get it pretty fast. But it could be soooo much faster.
As an example, we might have a traditional star-schema for our reporting users. We might have a fact table that records customer transactions, along with dimensions of a customer table, a vendor table, a product table etc. If we look at the size of the tables, we find that the product and vendor tables are relatively small compared to the customer, and the fact table dwarfs them all. A typical default would be to distribute each of these tables on their own integer ID, such as customer_id, vendor_id etc. and then putting in a transaction fact record id (transaction_id) that is separate from the others, even though the transaction record contains the ID fields from the other tables.
Then the users will attempt to join the customer and the transaction fact using the customer_id. Functionally this will deliver the correct answer but let's take a look under-the-covers what the performance characteristics will be. As a note, the machine is filled with SBlades, each containing 8 CPUs. For example, if we have a TwinFin-12, this is 12 SBlades with 8 CPUs, or 96 CPUs. They are interconnected with a proprietary, high-speed Ethernet configured to optimize inter-CPU cross-talk.
Also whenever we put a table into the machine, it logically exists in one place, the catalog, but physically exists on disks assigned to the CPUs. A simplistic explanation would be that if we have 100 CPU/disk combinations and load 100,000 rows to a table that is distributed on "random", each of the disks would receive exactly 1000 records. When we query the table, the same query is sent to all 100 CPUs and they only operate on their local portion of the data. So in essence, every table is co-located with every other table on the machine. This does not mean however, that they will act in co-location on the CPU. The way we get them to act in co-location (that is, joining them local to the CPU) is to distribute them on the same key.
But because our noted tables are not distributed on the same key, they cannot co-locate the join. This means that the requested data from the customer table will be shipped to the fact table. What does this look like? Because the customer table has no connection to the transaction_id, the machine must ship all customer records to all blades (redistribution) so that the CPUs there can attempt to join on the body of the customer table. We can see how inefficient this is. This is not a drawback of the Netezza machine. It is a misapplication of the machine's capabilities.
Symptoms: One query might run "fine". But two of them run slow. Several of them even slower. Results are inconsistent when other activities are running on the machine. We can see why this is the case, because the processing is competing for the fabric. Why is this important to understand? The inter-CPU fabric is a fairly finite resource and if we allow data to fly over it in an inefficient manner, it will quickly saturate the fabric. All the queries start fighting over it.
Taking a step back, let's try something else. We distribute the transaction_fact on the customer_id, not the transaction_id. Keep in mind that the transaction_id only exists on the transaction table so using it for distribution will never engage co-location. Once we have both tables distributed on the customer_id, let's look at the results now:
When the query initiates, the host will recognize that the data is co-located and the data will start to join without ever leaving the CPU where the two table portions are co-located. The join result is all that rises from the CPU, and no data is shipped around the machine to affect the answer. This is the most efficient and scalable way to deal with big-data in the box.
Now another question arises: If the vendor and product dimensions are not co-located with the transaction_fact, how then will we avoid this redistribution of data? The answer is simple: they are small tables so their impact is negligible. Keep in mind that we want to co-locate the big-ticket-or-most-active tables. I say that because we have sites that are similar in nature where the customer is as large as two of the other dimensions, but is not the most active dimension. We want to center our performance model on the most-active datasets.
This effect can rear its head in counter-intuitive ways. Take for example the two tables - fact_order_header and fact_order_detail. These two tables are both quite monstrous even though the detail table is somewhat larger. Fact_order_header is distributed on the order_header_id and the fact_order_detail is distributed on the order_detail_id. The fact_order_detail also contains the order_header_id, however.
In the above examples, the order header was being joined to the detail, along with a number of other keys. This achieved the correct functional answer, but because they were not using the same distribution key, the join was not co-located. So we suggested putting the order_detail table on the same distribution as the order-header (order_header_id). Since the tables were already being joined on this column, this was a perfect fit. The join received an instant boost and was scalable, no longer saturating the inter-CPU fabric.
The problem was in how the data architects thought about the distribution keys. They were using key-based thinking (like primary and foreign keys) and not MPP-based thinking. In key-based thinking, functionality flows from parent-to-child, but in MPP-based thinking, there is no overriding functional flow of keys - it's all about physics. This is not to say that "function doesn't matter" but we cannot put together the tables on a highly physical machine and expect it to behave at highest performance unless we regard the physics and protect the physics as an asset. Addressing the functionality alone might provide the right functional answer, but not the most scalable performance.
Last week (4/3/13) IBM did a product launch of the new Hadoop Appliance and the DB2 BLU Acceleration. The BLU model is columnar and they ran with the Netezza model of "simplify-load-and-go" so the total instructions to get data into the machine and act on it is now dirt-simple.
The Hadoop appliance also ran with part of the Netezza model. The Hadoop appliance takes the MPP approach in-a-box so that it's a self-contained appliance without having to stand up a gaggle-of-servers for the same purpose. Keep in mind that these appliances consume less power and generate less heat than the aggregate of their distributed counterparts on the raised floor.
I contrast this to the average hapless soul who wants to do Hadoop and calls upon his management to roll out a gaggle of servers to make it work, and cobbles together the necessary parts and software to make it all happen, painstakingly tuning the environment because that's-what-engineers-do. Then someone says, hey, we could have saved all that money (labor is not free, and neither is hardware) and bought a PureData appliance for Hadoop that has scalable power and a simplified interface - AND integrates to the other environments like PureData Netezza and PureData DB2 for a self-contained operational and administrative experience. We don't need to pay or hire our engineers to home-grow the core substrate. Now they can concentrate on what we hired them for: solving business problems rather than engineer technologies.
The bane of the above model is simply this: we will roll all of it out once, for one application. Repeating it for another application starts us from scratch again because rarely do our engineers roll out such environments with reusable patterns and modules. It is a custom-tuned and rarefied atmosphere for one business purpose. This is true of most application/solution development. The engineers do not focus on the parts they intend to leverage or reuse for the next application. It is all very application-centric all-the-way-to-the-Hadoop servers. One may argue that the Hadoop servers are reusable, but we know in application development that an app-server is rolled out per-application. So while the app-server might itself be similarly configured to other app servers, it is still a separate machine. At some point in this game, the "mission critical" card will be played and all other Hadoop projects will need their own hardware - er - their own gaggle of multiple servers. This is when the instances start to reproduce like rabbits. Would we rather just trade-in all those servers, or forego their purchase altogether and install an appliance? Even if it's one appliance per application instance, it's better than a farm of servers that stretch across a raised floor so wide that we can see the earth curve? Tempting no?
Orrrr - we could continue to do it the hard way. Many years ago I was impressed with the notion of "Eccentric Innovation" in that managers who were running out of capacity would act in desperation to stand up home-grown skunkworks (innovations) that were cobbled together by their most "creative" engineers who they did not hire for engineering or their ability to innovate - and ended up with an eccentric innovation - one that they would not have purchased off-the-shelf if given the choice, but that they instead paid several-times-more for and now they own it and only a handful of people on the planet can actually operate it. It's a very tense existence.
In the appliance genre, it sort of looks like this: If I give you a four-slice toaster, you will likely not use all four-slots except on busy mornings or if you have a big family. However, if I give you a 400-slice toaster, your problem is no longer toasting bread, but "bread-management" - keeping the toaster busy by pushing and pulling bread to and from it, and boosting your bread-movement infrastructure. No different for the Hadoop platform. No sooner will it roll out and people will start to use it, but will they use it enough to justify its expense? The total-cost-of-ownership is a glaring, almost blinding problem with a "common" Hadoop rollout but the costs of labor and upkeep are intangible. Appliances may have a tangible up-front expense but their low-maintenance and scalability mitigate total-cost-of-ownership issues.
And - of course - do we want a swarm of engineers running the Hadoop farm or do we want appliances in a lights-out ops center, quietly solving the world's problems before bedtime?
Many months ago I sat for some interviews essentially distilling the content of the Best Practice Sessions we executed in "deep dive" form at the Enzee Universe in Boston for several years.
Of course, time was always short and we were never able to touch on all the subjects in all the topics. As an example, the topic of Migration was jam-packed into one hour with a follow-on Q&A of 30 minutes. As one can see by the content of the monologues, there was always three or more hours of material we could never get to.
Now available on Amazon in a four-part "mini-series".
As with all analysis of implementations, please accept the following as a composite commentary (much like the Case Studies in Netezza Transformation). The names have been changed largely to protect the guilty. The innocent have already been punished.
So for those of you who may recognize shadows of your own environments in the discussion below, you now have plenty of time to get them cleaned up before anyone finds out about it! But honestly, don't admit to anyone that you are "doing it this way". Just fix it. What happens in the underground, stays in the underground!
I cannot (today) count how many on-site assessments I have executed or the variety of their outcomes. I have to say that on balance, most technology folks are pretty sharp and have things on track. I can usually advise them on how to make things better. This is of course exactly what they are hoping for. What manager wants to hear that they've done most it of it wrong? Or that their investment in the technology and the people, are a bust? No managers I know take their responsibilities so lightly. Some, however, inherit a mess from their predecessor and are flummoxed as to how to unravel it. They don't want me to "put lipstick on the pig", so to speak, but to provide a roadmap on how to dig out of the ditch (or hole, or rat's nest) and move things forward in a healthy direction.
Working with a 10400 Mustang, pre-TwinFin Era, one of our recently arrived data warehouse aficionados took our leadership aside and said, "What they are describing is a reporting system, like a data mart. But we aren't using any technologies to help them with this. We need to have a talk with them about standing up Microsoft SQLServer so we can put a data mart on it and..."
Stop right there. Yikes. He was so full of passion! It was really, really hard to talk him back from the ledge. So I finally said, "If you mention this plan to the client, even once, we will have to remove you from the project." And his eyes went wide like he'd been hit with a two-by-four across the forehead. "Why?" was his impassioned plea. Time to educate him on what Netezza does, right?
Netezza is a data warehouse appliance. It circumscribes and simplifies the data warehouse disciplines. It also makes some strong assumptions about the potential users of the appliance, not the least of which is what-problems-it-solves-well and what-problems-it-does-not-solve-at-all. (World Peace, Global Warming, Time Travel, Cloning of IT Staff Members, and getting the Dallas Cowboys to the Superbowl, to name a few).
Example: What if you were going about doing some-regular-task manually-and-tediously, and someone then showed you a device that would automate it? You might count your blessings and move forward with a skip in your step. But when you share the device's features with someone unfamiliar with the manual, tedious nature of an existence without it, they scratch their heads and say "I don't get it."
I am reminded of a joke where a lumberjack is in the market for a new saw. A powered-chain-saw salesman asks him how many trees he cuts down in a day with his manual saws, and the man says "30". Ahh, says the salesman, with one of these you could cut 100 or more in a single day. The lumberjack doesn't believe him, so the salesman tells him, Look, take this one for a test drive. Use it all day tomorrow and if it doesn't at least double your output, bring it back, no harm done, no questions asked. The lumberjack agrees but returns two days later, clearly disgruntled about the chain-saw's performance. "I was only able to cut down 10 trees with this lousy thing." To which the salesman balked, and wondered if it might not be defective. So he beckoned the lumberjack to follow him outside to their testing area, where he threw a log across two sawhorses and pulled the starter cord on the chain saw. When it roared to life, the lumberjack took a step back and shouted over the sound of the motor - "WHAT'S THAT NOISE?"
Clearly a little product-orientation was in order, no?
A CTO once lamented to me, "Well, we did the best we could with what we had." - Well sure. Don't we all? I don't know of anyone who borrows or rents help to do it poorly. Nor do they take their best people to deliberately make something sub-standard. The problem is, without a baseline knowledge of what the machine can do, how it is typically deployed, and what to embrace or avoid about it, then it's really no different than the lumberjack's problem. He did the best with what he had, didn't he?
What were the outcomes? Poor perception of the product by the user. An objective lack of productivity. General grousing about something that is not well-understood. Where have we seen this before? Give a call to practically any help desk of any product, especially a technology product, and they will bend your ear with "howling" examples of users who mis-applied the product - and some would say - were just plain stupid about it.
Underwriters Laboratory (UL) has a standard policy of quickly adjudicating claims against them no matter how frivolous. Seems that just having the "UL" on the product makes them a lightning rod for litigation. One man took his name-brand lawn mower, which also sported the UL sticker, picked it up while it was still running, and attempted to use it as a hedge-trimmer. He slipped, the lawnmower fell on him, and he sued the maker of the lawnmower and UL. Three boys found a giant bullfrog and decided to kill it by setting it on fire. They grabbed a gas can from the shed, doused the bullfrog with it and tossed a lighted match on the hapless creature. Someone should have told the lads about the volatility of gasoline fumes, because the flames climbed the can's fume-trail directly into its mouth and detonated the can's contents, seriously wounding and severely burning all three boys. They sued the makers of the gas can and UL, which also had a label on the can. Perhaps someone should send this one in to A Thousand Ways to Die. For bullfrogs.
But this is not a dissertation on A Thousand Ways to Fail with Netezza because frankly, it's really hard to fail with a machine this powerful. This is why I say that when we encounter howlers like the guy with the lawnmower or frog-immolation, we're clearly off the beaten path. Why is it then, that the "beaten path" pops up more often than it should? Or for that matter, pops up at all? Aren't data warehousing folks a little smarter than that?
Of course they are. In fact, I don't recall encountering any experienced data warehousing folks who have had a bad experience - quite the opposite. However, as for the folks who have never built a data warehouse but have a lot of experience in "applications" - well in this zone it can get a little choppy.
The point is, across the fruited plain we have exceptions to every rule. My sincere hope is that your project is not inadvertently caught in the "crosshairs of fate".
Rat's Nest Number One.
Upon arrival on site I knew something was wrong. People were squeezed into their cubes, boxes were stacked against walls in every room. The whole office just felt so crowded. And then they introduce me to "the machine". In this case, their production machine was the lowest-powered machine that Netezza had to offer, just short of a Skimmer. The admin at the desk barks at me for parking in the street and not in the garage underground. It's my first day, and nobody said anything about parking. The difference in cost was exactly $1, and if you're like I am and travel a lot, this kind of difference is not worth discussing. Except for here. It's tongue-lashing time. All of these things added up to some significant red flags, moving in a direction of a road lined with red flags.
Case in point, this is not a company doing things expediently or frugally. They are cheap. They will do things according to the lowest denominator of cost and skill, not because they have balanced priorities. They would rather save a few dollars on training or even a rent-an-architect, and allow the least-of-their-staff to painfully slog through the nuances of data warehousing on an immutable deadline. It's the immutable deadline I can't fathom. Here's why:
In a solution implementation, we have cost, duration and quality. Pick two. Whichever two you pick will shortchange the third. Every time it's tried. Well, these folks were shortchanging all three in the blind naivete that it was valid and workable. Without time or resources, quality is always the first, most expedient of the three to fall on its face. Doing it on-the-cheap? Well, what does this say of their readiness for prime-time? Data warehouses have an ongoing cost-of-ownership. It's not trivial. Those who want to play cheap should find another profession, one that does not cherish quality.
I was told by the client that their current back data processing environment used Netezza stored procedures. Another big red flag. Stored procs invite black-boxed code and we cannot capture data lineage through them. Netezza stored procs are ideal for the front-end. Never for the back-end. They are hand-crafted and rather ugly to maintain (this would be true on any platform).
On this particular platform, they had decided not to use monolithic stored procs (a proc with a lot of serialized operations in it) but use modular ones. So modular in fact, that each inbound data stream had its own dedicated "receivor" stored proc, followed by three more role-based stored procs plus another two - one to validate and one to push the data into the final target table. All 150 incoming record descriptions/filestreams had these 6 stored procedures assigned to them. With one catch - they were the same "role" of stored proc, but all of them were different. That's right, to intake 150 tables we saw 6 stored procs each, for a grand total of 900 stored procedures, and this was for just one of several data sources!
Many of you OO aficionados see something screaming out at you, that this should have been one general-purpose loader with six phases of operation, serving all 150 streams. Adding another source and another 100 streams, no problem, they go through the same loader and phases. Need more phases of operation? No problem, just add them to the loader and everyone benefits. Forever. It's a beautiful thing.
Of course, this means a deliberate instantiation of some reusable infrastructure. Many app-developer folks are not familiar with how to do this. After all, with 150 tables incoming,we could expect those definitions to remain pretty stable. But if the same stored procedures are facing the internal data model(s) (and they must), then we have worse than a cut-and-paste rat's nest, we have a hand-crafted rat's nest. If the data model must change, we may effectively invalidate most if not all of the stored procedures. Can you even imagine having to review - and re-review 900 stored procedures so -- oh never mind.
It therefore did not surprise me to learn that they had "frozen" the data model so that the stored procedures could have higher durability. We know this isn't realistic either, because the business will start to drive more requirements into the solution and the model must change to accommodate it, even if it's just attribution of existing tables. How do we keep these things from impacting the existing code? We have no choice but to freeze the data model. But this isn't really a choice, it's more like a un-necessary evil. Their stored procedure implementation only guarantees one thing: their functional code base will be in flux and unstable for the duration of the solution's lifetime.
I made a valiant attempt to explain the rather problematic issues concerning their implementation. (Problematic here, is a polite and professional term for rat's nest without having to say so). I have to admit however, that "rat's nest" may have done disservice to the rats. They also wanted me to "jot down" a list of "enhancements" that would make their solution better, stronger, faster - all that stuff. I could not think of a profesisional term for "burn it to the ground and start over".
Perhaps I could have told them the bullfrog story.
Rat's Nest Number Two
In keeping on our theme with stored procedures - recall - stored procedures in Netezza were originally concieved for supporting the front end BI tools. Not back-end data processing. In fact, pushing the back-end data processing under higher programmatic control is something new - even to ETL tools. That Netezza supports it very well is a bonus. Actually, Netezza does it better than any of the other databases, because the ease of manufacturing an intermediate table, using it and tossing it, is amazingly simple and easy to manage. Other machines, not so much.
But when I say "ELT" or back-end processing, it's still a SQL statement. We have options, like hand-crafting the SQL in script. Been there, done that. Or hand-crafting a stored procedure. Not really interested in doing that again. And then we have generated-SQL from a template or metadata-driven framework.
ETL tools have pushdown, but it's still pretty weak. At least, too weak for power-users like moi. I have no doubt that they will step up, eventually.
In this example, we have the opposite problem from the first. Just as much of a rat's nest, it is a monolithic stored procedure rather than a gaggle of modular ones. The monolithic stored procedure often runs for an hour or more, executes hundreds of SQL statements along the way, and has a lot of detailed steering logic embedded amongst them. It is a veritable nightmare to code and debug, and even worse to troubleshoot. I hear that some developers have been invited to padded cells afterwards, but I think those are just exaggerations. It can't be that bad, can it?
Given choice between only the two, I would choose the gaggle-of-modular over the monolithic. I mean, if you were implementing it and not me. I always have the choice to say no. I don't work for your company, after all. You may not have a good choice. Your uber-architects and their hired guns have told you it's stored-procedures-or-nothing. So it's time to pick a poison I suppose. I'll take hemlock for $400, Alex.
As these stored-proc programmers stared back at me with hollow eyes, I thought I had entered some macabre Tim Burton flick and all we needed was some spooky music, fog-machines and strange howling in the distance to make it complete. They spoke in muted, muffled tones and their questions seemed to drift. Had they slept in like, the last 48 hours? They all looked sooo tired. This is what a monolithic stored procedured does to your staff. Now watch it drain the lifeblood from your operations staff. It is the virtual/technical equivalent of leeches, and you thought we'd left those behind in the Middle Ages (for technology, that would be the 1990's)
Stored procedures don't have single-step capability. When we add another function to it, we have to test all of the functions at once, because it has to run end-to-end. We can creatively work around this in the beginning, but eventually we have to integrate it. When one test takes over an hour, or two, and the answer is buried in the mountain of carefully crafted NZPL-SQL code, at some point we have to wonder what we signed up for. (that would be, we signed up to do it wrong). Ouch.
Stored procedures cannot be parallelized (unlike their more modular counterparts) and as such is a glaringly missed opportunity. They are doomed to be serialized forever.
Now, our framework (that we consult and use as a problem-solving platform for Netezza - nzDIF) handled 100 percent of all data lineage no matter how many intermediate tables, databases or machines are involved in the overall flows and handoff of work. You won't get this with any other product, nor with anythng a stored procedure has to offer. This would be true of any stored procedure on any platform.
This is because on a transactional platform, procs are meant to handle multiple operations on singleton entities so data lineage simply is not an issue. On a Netezza platform, procs are meant to serve the BI platform, not the back end, so likewise data lineage is not much of an issue. Stored procs for the front end are largely summary/filters for pre-existing datasets. We want the lineage on those datasets, not the on-demand operations that consume them. "Could" we expect data lineage from stored procedures in Netezza? Why? The only reason would be to support back-end processing, and stored procedures are not for back-end processing. It's sort of a Catch-22.
Rule #10 in play here
And let's not forget Rule #10, shall we? Recently emblazoned in glowing letters on the catacomb walls of the Underground, Rule #10 is very simple: Never do bulk data processing in a general-purpose RDBMS engine running on a general-purpose platform.
Now, I just had to get Rule #10 into the forefront because this underscores the primary reason why stored procedures are bad for back end data processing. If we have a rule in place against using SQL for bulk processing on a general-purpose platform/engine, then any experience we may have with bulk processing through stored procedures on such a platform is itself a violation and not a marketable skill in the Enzee Universe. More importantly, it institutionalizes the violation and makes it so much worse. We could "maybe" dig ourselves from a ditch if we're using hand-crafted out-in-the-open SQL (also not recommended) but when ensconced behind the fortress of stored procedures, we have to first storm the fortress before we can loot it. Easier said that done.
All that said, folks who have instantiated stored-procedure-based data processing on general purpose platforms have already been doing it the wrong way, so why bring those practices into the Netezza machine? Just sayin'
Rat's Nest Number Three
Ahh, you thought we were done and coming into the home stretch, eh? Well, we're almost there.
This particular rat's nest only appears in places where people have churned a lot of contractors, consultants and other aficionados and hired guns through the company's various revolving doors. And as the person who inherits it rightly recognizes if as a rat's nest, or a hairball, something comes to mind that rushes through their brains like a river of water "Wow, and I deliberately signed up for this. What could I possibly have been thinking?"
Ahh, not to worry, this syndrome is rare, and shall pass. Breathing deeply will override the spooky breathing cadence of the Dark Lord of Expectations, and shall give you extraordinary confidence on how to resolve this problem.
This condition is entirely severable from the technology itself. Any shop that allows the contractors to establish their own standards without oversight is just signin' up for a world-of-hurt somewhere down the line. Fortunately for us, the Netezza machine is like a monster truck with gumbo-mudder tires. No matter how mired-in-clay it may be, we need only fire the engines and punch the accelerator to regain control and be underway in no time.
The first step, like any 12 step program, is to recognize that a problem exists. Bandaging a hemmorraging wound will not heal it. This will only forestall the inevitability of bleeding out. If we are to be proper stewards, bandaging has its benefits while we treat the larger wounds.
First and foremost, commit to some form of data management logistics. And this is not by purchasing a data backup tool. This is a committment to flow-based, insert-only architecture as the rule, with updates and deletes as the exceptions. After all, if we were using an ETL tool, we wouldn't be able to update or delete a flow of work. We can only integrate and filter the data while it's on its way elsewhere, but that elsewhere will always be an insert-only target - because it's a file set and and not a database. Only when we reach the book-end of the database can we perform updates and deletes, and these are largely to support things like constraints and slowly-changing dimensions. We just need to avoid invoking an insert/delete/update protocol for all tables at all times. Center on a theme and accommodate the exceptions. We must have rules, and this is one of them.
Commit to some form of rules-driven architecture. That is, when we encounter a new condition or potential fork in the logic, consider shaping it with a rule (one that we can switch on/off or modify from afar) rather than hard-wired SQL or hard-coded solutions. Is this easy to do? Of course not. Nothing is easy about data warehousing or large-scale flow mechanics, silly rabbit.
Netezza has simplified the harder, tedious and repeatable parts so that we can actually address the issues we never had time for before. The "next level" was never in view, or even on the radar because we were always immersed in the operational weeds of the implementation. With a Netezza machine, lots of that is behind us, but before us stands the new challenge. It goes something like this:
If I were to give you a 400-slice toaster, your problem is no longer toasting bread, but bread management. Keeping the toaster busy has now become a daunting problem of bread logistics, not machine capacity. The problem domain has shifted into a zone that lots of folks don't have any experience with. Time to step up.
Tossing around the term "Big Data" these days seems to elicit a wide variety of feedback, concern, conjecture, etc. Those in the "old school" of Big Data wrestled with billions of rows of structured data. The buzz now is for unstructured Big Data, and for folks in this zone, it's as though the "other" form of Big Data never existed. After all, they were never exposed to it. Their introduction to Big Data, as though it was brand new, was big "unstructured" data. I recently posted out on Linked-In a tongue-in cheek rendition of the branding problem Big Data is having. I have received so many emails on it that I thought it deserved a little more exposure, if only for pure entertainment purposes. Here we go:
Big (unstructured) Data was so-dubbed for lack of a better word. Unfortunately it has suffered the same ambiguity as Kleenex (don't we ask for a Kleenex when we mean any-old-tissue?) or Coke (don't kids ask for a Coke when they really mean any-old-soda, and didn't Coca-Cola have to go on a market scourge to avoid losing the meaning of the word? If you ask for a "Coke" the person at the counter is required to correct your choice if they don't serve it) and don't we use the word "cellophane" to mean - oh wait - cellophane really is a distinct product that never protected the meaning of its name, so it really is lost.
I would honestly rather protect the meaning the term Big Data to mean Big Structured Data without having to say so. Alas, I fear it has already been hijacked forever.
Give me a chicken sandwich, french fries and a Coke, with a side of Data.
We only serve Pepsi, will that be fine?
Yeah sure. And supersize the data to Big Data.
Will that be Big Structured Data or Big Unstructured Data?
What's the difference?
One is like a burger and the other is like a salad.
One we make from a stack with a secret sauce. The other we just toss together at the last minute.
Does it come with dressing?
As much as you can handle.
Make it so.
Unstructured it is. That will be $20.13 at the first window.
I tossed the clothing into the washer, grabbed the non-chlorine bleach, popped off the cap and poured some into the tub with the clothing. My wife, horrified, asked "What are you doing, didn't you measure it?" To which I say "Of course, it was three bloops."
"Three bloops," she recoils in further horror, "are you kidding me?"
"Well, no, look it requires exactly one cup of bleach for this size of load," I explained, then grabbed the measuring cup and turned the bleach bottle on top of it and let the contents "bloop" three times. It made exactly one cup of liquid. It would be one cup of liquid no matter how many times I repeated it. I simply took the shortcut and measured the bloops. An inexact measurement but just as effective. Of course, she wants me to use the measuring cup every time, and cannot imagine going through life with a bloop-here or a bloop-there. I mean, all those recipes in the kitchen with exact measurements. Do we reduce those to inexact quantities too?
A great mathematician once told me that he never figures the exact to-the-penny tip for a wait-staff of a restaurant. If he wants to give fifteen percent, he mentally calculates ten percent by chopping a zero off the whole-dollar amount, then divides this value approximately in half, adds it to the first then upwardly rounds to the nearest dollar. This method is horrifying to accountants, whom I have seen use calculators to figure exact tip amounts. He also told me that when figuring Celsius temperature, he simply subtracted 32 and divided by two. Is this the right "formula"? No, it's supposed to be 5/9ths right? But if all he wants to know is whether to grab a sweater, coat or neither, this is close enough.
I made some chili later that evening. This is a simple reciple. We brown and drain two pounds of ground meat. We then toss this into a two quart container and follow it with three large cans of diced tomatoes. Plus three packets of chili mix. Do I need to read the labels, really? Sure, it calls for a cup of this or several ounces of that. But lets face it, we're after a certain taste and I know that these ingredients in these proportions deliver it. This particular evening my daughter and I were making the chili together and she carefully followed the instructions, but we were left over with half a packet of chili mix plus half a can of diced tomatoes. We can see the pattern here, right? So to her horror I tossed the additional tomatoes into the container along with the mix and started stirring. I don't know if this is a "guy" thing or not. My youngest son was also horrified at my abject disregard for the instructions on the packet. I have tampered with the forces of the universe, you see. Do not deviate from the recipe, lest the earth open up and swallow us all. Or something like that.
Anyhow, when we served up the chili, they ate as voraciously as always and all was well. As a throwback from the days of yore, I add mustard to my helping of chili and use Frito's Scoops to spoon it out. Anyone familiar with Frito-Pie fully understands the connection.
Now one might well ask, what on earth has this to do with enterprise architecture or anything akin to it? In science, aren't we supposed to cross every T and make sure nothing is amiss? Yes and no. Implementations drill on detail. Architecture, not so much.
I listened to a crew of IT admins debate the required size of their new environment. One of them quipped that if we need more space or CPUs, we would need 6 weeks of lead time to order it. After sizing the environment, all agreed that we were at least within 10 percent of the necessary sizing, and this should be good enough. All except for one, who was concerned that if it was too small, we would have to order more. But you have six weeks to decide, right? Well, no, we need to order the hardware now so that it will be here in six weeks when the rollout needs it. Yeah, said another, but if we run into an issue between now and then we just order more. We don't need all of it right way. We're only using about twenty percent of the capacity to begin with. What's the big deal?
And yet, they continued in their paralysis, unwilling to make a committment until one of our rank simply forced them to. In the blink of an eye, all was clarified when everyone present was willing to admit that such decisions have inexact quantities all over them. It's an educated guess. Like a hypothesis. So we're still using science, but we have to make a call and get moving. People depend on it, and time has expired to delay any further.
We see a lot of this kind of in-exactness in data warehousing. Capacity planning especially has percentages and utilization wriitten all over it. A case in point is that we need to be seriously considering capacity upgrades when a system reaches 60 percent of its current capacity. Why is this? Because a red line exists at the 80 percent mark, and above that is reserved for system recovery and workspace. It is not okay to presume that the "last 20 percent" of capacity is available for regular use. It is the red zone. But if we go to the boss and say we are reviewing capacity when the current utilization is only 10 percent above the halfway mark (60 percent) - they may well ask - isn't this a bit premature?
Well, it honestly depends on how long it takes to get the upgrade/transition for capacity underway. If the assessment takes a few weeks, and the designation, procurement, delivery and installation take many more weeks, we have to consider how fast the data is growing. Will the data grow into another 10 percent of the machine by the time we're able to install the upgrade? Okay, then, we're still outside the red zone. But every percentage point above this is one tick closer to the red zone. And we don't want to cross it. Many times I have personally witnessed "perfectly operational" systems simply hang one day. Out of the blue. The processing capacity required for an intermittent spike did not have room to finish. Or an error recovery needed more spill space than it had left to give. Simple things often lead to catastrophic failure when chaos has no place to go. Or for that matter, when the machine cannot dispatch the chaos because it is already too overwhelmed.
A colleague relates that in his data processing shop, the disk space had long since breached capacity and regularly spilled over to tape drives during the evening's processing window. As he described it to me, their environment was using tape drives for runtime workspace! And the CIO could not stop complaining about how long the jobs were taking, but simply refused to buy any additional disk space for the environment. In his mind, the final storage needed exactly 80 percent of the capacity and they were not in the red zone. But weren't they?
In a data processing scenario, the "understood" quantity of disk space runs anywhere from 6x to 8x of the final product's size. So if we are targeting a 1 TB warehouse, we would need between 6 TB and 8 TB of workspace to support it. Whether this is actually hosted on the physical database machine is immaterial if the disk space is shared between database and the external, flat-file world, which is often the case. I recall one instance where I specified 300 gb for a 20 gb warehouse and the manager, himself a warehousing aficionado, raised a strong objection to such a need. When we did the math, I was actually being pretty conservative in the estimate, seeing that we needed to support Development, Testing and Production workspaces, you see. With 60 gb between them, and 6x needed to support each - voila! We have easily breached 300 gb. The punch-line of course, was that they needed to order an additional 150 gb to finalize the project. Oh well, it's just an educated guess.
But if he was willing to give me so much grief over 300gb, imagine what I would have heard for 450gb? The point being, it won't always be true that our bosses will give us, for all configuration lifecycle environment, upwards of 40x what we need in the end - but it certainly gives us insight as to why "all that disk space" seems to evaporate within a couple of months of the project's inception!
Set-based operations, big structured data handling, and now big-data on-the-grid, we will find even more "inexactness" to wade through. I had an interesting conversation just this week with someone who could-not-believe the data being returned by his big-data cluster. Something had to be amiss, he asserted, because "he just knew" that things had-to-be-different. Basically, he'd spent millions on marketing and brand recognition and had expected measureable lift for his product. When it did not arrive, it basically meant that all those millions were spent for nothing. Either that, or he was looking in the wrong place. I asked him if sales ever changed after one of these marketing pushes, and he said no, the marketing pushes were traditionally geared to keep product loyalists from defecting.
So I asked a very impertinent question - how do you know if the marketing pushes are doing anything at all? Wouldn't it be odd to just forego the next marketing push and see-what-happens? This was interesting to him, but simply out of his hands. The engine to create the marketing collateral and the waves of market "push" were ensconced as science in the highest echelons of the company. Asking them to forego even one cycle, and the risk involved in such a thing, could be suicide.
So let's measure it, I suggested. If the quantities are a science and we can know for certain where the lift is, or is not going, we can measure it as a trend. So he set up a number of market "trolls" as it were, to cast the net for information on their products and various trending for competitor products. He pulled these stats daily for the month prior to the marketing push, through the push and for one week thereafter. I warned him that if it measures "nothing" we have nothing to report on. We really need to report on "something" so we can show what directly affects loyalty to the product. He knew of several interesting 'anti'-quantities that could show us, almost in negative terms, whether the marketing push had any value. These are proprietary so I will not share them here. Nonetheless, by measuring these anti-quantities we could see a loyalty trend in a different way. Not when people re-aligned with their products, but when they dis-aligned with them.
This was an interesting graph. It showed that the loyalty to their brands had less to do with their marketing pushes and more to do with the sale-event discounts associated with their competitors. In a comedy of errors, their marketing pushes just-so-happened to be timed when their competitor sale-events were ebbing off, offering the illusion that loyalty was being restored when it fact it was simply re-aligning to its normal center. What if, he mused, they simply delayed the marketing push for some point after the customer loyalty naturally aligned-to-center? This could in fact pull in even more loyal customers, or let them know whether the loyalty push had any value at all.
Last year I ran into my colleague again, some two years after we had first characterized the situation. He told me that after showing all of the metrics, the marketing folks pooh-pooh'd his findings. All except for one, an ambitious soul who had recently been promoted to the second-in-command of the marketing department. According to legend, this person worked with my colleague to distil the right answers, inexact though they might be. Some 8 months later, they finally agreed to offset the marketing push by four weeks to see what the results would be. Three weeks into this cycle, with the upward trends behaving normally even though no marketing push was underway, gave them what they needed to know. The marketing pushes were at best, mis-timed and at worst, completely worthless. They decided to forego the marketing push entirely. Six months later, the trends remained in place without any effort on their part. With a primary benefit: They had saved many millions of dollars in marketing expenses. And by this time they had already begun the process of running an entirely different kind of marketing push, this time on the edge of the peak rather than in the trough of the lull.
So we can see that watching "trends" or "patterns" of gross movement gives us insight into how to attack (or retreat from) the marketplace in ways that make us more competitive. These gross movements are inexact. While we cannot conjure up successfull marketing potions with three-bloops of elixir, the approach to success is not so different. Patterns, swaths, wakes, edges, trends, peaks etc are all inexact measurements we derive from the existing information. But we need to do it on a scale that is impossible with commodity, general-purpose technologies. More importantly, while the detail data drives the final results, incrementally more information may not "move the needle" at all. In fact, just like the measurement of three-bloops - there's very little in how that measurement system will deviate from the center in any signficant manner. It is this "significance" we care about, and why the inexact results and processes to derive them may not be lockstep-perfect, but they tell us what we need to know.