Modified on by DavidBirmingham
One of the most significant questions to answer on the Netezza platform is that of "distribution" - what does it mean? how does it apply? and all that. If you are a Netezza aficionado, be warned that the following commentary serves as a bit of a primer for the un-initiated. However, feel free to make comments or suggestions to improve it, or pass it on if you like.
In the prior entry I offered up a simple graphic (right) as to how the Netezza architecture is laid out internally.
In this depiction, each CPU is harnessing its own disk drive. In the original architecture, this was called a SPU, because the CPU was coupled with a disk drive and an FPGA (field-programmable gate array) that holds the architectural firmware "secret sauce" of the Netezza platform. The FPGA is still present and accounted for in the new architecture as the additional blade(s) on the blade server. Not to over-complicate things, so let's look at the logical and physical layout of the data itself and this will be a bit clearer.
When we define a table on the Netezza host, it logically exists in the catalog once. However, with the depicted CPU/disk combinations (right) we can see 16 of these available to store the information physically. If the table is defined with a distribution of "random", then the data will be evenly divided between all 16 of them. Let's say we have a table arriving that has 16 million records. Once loaded, each of these SPUs would be in control over 1 million records. So the table exists logically once, and physically multiple times. For a plain straight query on this configuration, any question we ask the table will initate the same query on all SPUs simultaneously. They will work on their portion of the data and return it an answer. This Single-Instruction-Multiple-Data model is a bread-and-butter power strategy of Massively Parallel computing. It is powerful because at any given moment, the answer we want is only as far away as the slowest SPU. This would mean as fast as the SPU can scan 1 million records, this is the total duration of the query. All of the SPUs will move at this speed, in parallel, so will finish at the same time.
What if we put another table on the machine that contains say, 32 million rows? (trying to keep even numbers here for clarity). The machine will load this table such that each SPU will contain 2 million rows each. If no other data is on the machine, we effectively have 3 million records per SPU loaded, but the data itself may be completely unrelated. Notice how we would not divide the data another way, like putting 16 million records on 6 of the SPUs and the other 32 million records on the remaining 10 SPUs. No, in massively parallel context, we want all the SPUs working together for every query. The more SPUs are working, the faster the query will complete.
Now some understand the Massively Parallel model to be thus: Multiple servers, each containing some domain of the information, and when a query is sent out it will find its way to the appropriate server and cause it to search its storage for the answer. This is a more stovepiped model than the Netezza architecture and will ultimately have logistics and hardware scalability as Achilles's heels, not strengths for parallel processing. Many sites are successful with such models. But they are very application-centric and not open for reuse for other unrelated applications. I noted above that the information we just loaded on the Netezza machine can be in unrelated tables, and databases and application suites, because the Netezza machine is general-purpose when it comes to handling its applicability for data processing. It is purpose-built when we talk about scale.
But let's take the 16-million-record vs 32-million record model above and assume that one of the tables is a transactional table (the 32 million) and one is a dimensional table (the 16 million). The dimensional table contains a key called "store_id" that is represented also on the transactional table such that we can join them together in various contexts. Will the Netezza machine do this, and how will it? After all, the data for the tables is spread out across 16 SPUs.
Well, we have the option of joining these "as is" or we can apply an even more interesting approach, that of using a distribution key. Here's where we need to exercise some analysis, because it appears as though the STORE_ID is what we want for a distribution, but this might skew the data. What does this mean? When we assign a distribution, the Netezza machine will then hash the distribution key into one of 16 values. Every time that particular key appears, it will always be assigned the same value and land on the same SPU. So now we can redistribute both of these random tables on STORE_ID and be certain that for say, SPU #5, all of the same store_id's for the dimension table and for the transaction table are physically co-located on the same disk. You probably see where this is going now.
Ahh, but what if we choose store_id and it turns out that a large portion of the IDs hash to a common SPU? What if we see, rather than 1 million records per SPU, we now see around 500k per SPU with one SPU radically overloaded with the remainder? This will make the queries run slow, because while all the SPUs but one will finish fast, in half the time of before, they will all be waiting on the one overworked SPU with all the extra data. This is called data skew and is detrimental to performance.
However, had we chosen a date or timestamp field, our skew may have been even worse. We might see that the data is physically distributed on the SPUs in a very even form. But when we go to query the data, we will likely use a date as part of the query, meaning that only a few SPUs, perhaps even one SPU, will actually participate in the answer while all the others sit idle. This is called process skew and is also detrimental to performance.
Those are just some things to watch out for. If we stay in the model of using business keys for distribution and using dates for zone maps (which I will save for another entry) we will usually have a great time with the data and few worries at all.
Back to the configuration on store_id's. If we now query these two tables using store_id in the query itself, Netezza will co-locate the join on the SPU and will only allow records to emit from this join if there is a match. This means that the majority of the work will be happening in the SPU itself before it ever attempts to do anything else with it. What if we additionally clamped the query by using a date-field on the transaction table? The transaction table would then be filtered on this column, further reducing the join load on the machine and returning the data even faster.
So here is where we get lifting effects where other machines experience drag. For an SMP-based machine, adding joins can make the query run slower. On a Netezza machine, joins can serve as filter-effects, limiting the data returned by the query and increasing the machine's immediate information on where-not-to-look. Likewise the date-filter is another way we tell the machine where-not-to-look. As I have noted, the more clues we can give the machine as to where-not-to-look, the faster the query will behave.
I have personally witnessed cases where adding a join to a large table actually decreases the overall query time. Adding another one further decreases it. Adding another one even further and so on - five-joins later we were getting returns in less than five seconds where the original query was taking minutes. This filter-table and filter-join effect occurs as a free artifact of the Netezza architecture.
Distribution is simply a way to lay the data out on the disks so that we get as many SPUs working for us as possible on every single query. The more SPUs are working, the more hardware is being applied to the problem. The more hardware, the more the physics is working for us. Performance is in the physics, always and forever. Performance is never in the software.
While the above is powerful for read-query lift (co-located reading) there is another power-tool called the co-located write. If we want to perform an insert-select operation, or a create-table-as-select, we need to be mindful that certain rules apply when we create these tables. For example, if we want to perform a co-located read above and drop the result into an intermediate table, it is ideal for the intermediate table to be distributed on the same key as the tables we are reading from. Why is this? Because the Netezza machne will simply read the data from one portion of the disk and ultimately land it to another portion of the same disk. Once again it never leaves the hardware in order to affect the answer. If we were to distribute the target table on another key, the data will have to leave the SPU as it finds a new SPU home to align with its distribution key. If we actually need to redistribute, this is fine. But if we don't and this is an arbitrary configuration, we can buy back a lot of power just by co-locating the reads and writes for maximum hardware involvement.
So that's distribution at a 50,000 foot level. We will look at more details later, or hop over to the Netezza Community to my blog there where some of these details have already been vetted.
It is important to keep in mind that while distribution and co-location are keys to high(er) performance on the machine, the true hero here is the SPU architecture and how it applies to the problem at hand. We have seen cases where applying a distribution alone has provided dramatic effects, such as 30x and 40x boost because of the nature of the data. This is not typical however, since the co-located reads will usually provide only a percentage boost rather than an X-level boost. This is why we often suggest that people sort of clear-their-head of jumping directly into a distribution discussion and instead put together a workable data model with a random distribution. Once loaded, profile the various key candidates to get a feel for which ones work the best and which ones do not. We have seen some users struggle with the data only because they prematurely selected a distribution key that - unbeknownst to them - had a very high skew and made the queries run too slow. This protracted their workstreams and made all kinds of things take longer than they should have.
So at inception, go simple, and then work your way up to the ideal.
On multiple distribution keys:
Many question arise on how many distribution keys to use. Keep in mind that this is as much a physical choice as a functional one. If the chosen key provides good physical distribution, then there is no reason to use more keys. More granularity in a good distribution has no value. However, a distribution key MUST be used in a join in order to achieve co-location. So if we use multiple keys, we are committing to using all of them in all joins, and this is rarely the case. I would strongly suggest that you center on a single distribution key and only move to more than one if you have high physical skew (again, a physical not functional reason). The distribution is not an index - it is a hash. In multi-key distributions, the join does not first look at one column then the next - it looks at all three at once because they are hashed together for the distribution. Joined-at-the-hip if you will.
On leaving out a distribution key:
One case had three tables joining their main keys to get a functional answer. They were all distributed on the same key, which was a higher-level of data than the keys used in the join. The developer apparently thought that because the distribution keys are declared that the machine will magically use them behind the scenes with no additional effort from us. This is not the case. If the distribution key(s) (all of them) are not mentioned in the join, the machine will not attempt co-location. In this particular case, using the higher-level key would have extended the join somewhat but would not have changed the functional answer. Simply adding this column to the join reduced that query's duration by 90 percent. So even if a particular distribution key does not "directly" participate in the functional answer, it must directly participate in the join so as to achieve co-location. And if this does not change the functional outcome, we get the performance and the right answer.
How it affects concurrency:
Many times people will ask: Why is it that the query runs fast when it's running alone, but when it's running side-by-side with another instance of itself they both slow to a crawl? This is largely due to the lack of co-location in the query. When the query cannot co-locate, it must redistribute the data across the inter-blade network fabric so that all the CPUs are privy to all the data. This quickly saturates the fabric so that when another query launches, they start fighting over fabric bandwidth not the CPU bandwidth. In fact some have noted that the box appears to be asleep because the CPUs and drives aren't doing anything during this high-crunch cycle. That's right, and it's a telltale sign that your machine's resources are bottlenecked at the fabric and not the CPU or I/O levels. By co-locating the joins, we keep the data off the fabric and down in the CPUs where it belongs, and the query will close faster and can easily co-exist with other high-intensity queries.
Modified on by DavidBirmingham
For the past many months I have been diligently updating and upgrading the original 2008 Netezza Underground to address the many features of TwinFin, Striper and other offerings from IBM. I have recently been notified that it has passed final edit and is available on Amazon.com.
All I can say is "whew!" and many thanks to those who helped put it together. It's been a whirlwind.
Here is the URL
When I started the project I realized that a big part of the original book remains timeless. I didn't leave it "as is" though - practically every page and all the chapters have new material, case studies and such. I peppered the book with some additional graphics since the intrinsic points require a bit more reinforcement than mere words will suffice.
The original chapter on "Distribution Stuff" is now "Performance Stuff" and is twice as long, covering the various aspects of setting up tables, troubleshooting, page-level zone maps and a lot more.
Fortunately, this time around there is a better mechanism to contact me if you have questions or want to report any errata (hey, it could happen!) - you can reach me through this blog, on linked-in or directly through my primary email address at Brightlight Consulting:
For a brief history of why ELT (that is, in-box-data-processing) is even a topic of discussion, we must recognize that the high-powered appliances such as Netezza have not only made such implementation viable, but even desirable.
Just so we level-set on what one of these looks like, it's a SQL statement. Usually an insert/select but can also include updates and the like. Many of you recognize these as multi-stage operations inside a stored procedure. The sentiment of course is that such an animal can perform better inside-the-machine than to take it out of the machine (through an ETL tool), process it, and put it back. This may be true for smaller data sets, but you aren't reading this because you have smaller data sets, are you? Netezza users are big-game hunters. Pea-shooters are for small animals, but if we want big game, we'd better bring a bigger game with us, put on a bigger game face, and bring bigger equipment.
But is it really the size of the equipment? Netezza users know that the size and the architecture are keys to success. Dog-piling a bunch of CPUs together does not a power-frame make.
Okay, back to the storyline here - in-the-box SQL-transforms, in a traditional RDBMS platform, are the realm of small game. Once the game gets bigger, these transforms degrade in performance, and rather rapidly. Watch as a swarm of engineers tries to reconstruct and rebuild the procedures, the SQL, the tables and even upgrade the hardware in a futile attempt to achieve incrementally more power. Emphasis on incrementally-more. Not linearly more.
As they grow weary of this battle, the ETL tools start looking better and better. We reach a critical mass, and the ELT-in-the-box is summarily forsaken as we stand up the ETL tool.
Sometime after this transition, the data volumes once again overwhelm the environment. One thing leads to another, and one day the Netezza machine arrives on the floor. Hopes are high.
But notice the transition above - ELT was forsaken for ETL, likely never to return. But wait, now we really have the power to do the ELT SQL-transforms, but we've mentally and perhaps emotionally (yeah, verily even politically) moved away from SQL-transforms.
Some reading this might scratch their heads and think, What's he talking about? We've been doing ELT in the machines using (PLSQL, etc, name your approach here) for many years. Why would we shy away from it?
- Hand-crafted SQL statements. They are easy to deploy but difficult to maintain and re-deploy. The moment we have to change the data model, this becomes painfully obvious.
- Proliferation of one-off operations, in cut-and-paste form, increasing inefficiency and complexity of the overall implementation
- Lack of overarching control - stored procs run and go "black" for an hour or more without any clue as to their status
- They don't scale in a traditional (SMP-based) RDBMS - the larger the data gets, the more the implementation will suffer
- ETL tools don't support them in the quantity and logistical quality we really need, but they are getting there.
Why not use an ETL Tool? I mean, they handle push-down SQL-generation right?
I can perhaps summarize this situation in a single conversation I had with a ETL tool vendor who was hawking the capability for his own tool, and after showing me the mechanics of making one of these little jewels operate with aplomb, I asked him, "So what about doing a few hundred of these in a sequence, or branching them into multiple sequences?" The vendor rep looked almost hurt. "Why would you want to do that?"
Well, if we're really talking about migrating transforms into the machine, this can grow into hundreds of operations rather quickly. This situation apparently overwhelms the logistical capability of even the most powerful ETL tool. But I have hope that they will solve this situation. Eventually.
I am not holding out hope that it will happen soon, or voluntarily. These tool vendors have invested millions in the performance boosting of their own products and will not likely toss this investment on the flames of the appliance movement, even if said flames are the exhaust flames of the appliance's rocket engines. This is why I say "eventually". They don't really have a marketable reason to embrace this approach.
Another problem of the ETL tool is that it is so divorced from the appliance's infrastructure that it cannot control the cross-environment logistics. This is especially true of the "virtual transaction" - that is - multiple flows arriving in multiple tables that are each in context of one another, yet are individually shared-nothing operations. If one of the flows should fail, how do we back out of this? Can we do a rollback of the tables where their flows succeeded? No we cannot. We could certainly implement an approach, but this is neither inherent nor intrinsic to the ETL tool. We need a shared-nothing virtual transaction that will control all of the flow in a common context, commit them in that context and faithfully rollback the results in that same context. ETL tools don't go there. Unless we implement the tool that way. As an application. Once implemented, how do we reuse this for the next application, and the next? We can see that it's not part of the tool itself.
If next-generation "ELT" scenarios are to be successful, they need several very important capabilities that are simply non-optional and non-trivial:
- Catalog-aware - The entire application must be coupled to the catalog definitons so that changes in definitions will automatically drive the resolution of their impact. We don't want to go on a submarine hunt to find every place where a particular column is used. The tool should already know.
- Logistical control - the ELT management needs to be able to support hundreds of transforms with arbitrary branches and synchronization
- SQL-generation - from templates that are tied to the catalog - If the table changes, we can propagate the changed automatically into the transforms.
- Proper handling of shared-nothing operations into the target, including virtual transactions that we can replicate to other databases, or rollback completely.
- Treat internal appliance databases as "waypoints" rather than "endpoints". That is, multiple databases are in play and used for various roles, such as intake, intermediate workspace, and persistence. These resources should be configurable and relatively transparent to the tool.
- Support of major patterns such as CDC, Dedupe, column-level scrubs and transforms, masking, column remapping, soft-delete, slowly-changing-dimensions and of course, analytic functions etc.
I am sure the visitors of this blog have even more aspects of a "wish list" that they have implemented (perhaps painfully so) and want more control over the data, its processing logistics and error control and recovery. Feel free to add your own comments and suggestions here.
Modified on by DavidBirmingham
I hear a lot of feedback on the use of CDC to put data into a PureData for Analytics, Powered By Netezza Technology device. In the other machines (traditional database engines) the data flies into the box, the CDC is on-and-off the machine in seconds. But in my Netezza machine, the CDC seems to grind. I have it running every fifteen minutes, they say, and the prior CDC instance is still running when the next instance kicks off. This is totally unacceptable. Maybe we shouldn't be using CDC for this?
Or maybe they just don't have it configured correctly?
There are two major PDA principles in play here. One is strategic and the other is tactical. Many people can look at the tactical principle and accept it because it is testable, repeatable and measurable. The strategic one however, they will hold their judgment on because it does not fit their paradigm of what a database should do. I'll save the strategic one for last, because its implications are further reaching.
The CDC operation will accumulate records into a cache and then apply these at the designated time interval. This micro-batch scenario fits Netezza databases well. The secondary part of this is that the actual operation will include a delete/insert combination to cover all deletes, updates and inserts. So when the operation is complete, the contents of the Netezza table will be identical to the contents of the source table at that point in time (even though we expect some latency, that's okay).
The critical piece is this: An update operation on a Netezza table is under-the-covers a full-record-delete and full-record-insert. It does not update the record in place. A delete operation is just a subset of this combination. This is why the CDC's delete/insert combination is able to perfectly handle all deletes, updates and inserts. The missing understanding however, is the distribution key.
If we have a body of records that we need to perform a delete operation with against another, larger table, and the larger table is distributed on RANDOM, think about what the delete operation must do in a mechanical sense. It must take every record in the body of incoming records and ship it to every SPU so that the data is visible to all dataslices. It must do this because the data is random and it cannot know where to find a given record to apply the operation - it could literally be anywhere and if the record is not unique, could exist on every dataslice. It's random after all. This causes a delete operation (and by corollary an update operation) to grind as it attempts to locate its targets.
Contrast this to a table that is distributed on a key, and we actually use the key in the delete operation (such as a primary key). The incoming body of records is divided by key, and only that key's worth of data is shipped to the dataslice sharing that key - the operation is lightning-fast. This is why we say - never, ever perform a delete or update on a random table, or on a table that doesn't share the distribution key of the data we intend to affect. Deletes and Updates must be configured to co-locate, or they will grind.
Now back to the CDC operation. Whenever I hear that the CDC operation is grinding, my first question is: Do you have the target Netezza tables distributed on the same primary keys of the source table? The answer is invariably no (we will discover why in a moment). So then I ask them, what would it take to get the tables distributed on the primary key? How much effort would it be? And they invariably answer, well, not much, but it would break our solution.
Why is that?
Because they are reporting from the same tables that the CDC is affecting. And when reporting from these tables, the distribution key has to face the way the reporting users will use the tables, not the way CDC is using the tables. This conversation often closes with a "thank you very much" because now they understand the problem and see it as a shortcoming of Netezza or CDC, but not a shortcoming of how they have implemented the solution.
Which brings us to the strategic principle: There is no such thing as a general purpose database in Netezza.
What are we witnessing here? The CDC is writing to tables that should be configured and optimized for its use. They are not so, because the reporting users want them configured and optimized for their own use. They are using the same database for two purposes because they are steeped in the "normalization" protocol prevalent in general-purpose systems - that the databases should be multi-use or general-purpose.
But is this really true in the traditional databases? If we were using Oracle, DB2, SQLServer - to get better performance out of the data model wouldn't we reconfigure it into a star schema and aggressively index the most active tables? This moves away from the transactional flavor of the original tables to a strongly purpose-built flavor.
Why is it that we think this model is to be ignored when moving to Netezza? Oddly, Netezza is a Data Warehouse Appliance - it was designed to circunscribe and simplify the most prevalent practices of data warehousing - not the least of which - is the principle that there is no such thing as a general-purpose database. In a traditional engine we would never attempt to use transactional tables for reporting - they are too slow for set-based operations and deep-analytics. Yet over here in the Netezza machine, somehow this principle is either set-aside or ignored - or perhaps the solution implementors are unaware of it - and so these seemingly inexplicable grinding mysteries arise and people scratch their heads and wonder what's wrong with the machine.
And again, they never wonder what's wrong with their solution.
If we take a step back, what we will see are reports that leverage the CDC-based tables, but we will see a common theme, which I will cover in a part-2 of this article. The theme is one of "re integration" versus "pre-integration". That is, integration-on-demand rather than data that is pre-configured and pre-formulated into consumption-ready formats. What is a symptom of this? How about a proliferation of views that snap-together five or more tables with a prevalence of left-outer-joins? Or a prevalence of nested views (five, ten, fifteen levels deep) that attempt to reconfigure data on-demand (rather than pre-configure data for an integrate-once-consume-many approach?) Think also about the type of solution that performs real-time fetches from source systems, integrates the data on-the-fly and presents it to the user - this is another type of integration-on-demand that can radically debilitate source systems as they are hit-and-re-hit for the very same set-based extracts dozens or hundreds of times in a day.
I'll take a deep-dive on integration-on-demand in the next installment, but for now think about what our CDC-based solution has enticed us to do: We have now reconfigured the tables with a new distribution key that helps the reports run faster, but because this deviates from the primary-key design of the source tables (which CDC operates against) then the CDC operation will grind. And when it grinds, it will consume precious resources like the inter-SPU network fabric. The grinding isn't just a duration issue - it's inappropriately using resources that would otherwise be available to the reporting users.
What's missing here is a simple step after the CDC completes. Its a really simple step. It will cause the average "purist" data modeler and DBA to retch their lunch when they hear of it. It will cause the admins of "traditional" engines to look askance at the Netezza machine and wonder what they could have been thinking when they purchased it. But the ultimate users of the system, when they see the subsecond response of the reports and way their queries return in lightning fashion compared to the tens-of-minutes, or even hours - of the prior solution, these same DBAs, admins and modelers will want to embrace the mystery.
The mystery here is "scale". When dealing with tables that have tens of billions, or hundreds of billions of records, the standard purist protocols that rigorously and faithfully protect capacity in the traditionl engines - actually sacrifice capacity and performance in the Netezza engine. It's not that we want to set aside those protocols. We just want to adapt them for purposes of scale.
The "next step" we have to take is to formulate data structures that align with how the reporting users intend to query the data, then use the CDC data to fill them. It's not that the CDC product can do this for you. It gets the data to the box. This "next step" in the process is simply forwarding the CDC data to these newly formulated tables. When this happens, the pre-integration and pre-calculation opportunities are upon us, and we can use them to reduce the total workload of the on-demand query by including the pre-integration and pre-calculation into the new target tables. These tables are then consumption-ready, have far fewer joins (and the need for left-outer joins often fall by the way-side). After all, why perform the left-outer operations on-demand if we can perform them once, use Netezza's massively parallel power for it, and then when the users ask a question, the data they want is pre-formulated rather than re-formulated on demand.
This necessarily means we need to regard our databases in terms of "roles". Each role has a purpose - and we deliberately embrace the notion of purpose-built schemas, and deliver our solution from the enslavement of a general-purpose model. The CDC-facing tables with support CDC - we won't report from them. The reporting tables face the user - we won't CDC to them.
Keep in mind that this problem (of CDC to Netezza) can rear its head with other approaches also - such as streaming data with a replicator or ETL tool to simulate the same effect of CDC. Either way, the data arrives in the Netezza machine looking a lot like the source structures and aren't consumption-ready.
I worked with a group some years ago with a CDC-like solution, and they took the "next step" to heart, formulated a set of target tables that were separate from staging and then used an ETL tool to fill them. The protocol was simply this: The ETL tool sources the data and fills the staging tables, then the ETL sources the staging tables and fills the target tables. This provided the necessary separation, so functionally fulfilled the mission. The problem with the solution however, was that for the transformation leg, the ETL tool was yanking the data from the machine into the ETL tool's domain, reformulating it and then pushing it back onto the machine. The data actually met itself coming-and-going over the network. A fully parallelized table was being serialized, crunched and then re-paralellized into the machine. As the data grew, this operation became slower and slower. That's what we would expect right? The bottleneck now is the ETL tool. The proper way to do this, if an ETL tool must be involved, is to leverage it to send SQL statements to the machine, keep the data inside the box. The Netezza architecture can process data internally far faster than any ETL tool could ever hope - so why take it out and incur the additional penalty of network transportation?
The ETL tool aficionados will balk at such a suggestion because it is such a strong deviation from their paradigm. But this is why Netezza is a change-agent. It requires things that traditional engines do not because it solves problems in scales that traditional engines cannot. In fact, performing such transformations inside a traditional engine would be a very bad idea. The ETL tools are all configured and optimized to handle transformation one way - outside the box. This is because it is a general-purpose tool and works well with general-purpose engines. There is a theme here: the phrase "general purpose" has limited viability inside the Netezza machine. If we embrace this reality with a full head of steam, the Netezza technology can provide all of our users with a breathtaking experience and we will have a scalable and extensible back-end solution.
Modified on by DavidBirmingham
Sometimes the average Netezza user gets a bit tripped-up on how an MPP works and how co-located joining operates. They see the "distribute on" phrase and immediately translate "partition" or "index" when Netezza has neither. In fact, those concepts and practices don't even have an equivalent in Netezza. This confusion is simply borne on the notion that Netezza-is-like-other-databases-so-fill-in-the-blank. And this mistake won't lead to functional problems. They will still get the right answer, and get it pretty fast. But it could be soooo much faster.
As an example, we might have a traditional star-schema for our reporting users. We might have a fact table that records customer transactions, along with dimensions of a customer table, a vendor table, a product table etc. If we look at the size of the tables, we find that the product and vendor tables are relatively small compared to the customer, and the fact table dwarfs them all. A typical default would be to distribute each of these tables on their own integer ID, such as customer_id, vendor_id etc. and then putting in a transaction fact record id (transaction_id) that is separate from the others, even though the transaction record contains the ID fields from the other tables.
Then the users will attempt to join the customer and the transaction fact using the customer_id. Functionally this will deliver the correct answer but let's take a look under-the-covers what the performance characteristics will be. As a note, the machine is filled with SBlades, each containing 8 CPUs. For example, if we have a TwinFin-12, this is 12 SBlades with 8 CPUs, or 96 CPUs. They are interconnected with a proprietary, high-speed Ethernet configured to optimize inter-CPU cross-talk.
Also whenever we put a table into the machine, it logically exists in one place, the catalog, but physically exists on disks assigned to the CPUs. A simplistic explanation would be that if we have 100 CPU/disk combinations and load 100,000 rows to a table that is distributed on "random", each of the disks would receive exactly 1000 records. When we query the table, the same query is sent to all 100 CPUs and they only operate on their local portion of the data. So in essence, every table is co-located with every other table on the machine. This does not mean however, that they will act in co-location on the CPU. The way we get them to act in co-location (that is, joining them local to the CPU) is to distribute them on the same key.
But because our noted tables are not distributed on the same key, they cannot co-locate the join. This means that the requested data from the customer table will be shipped to the fact table. What does this look like? Because the customer table has no connection to the transaction_id, the machine must ship all customer records to all blades (redistribution) so that the CPUs there can attempt to join on the body of the customer table. We can see how inefficient this is. This is not a drawback of the Netezza machine. It is a misapplication of the machine's capabilities.
Symptoms: One query might run "fine". But two of them run slow. Several of them even slower. Results are inconsistent when other activities are running on the machine. We can see why this is the case, because the processing is competing for the fabric. Why is this important to understand? The inter-CPU fabric is a fairly finite resource and if we allow data to fly over it in an inefficient manner, it will quickly saturate the fabric. All the queries start fighting over it.
Taking a step back, let's try something else. We distribute the transaction_fact on the customer_id, not the transaction_id. Keep in mind that the transaction_id only exists on the transaction table so using it for distribution will never engage co-location. Once we have both tables distributed on the customer_id, let's look at the results now:
When the query initiates, the host will recognize that the data is co-located and the data will start to join without ever leaving the CPU where the two table portions are co-located. The join result is all that rises from the CPU, and no data is shipped around the machine to affect the answer. This is the most efficient and scalable way to deal with big-data in the box.
Now another question arises: If the vendor and product dimensions are not co-located with the transaction_fact, how then will we avoid this redistribution of data? The answer is simple: they are small tables so their impact is negligible. Keep in mind that we want to co-locate the big-ticket-or-most-active tables. I say that because we have sites that are similar in nature where the customer is as large as two of the other dimensions, but is not the most active dimension. We want to center our performance model on the most-active datasets.
This effect can rear its head in counter-intuitive ways. Take for example the two tables - fact_order_header and fact_order_detail. These two tables are both quite monstrous even though the detail table is somewhat larger. Fact_order_header is distributed on the order_header_id and the fact_order_detail is distributed on the order_detail_id. The fact_order_detail also contains the order_header_id, however.
In the above examples, the order header was being joined to the detail, along with a number of other keys. This achieved the correct functional answer, but because they were not using the same distribution key, the join was not co-located. So we suggested putting the order_detail table on the same distribution as the order-header (order_header_id). Since the tables were already being joined on this column, this was a perfect fit. The join received an instant boost and was scalable, no longer saturating the inter-CPU fabric.
The problem was in how the data architects thought about the distribution keys. They were using key-based thinking (like primary and foreign keys) and not MPP-based thinking. In key-based thinking, functionality flows from parent-to-child, but in MPP-based thinking, there is no overriding functional flow of keys - it's all about physics. This is not to say that "function doesn't matter" but we cannot put together the tables on a highly physical machine and expect it to behave at highest performance unless we regard the physics and protect the physics as an asset. Addressing the functionality alone might provide the right functional answer, but not the most scalable performance.
After completing another migration from a traditional, general-purpose RDBMS to the Netezza technology, I visited a friend who had several artifacts in his home that had to be the strangest things I'd ever seen.
Now I'd heard of genetically altered cat fur, you know, buying a cat online while picking the fur color of your choice (blue, lavender, teal - etc). Seems like an odd thing to do to a cat. I like cats and dogs both, so don't imagine that this blog essay attempts to take sides. Some folks are downright serious on their choice of pet, so I'll smooth that fur wherever I can.
Back to the hairless cats. I asked him "Where did you find these? And what happened to your other cats?" And he laughed, "That's a funny story. These are my cats, but I had to shave them." To this, I rolled my eyes, wondering where this was about to lead. He told the tale:
"We took a vacation down south and put the cats in kennels in the back of the truck. The round-trip took a toll on their fur and matted up everything from head to toe. Weeks later, the cat fur had not smoothed out. The kids had been brushing it out but it wasn't working. And if you think they look ridiculous shaved, you have no idea how silly they looked with their hair matted."
"So you shaved them?" I asked.
"Sure" he said, "Seems practical right? Just get rid of the matted hair altogether. Teasing it apart would have taken, well, years of time. Their fur will grow back out soon enough"
"Aren't there, you know, shampoos and stuff for that? I mean, shaving seems a little extreme."
"Tried all that. Bad thing about it, mats are bad for cats - they cause infections and all kinds of nasty side effects. Best to just shave it all and be done with it."
Laughing on the inside, I thought a bit about how we have to decompose and de-engineer an organically-grown data warehouse. Some would suggest porting (forklifting) the whole thing over "as is". Like taking a matted-hair cat and moving them from one house to another. It changes the venue for the cat, but doesn't help the cat at all. It's still sick and getting sicker from the mats. Such folks "tell a tale" of the success of their migration derring-do. But they are like nomads. Hunting the game until there's no more, then pulling up stakes to find another place to burn out. Forklift-migrations have value only to the ones who are doing the migrating, not the recipients. No sooner will they tie a bow on it than someone will request a change, and we will discover what we already knew: The original data model (now the new data model) isn't very resilient to change no matter where it is hosted.
We realize we have ported both the good and the bad from the old system, when we had the opportunity to port the good and leave the bad behind. We essentially are agreeing that we are about to standardize on the past and then accommodate the future, rather than a better approach: standardize on the future and accommodate the past.
Many years ago, at one site we had to carefully tease-apart the data and the stored procedured to find out what they were actually doing. Unfortunately we had carved up the work for several teams rather than reviewing it together. Had we done this, we would have discovered that the stored procs executed in chains of work, and that many of the chains were copy-pasted from one original chain that was too "matted" to risk breaking. So they copied-and-modified this chain to perform the new functionality. Enough of these and we see how the stored procedure doesn't benefit us (at all) for back-end data processing. In fact, we strongly suggest people use stored procs in Netezza for BI-adaptation and optimization, for the presentation layer. But not for the back end. Stored procs are not operationally viable for a wide range of reasons. It's even funny how folks move from one technology to another and try to replicate the stored procedure logic as a knee-jerk exercise, without realizing how flawed it really is. Perhaps the Netezza stored proc will run a lot faster. Trust me, performance is the least of your worries.
So once we converged the teams together, these themes started popping out like rabbits. By the end of the first day we are all laughing at the sheer level of redundancy in the back end. But not particularly surprised at the outcome. We'd seen it in lots of places before, but not so bad.
Of course, it never once occurred to us that we would port these hundreds of stored procs over to the new system. Rather we would functionally specify what they are doing now, and leverage tools to accommodate the vast majority of the functionality, only building what was left over. I mean, this is a standard functional port, why complicate things? Forklifting into a Netezza machine will certainly yield 10x performance, so why the beef? Without optimizing the data structures and processes to leverage Netezza's power, we might get 10x but leavel 100x on the table. Is this a good tradeoff?
Well, true to form, someone had the capacity to complicate things. He whipped out a spreadsheet and calculated the cost of the hundreds-of-stored-procs in the original system, not realizing we were planning to reduced these to maybe fifteen operations at most. Spreadsheet calculator in-hand, he estimated that it would take 24 people, 8 months, to handle on these stored procs. I sat back in my seat, stunned, because he was costing a project we weren't about to undertake. Rather, building out 15 or so operations would require a handful of people and 90 days at the outside. But also true to form, the project principals saw visions of sugar plums (another word for sales-comp) that got in the way of their better judgment. They actually went to the client with these inflated numbers, he rejected their proposal outright and gave the business to someone else. It's easy to lose a deal when the client sees the inflation on-the-page.
But what our "spreadsheet guy" missed, was that we weren't about to embark on a journey of finding a home for each stored proc (we already knew this had no value, and the client knew it too). He believed that we intended to bring the matted-cats into the house and put them on pillows, when we intended to pick the cats we wanted to keep, and shave them.
Okay, that's a strange analogy, but we had no intention whatsover of accepting all that convoluted spaghetti as the foundation for the go-forward system.
Netezza gives us the capacity - to simplify. We keep the parts we consider valuable (the cat) and get rid of all the mess that keeps the cat sick and unhappy. Taking only the functions we want, we then reconstruct (let the hair grow back out) only what we want to keep, and take the opportunity to apply some solid architectural principles and likewise capitalize on the strengths of the Netezza platform.
In the end, if we really have a platform that is standardized on the future, but accommodates the past, we also have something else that is even more powerful: A simpler, stronger engine that is ready to grow in functionality, adapting to our changing needs. The old system was never built with this kind of vision or priority, because the power wasn't there to affect it anyhow.
When introducing the Netezza platform to a new environment, or even trying to leverage existing technologies to support it, very often the infrastructure admins will have a lot of questions, especially concerning backups and disaster recovery. Not the least of which are "how much" "how often" and such like. More often than not, every one our responses will be met with a common pattern, a sentence starting with the same two words:
Case in point, when we had a casual conversation with some overseers of backup technology as a precursor to "the big meeting" we almost - quite accidentally - shut down the conversation entirely. Just the mention of "billions" of rows, or speaking of the database in "Carl Sagan" scaled terms, caused them to want to scramble for budget and market surveys of technologies that were more scalable than their paltry nightly tape-backup routines. In this particular conversation, we were talking about nightly backups that were larger than the monthly backups of all of their other systems combined. Clearly we were about to pop the seams of their systems and they wanted a little runway to head off the problem in the right way.
But what is the "right way" to perform a backup where one of the tables is over 20 terabytes in size, the entire database is over 40 terabytes in size and the backup systems require two or even three weeks to extract and store this information for just one backup cycle? Quipped one admin "It takes so long to back up the system that we need to start the cycle again before the first cycle is even partially done." and another "Forget the backup. What about the restore? Will it take us three weeks to restore the data too? This seems unreasonable."
Yes it does seem unreasonable precisely because it is - quite unreasonable. As many of you may have already discovered, the Netezza platform is a change-agent, and will either transform or crush the environment where it is installed, so voracious are its processing needs and so mighty is its power to store mind-numbing quantities of information.
The aforementioned admins simply plugged their backup system into the Netezza machine, closed their eyes and flipped the switch, then helplessly watched the malaise unfold. It doesn't have to be so painful. These are very-large-scale systems that we are attempting to interface with smaller-scale systems. We might think that the backup system is the largest scaled system in our entire enclosure, but put a Netezza machine next to it and watch it scream like a little girl.
So here's the deal: No environment of this size shoiuld be handled in a manner that is logistically unworkable for the infrastructure hosting it. We can say all day that these lower-scaled technologies should work better or that Netezza should pony-up some stuff to bridge the difference, but we all know that it's not that easy. Netezza has simplified a lot of things, but simplification of things outside the Netezza machine - aren't we asking a bit much of one vendor?
To avoid pain and injury, think about the things that we need to accomplish that are daunting us, and solve the problem. The problem is not in the technology but in the largesse of the information. We would have the same problem on our home computers if we had a terabyte of data to backup onto a common 50-gig tape drive. We would need twenty tapes to store the data. The backup/restore technology works perfectly fine and reasonably well for a variety of large-scale purposes. We simply need to be creative about adapting it to the Netezza machine. Don't plug it in and hope for the best. Don't do monolithic backups. The data did not arrive in the machine in a monolithic manner so why are we trying to preserve it that way? Leave large-scale storage and retrieval to the Netezza machine and don't crush the supporting technologies with a mission they were never designed for.
Several equally viable schools of thought are in play here. What we are looking for is the most reliable one. Which one will instill the highest confidence with least complexity? The more complex a backup/restore solution becomes, the less operational confidence we have in it. If it cannot backup and restore in a reasonable time frame, we exist in a rather anxious frontier, wondering when the time will come that the restoration may be required and we put our faith in the notion that it either won't, or when it does all of the other collateral operational ssues will eclipse the importance of the restoration. In other words. future circumstance will get us off the hook. There is a better way, like a deterministic and testable means to truly backup and restore the system with high reliability and confidence.
On deck is the simplest form of the solution - another Netezza machine. Many of you already have a Disaster-Recovery machine in play. Trust me when I tell you that this should be fleshed out as a fully functional capability (discussed next) and then the need for a commodity backup/restore technology evaporates. Using another Netezza machine, especially when leveraging the Netezza-compressed information form, allows us to replicate terabytes of information in a matter of minutes. I don't have to point out that none of our secondary technologies can compete with this.
A second strategy requires a bit more thought, but it actually does leverage our current backup/restore technology in a manner that won't choke it. It won't change the fact that the restoration, while reliable, may be slow simply because moving many terabytes in and out of one of these secondary environments is inherently slow already.
A third strategy is a hybrid of the second, in that enormous SAN/NAS resources are deployed as the active storage mechanism for the data that is not on (or no longer on) the Netezza machine. This can be a very expensive proposition on its own. We know of sites that keep the data on SAN in load-ready form, and then load data on-demand to the Netezza machine, query the just-loaded data, return the result to the user and then drop the table. You may not have on-demand needs of this scale, but this shows that Netezza is ready to scale into it.
A fourth strategy is a hybrid of the first, that is we still use a Netezza machine to back up our other Netezza machine, but we use the more cost-effective Netezza High-Capacity server, which is less expensive than the common TwinFin (fewer CPUS, more disk drives) but otherwise behaves in every way identically to its more high-powered brethren. And honestly if we were to put apples-to-apples in a comparison between the cost of a big SAN plant to store these archives versus the High Capacity Server, the server wins hands-down. It's cheaper, simpler to operate, doesn't require any special adaptation and we can replicate data in terms guided by catalog metadata rather than adapting one technology to another.
So let's take these from the least viable to the most viable and compare them in context and contrast, and let the computer chips fall where they may.
Commodity backup/restore technology
If we want to leverage this, we need to understand that it cannot be used to perform monolothic operations. These are unmanageable for a lot of reasons:
To mitigate the above, we need to adapt the large-scale database to the backup technology by decoupling and downsizing the operation into manageable chunks. This is a direct application of the themes surrounding protracted data movement in any environment. The larger the data set, the more the need for checkpointed operations so that the overall event is an aggregation of smaller, manageable events. If any single event fails, it is independently restartable without having to start over. Case in point, if I have 100 steps to complete a task and they are all dependent upon one another, and the series should die on step 71, I still have 29 steps remaining that may have completed without incident, but I cannot run them without first completing step 71. This is what a monolithic backup buys us - an all-or-nothing dependency that is not manageable and I would argue is entirely artificial.
To continue this analogy, lets say that any one of these 100 steps only takes one minute. In the above, I am still 30 minutes away from completion. I arrive at 6am to find that step 71 died, and now I have to restart from step 1 and it will cost me another 100 minutes. Even if I could restart at step 71, I am still 30 minutes away from completion.
Contrast the above to a checkpointed, independent model. If we have 100 independent steps and the step 71 should die, the remaining 29 steps will still continue. We arrive at 6am to find that only one of the 100 steps died and we are only 1 minute away from full completion. The difference in the two models is very dramatic:
Monolithic means we are operationally reactive when failure occurs. The clock is ticking and we have to get things back on track and keep moving. Checkpointed means we are administratively responsive when failure occurs. We don't have to scramble to keep things going. In fact, in the above example, if step 71 should fail and the operator is notified, doesn't the operator have at least half an hour to initiate and close step 71 independently of the remaining 29 steps? Operators need breathing room, not an anxious existence.
Monolithic methods are supported de-facto by the backup/restore technology. If we want to perform a checkpointed operation, we have to adapt the backup/restore process to the physical or logical content of the information. We don't want to directly mate the backup technology itself, so we need to adapt it.
Logical Content Boundaries
This means we have to define logical content boundaries in the data. What's that? You don't have any logical content boundaries, not even for operational purposes? Well, per my constant harping on enhancing our solution data structures with operational content, such as job/audit ID and other quantities, perhaps we need to take a step back and underscore the value of these things because they exist for a variety of reasons. One of them is now upon us - the operatonal support of content-bounded backups. It is required for scalability and adaptability and is not particularly hard to apply or maintain.
A more important aspect of content boundary is the ability to identify old versus new data. If the data is carved out in manageable chunks, some will always be least-recent and some more-recent. Invariably the least-recent chunks will be identical in content no matter how many times we extract them for backup. This means we can extract/backup these only once and then focus our attention on the most active data. In a monolithic model, there is no distinction between least or most active, least or most recent. In large-scale databases, the least-recent data is the majority, so the monolithic backup is painfully redundant when it need not be.
Do we absolutely need content-bounded backups for all of our tables to function correctly? Of course not. But by applying this as a universal theme it allows us to treat all tables as part of a backup framework where all of them behave the same way. So part of this is in the capture of the data but a larger part is the operational consistency of the solution.
Many reference tables such as lookups will never grow larger and we know this. In fact, their data may remain static for many years. For the ones that are tied to the application and grow or change every day, these we will call solution tables. They are typically fed by an upstream source and are modified on a regular basis. Any of these tables can grow out of control. The reference data then represents a very low operational risk. Why then would we not simply fold the reference data into the larger body and treat all the tables the same? There is no operational penalty for it, but enormous benefit from being able to treat all tables the same way inside a common solution.
At this point, the backup/restore framework will address all of the tables the same way, but now we have the ability to leverage rules and conditions within the framework so that special handling is available if necessary. This is a common theme in large-scale processing: Handle everything as though it will grow, but accommodate exceptions with configuration rules. I'll forego this aspect for now and let' take a look at what we need in basic terms:
- An intermedaite landing zone: Some off-machine storage location that can hold the data while in transit - e.g a NAS or SAN volume. This will not be the same size as the database but some fraction of it. It is a workspace for intermediate files, not permanant storage
- Content boundaries: A means to define and manage content bthese are tags or quantities in the data itself. They need to be consistent for all tables so that the backups and restore operate the same way. Of course, if we don't have these, we need to apply them.
- A restoration database: a separate database that will be used for the restoration process, because we won't be restoring the data directly into the table from whence it came. Why not? There is an increasing likelihood that the data structures have changed since the last backup. If the structure has changed, we cannot reload the data into it. We need a way to shepherd the data back into the machine.
- Processes: that capture / backup / restore the data using content boundaries and common utilities. These are typically control scripts that are easily configured, deployed and maintained
- Archival Database (next): Physical content replicas of the masters that hold all of the original information in a form useful for backup and restore. Whether they keep the data active/online is immaterial. Their role is to interface with the backup/restore mechanisms so that any data once made offline can easily be made online through this database.
Setup formal archival databases. Whether these reside on the same server, or on a DR / High Capacity server (below) is immaterial. The point is that the data in the master tables will be actively rolled into these tables, which will form the backbone for all backup and restoration operations. We therefore replicate the masters (with streaming replication) into the archival stores and then at some interval perform the backup of the archives, not the master.
Foreshorten the Master Tables
Now that we have a means to define content boundaries - you did apply those boundaries right? We can now look at the database holistically for optimizations based on active data.
At one site we have a table with ninety billion records in three different fact tables spanning over ten years of information, However, the end-users and principals claimed (and we verified) that the most active data on any given day is the most recent six months. Anything prior to that, they would query perhaps once or twice a year for investigative purposes, but had not tied any reports to it.
So now we have an opportunity to get agreement from the users to shorten the master tables. This is especially necessary for those fact tables. In the end, these fact tables (above) were shortened to less than four billion rows each and these are kept trimmed on a regular basis. The original long-term data is held in another archival database that is the foundation for the backups and restores.
On a per-table basis, use the content boundaries to manufacture/compile a set of checkpointed instructions for each table to support extraction and restoration
Execute the extractions to create flat files
Each extraction will pull data to a Netezza-compressed flat file and also supply another file with the metadata instructions necessary to restore the file's contents. On a per-file basis, these instructions will include:
Original table definition, used to construct a template of the table in the restoration database
File load instructions used to get the content back into the machine
Transformation command, used to move data from the restoration database into its original home
- The files will be extracted (using content boundaries) onto the storage landing zone
- The smaller extractions can be performed in parallel, are independent and form their own checkpointed process
- The backup techology will begin its backup process once the second (metadata companion) file arrives for the set. The metadata companion file behaves as a trigger.
- Once the backup for the file-set is complete, the backup/restore deletes the file and its companion
- This decouples the backup/restore processing from the database (completely) so that it can focus solely on file-based backup and restore
- For restoration, both the content file(s) and companion metadata file(s) will be retrieved and placed on the storage landing zone
- The metadata will be used to reconstruct an empty intermediate version of the original table in the Restoration Database
- Load the table using nzload
- Then the metadata will be used (even a preformulated SQL statement) to copy the data into the original target
- Throw away the intermediate load table, delete the data file and its companion metadata file
- The restoration process can be run as many ways-parallel as supported by nzload
- The restoration process can also be surgical, with the more agile ability to restore data in smaller segments
The above protocol, while likely easy to pull off an administer, still has a number of moving parts that the Netezza based-equivalent (later) will not have.
The only difference between the above protocol and one using a SAN-based storage mechanism, is the absence of a formal backup/restore technology. Rather the SAN is the long-term storage location and we perform the incremental extractions onto it. Rather than delete the files, we keep them.
This has significant implications for the cost of the SAN. After all, if we intend to interface this to the Netezza machine, we would not want common NAS storage because it is too slow and the vendors actively disclaim their technology from being viable for data warehousing. The primary reason is that the network and CPUs are set up for load-balancing, like a transactional database, but not bulk onload/offload of the data.
Not only will we need enough SAN to backup the environment, but to also carry fathers and grandfathers if need be (this is a policy decision). With checkpointed extracts, the father/grandfather issue is largely moot. This is because once a checkpointed extract of older data is pulled and stored, it won't be changing and capturing another one just like it has no value.
In this approach we leverage another Netezza machine like a DR server, as our backup, archival and restore foundation. It can easily hold the information quantity. The difference here is in price, of course, since a fully-functional TwinFin is more expensive than most common SAN installations. However, the High Capacity Server (below) mitigates this pricing problem while delivering a consistent data experience.
One primary benefit of performing backups into the DR server is that it can automatically serve the role of a hot-swap server in case of failure in the primary server.
For this scenario to work however, we would want streaming replication between the active databases and the DR server so that the data is being reconciled while being processed. This allows us to have a fully functional hot-swap if the primary crashes, and we can continue uninterrupted while the primary is serviced. Word to the wise on this kind of scenario however, bringing back the primary means that it is out of sync, since the secondary took over for a span of time. So we would need to be able to reverse the streaming replication to make it whole.
Scenarios like this often embrace the practice of operationally swapping the two machines on defined boundaries, like once a month or once a quarter, where they actually switch roles each time. This allows the operations staff to gain confidence in the two machines as redundant to each other in every way. I have seen cases like this where the primary machine went down, the secondary machine kicked in seamlessly and all was well. I have also seen cases where the principals kept the DR server up to date but when it came time to operationally switch, some important piece (usually in the infrastructure between the devices) was missing causing the failover itself to fail. It is best to have a plan in place, but it's better to have tested the plan and that it actually works.
On a per-table basis, use the content boundaries to manufacture/compile a set of checkpointed instructions for each table to support extraction and restoration
Using a variation of the nzmigrate utility, perform the table-level extractions
Extract data in Netezza-compressed form to a flat file
Load the flat file into a restoration database in the second machine
Perform the transform-execution to copy data from the restoration database into the target table
The extractions, restoratons and transformations can all leverage simple scripts and catalog metadata, not become dependent on deeply hand-crafted code
Intermediate (SAN) landing zone requires less space because the data is being transferred in Netezza-compressed format and it is cleaned-up on-the-fly
Transfer is quicker because Netezza-compressed data is written to the data-slices directly instead of coming through the host.
A word on Netezza-compressed transfer. I wrote about this in Netezza Transformation but it is important to highlight here. We performed an experiment moving half a terabyte scattered across a hundred or more tables. This data was moved from its original home to a database in another machine. The first method used simple SQL-extract into an nzLoad component. This process took over an hour. The second method used transient external tables with compression, coupled with an nzload in compressed mode. The entire transfer took less than six minutes. This was because the compressed form of the data was already 14x compressed.
In other experiment using over 20x compression for the data, we were able to transfer ten terabytes in less than an hour. This kind of data transfer speaks well for the streaming replication necessary for DR server operation (above) but underscores the fact that even when transferring between Netezza machines, it's as though we haven't left the machine at all.
Netezza-based High-Capacity Server
This option is simply a form of the Netezza-based hybrid (above) but on a dedicated server designed to support backup and recovery.
The better part about this server is that is has more disk drives and fewer CPUs, making it far more cost effective for storage than common SAN devices. Couple this with the minimal overhead required for transferring data between machines, and the ability to surgically control the content with the content-boundaries and catalog metadata, and we get the best of all worlds with this device.
Not only that, but it is also scalable to support storage of all other Netezza devices in the shop as well as any non-Netezza device where we simply want to capture structured information for archival purposes. The High Capacity server is queryable also, meaning that even the ad-hoc folks will find some value in keeping the data online and available.
Lastly, in Spring 2010, as part of the safe-harbor presentation one of the principals at IBM Netezza announced plans for a replication server. I can only imagine that this device will deliver us from any additional hiccups associated with streaming replication that we might now be doing in script or other utility control language.
At Brightlight, our data integration framework (nzDIF) has the nz_migrate techniques built directly into the flow substrate of the processing controller, as well as the enforcement and maintenance of the aforementioned content boundaries. We are actively acquiring and applying best, most scalable and simplified approaches as a solution framework firmly lashed to a purpose-built machine. I am a big proponent of encouraging Enzees to take on these things themselves, or at least let us coach you on how to make it happen. The solutions are simple because the Netezza platform itself is simplified in its operation. Stand on the shoulders of genius - the air is good up here.
Once again this entry will serve as an intro to the un-initiated. I get a lot of pings from folks who are new to the stuff and want to know their way around the machine from the beginning. Zone maps are bread-and-butter power tools of the Netezza technology. The companion capabilities are the materialized view and the cluster base tables (which are new to v 6.0).
About three years ago I sat in on a special presentation to our firm from one of Netezza's competitors. They put up some power-point slides to compare their technology to "the others". Their product was a data warehousing product and all the "others" were "OLTP" products. The usual suspects were in this list of course, but oddly so was Netezza. But Netezza is not an OLTP machine in any sense of the word. How odd that this company did not know enough about its primary rival to discern this simple truth. In addition, this vendor made certain odd claims about Netezza that were either architectural false-assumptions or just misinformation. One of which was this "In a Netezza machine, the more data you add to a table, the slower the table's response will be." This is of course completely false. Netezza tables are like any data storage with one major takeaway that should not be missed:
Netezza tables may behave inconsistently if the data is not properly configured and deployed. Notice I did not say "if the hardware is not properly ...." but the data itself. Netezza already has the hardware configuration under control and there's nothing we can do to affect it. We can however, align our data with the hardware architecture and will receive the benefit of a consistent experience no matter how much the data grows. It's hard to imagine this kind of consistency on any platform, so stay with me here.
A number of years ago I did some deep immersion in a health-care expert system for Value Health Sciences. The objective was to directly apply and enforce the codes found in the CPT code book used by physicians (those are the codes the doc will check on the sheet he hands to you before he leaves the office). Insurance companies have standard billing/invoicing arrangements with the physicians and these codes are the centerpiece. For VHS, their medical practitioners had set up rules for characterizing (and properly re-characterizing) a claim. For example, if a physician billed codes 1,2 and 4 but he was supposed to have billed these all as 1 and 5 (because code 5 bundles 2 and 4 to reduce cost) the system would have a rule in it that looked something like:
1,2,4 = 1, 5
The difficulty here of course, with some 100,000 codes and over 40,000 rules, is to organize them so that they are quickly accessible. By design, the first code in the rule (eg. code "1" above) was considered an "index" code. In other words, any and all rules that would use the code "1" as an anchor would also make "1" the first code. This lended itself to the opportunity to "vector" the rules. That is, create an in-memory array, indexed by the given code, and a list of rules would be handily available at that vector point. So that all we had to do was look at the claim (e.g. codes 1,2,4) fetch the various candidate rules from their vector lists ( the list for code 1 might have 3 rules, likewise for the others, reducing our total candidate rules from 40,000 to only 9 rules). We could then run the rules and dispatch the claim. When I had first received this model, the system was physically running through all 40,000 rules to find the right match. This vectored approach reduced the total rules, but also eliminated guesswork. And with all that, boosted the power and created a very consistent turnaround time for each patient claim.
The vector list above is a common way in programming to organize information so that - if we have something in our hands already - like a filter or a vector based on a natural key - we can quickly find where we need to start looking and eliminate the places where we are not interested. While this sounds a lot like a common index-structure approach, it's actually a process-of-elimination approach. The two have similar qualities, so I will delineate them here at a high level.
An index is a surgical approach to finding data by logging its record-based keys and a record locator into a separately maintained structure. By finding keys, we find physical records. This is important for transactional systems that need fast record-level turnaround. It is not ideal however, for analytics or data warehousing where the vast majority of queries don't care about individual records at all, but about answers that span records. This alternative approach of vectoring, is a way to bucket the information so that we're not attempting to specifically locate the record itself, but the bucket where the record may be located, that is, derive a group of candidates upon which we will take further action.
When we think about this, the candidate approach, as opposed to the surgical approach is the only way to scale anything that has to deal with multiple-record sets (set-based processing). Engines that are geared to examine one-record-at-a-time even if it is to aggregate or perform multi-record operations with the single-records it retrieves, cannot compete with an engine that is designed to defacto treat the records as multiples from the outset. Transactional versus bulk. The transactional engine is inherently slower and non-scalable compared to the bulk engine (for set-based operations).
I noted in a prior entry that Netezza's anti-indexing strategy heavily leverages "where not to look" because no matter how large a table gets, this "vectoring" or "candidate" approach will guarantee that our response time is very consistent. How does this apply to zone maps?
Recall the diagram with the CPU/disk combinations. Each of these disks is divided into its smallest actionable physical portion (an extent) and the extent then becomes the currency-of-the-realm when we're talking about searching for data. The machine automatically tracks the data that is added to an extent, and when we generate-statistics, every extent is accounted for in a zone map. The map has basic statistics on the columns in the table at the extent level, such as the high/low values for a column in the given extent.
Extent ColA ColB ColC
1 200 A B
1 300 A B
1 400 C D
1 500 E F
2 200 F G
2 400 H I
2 600 J K
3 100 L M
3 200 J O
Note the above layouts of the extents. If we were to look at the high/low values of each, we would find zone maps like this:
Extent ColA High ColA Low
1 500 200
2 600 200
3 200 100
So now if we ask a question to the machine based on say, a ColA value of 200, the machine will search all three extents. It has no choice because the zone map is telling it that pertinent data exists in all three.
Now let's physically sort the data by ColA
Extent ColA ColB ColC
1 100 L M
1 200 A B
1 200 F G
1 200 J O
2 300 A B
2 400 C D
2 400 H I
3 500 E F
3 600 J K
Now if we take a look at the zone map, we would see something like:
Extent ColA High ColA Low
1 200 100
2 400 300
3 600 500
So that now if we query the table for values of 200, the machine searches only one extent, not all of them. (Now, this is a simple example, because a given value could extend across many extents). Likewise a given disk has thousands of extents upon which to store the data, so it can spread the information out into much smaller, manageable chunks.
The point, however, is that by organizing the data this way, the zone maps reduce the total searchable extents. Keep in mind that we don't have to sort the data each time we want to use it. For many warehouses, the data naturally arrives in this contiguous form (for example, transaction date in a retail model). The Netezza machine will track all integers and dates automatically without us having to do anything additional to enable it. So we might take a penalty once to sort the data at the outset, then benefit from the data's natural arrival and collation.
This effectively creates the aforementioned vectored effect. It also guarantees a consistent experience because the data associated with the given value (say 200 above) isn't going anywhere. It will always be in the same location, tracked by the same zone map. Anytime we want to get back to the records associated with value 200, the zone map will serve them up. If the response time last week was 2 seconds, we could quadruple the size of the table and it will still be 2 seconds, because the data itself didn't go anywhere.
Contrast this to other heavy-lifting algorithms in software technologies. The Ab Initio platform, for example, will perform an in-memory Sort that will "intelligently" spill to disk when it runs out of memory. What this means is that it will take a predefined block of memory and attempt to perform all work inside it until it runs out. If so, it will manufacture 16 file locations within which to spill its overflow. These are organized with the same "vectored" approach as the sort keys would dictate. If these 16 files run out of space, they will further divide into 256 files to split the workload. Anyone working with the technology will tell you that this secondary split into 256 is to be avoided if at all possible because it negatively (and significantly) affects the performance of the sort. But we see the difference here. A general-purpose sort in software needs these 256 assets to functionally complete its mission, and adding more means less performance. Now compare this to Netezza which uses not just 256, but thousands of extents to divide the data out - and rather than experience the performance drag, will derive power and lift from this.
So let's apply something additional, especially for highly active tables. If we have a nightly batch window for processing, we likely have all the time we need to keep these zone maps healthy. But if we have an intraday model or even an aggressive micro-batch model, it is possible that the zone maps may be "violated". Note above that as long as the data arrives in contiguous chunks (like transaction date) the data will be set into the extents so that we minimize the total extents for a search. If a transaction date (for example) should appear outside of this contiguous range (say a record from last week's data has arrived) then it means the high/low range for that particular extent will be extended to include the errant record. It could mean that the given extent is also included in queries that don't related to either date, because the extent is simply being included because of its high-low range. That is, every query looking for this extent's primary transaction date, the errant transaction date, or any date in between them, will include this extent in the candidate list. While this is inconsequential in most cases, if it repeats over time it can slowly degrade the table's performance. Highly active micro-batch and intraday models have a tendency to see these effects if they remain unchecked or unguarded. But not to worry - the solution is simple.
What is daunting about this is that our highly-active-table might be a multi-billion-row beast that does not lend itself to re-sorting and in an intraday model, there's no time to re-sort the data anyhow.
A materialized view has the quality of being able to apply a virtual zone map to a table without physically re-sorting the information. All we have to do is pick the columns we want to zone map, and then form the materialized view with an order-by on those columns. Rebuilding one of these is usually very fast. In addition, we can be rebuilding it even while its prior version is active. At one site, we set up a process to rebuild the view thus:
Note that for the few moment that two materialized views are applied to two tables bears no risk at all. And if we drop the original materialized view, it will wait for all pending queries to stop - meaning that this won't break anything the users are doing (another necessity for highly active intraday models)
Important here is that the materialized views are not being used as surrogate tables, but as a means to virtualize the zone maps for faster response time on the table associated with the view. Another important aspect of the materialized view is that it can zone-map additional data types that the regular zone maps do not (e.g. varchars).
Let's say that we have a very active table and even re-creating the materialized view is too much uptake to maintain it. Not to worry, the materialized view has another quality that means we don't have to refresh it each time the table changes. The materialized view maintains its contents until the underpinning table changes. We then have the option to rebuild the view or use it "as is". This "as is" is usually agreeable for the short term, since when data is added to the underpinning table, the materialized view creates an extension to itself. Now whenever we query the table, it will examine the materialized view for default zone-map information but will also examine the extent. Considering that this extra extent is very small it is also insignificant in its overhead. So we can keep this in place for say, all day long and then clean up the materialized view in the evening. The materialized view's performance will not drag over the course of a day unless we do some serious re-churning of the underpinning data (which is not normal for common micro-batch models).
And now - Cluster-based tables
With all the above preliminary stuff out of the way, a discussion on CBTs is fairly straightforward. As of the 6.0 release, Netezza has provided this as a multi-dimensional capability for even more powerful searches. The data in the CBT is physically organized into extents based on keys. And when we perform a groom operation, the data is physically reshuffled so that the keys are physically co-located on the fewest extents possible. Like a self-enforcing zone map. only a lot more powerful. In a zone map approach, the primary way to get the data contiguous is to re-sort the data and the zone maps fall out for free. In a CBT, we organize the data and the execute groom. The initial groom will take awhile, but each subsequent groom will take inconsequential time. The data in the CBT can be organized on more than one key, but the keys can behave independently like a multi-dimensional effect.
Case in point - let's say we have 200 extents of data and the particular set of data we want spans only 10 of those extents (a zone map). We will ignore the other 190 extents and focus all the machine's energy on the 10 where the data is located. But wait, what if we have an additional key upon which to search, and the combination of the first key (10 extents) and second key reduces the total extents to only 2? That is, the first key appears in 10 extents but the first and second key appear in only 2 extents of those 10. Now we search only 2 extents. Again, we have organized and configured data, not hardware, to achieve the goal. We have told Netezza where-not-to-look and now we have zoomed in on a mere 2 extents out of 200.
But let's say that we add a lot more data to the table over the year, expanding this 200 extents to many hundreds if not thousands more. Can we see now that even if this happens, we can go back to the table again and again, ask the the same question as before, and only 2 of the extents will be used to search for data. This of course dispels the vendor-promulgated propaganda that Netezza tables get slower when we add more data.
As the title suggests, one of the challenges of new Netezza users is in learning about the product, what it can (and doesn't) do, and how it applies in data warehousing. When I first published the book (Netezza Underground on Amazon.com) the impetus for the effort was just that - people asking me lots of questions leading to fairly repetitive and predictable answers. It's an appliance after all. We can apply it to a multiplicative array of solutions but on the inside, certain things stand out as immutable truths.
Of course, lots of folks are running around down here in the catacombs, convincing people that they need to read the glowing glyphs on the ancient stone walls as guides on their quest for more information. This is entirely unnecessary. The title of the book (and the blog) is a tongue-in-cheek nod to the way that some might spin the story on the machine in their own favor. Some might claim that we need lots of consulting hours to roll out the simplest solution. Consultants can help, of course (I'm one of them). But it depends on the spin and the tale that is told, that will determine the magnitude and expense of those consultants..
I'll try to provide a balance that delivers value without extraordinary expense. Yet another source of misinformation is Netezza's competitors, who like to toss "innocent" bombs here and there to direct people down the path toward their own product. All is fair in business and war, as they say, but as we bring these issues out of the darkness and into the light, the objective is to become more informed.
I am neither a Netezza nor IBM employee, so apart from compensation for actual work performed in rolling out solutions, for which people would pay me regardless of technology, I don't have any other relationship with these companies. I am a huge fan of the Netezza product and architecture and make no secret of this.
So some may have come here to ask questions about the technology or read what some seasoned experts have to say. We can do all that and a lot more. For now, let's look at the counter-intuitive nature of the product's internals.
I'll paint an imaginary picture first. Let's say we have two boxes. In one box we have thirty-two circles and in the second box we have thirty-two drums. The circles are CPUs and the drums are disk drives. They are in separate boxes, and now we draw a pipeline between them. Make it as large as you want. This depicts a standard SMP (Symmetric Multi-Processor) hardware architecture that is the common platform for data warehousing. With the exception of Teradata and Netezza, this hardware configuration is ubiquitous.
Now let's draw another mental picture. This time we'll have one box. Each of the circles will be mated with one of the drums. Now we have thirty-two circle/drum combinations. In the Netezza machine each of these is called a SPU (snippet processing unit) and represent the CPU coupled with a dedicated disk drive. Some additional hardware exists to coordinate and accelerate this combination, but this is a simplified mental depiction of the fundamental difference between the prior SMP configuration and Netezza's, an MPP (Massively Parallel Processor) configuration.
Now I used thirty-two as a simple example. In reality, Netezza can host hundreds of these cpu/disk combinations, the largest standalone frame containing over eight hundred of them, scaling to over a petabyte in storage capacity. Those of us who regularly operate on these machines are accustomed to loading, scanning and processing data by the terabyte, the smallest tables in the tens of billions of rows.
Some notes on the difference in their operation:
SMP: Typically used for transactional (OLTP) processing and is not purpose-built for data warehousing. It is a general-purpose platform. The typical bane of an OLTP engine is that it performs well on getting data inside, but is lousy on getting data out (in quantity, for reports and such).
MPP: (Netezza specifically) does not do transactional processing at all. It is purpose-built to inhale and exhale Libraries-of-Congress at a time.
SMP: Table exists logically in the database, but physically on the file system in a monolithic or contiguous table space.
MPP: Table exists logically in the database, but is physically spread out across all the SPUS. If we have 100 SPUs and want to load a table with 100,000 records, each SPU will receive 1000 records.
SMP: SQL statements are executed by the "machine" as a whole, sharing CPUs and drives. While the SQL operations may be logically and functionally "shared nothing" - the hardware is "shared-everything". In fact, CPUs could have other responsibilities too, which have no bearing on completing a SQL statement operation. In the above example, the SMP has to access the file system, draw the data into memory nearest the CPUs, then perform operations on the data. Copying data, for example, would mean drawing all the data from the tablespace into the CPUs and then pushing back down into another tablespace, but both tablespaces are on the same shared disk array.
MPP: SQL Statements are sent to all CPUs simultaneously, so all are working together to close the request. In the above example, each SPU only has to deal with 1000 records. Unlike an SMP, the data is already nearest the CPU so it doesn't have to go anywhere. The CPU can act directly on the data and coordinate as necessary with other CPUs. Copying data for example, means that the 1000 rows is copied to a locaton on the local drive. If 100 CPUs perform this simultaneously, the data is copied 100 times faster and it never leaves the disks.
SMP: Lots of overhead, knobs and tuning to make it go and keep it going. From the verbosity of the DDL to the declaration of index structures, table spaces and the like.
MPP: (Netezza) the overhead of adminstration is hidden from the user. In the above example, the user need only declare, load and utilize the table, not be concerned about managing the SPUs or disk allocation. The user's only concern is in aligning the data content with the hardware architecture, an easy task to perform and an easy state to maintain.
SMP: SQL-transforms, therefore, the insert/select operations that run so slow on an SMP platform, continue to run slow, and slower as data is added. They are not a viable option so usually would not occur to us to leverage them otherwise.
MPP: SQL-transforms leverage the MPP and do not slow as the data grows.
In fact, many people purchase the Netezza platform in context of its competitors, who don't (nor can they) use SQL transforms to affect data after arrival.
If we purchase Netezza as a load-and-query platform only, we have missed a special capability of the machine that differentiates it from the other products. If we leverage this power, we re-balance the workload of the overworked ETL tools, in some cases eliminating them altogether. Netezza calls this practice "bringing everything under air", that is, bring the data as-is into the machine and do the heavy-lifting transforms after it's inside.
The Brightlight Accelerator Framework is one example of a flow-based, run-time harness for deploying high-performance, metadata-driven SQL-transforms. While we have matured this capability over time, the consistency of the Netezza platform is the key to its success.
SMP:Scalability is through extreme engineering, leading to hardware upgrade.
MPP:Scalability is a function of data organization as it leverages the hardware's power. Upgrading hardware is therefore rarely the first option of choice.
SMP:Constrained by index structures and the shared-everything hardware architecture through which all data must pass
MPP:Constrained by human imagination, not index structures (there aren't any) and no shared hardware.
We have noted that with traditional SMP architectures, when the machine starts to run out of power, engineers will swarm the machine and start to instantiate exotic data structures and concepts that serve only as performance props, not the 'elegant' model once installed by the master modelers. As the box continues to decline, more engineering ensues, eliciting even stranger approaches to data storage and management. Soon it becomes so functionally brittle that it cannot handle any more functionality, so people resort to workarounds and bolt-ons. Functionality that should be a part of the solution is now outside of it, and functionality that never should have been inside (mainly for performance) has taken up permanent residence.
In a Netezza machine we have so much power available that traditional modeling approaches (e.g. 3NF and Dimensional) may have a defacto home, but now we can examine and deploy other concepts that may be more useful and scalable. Approaches that we never had the opportunity to examine before, because the modeling tool did not (and does not) support it and the natural constraints of the SMP-based engine artificially constrain us to index-based data management.
In addition, the Netezza machine operates on a counter-intuitive principle of finding information based on "where not to look". With this principle, let's say we have a terabyte of data and we know where our necessary information is, based on where-not-to-look. See how this won't change even if we add hundreds of terabytes to the system? If I add another 99 terabytes to the same system, the query will still return with the same answer in the same duration, because the data I'm looking for doesn't appear in those 99 terabytes and the machine already knows this.
For example, if we go over to Wal-Mart and we know where the kiosk is for the special-buys-of-the-day, we can find this easily. It's in a familiar location. What if tomorrow they expand the same Wal-Mart to ten times its current size? Will it affect how long it takes me to find that kiosk? Clearly not. Netezza operates the same way, in that no matter how much data we add, we don't have to worry about scanning through all of it to find the answer.
And when it boils down to basics, the where-not-to-look is the only way to scale. No other engine, especially not one that is completely dependent upon index structures to locate information, can scale to the same heights as Netezza.
Well, some of the stuff above is provocative and may elicit commentary. I'll continue to post as time progresses, so the data won't seem so much like underground information, after a while anyhow.
Modified on by DavidBirmingham
Hey, I just couldn't resist!
Please note that in the following commentary, I am attempting to unpack the POTUS 2016 election analytics via the polls, not take political sides. I am one of those "independent/undecideds" that everyone complains about, partly because no particular party "represents" me. Call me a rogue!
Setup a random number generator in a computer so that it generates hundreds of numbers per second, all of them integers from 0 thu 99. Have it collect as many as it can in one minute. When it's done, the average of all numbers generated should hover around 50 (due to the Gaussian distribution (bell curve)). Once this is working, set up an observation and then focus your mind on something in the computer, something imaginary even, and "declare" this object in your mind to be the CPU or the generator itself.
Now start the recording and focus your mind and think - or even say out loud - "aim high". Continue these thoughts with as much intensity as possible. Perform this observation many times to get a lot of data. When the experiment is done and the observations are collated, the majority if not all of the averages will be above the value of 50. Conversely, when it's repeated with the phrase "aim low" the final results will show a downward skew below 50.
How is this possible? Is the mind truly able to affect an electronic random number generator? Does the concentration of the mind somehow force this outcome?
When independent observers watched the researchers repeat this and collected data both for "aim high" and "aim low" - an interesting pattern emerged. At the end of the test, an observer said "I noticed you kept resetting the test until you got it calibrated, what was that all about?" Well, said the researcher, when I'm trying to take a reading I have to make sure everything is the same. If I see that the average isn't moving I reset and refocus." "How many times do you do this?" Well, I don't really keep track of that."
The problem here is that the researcher was throwing out valuable information. It was showing that the return numbers really were no better than random chance but the researcher wanted to believe so much that a lack of rigor in the test protocols was a-okay. The researcher was throwing out observations that proved the hypothesis wrong and keeping those that agreed with the hypothesis.
So now that the election is done and the outcome is known, let's have a little fun with the analytics. You know, the pollsters. They claimed to have the answer (predictive analytics) - but most of them were dead wrong. Why was that?
Many of the news stories today start out with "The <winning candidate name here> in a surprising victory" - or "stunning upset", or "the nation changed its mind in the eleventh hour" - are all just weak ways of failing to admit that they got it wrong. Not only did they have it wrong in the end, they had it wrong all along.
The Electoral College complicates the analytics, so before we get started, a lot of foreigners read this blog and I have received "casual" questions in the past about the Electoral College, so here's a quick primer on it.
Ironically, the Broadway show "Hamilton" has been in post-election news, but the players in this production have all admitted that they didn't vote. The creator of the Electoral College was Alexander Hamilton, who also advocated that the POTUS be called a "king", and advocated steep immigration restrictions and even closed borders. This is ironic in the light of recent statements from the "Hamilton" cast. I wonder if they know these things about the play's namesake?
The election of a President Of The United States (POTUS) is unlike many other elections, in that it's not a popular vote. It is a collection of results of fifty-one separate elections (50 states and the District of Columbia).
There's a strong reason for this. The POTUS is elected by states that are each independently sovereign of one another. The United States is just that- a union of sovereign states, each with their own governments and laws.
In America, state sovereignty is protected in a variety of ways. The FBI cannot insinuate itself into every criminal investigation. There are strong lines of jurisdiction that determine whether its a state or federal crime, all borne on the sovereignty of the state. The federal government can't arbitrarily send troops into a state without the governor's permission. The same is true for a variety of federal agencies. For example, when Katrina hit Louisiana. President Bush was on-site the next day with an offer to the governor to send in troops to keep order and distribute food and provisions for survivors. The governor hesitated on this offer. A day or two later the levees in New Orleans broke and stranded millions. The fact remains, the troops could not enter Louisiana without permission. The states have rights.
The Constitution protects the rights of the states to maintain their sovereignty. A popular vote for "anything federal" would violate this sovereignty and put all voters, regardless of their geography, into a common pool. A candidate that made a lot of promises to the population centers would win hands-down. Moreover, election fraud in one location affects the whole. The POTUS election is this way by design, to keep democracy at bay, so that cheating is more complicated, the sovereignty of each state is honored, unbalanced power is not given to population centers to leave the majority of the country out of the race entirely.
Even as this is written, the state of California has laws that allow non-citizens to vote for local elections, and their totals are driving a "popular vote" for the loser because some 4 million votes were illegally cast for POTUS. Those haven't been excluded yet. They will likely line-up the popular vote with the electoral vote.
"Popular vote" would become a way for "big city dwellers" to tyrannize the country. Our founding fathers understood tyranny quite well, and the many forms it can take.
But what if the outcome were decided by popular vote? Consider that in states that are lopsided, like California, New York and Texas, voters for one candidate are always represented while voters for the other are not. Many voters for the other candidate don't go to the polls at all because they know it's a waste of time. But if this were a popular vote - those folks would have a vote that counted and be encouraged to go vote. Such things come into play when the votes are counted differently and would significantly affect how many additional votes are cast.
What if the candidates tie in Electoral votes or don't reach the necessary 270? The contest goes to the House of Representatives. Many people mistakenly believe that the House would cast 435 votes for President. This is not the case. The representatives of each state must "caucus" and have their own internal election, and only one vote from that state is cast for POTUS - 50 votes in all. Why? Because the states elect the President, not the people. It's a running theme.
This is a bit of a startling realization for some, that even the Father of the Constitution, James Madison, commented that "democracy is evil" - primarily because it allows the will of the many to trample the rights of the few. The founders deliberately put in place a variety of checks and balances that keep democracy at bay. As such, America is a Representative Republic, not akin to any form of democracy at all. This is confusing to some, who've been told America is a democracy and the Electoral College is "undemocratic" and if it's not democratic it must be bad.
History has proven that "the majority is often wrong" - and this was proven out in the POTUS pre-election polling - all but a few were completely wrong.
Recall that many hundreds of years ago, a chap named Galileo challenged the majority "opinion" on celestial mechanics. Oddly, the Aristotelians in the universities - all of academia - stood in opposition to him. Even the leaders of his own faith disowned him. He was a lone voice in in a sea of consensus - and all they had to do was look into the telescope. In all walks of life, we often find that a small number of people "have it right" while the majority is wrong. The founders of America wisely recognized this "herd effect" of humanity and didn't want it to have any power. One of the reasons for this is that the "herd effect" is often driven by a majority of the least-informed and most-fearful.
Should population centers control the presidency? This question was asked and answered by the founding fathers in the form of the Electoral College.
This system however, radically complicates the prediction of an overall outcome. Each state's activity has to be predicted and modeled, and depends on a lot of very dynamic factors.
Even further - in the 2016 election - the media, the newspapers, the consultants and a wide range of politicians, including many politicians in the same party as the eventual winner - all claimed that the eventual winner would lose in a landslide and be the most devastating loss in history. All of them were wrong. Instead, the very opposite became reality. The naysayers were proven wrong, the "favored" candidate lost and that candidate's party was decimated.
The problem here isn't that the nation "suddenly turned at the last moment" as some have suggested, but the pollsters themselves fell ill to a common problem in these kinds of analytics. They had it wrong from the beginning and their polls failed to reflect reality. There's an obvious reason we'll get to a little later.
Here was a chance for predictive analytics to shine like no other - a time-boxed, measurable event and outcome. This is why those of us who champion analytics are so appalled by the epic-fail. It could have been a shining moment but instead just a fizzle. For analytics. The worst part is that people without any predictive computing also predicted what would happen in the aftermath of the election - and they were also right. No computer algorithms required. Perhaps the nature of a "learning machine" needs to be couched in terms of what is being learned and how we know it's true?
Many of us who want excellence in analytics were absolutely appalled at the abject lack of accuracy of the presidential polling. It seemed a lot like a circus and the election night seemed like their version of damage control more so than reporting an expected outcome.
After all, if the polls were scientific and on-the-money, when the polls closed and the votes were reported, the science should have already reflected the reality of the outcome. No surprises at all. Why was it such a shocker? An upset? A cinderella story?
Think about that for a moment. We hire someone to do analytics for us, find market opportunities and reduce risk, and when we act on their carefully crafted predictions, it goes south and we lose millions of dollars. Would this be a good testimony to the accuracy of our analysts, or an abject failure? Would we ever trust them again? Would we throw out the baby with the bathwater, so to speak, and forsake the value of analytics altogether? Some have done this, and later revisited it with a bit more scrutiny ( and are happy with their results now).
When it comes to our companies, our livelihoods, etc - trusting the data and the analysts is hard because it's personal. It's not like a video game where we get unlimited retries. The one failure may put us out of business. The one success may make us handsomely solvent for decades.
Let's get some things out of the way - we have a few broad reasons why they got it so wrong.
- On the dark side - they knew the results and were lying. I don't see any benefit for the pollster or the candidate in this. In fact, showing a candidate artificially high in the polls could spur the opposition voters to the polls!
- On the lighter side - they were incompetent - I can't accept that so many were off because of this - that's just me talkin'
- On the analytic side - they were victims of "confirmation bias". This explains it much better. They were sincere, but sincerely wrong. They were simply too personally invested in the outcome.
- Reuters and their partner IPSOS, recently shared that much of their polling data favored the eventual winner, including the most controversial issues - by over 83%. They don't see their failure to report these known facts as having any influence on the election. In fact, they completely changed their time-honored measurement system in a manner that favored the eventual loser. Why would they do this? It's really simple: they could not bring themselves to believe that they were wrong.
Many who have attempted to find a reason for the failure fall into one of two major buckets - those who look at it scientifically and those who look at it politically. The political view has a problem in that the presence of bias - of the modelers and analysts - tends to taint the model and inject a "confirmation bias". That is, the polls say what the analyst expected them to say, but the analyst may not realize that their bias has already affected the outcome. Many who look at it scientifically are using politics first, so the confirmation bias creeps in again.
Case in point, an analyst may ask the question, "If the election were held today, who would you vote for?" And it's interesting how many of those voters polled were dishonest in their response. How we know this will follow shortly. Taking these responses at face value, the analyst sees that the numbers trend as they expected, so they report the results. Another form of confirmation bias is when the analyst has a hypothesis in hand, such as "We know that nobody could possibly want to vote for Candidate A for so many important reasons, thus..." and this hypothesis is what drives their questions and likewise the rest of the analysis. They only ask questions in this context and only hear answers in this context.
This is a lot like the circular question of "Do you still kick your dog?" If someone answers "yes", they're a reprobate. If they answer "no", it implies they used to kick their dog...
Unbeknownst to many, this circularity is one of the major flaws of the scientific method. This is why the scientific method can't be used to "prove" anything. It can be used to falsify, but not to prove. Bias rears its head in common scientific experiments and even in police forensics.
"What happened to the evidence you collected in the bathroom of the crime scene?"
"We analyzed it and threw it out as irrelevant."
"You threw it out as irrelevant? Why?"
"Because it didn't match the suspect."
(real conversation between a prosecutor and CSIs in a triple murder)
In scientific circles - a professor at an Ivy league university had an intern who reported that the professor had collected a wide array of samples from a recent field trip and the intern had expected to spend the weekend collating them as part of his duties to the professor. But when the labeled/bagged samples arrived, only a fraction of them made it into triage. He asked the professor what happened to the rest of it, and the professor said it had to be discarded because "it didn't fit the profile".
This "doesn't match the suspect" or "doesn't fit the profile" is a problem because it means the analysts have a hypothesis concerning a specific outcome and have thrown out evidence suggesting, or even proving, that their hypothesis is wrong. The professor, in throwing out evidence that didn't match a profile, is aligning to his original hypothesis and discarding evidence that disagrees with it. In the case of the CSI, if they throw out all but the data that "matches the suspect" they are by definition throwing out the case against the real criminal. If any of that evidence could exonerate the "suspect" but the "suspect" is falsely convicted, the real criminal goes free.
In the case of Reuters/IPSOS - their hypothesis was that the eventual loser would win - has to win - in a total landslide. Any data that contradicted this simply had to be erroneous. Yet it was not - it was speaking the truth and they ignored it.
So the flaw of the scientific method is that the scientist can inadvertently use the hypothesis as the filter through which all evidence is examined, rather than it's intended purpose - a springboard for investigation to be either confirmed, rejected or modified based on the evidence discovered. Since many researchers apply for grants based on "the hypothesis", they must have a reasonable confidence that the hypothesis has merit in order to influence donors for funding. If the scientist has enough failures, those funding sources will dry up.
Thomas Edison claimed to have failed over 1000 times in inventing the lightbulb. His explanation was that in each case, he eliminated a candidate with the expectation that a final candidate would succeed - claiming 1 percent genius, 99 percent perspiration. Unfortunately, donors these days aren't so forgiving. Scientists like Edison, Newton and Lavoisier had independent means of supporting themselves while performing science. Today, scientists expect to be financially supported even while they do the work. That's reasonable, but it increases the cost of science, and tends to tempt the scientist into fudging the data to match the hypothesis. He has bills to pay, kids in college, vacations to pay for. A dependency on money, and lots of it, has been shown to affect their judgment.
As for polling, the polling firms sell their results to the candidates and it's collectively worth billions of dollars. Down-ballot candidates in 2012, 2014 and 2016 spent over a billion dollars in polling alone.
Some reason this way: if you're a pollster in the business of selling polls, and the first poll you take shows your "guy" waaay ahead of the pack, do you share this or do you show the race to be "very close"? This "very close" will cause your client to be nervous and want to purchase another polling result as soon as possible. You have a vested interest - even a conflict of interest - in whether or not you deliver the right polling results because it's your job to sell more results. Since everyone knows that the only polls that count - are the polls in the final week - hey - why not massage the numbers? Who's it gonna hurt?
In the book "Wrong: Why Experts Can't be Trusted" - the author cites one case after another of scientists eliminating, adding or fudging data. One researcher went so far as to use a magic marker to put stripes in the fur of a lab rat so that it could pass certain acceptance criteria. He notes that even though doctors have formally accepted a correct form to administer Cardio-Pulmonary Resuscitation (CPR) the average training manual and even the Red Cross still use the former method.
Periodically in the news will appear a case of fraud or collusion in the area of climate science. If the science is there, why the fraud? For example, the IPCC announced that global warming has been in a "pause" since 1997. The current trend is more toward cooling, as exemplified by the glacier forming in Mt. St. Helens' caldera. Many scientists are asking the impertinent question: Is the "pause" really a "pause" or is it something else, like "reality"?
In May of 2015 NASA announced that the polar ice caps are not receding, in fact the Antarctic sea ice is expanding.
For folks to claim that "the majority of climate scientists believe..." isn't relevant, because many centuries ago the majority of scientists believed the Sun revolved around the Earth. Still, science is science, and not subject to consensus or democratic vote. One "uncovered" memo after another has revealed that the "result" of climate science can be purchased regardless of the contradicting data. So why does anyone trust it? After all, doesn't Earth itself do more damage to itself, than mankind could possibly keep up with?
In the case of Presidential polling, there's no compelling reason to deliberately get it wrong. Some like Reuters may withhold information, but this isn't the same as false reporting. Statistics show that false polls don't do a lot to suppress or affect voter turnout or sentiment. In fact, one could make the case that a false poll strongly in favor of one candidate could strengthen the resolve of a person voting for the opponent. Likewise if a candidate is seen as strong in the polls, some of the candidate's supporters may not go vote under the assumption that their vote wouldn't count all that much. Both of these dynamics are partially in play in 2016 but nobody can measure it, so nobody knows for certain its impact.
I worked with a firm some ten years ago that had so much cash rolling in, they could have wallpapered the offices with 100-dollar bills and not missed any of it. Their business was so dramatically profitable that their investors loved them, their customers loved them, but one love was lost and had not returned - working for the corporate offices was drudgery. They had not upgraded their computing systems in many years, so many of the employees spent countless hours, every day, pulling data and collating spreadsheets.
In the meantime, a lack of visibility across their corporation made their expenses invisible. They were hemorrhaging cash in most departments and could not see it, and didn't care because so much more cash arrived to replace it, and then some. Such situations always have a day of reckoning. We were onsite because that day had arrived and the senior management may as well have been in witness protection they were so frightened.
Analytics matters. It tells you where you stand.
Unless it's presidential polling. Which is just bizarre. I hear that it's - uh - like the most powerful position in the world? I could be wrong about that, but for the presidential polling to be so completely off?
The reason we as voters don't particularly care about the presidential polling is that it's rarely accurate. We're sort of "inoculated" to it by now. We hear all sorts of stories of pollsters using the polls to shape opinion rather than reflect it. We want to believe that they're doing it for the right reasons - to be accurate and regarded as reliable, not as agenda-driven political hacks. Hope springs eternal, but at the end of every election season we see how completely wrong they were. At the beginning of every election season, the lessons of four-years-prior are forgotten and we find ourselves watching the polls ever-so-hopefully.
Pardon the analogy, but isn't this a lot like Lucy and Charlie Brown, where she holds the football for him and rips it away at the last moment? She promises each time she won't do that, but always does. Why does Charlie keep coming back? Well, because it's funny, and it's Peanuts, and we know it's make-believe.
But presidential polling isn't make believe and the outcome has real-world consequences.
Only three of the mainstream pollsters were even remotely accurate. All the rest (dozens of them) couldn't have been more off if they had just manufactured the numbers from thin air. In fact, any of us could lick a finger, test the political wind, and produce a poll that was more accurate than ninety-percent of those claiming to use analytical science. Just embarrassing.
And if it's science, why were they so wrong? I mean, so completely wrong?
This is why the final election outcome was such a "shocker". Expectations. Just the setting of false expectations is enough to set someone off. Tell your wife you have a romantic weekend planned and then at the last minute get called into a non-optional emergency meeting at a client site - uh - yeah - set those expectations carefully! In one particular case, I had set my client's expectations that I would be unavailable. They didn't even bother to call.
Nate Silver, famed analytics guru of many past elections, put his private formula alchemy to work and at the beginning of the election night, already had one candidate favored with 70 percent chance of winning. No margins, you see, just whether the candidate would win. As the polls closed in each time zone, Silver changed his percentages. They went to 60, then 50 and dipped below 50, moving downward "as the world turned" and polls closed by the hour. The opponent likewise rose in the other direction. Betting odds were a reversal of fortune for many.
People look at Silver's messaging in real-time, analyze it and proclaim, "The frontrunner is losing ground" or "the underdog is gaining ground". Why doesn't anyone see such sentiments as odd? The reason I say this is simply:
Friends of mine go to the tracks on occasion, probably more often than their wives would like, and spend time betting on dogs or horses. At the beginning of the day, the track officials publish the betting odds, much like Silver published "odds" during the course of the campaign. But those odds at the track are only good before the race begins. The officials don't change the odds after the starting bell.
And what happens when the gate opens and the horses charge forth? Seems to me they're just like Olympic runners in a starting block. They all have the same starting point - zero - and all of them have to gain ground faster than the opponents - to break the ribbon first. I mean, everyone gets that, it's why we watch races. I still recall many Summer Olympics ago, one of my favorite runners (Gail Devers) was in the 100-meter hurdle. She was at least five hurdles ahead when she hit the last hurdle and tripped over it. She landed on her knees and tried to recover, but finished third. She was favored to win, too. That really was a case of gaining ground only to lose it later. Many recall the Winter Olympic snowboarder who was favored to win second place. The frontrunner and backrunner tangled right out of the gate and the hero ran well ahead of them. When she hit the last hill, she decided to "hot dog" and brought her board up to touch it, lost her balance and spilled out. She hurriedly tried to make it right but the other two sped past her, an opportunity lost.
The lesson in all this, is that once the polls close - the outcome is prescribed. They can't gain or lose ground. At the voting precincts, there is no frontrunner or underdog after the polls close. It's all over but the counting.
The odd part about this being applied to the 2016 presidential race, is that when the polls closed and votes were reported, one jumped out in front and stayed out in front, and ultimately won the contest. Anything prior to the votes being counted - predictions, polling, exit polling - didn't matter any longer - because most of them got it completely wrong. The only question people have later is - if most of the pollsters were completely wrong, how do we know the others aren't just a fluke?
Isn't that what we'd say about any other kind of contest? If fifty people make a prediction for an outcome and only three get it right, we chalk it up to random chance, not the skills of the predictor. The point being - if science is directly applied, we move closer to the expected outcome. If no science is applied, we can't expect better than random chance.
So for Silver to claim "gaining ground" or "losing ground" after the polls close, is ridiculous. It's Silver's form of "damage control" after being proven so completely wrong. Moreover, he was repeatedly proven completely wrong about the winning candidate from the time the candidate announced a presidential bid. Every prediction he made - polling to the primaries - crashed and burned. He and others expected their front-runner to win in a landslide of 500 or more electoral points. Epic fail. How embarrassing is that?
Not for him, per se, but for analytics in general. It is because of Silver's past prior success that advanced analytics has gained ground in the marketplace, but when folks like Silver so completely and visibly fail, it sets-back analytics and in some ways can bring shame to those who sold their analytics based on Silver's success with it.
To Silver's credit, part of his "damage control" was his explanation of how close he called the "popular vote". Well, Nate, that's reeeeeal nice. But as noted above, the popular vote doesn't mean squat. This is the United States, each state is a sovereign entity not beholden to the other states, and do not participate in a popular-vote-based election. It is in their best interests, and always has been, to avoid being pooled with the popular vote.
A similar effect happened with the 2004 POTUS election, where one candidate was handily whipping the incumbent - based on exit polls alone - but when they started counting votes, the incumbent immediately jumped in front and never fell behind even once. Various groups in favor of the challenger cried foul - but strangely did not point a finger at the exit pollsters. The outcome of the election would have been the very same without their reports. And since their reports were so completely wrong, why report them at all? The more sinister among us would claim they were trying to affect the outcome. Voters are a little smarter than that, so it's hard to swallow. Especially if its across fifty sovereign states, each with the own vested interests - this just makes it a lot harder to cheat.
Another strange effect happened with the 1980 election, where the polls leading into the election night had the incumbent ten points ahead, but as the night unfolded, the challenger took the race by a total landslide. Only many years later was it learned that in the week prior to the election, the incumbent was taken aside and told that he wasn't ten points ahead, but ten point behind. No way, no how would he recover this ten points in a few days. This was a "brace yourself" moment, so that nothing about the election night was unexpected for the candidate. I suspect that the same numbers were available to the challenger as well. This was the first time that the loser conceded the race even before the polls closed on the west coast
In the 2016 election, one candidate was behind in most of the prior predictive polls, and led in only a few, while the other candidate enjoyed a comfortable "lead" throughout the campaign season. The underdog claimed that the polls could not be trusted. People laughed. The polls that showed him ahead were ridiculed. They however, were closer to the mark than anyone realized. Oddly, these same polls were the most accurate ones four years ago in 2012. Why weren't they trusted this time? Confirmation bias. They disagree with our hypothesis - so there's just no way they can be right.
Since both candidates were "seeking low ground" in their rhetoric - in the most bizarre and tumultuous race ever, the candidates dished out their fair share of mud. As a result, many voters were uncomfortable in openly committing to either candidate. Of course, each candidate had their own "openly loyal base" but knew they could not win with the base alone. They had to reach the "independents" and "undecideds" - but how to find them? How to know what they really think?
One of the pollsters used an interesting question - "How are your neighbors voting?" - and this seemed to unlock a wealth of information. In the final analysis, this one question unlocked the hidden information. He had found the true sentiment of "undecided" voters. While a person may not feel comfortable sharing their own opinion, it was easy to share the opinion of an "imaginary neighbor". This pollster's numbers were the most accurate, state by state than any of the other pollsters. He called it, but he had to use a little subterfuge to make it happen.
This subterfuge it seems, is the hallmark theme of politics. The politicians keep a public face and a private face. One candidate was revealed to have told donors to expect public responses to the voters that were incongruous with the private responses to the donors - not to worry, it's just politics.
On a personal note, my father was a District Attorney (an elected office) for over 30 years in East Texas, and was in office at the time of my wedding (to my wife of now 30 years) - and my wife's parents had taken-on the expense of the wedding reception. Dad told them to invite an additional 300 guests to the wedding and reception, which would have blown their budget sky-high. They objected but Dad said - no worries, it's just politics. None of them will actually show up, but it's bad form not to invite them. Sure enough, none of them showed up. But this is a bit of a subterfuge in itself, is it not?
Politics is an art of partial truths and partial subterfuge. Anyone who reveals their agenda from the outset is considered a poor political player. One must have a public agenda and a private agenda, if they want to "get anything done". At least, that's the "common wisdom".
This is why many technologists often divorce themselves from politics entirely. If I took a poll of technologists nationwide, of those eligible to vote I would find that only a small percentage are actually registered to vote. Technologists often watch politics like a sporting event, if they watch sporting events at all.
But this politics-as-usual problem - the subterfuge and hidden agendas - had apparently wearied the American voter. So when a candidate stepped forward with no political experience at all - nobody knew how to measure it. They still don't. Even now they are attempting to describe the candidate's victory within a political paradigm, and nothing they come up with is accurate. I read one dissertation that attempted to do the same-old shoe-horn of analyzing based on demographics, when the winning candidate had clearly appealed to a populist voter base that cross-sectioned a wide array of disparate demographics.
No wonder they were so completely wrong- they were looking in the wrong place, and asking the wrong questions. Conspiracy theories emerge. Are they that incompetent? Are they deliberately lying? Either way, how can we trust them?
Silver moves into "damage control" in the next days with "we were only 2 percentage points off" - well no, claiming that one candidate had over 70 percent chance of success is a lot more than 2 percentage points - but since he doesn't do polls himself but bases his information on existing polls and other information - he was effectively drinking from a poisoned well. If he had taken a superficial look at the candidates he would have seen why. One was a career politician and one had never run for office, and didn't understand the first thing about politics - so didn't conform to the common model. This not only confounded the opponent, it confounded the media - and the analytics.
I kept after my kids to pay attention to this election season because they'll tell their grandchildren about it - this will never happen again in our lifetimes.
In large part, this was unlike any other presidential political season for these very reasons, but the pollsters treated the second candidate like the first, attempting to shoe-horn everything into a common "career politician" model. One time after another, the non-politician candidate beat the predictions and nobody could understand why. In the end, the "political class" of consultants and the "establishment" were very afraid. Here they had set up a system through which all political candidates had to arrive, but this candidate proved that none of it was necessary. It relegated the established political engine to irrelevancy.
Moreover, the winner is about to enter office without any obligations to donors or other influencers- and no fingerprints on any of the problems taking place in government now. This was not the case with any of the other candidates.
Has anyone ever witnessed something so strange? The candidate stuck to several core issues that resonated with voters in all demographics and gathered more diversity under the candidacy than anyone prior. The losing candidate on the other hand, kept going through one re-invention after another, as if a phoenix rising from flames. Voters saw this as phony.
How does this apply to our internal corporate issues? Does dirty politics play a role in how numbers are reported? Do we seriously think that if the eeevil political player down-the-hall is able to manipulate numbers to his/her advantage, that they will be altruistic and avoid the urge to cheat? No doubt many avoid the urge, but there is an ever-present propensity to cheat, to capture numbers and spin the story to one's favor. It's just human nature.
When Inmon coined the phrase "Single version of the truth" - this is exactly the problem it addressed - to make sure, in certain, objective and scientific terms, that everyone was reporting from the same place, same totals, same everything, so that nobody could cheat. It's bad enough that someone would cheat to pad their numbers and look better, it's even worse when an underperformer pads their numbers to look even marginally acceptable. Corporate heads wanted neither, but a single place to go where everything was laid bare, the good, bad and the ugly.
In Red Storm Rising, Tom Clancy tells a story of the Russian Politburo and their analysts, who would arrive with three reports in-hand. One was the worst-case, one the best-case and one the middle-ground. When the analysts arrived, they would attempt to "read" the sentiment of the Politburo members - before choosing which report to proffer. The Politburo was known for being harsh with people who disagreed with them. So the analysts would attempt to discern the sentiment and intersect it with a report that aligned with Politburo sentiment rather than challenged it. In this particular storyline of Clancy's, this sentiment was ill-placed, the report was the "best case" and the outcome was disastrous.
Sometimes the folks asking the questions are their own worst enemy. They ask strongly biased questions in the wake of harsh outcomes for dissenters. If a person wants to keep from getting their head lopped off, they do whatever, say whatever to avoid this outcome, but it's not doing the decision-maker any good. If anything, it's misleading the decision-maker down a disastrous path. Unless of course, the decision-maker is just so good at what they do, they don't care about the opinions of others anyhow. Except to see who is loyal to them, of course.
This is the dichotomy presented to us by presidential polling versus our internal analysts. The pollsters and the analysts both have a loyalty or bias, so it is incumbent upon us to either determine that bias, or guide that bias in our favor. In business we want our analysts loyal to our goals and success. We want their honest answer as to where we're headed and whether or not it's a good idea, how to steer toward success and how to avoid danger, and the most effective way is to join-at-the-hip. Their fate is our fate - they have a vested interest in helping us get it right.
Consider the story of the king who went to visit a wise old soothsayer, who told the king, "You will cross a river, and a great king will be defeated." So the king mustered his troops, crossed the river and his entire army was routed. The soothsayer was right - a king had been defeated - but I'll bet the king asking the question would have wanted a more specific answer.
Conversely, who are the pollsters loyal to? Clearly at least one pollster was chasing "the answer" and found it - and reported it regardless of how many other laughed at him for it. The others, who had it wrong, were loyal to something else. We don't need to know what that something else was, just that they weren't pursuing the truth. And if they were pursuing the truth but were that-far-off - they certainly weren't pursuing it well. With the stakes so high, wouldn't we want the person to pursue it well? Don't we want them to be loyal to us?
And "loyal to us" isn't the Politburo gambit of "reading" us to tell us what we want to hear. We want them to tell us what we need to know.
Or are we really okay with analysts who "found what they were looking for" and upon "finding it" proclaimed "see I told you so" even as the company was entering Chapter 11? One particular very-large energy company in Texas (Enron) had one and only one analyst telling them they were on the wrong path. He was right, and could say "see I told you so" - but the decision-makers weren't listening.
This is a lesson for the analysts out there - just like only a few 2016 pollsters got it right while the others laughed at them - you as an analyst might be up against similar odds - and feel like Galileo in conflict with the greatest academics of his time. Your chief analysts may tell you that you're wrong - that you're making a bad career move to disagree with them. What if they are using the "common metrics" and you have found an outlier, a significant anomaly that creates tension in all the common answers? If the data is on your side, you have some decisions to make.
Many years ago I worked with a chap who built hardware parts for computers. One IEEE-certified schematic for a device showed that he needed a much larger wire than was necessary. The wire could hold more power than common house-current, but the device was powered by a nine-volt battery. The larger wire seemed like overkill, but the specification called for it. He took his case for the larger wire to the boss, who told him that he needed to use a smaller wire. The engineer stood his ground on the side of the schematic - and claimed that he'd taken an engineering oath not to follow instructions from people in opposition to a schematic specification. A battle ensued that lasted for many days until the young engineer tendered his resignation. A contractor was called in to fill his shoes, and he noted the same issue with the size of the wire. He called the vendor who said "This has already been published in errata. Do you not have a copy of it?" The contractor said no so they faxed the same. Lo and behold, the new schematic had specified a smaller wire. Why didn't the first engineer think to do this instead of sticking to the original data - in the face of such a glaring anomaly?
What does all this mean? We can stand our ground with bad data in our hands and be sincerely wrong. Or we can look at other aspects of the problem and regard glaring inconsistencies as problems to solve rather to ignore. Nate Silver ignored the "poisoned well" of the data he was using, as did the other pollsters. The ones who got it right, stuck to their answers even through ridicule - because they knew this race was different in too many ways to count, and required more than just the common metrics.
An old joke goes like this: Some analysts got together to determine the "meaning" of "two plus two" - and brought in a mathematician. His answer was "four - what's your point?"
They brought in a philosopher, who answered with "Well, "two" in one universe might mean different things than in ours, the same for the meaning of "plus" or even "four", so can you be more specific?" They thanked him for his time.
Then they brought in the attorney. Upon hearing the question, he rose from his seat, shut the door, seated himself and leaned into them, "What do we want it to be?"
Modified on by DavidBirmingham
Many years ago we encountered an environment where the client wanted the old system refactored into the new. The "new" here being the Netezza platform and the "old" here being an overwhelmed RDBMS that couldn't hope to keep up with the workload. So the team landed on the ground with all hopes high. The client had purchased the equivalent of a 4-rack striper for production and a 1-rack Striper for development. Oddly, the same thing happened here as happens in many places. The 4-rack was dispatched to the protected production enclave and the 1-rack was dropped into the local data center with the developers salivating to get started. And get started they did.
The first team inherited about half a terabyte of raw data from the old system and started crunching on it. The second team, starting a week later, began testing on the work of the first team. A third team entered the fray, building out test cases and a wide array of number-crunching exercises. While these three teams dogpiled onto and hammered the 1-rack, the 4-rack sat elsewhere, humming with nothing to do.
We know that in any environment we encounter, with any technology we can name, the development machines are underpowered compared to the production environment. And while the production environment has a lot of growing priorities for ongoing projects, we don't have this scenario for our first project, do we? Our first project has a primary, overarching theme: it is a huge bubble of work that we need to muscle-through with as much power as possible. That "as much as possible" in our case, was the 4-rack sitting behind the smoked glass, mocking us.
And this is the irony - for a first project we have a huge "first-bubble" of work before us that will never appear again. the bubble includes all the data movement, management and backfilling of structures that we will execute only once, right? Really? I've been in places where these processes have to be executed dozens if not hundreds of times in a development or integration environment as a means to boil out any latent bugs prior to its maiden - and only - conversion voyage. But is this a maiden-and-only voyage? Hardly - typically the production guys will want to make several dry runs of the stuff too. We can multiply their need for dry runs with ours, because we have no intention of invoking such a large-scale movement of data without extensive testing.
And yet, we're doing it on the smaller machine. No doubt the 1-rack has some stuff - but I've seen cases where it might take us two weeks to wrap up a particularly heavy-lifting piece of logic. If we'd done this on the larger 4-rack, we would ahve finished it in days or less. Double the power, half the time-to-deliver (when the time is deliver is governed by testing)
In practically every case of a data warehouse conversion, the actual 'coding' and development itself is a nit compered to the timeline required for testing. I've noted this in a number of places and forms, in that the testing load for a data warehouse conversion is the largest and most protracted part of the effort. And if testing (as in our case) is largely loading, crunching and presenting the data, we need the strongest possible hardware to get past the first bubble. A data conversion project is a "testing" project more so than a "development" project, and with the volumes we'll ultimately crunch, hardware is king.
But I've had this conversation with more people than I can count. Why can't you deploy the production environment with all its power, for use in getting past the first bubble, then scratch the system and deploy for production? What is the danger here? I know plenty of people, some of them vendor product engineers, who would be happy to validate such a 'scratch' so that the production system arrives with nothing but its originally deployed default environment.
Yet another philosophy is that we would pre-configure the machine for production deployment, but nobody likes developers doing this kind of thing in a vacuum. They would rather see deployment/implementation scripts that "promote" the implementation. I'm a big fan of that, too, for the first and every following deployment. That's why I would prefer we used the production-destined system to get past the first-bubble-blues, then scratch it, and get the original environment standing up straight, and only then treat it as an operational production asset.
Most projects like this have a very short runway in their time-to-market, and we do a disservice to the hard-working folks who are doing their best to stand up this environment, They need all the power they can get, especially when they enter the testing cycle.
And for this, it's an 80/20 rule for every technical work product we will ever produce. Take a look sometime at what it takes to roll out a simple Java Bean, or a C# application, or a web site. Part of the time is spent in raw development, and part of it in testing. If I include the total number of minutes spent by the developer in unit testing, and then by hardcore testers in a UAT or QA environment, and it is clear that the total wall-clock hours spent in producing quality technology breaks into the 80/20 rule - 20 percent of the time is spent in development, and 80 percent in testing.
And if the majority of the time is spent in testing, what are we testing on Enzee space? The machine's ability to load, internally crunch and then publish the data. On a Netezza machine, this last operation is largely a function of the first two. But we have to test all the loading don't we? And when testing the full processing cycle we have to load-and-crunch in the same stream, no? What does it take to do this? Hardware, baby, and lots of it.
I can say that multiple small teams can get a lot of "ongoing" work done on a 1-rack, no doubt a very powerful environment. I can also say that a machine like this, for multiple teams in the first-bubble effort, will gaze longingly at the 4-rack in the hopes they can get to it soon, because so much testing is still before them, and they need the power to close.
What are some options to make this work? Typically the production controllers and operators don't like to see any "development" work in the machines that sit inside the production enclosure. They want tried-and-tested solutions that are production-ready while they're running. At the same time, they have no issues with allowing a pre-production instance into the environment because they know a pre-production instance is often necessary for performance testing. Here's the rub: the entire conversion and migration is one giant performance test! So designating the environment as pre-production isn't subtle, nuanced, disingenuous or sneaky - it accurately defines what we're trying to do. It's a performance-centric conversion of a pre-existing production solution, now de-engineered for the Netezza machine. As I noted, development is usually a nit, where the testing is the centerpiece of the work.
With that, Netezza gives us the power to close, to handily muscle-through this first-bubble without the blues - we only hurt ourselves with "policies" for the environment that are impractical for the first-bubble.
This brings us full-circle yet again to a common problem with environments assimilating a Netezza machine. The scales and protocols put pressure on policies, because those policies are geared for general-purpose environments. There's nothing wrong with the policies, they protect things inside those general-purpose environments. But the same policies that protect things in general-purpose realm actually sacrifice performance in the Netezza realm. Don't toss those policies - adapt them.
Within a few minutes, Neon and Ruth Guardian arrived at the office of the Architect. The label "Most Recent" was alongside the open door. Inside, they could see a high-back leather chair and the back of a man's head. Beyond him. an array of computer screens in a grid, filled and scrolling with operational information.
Neon knocked and entered. He wondered why such a complex command console was required for such a simple data warehouse "Mr. Recent?"
The Architect turned around to greet him, "Hello, Neon - I've been expecting you." He clicked his remote mouse toward the wall and several screens blinked. "But I'm afraid that Most Recent is a status, not a name." He then pointed to the floor, "Please remain standing. You won't be long."
Neon glanced toward Ruth and wanted to roll his eyes, but refrained. "I have a few questions," Neon said, pulling up a chair.
The Architect confidently smiled and said, "I suspect that you do, but keep in mind that you are still irrevocably a consultant. Some of my answers will make sense, and some won't."
Guardian leaned into Neon's ear and whispered, "Actually none of his answers will make sense."
The Architect continued with the tiniest smile, "The question you ask first will be the most important to you, but is also the most irrelevant."
Neon paused, then asked, "Why didn't you protect the company and its interests when you designed the data warehouse?"
"Okay, I was wrong - that is a very good question. Let me think about that for a moment."
"Take your time."
"I suppose the question is not about the data warehouse itself, but the outcome of its applications."
"No, it's the data warehouse itself." Neon focused.
"But the data warehouse is practically perfect in every way." He was noticeably uncomfortable.
'It's full of junk," Guardian sighed.
The Architect continued, "Look at the algorithms, the flows, the data management, the sheer muscle in the machinery. Infinitely capable and infinitely scalable." He grinned and laughed to himself, "Null, missing and invalid values are simply scattered anomalies within what is otherwise a harmony of mathematical precision."
"But you're supposed to correct those," Neon challenged, "the data warehouse process has a responsibility to scrub bad data and either exclude it or make it right."
"What if the data doesn't want to be excluded? What if we don't know how to make it right? What if it cannot be made right? What if excluded data finds a way to come back?"
"The data isn't like an animal," Neon huffed, "it doesn't have a personality. It doesn't feel pain or make decisions. It doesn't - " he paused, leaning backward.
"It doesn't," The Architect agreed, "and it isn't." He shared a long, intense eye-to-eye gaze with Neon, "you still don't understand, do you? All data is merely binary one's and zero's recorded on media, and the metadata to describe it is no different. Data or metadata - what is the difference? - it isn't real. It is a collection of magnetic signals that record the semblance of something in the real world"
There is no data Neon recalled. And if metadata itself is data, then it, too is not real. There is no data, no metadata. What then - is reality? "This is just a philosophical crock," he concluded.
"When data flows through the systems," The Architect explained, "it is not really flowing, only signals representing the data are changing in memory, organized such that the mathematical states cascade in value from one machine to the next - but it's not like water. It doesn't have - physicality."
It's not real in the physical sense, Neon's brain was on fire, but it represents reality, and affects reality - like - sorcery?
"He's doing it too," Guardian said, "talking in circles and not accepting responsibility."
"Don't reduce," Neon focused, "the data is junk, no matter how you choose to describe it, or how capable the technology is. Magnetic state leaves one place and lands in another, and you took no responsibility to make it right - is that about it?"
"Anomalies in the source data reduce its quality," the Architect asserted, "to a level that is beyond correction, but not beyond control."
"They are beyond neither," Neon corrected, "correction and control are simply a function of diligence."
"Diligence and control are easy," he said, "but correction is the realm of the data stewards. What if I correct the data? From what incorrect state to what correct state? And who decides? Who controls?"
It suddenly dawned on Neon that a major missing piece had nothing to do with the architecture, data, or technologies involved. It had to do with governance. No controlling authority existed to control quality of process or result.
"But you're not controlling it at all. Data quality is not beholden to anomalies," Guardian pointed out, "Increasing the quality - eliminating junk - is a deliberate act of will."
"Choice?" the Architect clarified, "is not relevant. The data flows. The programs run. Everything is doing what it is supposed to do. Everything is fulfilling its purpose."
"And that would be?"
"This data mart here," Neon pointed to a screen, "What is it's purpose?"
"Interesting - ", the Architect raised an eyebrow," you asked that question much sooner than any of the others. Impressive." He cleared his throat and continued, "the mart was designed to deliver business facts and dimensions."
"He didn't answer your question," Guardian said, "ask it a different way."
"I don't decide purpose any more than you choose functionality," said the Architect, "Only the users know the purpose. Why do you need to know?"
Neon was growing impatient, "And who are the users of this data mart?"
"The warehouse has developed over the years through several iterations," the Architect continued. "In each case, the basic data remained the same, only the architects and technology changed. We stabilized the staff and the technology, so now all things are safe from harm."
"All things," Neon said - "like - "
"Jobs, careers," the Architect said, "and their security. So we are not beholden to fate. You do understand fate, don't you Neon?'
"But the data -"
"The data supports our existence, but is not the reason for it."
"The technology is about to change. We're tossing out the hand-crafted scripts and replacing them with a framework environment." Neon asserted.
"Netezza?" the Architect rubbed his chin, "will change nothing. We will implement it in the same manner as all before it."
"The changing of the technology might change more than you think. It may change how you approach the problem. And reduce anomalies to zero."
"A problem is simply an opportunity without a solution. The technology serves us, and we are its master. What is the significance of technology without the technologists?"
Neon recalled his conversation with Miro the Virginian, and muttered to himself, "the slaves."
"Pardon?" the Architect queried.
"Change," Neon observed, crossing his arms, "the problem is change. Your architecture is not adaptive, so cannot account for change. It exists with the fear of change, not the adaptation to it. Your technologists are slaves to their own creation - and you are their slavemaster."
Ruth's eyes widened. She'd never seen a consultant speak to the Architect this way before.
Neon continued, "Hmm. From where I'm sitting, the only thing more significant than the data warehouse itself is -"
"Me," interjected the Architect. "Data changes shape, is delivered different ways, but we continue."
"I was about to say that the only thing more significant than the warehouse is its failure," Neon sighed, "to preserve the security and integrity of the data and fulfill known purposes of the users."
"Users, schmoozers," the Architect snickered, "Developers implement a data warehouse. Users get what they ask for."
"And not what they need?"
"Who can say what they need?"
"I can," Guardian offered, "they need to be successful. The data isn't helping them. You seem to think the data is secondary."
"Secondary only to my own importance, which cannot be questioned."
"I am an engine of rhetoric. I ask rhetorical questions that are not meant to be answered, and I give rhetorical answers that are not meant to be questioned," he stared down his inquisitor, irritated and unamused. "I am the Architect."
"Do you really?", the Architect leaned forward, now oblivious to the screens flashing all around him, "and what exactly do you see?"
"Someone on his way out," Neon said, rising, "you have a choice. Pick the door that cooperates with me and save your skin. Or pick the door that defends this nonsense and - "
"The loss will be for Red Corporation either way," the Architect said, "the loss of a sublime data architecture in favor of something simple - and maintainable by simpletons," he cleared his throat, "or the loss of very important staff and the business knowledge between their ears."
"It can be better - "
"Optimism," The architect observed, "your greatest strength - and your greatest weakness."
Neon sighed, "I'm a realistic optimist. I think things can't get any worse than they are now - "
Neon turned to leave and Ruth followed.
"I've built a lot of these," the Architect called after him, "and I will build many more. I've gotten very efficient at it!"
Neon and Ruth made a quick exit. Walking down the hallway back to their offices, Neon wondered what Ruth's role in this mess had been. "Why didn't you stop them?"
"I wasn't here," Ruth said, "to guide the effort." Ruth thought about it for a moment and said, "The leaders need good data to make sound decisions. It protects their careers, the company and ultimately my job and the reason for it. I hunt for the bad data and kill it. I hunt for the good data and preserve it."
"So your something of a data hunter?"
"More like a data protector."
"We shouldn't borrow lines from another script."
"Oh, sorry, couldn't resist. In any case, I protect things."
"I protect, " she paused for effect, "that which matters most."
Modified on by DavidBirmingham
Neon browsed the list of strange and cryptic entries rolling on the screen. "These are back doors" he said under his breath.
"Yes," said Ruth Guardian, the data quality expert for Red Corp, "Each one represents a potential hole in security for private information."
"It's worse than that."
"If this data content gets to the desktop of a decision maker," he sighed, "it could take the company in a dangerous direction."
Her eyes widened.
"And if the wrong or incomplete data appears on the customer website, the customer will assume - and rightly so - that you're not protecting their information."
"Who's in charge of applications here?"
"His name is Miro," Ruth informed, rising, "Follow me." They exited and moved quickly down the hallway. "He's was born in Virginia, but I think he spent some time overseas in his youth."
"Miro the Virginian doesn't talk like a Southerner, then?" Neon smiled.
"Maybe the South of France," she laughed.
Miro was standing on a platform watching three teams of application developers, slinging code and user screens with abandon. Miro looked like the conductor of - some strange orchestra.
"Need some of your time," Neon approached behind him.
"I can take time," Miro said casually, "if I don't take the time, how will I ever have the time?" He laughed at his own joke.
Neon peered over the shoulder of an application programmer - the complexity of the environment was mind-crushing. But something was odd. The complexity seemed to be - artificial.
"You know why I'm here."
"Yes, your arrival was heralded in company newsletters," Miro said, "but the question is - do you know why you are here?"
"I am here to reduce instability," Neon said.
"But that is a task, not a goal. What is the reason for you to do this?" He smiled and turned away, "You are here because you were sent here by the company leaders. They sent you, and you obeyed. This is the reality."
"The reality is that the environment is unstable."
"No, the only reality - is causality. Dirt breaks the screens, non?" Miro waved his hand, "We have dirt. We have screens. We have to clean the dirt. More dirt, more breakage. More code, less breakage. It's a cause and effect. Dirt is the cause. It creates one of two effects - more code or more breakage. Cause and effect."
Neon held up his hands, "I get it, alright? You have to harden the presentation layer because of all the dirt and disconnection in the data layer - "
"Watch this," Miro smirked as delivery people entered the room with boxes of pizza and two portable soda fountains. They quickly refilled the tankers on each developer's desktop and left the pizza in the middle of the room. "On the hour - every hour."
One of the developers spun around, popped open the first pizza box and hungrily wolfed down two big pieces. He washed it down by gulping his soda loudly.
"You see, " Miro pointed out, "the food and soda are like opiates. It doesn't matter how much work they have, as long as they are fed and watered." He held his hand to his ear and cupped it as though expecting to hear -
BUUUURP! The developer belched and returned to work.
"Satisfaction comes in many forms," Miro grinned, "Cause and effect. The food is the cause and the work is the effect. They don't have to understand it."
Neon recognized them as slaves of the same opiates that once were his master. "All you're doing is dealing with symptoms, not root causes," Neon asserted, "The root cause is bad data - and you've just found ways to manage the symptoms - like over-coding."
"Overcoding?" Miro corrected, "We only code what is needed to make things work. It fulfills its purpose."
Neon held up a finger as if to say hold that thought - then reached down, hit three keys - and all of the screens in the room went BSOD - Blue Screen of Death.
Miro gasped, the developers screamed and began a recovery fire drill. Miro quickly composed himself, raised an eyebrow and said "Okay, you have some skills."
"The environment is brittle because of the overcoding. Less code, fewer failure points. Cause and effect. The extra code is not necessary - it's a symptom -"
"Oh? How will you solve the problem otherwise?" Miro challenged.
Neon rolled his eyes and muttered, "Clean the data first."
"It is so simple for you, consultant," Miro huffed, "But you know nothing. The data is what it is. If you think you can change it - you must approach the Architect. He will not suffer the likes of you - "
"But all of the code, and all of the work," Neon observed, "gets in the way of serving what the users really need."
"You know what the users need?"
"Yes, well -"
"No, I don't think you do." Miro asserted mysteriously, "You think you know what they need, but do you really?"
"He's talking in circles," Ruth observed, "welcome to my world."
Neon turned to leave with Ruth on his heels.
"This isn't over," Miro snapped.
"Oh yes it is, " Neon retorted, "The leaders turned this over to me and I will put an end to it."
"I survived your predecessors, consultant " Miro called after him, "and I will survive you!"
Yep, Neon thought, I've heard that before, too.
Ruth quickly joined Neon in the hallway, "Now what?"
"Can I speak to the data architect?"
"Which one? There have been six before you. The most recent -"
"Just let me talk to him."
Modified on by DavidBirmingham
Neon strode through the crowd, half-gliding as his long black raincoat floated behind him. Orpheus had called - and now it was time to meet with the database vendor. He made his way confidently toward the large black building.
Orpheus joined Neon outside building and both of them entered. Orpheus led him to a large lobby full of people, signed him into the visitor list, turned and said, "You're here to see Oracle. Whatever they tell you is only for you to know and nobody else. It's a non-disclosure thing." With that, he departed and left Neon amidst the others.
Neon found his way to a chair and sat down to wait quietly. Next to him was a young man who looked like he'd just graduated college. He was tapping feverishly on the keyboard of a well-worn laptop.
"Why are you here?" the kid asked.
"To see Oracle," Neon muttered, closing his eyes, "and you?"
"Metadata," the kid whispered, almost hypnotized by his own work. "If you notice," he tapped several keys and filled the screen with data, "how the information moves when the metadata changes..."
He was right. As he moved his mouse, the metadata changed, controlling the information on the screen almost like water in a pool.
"That's interesting," Neon said, "how are you doing that?"
"Making the information move like that?"
"There is no information," the kid said, "only metadata."
"But there has to be -"
"Think about it," said the kid mysteriously, "When we see an "A" on the screen, it's not really an "A", but the way the screen has been programmed to represent the digital code for "A". And below that, the "A" is a hexadecimal value, which is another representation for binary digits. And we know that binary digits are representations of electrical signals that can only be "on" or "off". Ultimately, each level is a representation of the one below it, so there is no true data, just representations of electrical signals. It's all metadata."
"There is no information," Neon whispered.
"Mr. Neon," an admin called out, and Neon left the young man to his own mysterious devices. "Oracle is down the hall to the left."
Neon entered the conference room where only a small, unassuming little woman stood pouring herself some coffee. "Oracle?"
"Shortly," she said, not turning around, "they're in the next room wrapping up. But we can talk for now."
Neon waited in the uncomfortable silence while the woman slowly stirred her coffee.
"You don't drink coffee do you?" she asked nicely.
"Not really, no."
"How did I know that?" she asked, motioning for him to sit down. She answered her own question, "Your eyes are naturally bright. You don't need the stuff."
They moved toward the large conference table while she continued, "Ever wanted to predict the future?" She sipped her coffee and paused for effect. "Or see things others cannot?"
"What do you mean?"
"Predictive analytics. Remember the little girl on the beach in Indonesia? The water rushed out to sea, and everyone thought it was odd, but still they ignored it. She knew a tsunami was coming because she'd studied it in school. She sounded the warning, saved lives - and I noticed - that none of the other tabloid prophets had anything to say about it."
Neon smiled, "That's because you can't really predict the future."
"What if you could?" she smiled, "in the same way that she did - she obviously knew what was about to happen before it happened."
"That's different. That's not predicting, that's analytics."
"In the business world, and in technology, there's no difference."
Neon paused for a moment, "Go on."
"People go about their business, everything is routine. Then something odd happens one day. Maybe it's deja-vu. Maybe something more. Ever been somewhere that the odd things were so glaringly out of place to you, but nobody else seemed to notice?"
Neon was waiting for her to make her point. Perhaps like information, there is no point, he thought.
"Have a seat," she invited, "I've got something to show you."
She produced a laptop, popped it open and said, "Look." Filling the screen were various utility scripts.
Neon's eyes got wide, "Is this what I think it is?"
"Yes," she said, "From the underground." She went on to explain that while all of the scripts had been built for utility, she'd done it under the license of a prior customer. The customer had received value for their money, and she had walked out with the reusable collateral.
"So these weren't developed with a bootleg copy of Linux?" Neon queried.
"Would it matter?" she said, "Either way, they're from the underground. How much would something like this be worth to you?"
Neon thought for a moment, "Functionally I'm sure there's value, but we'd have to retrofit these into our current naming conventions, configuration management - I mean, it would be easier to build them from scratch than accepting them from you."
"But you have to know what to build," she said mysteriously," you have to know ahead of time what you're going to need. What if these contain ideas about things you didn't know you needed, but really do?"
"Cut to the chase, just give me the insights and I'll apply them."
"Insights are valuable too," she said, "like seeing into the future. You will need some of these functions, and you won't know it until it's too late."
"It will never be too late," Neon said, "we can always add it later."
"And if you never do, you'll suffer when there could be a better way. Ignorance will enslave you."
"I'll take that risk."
"That's not the Netezza way," she said, standing, "the Netezza way - adaptive architecture - is to add it to the mix - keep it in the arsenal - knowing that you may need it because so many people often do."
"That's a little extreme - "
"Is it?" she leaned forward, "What must you have in place before developing? What are the top three pitfalls of development? What are the top three missed opportunities if you start wrong? Do you know how to apply metadata so that data doesn't really matter anymore?"
Neon shifted his weight in the chair.
"Like the little girl - predicting the future from prior knowledge. Harness the information, and harness the future."
"Nobody can amass that much information," he shook his head.
"It's not about the information," she said, "it's about how the information is used."
Neon whispered in recollection, "There is no information... only -"
"What transforms raw data into actionable information? Metadata - Behavioral and structural metadata are not the realm of tacticians," she said, "they only understand raw data. Rise above the data - connecting it with metadata - and become an information alchemist. You'll have the persona of a shaman, or a sorcerer, not simply a developer - once you master the power of metadata, and its power over data to create information."
"But data drives the processes - "
"But metadata determines which processes to drive - "
Neon's eyes widened, "I'll take that cup of coffee now."
She retrieved the coffee, using the time to explain, "In the corporate layoffs of earlier years, many IT professionals developed a deep distrust for corporate IT, to the point that most of them had gone underground, to a "1099-culture" of contractors that popped on and off the grid like ghosts. This created an underground of disavowed professionals who sometimes take extreme measures to be successful."
"What do you mean, ghosts?" Neon asked.
"You see those folks out there?" she pointed out a small window to a view of a hallway, "they work here and fulfill a purpose. Sometimes they are replaced by someone who is better or different than they are - happens all the time." She took a long draw of her coffee, "then there are those who don't really have a purpose, or their purpose is lost. They might leave the company and reappear as consultants or contractors."
"Happens all the time," Neon observed.
"Yes, but the legends that you hear - of vampire firms that suck the life blood and money from a company, or ghosts that appear on and off the books to fix spot problems, or aliens that come from far away places", she shook her head, "are part of an underground - or an underworld. In many cases these are people who have lost their original purpose - or who are trying to find one - some are even outcasts who live in the underground and supplant or assist people - " she pointed to the hallway, "like them."
Neon reflected, "Go on -"
"So there's nothing really wrong with being a ghost," she said, "I'm something of one myself. The point is - the underground is a dangerous place for unprotected hex," she grinned.
Neon raised an eyebrow.
"I mean software - and software licenses. There are bootleggers and pirates, crackers and hackers all over the globe. Vendors have wisely chosen to lock down their own licensing model to keep the vampires at bay. A form of garlic, if you will."
"And the vampires reciprocate with a model of entry by invitation only," Neon smiled.
"Or we'd have bootleg AI products," Neon finished, "But this makes it even harder for a consultant to succeed - they always have to be working for a licensed customer in order to get exposure to the product - making them even more desperate to get a copy of it themselves."
"Bingo," she said, taking another sip, "it's a pickle, no doubt about it."
"So how do we solve it?"
"What's to solve? People work their way around it," she said, then explained how one consulting firm put their daytime workforce in play and then replaced them with a night-time workforce to do unit testing for the daytime work force. The night work force was "free" to the company, but they were getting trained in a live environment.
"So they found a way," Neon leaned back in his chair.
"They always do," she said, sipping her coffee and maintaining a long silence.
"I interviewed a fellow the other day," she began, "for the third time."
"Why so many -"
"Wasn't my intention," she smiled, "he just used different names each time. The third attempt he had someone in the room with him coaching him on the answers. Desperate times call for desperate measures." she took a long sip of the coffee. "It's one thing to have the answers - it's another thing to actually know them."
"Seems like it would be easier to get a job somewhere else doing something else?"
"Where else?" she said, "and what else? The tactical underground lives for the information - bootleg or otherwise. Anything involving information will always give them work to do. They live for the "I", so to speak, the "I" in IT".
"But you said there is no information- "
"Bingo. They are living for something that is not real. The information will always change, faster than you can blink."
"So when we say there is no information, it's not that the information isn't there, only that it's passing through so quickly - "
"Like water in the plumbing. The plumbing is the metadata - it carries the water - regulates it - manages it. Which is more important, the water or the plumbing - and who makes more impact - the plumber or the one who is constantly focused on the water?"
"Here's another dilemma," she continued, "a team with ill-trained Netezza people arrives on site and does damage trying to understand how to make the technology fit into how they do things. It's how they do things that's more important to them, because they are predicting they may need a backup - like another technology and another team - to backfill if they fail."
"They are predicting their own failure," Neon said.
"Predicting is hard to do," she smiled, "but planning it is not. You can't fool me - some teams do such a bad job that I can't be convinced their failure wasn't deliberate. Netezza is so easy to use it would be like failing to cross a country road. You have to trip yourself a bunch of times - deliberately - you see what I mean?"
"Sounds almost - insidious."
She sipped her coffee, "You've seen the so-called support systems in the underground - Is this sort of like the blind leading the blind - or perhaps the blind misleading the blind?"
"Never thought about it that way, " Neon mused.
"Think about where you are now," she continued, "I bet you'd like to learn more - can you share information like: How much did you spend on your development staff? Your consulting staff? That much? That little? What was the outcome? How long did it take? that long? That quick? How do you schedule work and scope projects? etc. etc. " she waved her hand in the air, "the list goes on and on - "
"What's the answer?"
"Collaboration," she said slowly, then paused, "and someone to make it happen." She leaned forward - "a facilitator. In gaming systems, it's a sprite - something that exists only to bridge the player into the game." She leaned forward, "the real game."
Neon's eyes sparkled, "So if end users collaborate on each other's work, they could trade ideas and watch each other's backs." He stopped himself, "But that is a huge amount of information -"
"You don't learn very fast, do you?" She smiled. "It's not about the information, it's about the facilitation. As far as you're concerned, they need a broker. Someone to bridge them into the game."
Neon's cell phone beeped and he turned around to check the display. It was Ruth Guardian, and it was time to leave. He turned to the woman to apologize, but she and her laptop were gone. On her chair was a business card with the Oracle logo, and on the back was scribbled a cell phone number.
He looked around quickly, and noted that the door on the far end of the conference room was slowly closing, telling him that she'd exited through it. He darted toward the door and opened it quickly.
The hallway was empty.
Musing over his recent adventure with the white rabbit, Neon sat silently, thinking Dilbertesque thoughts - still steaming from a reprimand by his pointy-haired boss (someone who strangely looked like an elf from Lord of the Rings, but without whispering most of his words)
A courier dropped off a package containing a cell phone (with a design that not even Mr. Spock's agent would have agreed to)
The cell phone rang -
"Neon do you know who this is?"
Neon whispered, disbelieving - "Orpheus?"
"Yeeesss. I've been wanting to meet you, but we haven't much time."
"What do you mean?"
"You must quickly take this phone across the building. Don't let anyone see you."
Neon complied, ducking low as he darted through and between the cubicles.
"When you get to cube 4184, look inside. There's a consultant there from one of my competitors working on a project that should have been awarded to me".
"Yes I see him."
"Take him out."
"What?" Neon whispered as the consultant looked into his eyes.
"Okay, tell him to go to the nearest window and jump. I'll be waiting to catch him."
"Wha -?" Neon said again in disbelief.
"Okay, tell him to take a red pill, then stick his finger in the nearest mirror. We'll drop a harness in the water and lift him out."
"You've lost me."
"Just make sure he takes the pill, we'll do the rest"
"Wait - "
"The pill, Neon, and wash your hands afterward."
"Look - "
"No you look, do I have to spell it out? He's a competitor. Keywords are jump. Water. Long walk off a short pier. I'm throwing you some bones here."
"You want to get rid of him?"
"No, not get rid of him, just cancel his contract. Punch his ticket, you know - he'll find plenty of work to do elsewhere."
"Like where else?"
"Where do you want him to go?"
"Out? Work with me here..."
"I don't have time for this - he's got work to do and so do I"
"Yes but his work should be my work - and that's the way it works."
"I'm still lost. He's giving them the same thing you would - in fact exactly the same thing."
"Yes but with a major difference, mine is half the cost for the same value."
Neon thought about this for a moment. "That's impossible," he whispered.
"And in half the time, or less," Orpheus added.
"So where are you?"
"Everywhere and anywhere, but mostly in Chicago."
"No, I mean - forget it. Look, Orpheus, I'm uncomfortable with this conversation..."
BZZZZT! Orpheus sent an electric shock through the phone
"Ow! " Neon dropped the phone, shook his hand and retrieved it from the floor
"How's that for discomfort? Now get to it."
"Okay - look - I can't do this -"
"What - next thing you'll tell me is that someone becomes 'special' by doing an action flick with a runaway bus opposite Dennis Hopper and Sandra Bullock. Is that what I'm about to hear?"
Neon was silent. That one really hurt.
"Look, if you hang up the phone, a bunch of those same guys will lock you in a room, put fake flesh-colored putty in your mouth and then let an electronic shrimp-creature crawl into your navel -"
"Your only other way to escape is to jump out on a rickety scaffold in high winds - but there's an Infinity with the motor running waiting for you - 20 stories down."
"We're done here"
"We'll see...." Orpheus snorted, "Just wait until you meet the Mad Hatter".
"What is with you and this Alice In Wonderland thing?"
"Seeya, Alice - "
So Neon slowly walked back to his desk, not understanding just how competitive this business could be. The words of Orpheus rang in his mind as he flopped into his chair.
Charging more money for no better value? Less time and less money for the same result? Could that be possible? If so - the stakes were very high indeed. More importantly, the more that others knew that the value was the same, the less traction the competitor would have. The less traction, the less money, and they would implode. They would have a vested interest in shutting down and silencing anyone who got in their way - or who could reveal their secret.
Within moments, several members of the competitive firm appeared in Neon's cubicle, one of them holding a ball of putty. The following ordeal was horrific.
BEEP BEEP BEEP
Neon awoke from his dream state clutching his stomach. Electronic shrimp creature, indeed. It was all just a dream, wasn't it?
The phone rang, "Yes?" Neon asked. His head hurt and his mouth tasted like - putty.
"Infinity, is that you?"
"Yes, You must meet me downstairs, I have a statement of work for you, and a new contract"
"Does it involve surfing?"
"No, it - "
"Time travel? I don't want to do another time travel - "
"Shut up and meet me downstairs - "
Neon met Infinity at the coffee shop on the first floor. They stood in line, made small talk, ordered their respective cups of coffee and found a table toward the rear. Neon broke the ice, "Whatcha got?"
Infinity took a light sip of her brew and smiled. "There's something you need to know," she said, "that thing about the shrimp creature may have felt like a dream, but just in case, eat some creole tonight to wash it all down. Can't be too careful."
Neon just stared at her. What a strange person.
"Someone wants to meet you," Infinity said.
Neon felt a light touch as he looked up and heard "Hello, Neon," - it was a voice he instantly recognized -
"Yes, and it's a pleasure to meet you, Neon," he smiled, "I've been waiting a long time." He pulled up a chair next to Infinity, who simply smiled, rose, and departed without a word.
"I have two contracts here," Orpheus said, pulling out the portfolios and laying them side by side. "They are both very difficult projects, with especially difficult managers in charge of them. Both of them are a real pill."
Neon scanned the first several pages of each statement of work.
"These two projects, they are very similar, yes?" Orpheus asked.
"Practically identical, except for one thing," Neon flipped to the back pages of each - "The Blue project says it will take 10 months with 10 people. Yet the second one, for Mr. Red, will take 5 months and 5 people. How is this possible?"
"Which one do you believe?" Orpheus asked, "The Red or Blue?"
"Do you believe in fate, Neon?"
"I know precisely what you mean. Why not?"
"Because I am the master of my destiny."
"Is that really true? Think of the project you are on now - are you its master, or its servant?"
"Well, I - "
"Do you answer support calls in the middle of the night? Is the environment unstable? Do you feel the wheels could fly off at any moment?"
"Well, sometimes but - "
"And this feels like control to you? It sounds like the warehouse is your master, and you are its servant. You are not in control - your fate is tied to the health of the fact tables and the Service Level Agreement."
Neon sank in his chair, "go on."
"There is a question in your mind, Neon," Orpheus said ominously, "like a splinter - a question you cannot answer yourself. What is the question?"
"There is only one question," Neon affirmed, "and only one way to solve the same problem in half the time - without killing myself."
"What is the question?"
"How do I effectively apply an adaptive approach?"
"Almost - "
"Okay, how do I effectively apply an adaptive approach with PureData For Analytics?"
"Bingo," Orpheus laughed, "Next, we'll go up on the rooftop and jump from building to building."
Neon's eyes widened, "what?"
"PSYCHE!" Orpheus pointed his finger, "are you always this gullible? No - but you're right. The only way to produce the same quality in the half the time and half the people - creating a more stable and resilient environemnt - is with an adaptive model."
Neon glared. What a strange person.
"So which will it be?" Orpheus added, "the Red or the Blue?"
Neon held both projects in his hands, "If I pick the Blue, I get more revenue and put more people to work. If I pick Red I get less revenue and less people working."
"Yes, yes," Orpheus was excited - Neon was catching on.
"So as a consultant I am predisposed to pick the Blue project."
"But as a master of destiny you are predisposed to pick the project that will lead to mastery of the data warehouse, rather than being its servant."
Neon stared at the two projects, then took a big gulp of coffee.
"The other approaches - we'll call them application-centric - are a dream world," Orpheus continued, "designed to control and master the architect and all his developers. It turns a good developer into this," he held up a tiny cage with a rock inside it.
"A pet rock?"
"It was the best I could do on short notice - work with me. It means you are the servant of the warehouse, instead of its master. The warehouse should be in the cage, not you. Take the Blue project, and you will remain a servant."
"And the Red project?"
"I'll show you how deep the rabbit hole goes," Orpheus grinned.
"Again with the Alice in Wonder -"
"Okay, Okay, I'll show you how the adaptive approach builds automatic resilience into the design."
"Red or Blue, my choice?"
"Free will, Neon."
"No, that's a line from another script I'm looking at."
"Oh, okay, well then - you are free to choose. And remember," Orpheus leaned close and whispered, "I am only offering you productivity." He finished his coffee, "and the opportunity to be the master of your destiny."
Neon downed his coffee also, snatched up the Red folder and said "I'm in".
"Good," said Orpheus, "follow me please."
They made their way to the street where a stretch limosine awaited. As they approached, Neon noted his distorted image on the vehicle's smoked glass window.
Once inside the vehicle, the driver said "Hello, Neon."
"He knows me," Neon quizzed Orpheus.
"Zephyr is my best architect," Orpheus laughed, "he knows the adaptive approach better than anyone."
"Stick with me Dorothy," said Zephyr, "and Kansas will be bye-bye."
As the car moved forward, it meandered through traffic, passing a black van, and Neon could see his reflection again in the car window glass - from the inside. It was crystal clear now. "Can you see that?" He said.
"The reflection?" Orpheus smiled, "depends on which side of the glass you are on. From the outside, it's distorted - but on the inside it's a clear picture." He leaned forward to Neon, "More to the point - the image on the outside was of a servant. Now its an adaptive architect. Things will be clearer every time you view the looking glass."
"Again with the Alice in - "
"Okay, okay - I'll put it like this: One day you'll look in the mirror and see what you've become - an adaptive architect - and you'll no longer be a servant."
Neon was fixated on the window.
"CopperTop?" Orpheus asked, handing him a battery for his beeper.
"Energizer," Neon said flatly, "Rabbits are a theme with me. It's personal."