The long-awaited "SQL" to Netezza Underground has hit Amazon.Com. Netezza Transformation
This book tackles some deeper issues around transforming our data warehouse, our approaches and even our thinking to align with the arrival of our brand-new appliance
No, it's not the appliance we'll transform, but that the appliance will transform us. Once again, a little tongue-in-cheek irony
The book offers working examples of the stuff people ask me about most often, like ELT/SQL-Transforms, checkpointing, exception handling (primary/foreign key), windowing (SCDs and deduplication) as well as a cookbook on more details to watch for in a migration project.
And of course, is replete with Case Study Short and a whole chapter on Case Studies. There's also an appendix at the end to offer up some simple scripting jump-start routines that can make bash so much easier.
At far back as the 2007 Enzee Universe conference in Boston, people have scratched their head on the notion of a large-scale database that actually boasts about the absence of index structures. After all, indexes are the mainstay of relational databases and we simply can't get by without them, right? This is a simple example of how the power and architecture of the technology frees us to think about data loading and storage, data retrieval and in-the-box processing in a completely different way.
Firstly, it's no rumor that Netezza has no indexes. And for those of us who can't stand dealing with them, this is a huge plus. One of the Enzees at the 2011 conference asked point-blank - "What does Netezza have that Oracle does not?" and the clear answers will arise in this and later blog entries, but for now the absence of maintenance of index structures, will do just fine. What does Netezza have?
Or rather, the vendor has spent enormous thought labor into simplification of complex matters so that we don't have to deal with them. We can get to the business of - well - the business. The application data and information that we want to spend more time with if only we weren't dealing with the technical administration of the database engine. This is one of the greatest weaknesses of the traditional relational database when it comes to data warehousing in general and bulk processing in particular. It is also one of their greatest weaknesses in the space of analytics. After all, we want to apply the analytics by selecting subject areas and data sets to analyze, but if we have to stand up lots of infrastructure to support this mission we are regressing toward functional hardening rather than functional flexibility. Brittleness ensues and someone asks for a refactor, a re-architecture or whatever. When this request arrives in relation to a complex, engineered, multi-terabyte (or say tens-of-terabytes) data store, many will see it as a mind-numbing proposal. For most infrastructures, the weight and complexity of the installation has overwhelmed the logistical capacity of the humans to ever hope reeling it back in. There's not enough power in the machine to help. This could take a very long time to remediate.
Not so with an appliance. Such big-data issues are its bread-and-butter zone. If we have issues with a particular data model or information construct, the machine has the power (and then some) to get us out of the bad direction and into the new one. How did the direction get "bad" in the first place? The business changed its information priorities since the inception of the original model, and now the original no longer serves it well. The business has moved, because that's what businesses do. The analysts found opportunities in the nuggets of information-gold, and now want direct access to that gold rather than having to navigate to it or salivate while waiting for it.
One of the core features of the Netezza platform is how the data is distributed on the disk drives. Because it is an MPP, each of the disk drives is its own self-contained, shared-nothing hardware structure. Contrast this to the common SMP-based platform where the CPUs exist in one duck pond and the disk drives exist in another duck pond. The data is then couriered between duck ponds through throttled estuaries. If we issue a query, all of the data has to be pulled through this estuary and presented to the CPU ducks so that the data can be manipulated and examined in software. It is a software engine on a general-purpose platform.
However, imagine the Netezza platform where each Intel CPU is mated with special hardware (an FPGA), a bounty of RAM to manipulate and cache data, a self-contained Linux operating system, and its own dedicated disk drive. Imagine also that the disk drive is itself divided into multiple sections, where the inner sections are used for internal processes like RAID, temp space and system data but the outermost ring is where user data is stored, offering the benefit of the fastest disk speed for the oft-accessed information. All of these little attention-to-detail aspect cause each of the CPUs to run at a much more powerful factor than their common SMP counterparts. Why? Because their data is co-located with the CPU, where with the SMP we have to drag the data out of one duck pond to get it close to the CPUs that may operate on it. And with SMP there is no such thing as dedicated CPU-to-disk access. The SMP CPUs are shared-everything, but also the disk drives - shared-everything at the hardware level but shared-nothing at the functional-logical level. Netezza's CPUs are shared-nothing at the functional-logical level and shared-nothing at the CPU/disk level.
In Netezza, let's imagine putting 100 of these CPU/disk combinations to work. When we form a table on the host, the table exists logically in the catalog in one place. But it exists physically in 100 places. If we load 100 million records into the machine distributed randomly, then each of the CPUs will "own" 1 million of those records. If we issue a query to the table, each of the CPUs will operate only on the million records it can see. So for any given query, we are not serially scanning 100M records. We are scanning 1M records 100 times.
Now some may object, that we're still scanning the data end-to-end, and for plain-vanilla queries, this is true. However, I recall the first time I performed a join on two tables that were 100M records each doing a one-to-one join on unique keys, on a machine with 24 of these CPUs and the answer returned in 13 seconds. This was a plain vanilla test, so your actual mileage may vary. However, I did the same test on the same machine with 1B records joined to 1B records and the answer returned in less than 300 seconds. Imagine attempting this kind of join on a traditional relational database and expecting a return in any time to actually use the result. (and no, I did not use co-location for this test - a matter for another blog entry)
We have an additional "for free" aspect of the machine called a zone map. If data is laid down in contiguous form (the way transactions arrive in a retail point-of-sale model for example) then the machine will automatically track these blocks and keep their locations in a "zone map". If we then query the database with a given date, the machine will ignore the places where it knows the data is not, and use the zone maps to locate the range of data where it needs to search.
As as example of this, we know of a 150-billion-row table that is over 25 terabytes in size, distributed and zoned such that over 95 percent of its queries return in 5 seconds or less. For a traditional RDBMS, it would take more time than that just to get its index scans squared away so that it could even approach the storage table. This is also why the Netezza machine itself can be scaled into the petabyte zone while maintaining the same performance. No indexes are in the way to load it. No indexes are necessary to search it. Imagine now: the two most daunting aspects of data warehousing and analytics - loading the data and keeping up with the user base - have been washed away with the elimination of indexes. (Don't we turn indexes off in a traditional SMP database so it will load faster, and don't we chase a rainbow trying to index and re-index the tables to keep up with user needs?) Not with Netezza. Without indexes on the columns, all columns are fair game for searching. Without indexes in the way for loading, we can deliver information into the machine, a reprocess it while there, with no penalty from the use of indexes.
By this measure, this is an anti-indexing strategy because Netezza operates on the basic principle of where-not-to-look. In other words, if we can tell it where the data is not, it can zone-in and find the data. When we think about it, this is how a common brick-and-mortar warehouse works. If we showed up and asked for a box of nails, the attendant knows that he doesn't have to look in the parts of the warehouse that carry lawn equipment, hammers, saws or window draperies. He knows where the nails don't exist.
Contrast this to the common SMP-based indexed database, which uses exactly the opposite approach. The indexes are searched for the specific key and then this key is applied to the primary data store. This is why indexed structures in general cannot scale in the same manner as an anti-indexed structure. Keep in mind, with a Netezza platform it won't matter how much data we put on that 25-terabyte table. We could double, triple it or more - and it would always provide the answer in a consistent time frame. This is because from query-to-query - it's still not-looking in the same places, no matter how big those places get. It will still continue to ignore them because the data's not there and it already knows that.
I've had people tell me that there is no real difference in the SMP versus the MPP. The SMP, "properly configured" (they claim) is just as good as any old MPP. However, there is no "proper" way to configure a general purpose hardware architecture so that it will scale. The only way we could hope to coordinate these CPUs is in software, and the only way we can get data into the CPUs is by accessing the shared disk drives in another duck pond on completely different hardware. The SMP configuration is by definition the wrong configuration for scalable bulk processing of any kind, so there is no way to "properly configure" something that already the inappropriate storage mechanism. This would be no different than claiming that a "properly configured" VW Bug (Herbie notwithstanding) could be just as fast as a stock car. The VW Bug is not the wrong platform for general purpose transportation. But it's the wrong platform for model requiring high-scale and high-performance, just an an SMP-based RDBMS cannot scale with the same factor (for set-based bulk processing) as a machine (Netezza) specifically built for this mission. Only a purpose-built machine can ultimately harness this level of data, and only an appliance can remove the otherwise mind-numbing complexity of keeping it scalable.
In the next weeks leading up to IOD (where I will be speaking on most-things Netezza) I'll offer up some additional insights on the internals of the architecture and how it differs from the traditional platforms.
When processing in bulk, leading us to scale (that is, tens of billions of rows) we can trust Netezza to pull this off rather handily. With hardware and architecture in the bag for us, is there anything we can do wrong, or perhaps inefficiently, that would deny us entry to the soaring heights of performance so elusive on other platforms? After all, if the stratosphere is within reach, we want to go there as quickly as possible.
As it turns out, there are some things that Netezza doesn't do well. And all of them are transactional in nature. Not to worry, as we really don't want Netezza to foray into those realms. We can do it faster and simpler in set-based form. In fact, we find that many newbies have to shake off some transactional thinking that in later days they refer to as cobwebs. The trappings of transactional thinking that are artificial constraints, necessary evils to our existence on an SMP-based RDBMS (but are in our rear-view-mirror and rapidly fading once we cross over to Netezza)
Now don't get me wrong, many people will deploy an SMP-based RDBMS for their warehouses and have a smashing time of it for many years, perhaps far into the future with no scalability issues at all. We could even venture to assume that over half of all warehouses rolled out this way will likely never see a capacity hiccup, simply because they are rolled out firmly in the center of the "bell curve". (A little statistics lingo there)
Recall many years ago, in the prob&stat class we may have slept through, that the bell curve gives us the 80-20 rule or something akin to it. It also gives us a another rule, that twenty percent (or so) of warehouses will be at the small end, twenty percent at the top end, and the rest in the middle. So guess what? This puts eighty percent in the middle-of-the-road and not really on the Netezza radar for the moment. But perish the notion that Netezza intends to ignore that market. It plays very well in that zone too. In fact, Netezza can move-and-shake along the entire continuum of the solution spectrum.
But when it enters the upper twenty percent zone, the air starts to get thin. Nitrogen narcolepsy befalls the heroes-of-the-eighty-percent, and they start to fall away very quickly. When we get into the zone of searching, analyzing, crunching and even just summarizing data on the orders of billions, tens of billions, and hundreds of billions of rows, we have a number of additional rules that will impose themselves on our existence. These are no different than the laws of aerodynamics we would use to defeat the laws of gravitation. One set of rules has sway, while another set of rules is used to overcome the effect of the first. The rules we have to deal with are those concerned with scale. If we don't pay attention to scale, down to the lowest level possible, no amount of efficiency in software will bail us out. When it comes to scale, salvation is in the hardware. More importantly, the architecture of the hardware.
I often adjure prospective Enzees that when evaluating the TwinFin against other platforms, avoid the kool-aid softball questions that anyone can answer. When a proof-of-concept tournament is underway, the easy questions all float around in the eighty-percent zone where any of the technologies play in one form or another. Rather take a tough problem, one that only exists in the twenty-percent zone, and drop it in the lap of the tournament players. Take a look at three things: (1) what was the raw performance number, (2) what was the difficulty or complexity of the solution, (3) How many vendor engineers did it require and as a bonus (4) how long did it take the vendor engineers to give you an answer? Minutes, hours, days? How long they took to formulate an answer, and how many of them were required, is second only to the performance of the solution itself. After all, anyone can test-or-evaluate in the eighty-percent zone where everyone is a hero. Get the problem domain into the twenty-percent zone (where we'll be on day-one after the technology is installed) and get those answers now. Wouldn't it be a bit awkward to ask these questions when it's too late to do anything about it?
I have mentioned in other essays that the Netezza architecture, and its deep attention to details-of-scale are what come alongside us as our ally when the other heroes-of-eighty-percent simply gasp for air. One of these is the architecture of the disk drives themselves. The drives are of course dedicated to their own CPU, RAM, FPGA and all that, but the layout of the disk platter itself is also intriguing. As an aside, each Netezza disk shares information with another sister drive so that if it should crash, it can be hot-swapped and rebuilt from its sister drive very quickly, easily and automatically. So where is this drive data stored on the sister-drive? As the design would have it, on the inner rim of the sister drive's platter. Otherwise, each drive's outer rim holds user data and the middle ring holds temp space and system data. But we can see the wisdom in this model, that the outer rim is spinning fastest, so delivers the data to the user much faster than the inner rims would.
Does our eighty-percent-hero do this for us? Well of course not. The SMP-based model of Netezza's competitors share their drives like any other SMP device. The odd state of affairs is that they are shared-nothing at the functional level, but shared-everything at the hardware level. This dissipates the strength of the machine. Conversely for Netezza, it's shared-nothing all-the-way-down. This bodes well for query support, because it means every query gets the full-strength and undivided attention from the machine, even for the fractions of a second necessary to complete it, and thus each query is returned on a wave of strength, not a river of dissipation.
For those who want to roll out the eighty-percent models and be happy with it, hey I don't disparage anyone from making a living. Functionally speaking, those warehouses have some of the most advanced features on the planet - in the eighty-percent zone. Netezza's customers by definition have already breached the eighty-percent mark and will never revisit it again.
But true to form, when dealing with a non-scalable platform, our enemies rapidly reduce to size and complexity. That is, the larger the size of the solution, the less functionality we can derive from it. And the larger the complexity of the solution, its required storage capacity is stunted from growth. Guess what happens when we migrate either of these solutions to a Netezza machine? When the complex functionality is ported, it suddenly finds a breath of fresh air and data volumes start to grow, sometimes exponentially. If the data sizes are already too unwieldy and we move to Netezza, the additional capacity allows us to build more functionality which in turn leads to - you guessed it - even more growth of data volumes. The short answer is, when migrating or upgrading either of these animals, the new platform had better be able to immediately scale in orders of magnitude, not just incremental percentages. This is why so many migrations onto the eighty-percent-platforms sputter and fail. The initial migration is successful, but they quickly find themselves out of capacity in short order. This was never expected, so the migration is a bust.
But with Netezza, it has the capacity to take on the migrated functionality and grow with it - without any particular hiccups or worrisome meetings about its abilities. It just hums along, keeps all its promises and never complains.
Okay, so I'm a big fan of the technology. Mostly because it makes a migration easy and makes its customers happy. Who couldn't use a little more of that?
But for those heroes-of-the-eighty-percent, there's a place for them should they choose to don the necessary equipment for high-altitude flight. Of course, there aren't any heroes in the twenty percent zone, considering that the only true hero in that space is the TwinFin. And when flying above the clouds (with any technology), humility dictates that we set aside the hero-way - and we really have to trust the hardware, don't we?
Modified on by DavidBirmingham
In this case, a tortoise-brained hare is an animal that is capable of going fast but wonders why he can't.
Over the last six months or so I have seen a "trending" situation with those who use the Organize-On. This is some pretty cool functionality so it's important that we get it right. Generally speaking, folks who understand how zone-maps work will have a splendid time with Organize. Others, not so much. The machine is supposed to be fast, so why are all my queries so slow?
All I have to do is add the Organize On keys and I'm good to go. Answer: No, you have to groom first, as this actually applies the Organize to the physical data. Response: Really? when we did this it took forever! Answer: Yes, the first groom may take a bit of time but every groom after that will be painless.
We CTAS a table every night for maintenance and the groom is always running slow. Answer: if you are using Organize On, don't use a CTAS for regular maintenance. With an Organize, the table is physically broken apart and spread across multiple additional extent-pages to provide for easier groom maintenance. A CTAS re-compresses it. So another groom is required to blow it back out again.
We like to use Materialized Views to optimize our tables but now the Organize doesn't let us. How can we get this back? Answer: You don't, because the Organize essentially replaces the Materialized View and does it so much better. Asking this question means you might not understand zone maps as well as you think you do! Just sayin'
We have included the distribution key in the Organize, along with some other join keys. Answer: (Sigh) -Remove them. Join-only keys do not belong in the Organize. The Organize is for filter attributes. Uh, perhaps you don't understand zone maps? Just sayin'
We have some lookup tables where we applied an Organize. But this didn't seem to matter. Answer- No it won't because if the table is not big enough the system won't use zone maps anyhow.
We were told to turn off Page-Level zone maps but this caused a 5x reduction in performance. Answer: Page-levels radically reduce disk I/O which is our number-one enemy. Turn them back on and go home happy.
It is clear that our outlying cast of folks who have never been introduced to zone maps think that the Organize is either a clustered-index or some other kind of indexing scheme. Fair enough, is this because of the deceptive naming? Like the Materialized View? (okay, don't get me started). The Organize keys are not keys in the same sense as indexes and are certainly not used as join keys.
They are filter-attributes. By this we mean that we fully intend to use these columns with constant values, lists or selected-lists as constraints.
Zone Map Primer
Here I will use my favorite example because it is time-tested and describes the capability. If I pay a visit to Wal-Mart and I'm looking for batteries, I might visit the customer kiosk and the lady tells me that the batteries are "on the end-cap, Aisle Three". This is an indexing model because she told me exactly where to look. This does not scale because when we get to billions of rows we spend more time searching the index than retrieving records.
In a Netezza model however, she would say "It's on an end-cap but I'm not sure which one." So I go to Aisle 1, then 2, then 3 and jackpot, I buy the batteries and go home. The most important part of her answer however, was in what she "did not say". She did not say that the batteries were in Men's or Women's clothing, Automotive or Electronics. In constraining the search to a particular location, she also told me "where not to look". This aspect of "where not to look" is critical to understanding zone maps. It is also critical to stratospheric scalability. Clearly if this Wal-Mart were to quintuple in size, my battery-buying-duration experience would not change in the least.
Now for the geeky part:
Each Netezza disk has 120K physical extents. Each extent has 3MB of space, divided across 24 pages(blocks) of 128k each. If we are running prior to OS 7.x, we will be zone-mapping at the extent-level. If we are 7.x or higher, we will use the page-level. Run don't walk to 7.x and page-level zone maps! They will radically reduce I/O on the extent and this is critically important. A set of records found only on a single page can return in 1/24th of the time than if it has to scan the full extent (96 percent faster). In a real-world experiment, the zone maps on a Twinfin-18 were changed from page-level to extent-level and the same battery of queries executed against it. It performed 5x to 10x slower than with the page-level turned on. Do not underestimating the influence of disk I/O on query turnaround. Netezza is a very physical machine.
Here's another internal example: Compression in Netezza is a nominal 4x. I have seen it much higher. Let's say we have an uncompressed record that is 10,000 bytes in size. When we read it, we will read all 10,000 bytes. If the data is compressed however, we will read only 2500 bytes, a 75% reduction in disk I/O. Netezza is the only platform where compression boosts the power in both reading and writing data because it reduces disk I/O tremendously.
Zone maps are a table that Netezza holds in memory for supporting a table. Each table has its own set, describing the contents of the extents/pages allocated to the table on a given dataslice. The contents of the zone maps are built automatically when we load up our table. It will collect the information from the records stored on the given extent/page for all integers, dates and timestamps. It will store the high and low value of each, representing the table's record "ranges" for that extent.
Thus when we execute a query using one of these columns as a filter-attribute, it will go to the zone maps first and cherry-pick only those extents containing the data we want, excluding all others. this means that the data on those other zone maps won't even see the light of day for the query-in-play. If we use an Organize, we can use a wider range of data types and be more deliberate with zone-map management.
Let's say we want to search-on the transaction-date on our fact table. If we have not physically organized the data, the same transaction-date value may appear across many hundreds or thousands of extents, affecting the high/low range of many zone maps. If however, we were to physically co-locate the records with common dates into tighter physical groups, the records will physically appear inside fewer zone maps. These are the extents/pages that the machine will cherry-pick for scanning and will completely scan each one. We want it to scan as few of these as possible.
In times past, we had three primary ways of doing this.
A brute-force sort, which is pretty egregious when the table gets very large.
A software program that separates the key's distinct values and executes a block-select from the original table to the new, selecting only once for each distinct value. This physically co-locates like-keys. (the data does not have to be sorted, only like-keys co-located)
A materialized view, which would manufacture a virtual zone map (which is why it's not allowed with the Organize)
Let's say that our table's data takes up 400 extents on each data slice. If our transaction-date appears in 300 of these (it is poorly organized) then when a query runs like so:
Select count(*) from mytable where transaction_date = '2014-01-01';
All 300 extents will be searched for this information. However if we Organize on this transaction-date and then groom - the data will be physically shuffled around to co-locate the records with the same transaction-date on as few extents as possible. Let's further say that once this happens, the given date appears in only one extent. What we have just done is optimized the table 300x. We have eliminated 299 other locations to look for data. This is important because scanning a 3MB extent is a lot of work. If we are scanning 299 additional extents for each dataslice, we're really doing a lot of extra work for nothing. If we translate this to a page-level problem, we may have 24x300 pages originally containing the keys. If with Organize we reduce this to a single page, we have further reduced our scanning load by another 96 percent of the extent containing the page.
The important factor in the example is the "out-in-the-open" value of "2014-01-01". It is a "filter attribute". It is not being joined to a table with this value, Doing it this way means that the FPGA/CPU will discover which zone maps have a high/low boundary that contains this value and will retrieve a candidate list of them, If there is only one as opposed to 300, we have radically reduced out workload. Netezza will literally exclude those extents from being examined at all. We have told the machine where-not-to-look. If we join this value however, such as using a time dimension, applying the date value to time dimensions and joining the time dimension to our fact table, we will require the system to fetch the record in order to examine it, at which time it will determine whether to keep it or toss it. this sort of thing can initiate a full table scan, nullifying the zone map entirely.
We don't want this to happen. We want the data to stay on disk and never see the light of day if it's not participating in the query.
To show how dramatic this can be, we were at one site hosting over 100 billion rows in the primary fact table. A full scan of this table took 8 minutes (which is not too shabby in itself, just sayin' ). The reporting users knew that if a query ever exceeded a few minutes in duration, it was probably ignoring the zone maps. This is because once-Organized, this table would return the average query in sub-second response. Think about that,100 billion rows in sub-second response.
This is why paying attention to zone maps is such a big deal. Optimizing distribution can get us boost in the single-digits (2x, 3x etc) on a given query. Optimizing zone maps can get us 1000x boost and higher.
The Organize-On accepts one or more keys which will be applied to physically co-locate records of like-valued keys, then it will update the zone maps. Here is a test to see if we understand the application:
Take one of your largest tables. CTAS the table to another database, order by the distribution key, or by the hidden "rowid" column to make sure that the given filter key is not ordered. This could take a bit of time, of course. Then perform a query using one of your date parameters as in the example above, and time the query. Now perform an
Alter Table tabname organize on (date column name here).
Then perform groom. Once completed, execute the same query and get a timing. We can see that a several-minute query can go sub-second very easily.
What's more the additional keys in the Organize are (more or less) independently organized. They will all enjoy a much faster turnaround than not using the Organize. If records arrive on the table out-of-order, no worries. Run the groom again. Subsequent runs of groom will always be shorter in duration than the first one.
Clearly however, if the zone-map is intended to apply filter-attributes to guarantee the exclusion of extents/pages completely, we cannot use a join-key. Or at least, not a join-only key. This also means that the distribution key is out (unless we plan to call-up an individual record based on the distribution key). Also, generally do not mix high cardinality keys with low cardinality keys. Netezza finds its strength somewhere in the middle. We will find that it disfavors the low-cardinality keys when high-cardinality ones are in-the-mix.
A good way to tell which of the filter attributes for a table are "high-traffic" is to turn on query history and then examine the table/view associated with "column access statistics - $vhist_column_access_stats. This will provide the number of times the column participated in a query and with which table(s) the base table interacted with. Perform a descending sort on the NUM_WHERE column and this will reveal all. In this short list we will see filter-attributes that are most useful. Don't use any of the join-only columns or the distribution key. These may adversely affect the multi-key algorithm's output and might not optimize any zone maps for this table.
At one site, we noted several inappropriate keys in the Organize, and simply by removing them and "grooming" again, the table experienced a 100x boost. The inappropriate keys were washing out the effectiveness of the other keys.
Many of us have seen this skyline, with buildings of various heights stabbing toward the sky. Compare this to the distribution graph that is part of the Netezza Administration GUI application. Normally this should be very flat (but a jagged-edge is usually okay). Clearly a Manhattan skyline is forbidden.
Or is it? if we have a table that is very skewed (like a Manhattan skyline) but the data can be easily "horizontally" sliced with zone maps, our round-trip time for a "tall" data-slice is no different than a "short" one. All we need to look out for is process-skew (too many horizontal slices on one dataslice)
Measuring the madness
Okay, David this is all very interesting but how can I know which extents or pages or whatever is being used by the keys? Well, there are a couple of handy hidden columns on each row that can help tell-the-tale. One is the _PAGEID and one is the _EXTENTID.
select count(*) , datasliceid dsid, _pageid pid from fact_customer group by datasliceid, _pageid
will tell us how many distinct pages are being used for each data slice.
select count(*), datasliceid, _pageid from fact_customer where transaction_date = '2013-01-01' group by datasliceid, _pageid order by datasliceid, _pageid;
In the above, the "count" should be reasonably even for each dataslice.
If the total count is radically more than "1", then let's organize on transaction_id and then groom. Now try the metrics again to see if it did not reduce the total pages.
I am sure with these two columns in-hand you can think of a variety of creative ways to use them. The conclusion of it all is to get the records with common key values packed as closely as we can so they take up as few extents / pages as possible.
It's a wrap
So now that the Organize seems a little better, you know, organized, maybe this will provide a bit of guidance on how to set up your own Organize and zone maps.
And don't forget to groom when changing the Organize keys.
We won't need to groom every time we do an operation. I would suggest a groom on both a schedule and a threshold. Pick a threshold ( a lot of folks like five percent). When the total deleted rows gets above five percent of the total non-deleted rows, or the total pages per unique data point gets above an unacceptable threshold, it's time to groom. But grooming on every operation is expensive, has marginal value and actually may throw away records we wanted to keep (in case of emergency rollback).
Modified on by DavidBirmingham
Sometimes the average Netezza user gets a bit tripped-up on how an MPP works and how co-located joining operates. They see the "distribute on" phrase and immediately translate "partition" or "index" when Netezza has neither. In fact, those concepts and practices don't even have an equivalent in Netezza. This confusion is simply borne on the notion that Netezza-is-like-other-databases-so-fill-in-the-blank. And this mistake won't lead to functional problems. They will still get the right answer, and get it pretty fast. But it could be soooo much faster.
As an example, we might have a traditional star-schema for our reporting users. We might have a fact table that records customer transactions, along with dimensions of a customer table, a vendor table, a product table etc. If we look at the size of the tables, we find that the product and vendor tables are relatively small compared to the customer, and the fact table dwarfs them all. A typical default would be to distribute each of these tables on their own integer ID, such as customer_id, vendor_id etc. and then putting in a transaction fact record id (transaction_id) that is separate from the others, even though the transaction record contains the ID fields from the other tables.
Then the users will attempt to join the customer and the transaction fact using the customer_id. Functionally this will deliver the correct answer but let's take a look under-the-covers what the performance characteristics will be. As a note, the machine is filled with SBlades, each containing 8 CPUs. For example, if we have a TwinFin-12, this is 12 SBlades with 8 CPUs, or 96 CPUs. They are interconnected with a proprietary, high-speed Ethernet configured to optimize inter-CPU cross-talk.
Also whenever we put a table into the machine, it logically exists in one place, the catalog, but physically exists on disks assigned to the CPUs. A simplistic explanation would be that if we have 100 CPU/disk combinations and load 100,000 rows to a table that is distributed on "random", each of the disks would receive exactly 1000 records. When we query the table, the same query is sent to all 100 CPUs and they only operate on their local portion of the data. So in essence, every table is co-located with every other table on the machine. This does not mean however, that they will act in co-location on the CPU. The way we get them to act in co-location (that is, joining them local to the CPU) is to distribute them on the same key.
But because our noted tables are not distributed on the same key, they cannot co-locate the join. This means that the requested data from the customer table will be shipped to the fact table. What does this look like? Because the customer table has no connection to the transaction_id, the machine must ship all customer records to all blades (redistribution) so that the CPUs there can attempt to join on the body of the customer table. We can see how inefficient this is. This is not a drawback of the Netezza machine. It is a misapplication of the machine's capabilities.
Symptoms: One query might run "fine". But two of them run slow. Several of them even slower. Results are inconsistent when other activities are running on the machine. We can see why this is the case, because the processing is competing for the fabric. Why is this important to understand? The inter-CPU fabric is a fairly finite resource and if we allow data to fly over it in an inefficient manner, it will quickly saturate the fabric. All the queries start fighting over it.
Taking a step back, let's try something else. We distribute the transaction_fact on the customer_id, not the transaction_id. Keep in mind that the transaction_id only exists on the transaction table so using it for distribution will never engage co-location. Once we have both tables distributed on the customer_id, let's look at the results now:
When the query initiates, the host will recognize that the data is co-located and the data will start to join without ever leaving the CPU where the two table portions are co-located. The join result is all that rises from the CPU, and no data is shipped around the machine to affect the answer. This is the most efficient and scalable way to deal with big-data in the box.
Now another question arises: If the vendor and product dimensions are not co-located with the transaction_fact, how then will we avoid this redistribution of data? The answer is simple: they are small tables so their impact is negligible. Keep in mind that we want to co-locate the big-ticket-or-most-active tables. I say that because we have sites that are similar in nature where the customer is as large as two of the other dimensions, but is not the most active dimension. We want to center our performance model on the most-active datasets.
This effect can rear its head in counter-intuitive ways. Take for example the two tables - fact_order_header and fact_order_detail. These two tables are both quite monstrous even though the detail table is somewhat larger. Fact_order_header is distributed on the order_header_id and the fact_order_detail is distributed on the order_detail_id. The fact_order_detail also contains the order_header_id, however.
In the above examples, the order header was being joined to the detail, along with a number of other keys. This achieved the correct functional answer, but because they were not using the same distribution key, the join was not co-located. So we suggested putting the order_detail table on the same distribution as the order-header (order_header_id). Since the tables were already being joined on this column, this was a perfect fit. The join received an instant boost and was scalable, no longer saturating the inter-CPU fabric.
The problem was in how the data architects thought about the distribution keys. They were using key-based thinking (like primary and foreign keys) and not MPP-based thinking. In key-based thinking, functionality flows from parent-to-child, but in MPP-based thinking, there is no overriding functional flow of keys - it's all about physics. This is not to say that "function doesn't matter" but we cannot put together the tables on a highly physical machine and expect it to behave at highest performance unless we regard the physics and protect the physics as an asset. Addressing the functionality alone might provide the right functional answer, but not the most scalable performance.
As with all analysis of implementations, please accept the following as a composite commentary (much like the Case Studies in Netezza Transformation). The names have been changed largely to protect the guilty. The innocent have already been punished.
So for those of you who may recognize shadows of your own environments in the discussion below, you now have plenty of time to get them cleaned up before anyone finds out about it! But honestly, don't admit to anyone that you are "doing it this way". Just fix it. What happens in the underground, stays in the underground!
I cannot (today) count how many on-site assessments I have executed or the variety of their outcomes. I have to say that on balance, most technology folks are pretty sharp and have things on track. I can usually advise them on how to make things better. This is of course exactly what they are hoping for. What manager wants to hear that they've done most it of it wrong? Or that their investment in the technology and the people, are a bust? No managers I know take their responsibilities so lightly. Some, however, inherit a mess from their predecessor and are flummoxed as to how to unravel it. They don't want me to "put lipstick on the pig", so to speak, but to provide a roadmap on how to dig out of the ditch (or hole, or rat's nest) and move things forward in a healthy direction.
Working with a 10400 Mustang, pre-TwinFin Era, one of our recently arrived data warehouse aficionados took our leadership aside and said, "What they are describing is a reporting system, like a data mart. But we aren't using any technologies to help them with this. We need to have a talk with them about standing up Microsoft SQLServer so we can put a data mart on it and..."
Stop right there. Yikes. He was so full of passion! It was really, really hard to talk him back from the ledge. So I finally said, "If you mention this plan to the client, even once, we will have to remove you from the project." And his eyes went wide like he'd been hit with a two-by-four across the forehead. "Why?" was his impassioned plea. Time to educate him on what Netezza does, right?
Netezza is a data warehouse appliance. It circumscribes and simplifies the data warehouse disciplines. It also makes some strong assumptions about the potential users of the appliance, not the least of which is what-problems-it-solves-well and what-problems-it-does-not-solve-at-all. (World Peace, Global Warming, Time Travel, Cloning of IT Staff Members, and getting the Dallas Cowboys to the Superbowl, to name a few).
Example: What if you were going about doing some-regular-task manually-and-tediously, and someone then showed you a device that would automate it? You might count your blessings and move forward with a skip in your step. But when you share the device's features with someone unfamiliar with the manual, tedious nature of an existence without it, they scratch their heads and say "I don't get it."
I am reminded of a joke where a lumberjack is in the market for a new saw. A powered-chain-saw salesman asks him how many trees he cuts down in a day with his manual saws, and the man says "30". Ahh, says the salesman, with one of these you could cut 100 or more in a single day. The lumberjack doesn't believe him, so the salesman tells him, Look, take this one for a test drive. Use it all day tomorrow and if it doesn't at least double your output, bring it back, no harm done, no questions asked. The lumberjack agrees but returns two days later, clearly disgruntled about the chain-saw's performance. "I was only able to cut down 10 trees with this lousy thing." To which the salesman balked, and wondered if it might not be defective. So he beckoned the lumberjack to follow him outside to their testing area, where he threw a log across two sawhorses and pulled the starter cord on the chain saw. When it roared to life, the lumberjack took a step back and shouted over the sound of the motor - "WHAT'S THAT NOISE?"
Clearly a little product-orientation was in order, no?
A CTO once lamented to me, "Well, we did the best we could with what we had." - Well sure. Don't we all? I don't know of anyone who borrows or rents help to do it poorly. Nor do they take their best people to deliberately make something sub-standard. The problem is, without a baseline knowledge of what the machine can do, how it is typically deployed, and what to embrace or avoid about it, then it's really no different than the lumberjack's problem. He did the best with what he had, didn't he?
What were the outcomes? Poor perception of the product by the user. An objective lack of productivity. General grousing about something that is not well-understood. Where have we seen this before? Give a call to practically any help desk of any product, especially a technology product, and they will bend your ear with "howling" examples of users who mis-applied the product - and some would say - were just plain stupid about it.
Underwriters Laboratory (UL) has a standard policy of quickly adjudicating claims against them no matter how frivolous. Seems that just having the "UL" on the product makes them a lightning rod for litigation. One man took his name-brand lawn mower, which also sported the UL sticker, picked it up while it was still running, and attempted to use it as a hedge-trimmer. He slipped, the lawnmower fell on him, and he sued the maker of the lawnmower and UL. Three boys found a giant bullfrog and decided to kill it by setting it on fire. They grabbed a gas can from the shed, doused the bullfrog with it and tossed a lighted match on the hapless creature. Someone should have told the lads about the volatility of gasoline fumes, because the flames climbed the can's fume-trail directly into its mouth and detonated the can's contents, seriously wounding and severely burning all three boys. They sued the makers of the gas can and UL, which also had a label on the can. Perhaps someone should send this one in to A Thousand Ways to Die. For bullfrogs.
But this is not a dissertation on A Thousand Ways to Fail with Netezza because frankly, it's really hard to fail with a machine this powerful. This is why I say that when we encounter howlers like the guy with the lawnmower or frog-immolation, we're clearly off the beaten path. Why is it then, that the "beaten path" pops up more often than it should? Or for that matter, pops up at all? Aren't data warehousing folks a little smarter than that?
Of course they are. In fact, I don't recall encountering any experienced data warehousing folks who have had a bad experience - quite the opposite. However, as for the folks who have never built a data warehouse but have a lot of experience in "applications" - well in this zone it can get a little choppy.
The point is, across the fruited plain we have exceptions to every rule. My sincere hope is that your project is not inadvertently caught in the "crosshairs of fate".
Rat's Nest Number One.
Upon arrival on site I knew something was wrong. People were squeezed into their cubes, boxes were stacked against walls in every room. The whole office just felt so crowded. And then they introduce me to "the machine". In this case, their production machine was the lowest-powered machine that Netezza had to offer, just short of a Skimmer. The admin at the desk barks at me for parking in the street and not in the garage underground. It's my first day, and nobody said anything about parking. The difference in cost was exactly $1, and if you're like I am and travel a lot, this kind of difference is not worth discussing. Except for here. It's tongue-lashing time. All of these things added up to some significant red flags, moving in a direction of a road lined with red flags.
Case in point, this is not a company doing things expediently or frugally. They are cheap. They will do things according to the lowest denominator of cost and skill, not because they have balanced priorities. They would rather save a few dollars on training or even a rent-an-architect, and allow the least-of-their-staff to painfully slog through the nuances of data warehousing on an immutable deadline. It's the immutable deadline I can't fathom. Here's why:
In a solution implementation, we have cost, duration and quality. Pick two. Whichever two you pick will shortchange the third. Every time it's tried. Well, these folks were shortchanging all three in the blind naivete that it was valid and workable. Without time or resources, quality is always the first, most expedient of the three to fall on its face. Doing it on-the-cheap? Well, what does this say of their readiness for prime-time? Data warehouses have an ongoing cost-of-ownership. It's not trivial. Those who want to play cheap should find another profession, one that does not cherish quality.
I was told by the client that their current back data processing environment used Netezza stored procedures. Another big red flag. Stored procs invite black-boxed code and we cannot capture data lineage through them. Netezza stored procs are ideal for the front-end. Never for the back-end. They are hand-crafted and rather ugly to maintain (this would be true on any platform).
On this particular platform, they had decided not to use monolithic stored procs (a proc with a lot of serialized operations in it) but use modular ones. So modular in fact, that each inbound data stream had its own dedicated "receivor" stored proc, followed by three more role-based stored procs plus another two - one to validate and one to push the data into the final target table. All 150 incoming record descriptions/filestreams had these 6 stored procedures assigned to them. With one catch - they were the same "role" of stored proc, but all of them were different. That's right, to intake 150 tables we saw 6 stored procs each, for a grand total of 900 stored procedures, and this was for just one of several data sources!
Many of you OO aficionados see something screaming out at you, that this should have been one general-purpose loader with six phases of operation, serving all 150 streams. Adding another source and another 100 streams, no problem, they go through the same loader and phases. Need more phases of operation? No problem, just add them to the loader and everyone benefits. Forever. It's a beautiful thing.
Of course, this means a deliberate instantiation of some reusable infrastructure. Many app-developer folks are not familiar with how to do this. After all, with 150 tables incoming,we could expect those definitions to remain pretty stable. But if the same stored procedures are facing the internal data model(s) (and they must), then we have worse than a cut-and-paste rat's nest, we have a hand-crafted rat's nest. If the data model must change, we may effectively invalidate most if not all of the stored procedures. Can you even imagine having to review - and re-review 900 stored procedures so -- oh never mind.
It therefore did not surprise me to learn that they had "frozen" the data model so that the stored procedures could have higher durability. We know this isn't realistic either, because the business will start to drive more requirements into the solution and the model must change to accommodate it, even if it's just attribution of existing tables. How do we keep these things from impacting the existing code? We have no choice but to freeze the data model. But this isn't really a choice, it's more like a un-necessary evil. Their stored procedure implementation only guarantees one thing: their functional code base will be in flux and unstable for the duration of the solution's lifetime.
I made a valiant attempt to explain the rather problematic issues concerning their implementation. (Problematic here, is a polite and professional term for rat's nest without having to say so). I have to admit however, that "rat's nest" may have done disservice to the rats. They also wanted me to "jot down" a list of "enhancements" that would make their solution better, stronger, faster - all that stuff. I could not think of a profesisional term for "burn it to the ground and start over".
Perhaps I could have told them the bullfrog story.
Rat's Nest Number Two
In keeping on our theme with stored procedures - recall - stored procedures in Netezza were originally concieved for supporting the front end BI tools. Not back-end data processing. In fact, pushing the back-end data processing under higher programmatic control is something new - even to ETL tools. That Netezza supports it very well is a bonus. Actually, Netezza does it better than any of the other databases, because the ease of manufacturing an intermediate table, using it and tossing it, is amazingly simple and easy to manage. Other machines, not so much.
But when I say "ELT" or back-end processing, it's still a SQL statement. We have options, like hand-crafting the SQL in script. Been there, done that. Or hand-crafting a stored procedure. Not really interested in doing that again. And then we have generated-SQL from a template or metadata-driven framework.
ETL tools have pushdown, but it's still pretty weak. At least, too weak for power-users like moi. I have no doubt that they will step up, eventually.
In this example, we have the opposite problem from the first. Just as much of a rat's nest, it is a monolithic stored procedure rather than a gaggle of modular ones. The monolithic stored procedure often runs for an hour or more, executes hundreds of SQL statements along the way, and has a lot of detailed steering logic embedded amongst them. It is a veritable nightmare to code and debug, and even worse to troubleshoot. I hear that some developers have been invited to padded cells afterwards, but I think those are just exaggerations. It can't be that bad, can it?
Given choice between only the two, I would choose the gaggle-of-modular over the monolithic. I mean, if you were implementing it and not me. I always have the choice to say no. I don't work for your company, after all. You may not have a good choice. Your uber-architects and their hired guns have told you it's stored-procedures-or-nothing. So it's time to pick a poison I suppose. I'll take hemlock for $400, Alex.
As these stored-proc programmers stared back at me with hollow eyes, I thought I had entered some macabre Tim Burton flick and all we needed was some spooky music, fog-machines and strange howling in the distance to make it complete. They spoke in muted, muffled tones and their questions seemed to drift. Had they slept in like, the last 48 hours? They all looked sooo tired. This is what a monolithic stored procedured does to your staff. Now watch it drain the lifeblood from your operations staff. It is the virtual/technical equivalent of leeches, and you thought we'd left those behind in the Middle Ages (for technology, that would be the 1990's)
Stored procedures don't have single-step capability. When we add another function to it, we have to test all of the functions at once, because it has to run end-to-end. We can creatively work around this in the beginning, but eventually we have to integrate it. When one test takes over an hour, or two, and the answer is buried in the mountain of carefully crafted NZPL-SQL code, at some point we have to wonder what we signed up for. (that would be, we signed up to do it wrong). Ouch.
Stored procedures cannot be parallelized (unlike their more modular counterparts) and as such is a glaringly missed opportunity. They are doomed to be serialized forever.
Now, our framework (that we consult and use as a problem-solving platform for Netezza - nzDIF) handled 100 percent of all data lineage no matter how many intermediate tables, databases or machines are involved in the overall flows and handoff of work. You won't get this with any other product, nor with anythng a stored procedure has to offer. This would be true of any stored procedure on any platform.
This is because on a transactional platform, procs are meant to handle multiple operations on singleton entities so data lineage simply is not an issue. On a Netezza platform, procs are meant to serve the BI platform, not the back end, so likewise data lineage is not much of an issue. Stored procs for the front end are largely summary/filters for pre-existing datasets. We want the lineage on those datasets, not the on-demand operations that consume them. "Could" we expect data lineage from stored procedures in Netezza? Why? The only reason would be to support back-end processing, and stored procedures are not for back-end processing. It's sort of a Catch-22.
Rule #10 in play here
And let's not forget Rule #10, shall we? Recently emblazoned in glowing letters on the catacomb walls of the Underground, Rule #10 is very simple: Never do bulk data processing in a general-purpose RDBMS engine running on a general-purpose platform.
Now, I just had to get Rule #10 into the forefront because this underscores the primary reason why stored procedures are bad for back end data processing. If we have a rule in place against using SQL for bulk processing on a general-purpose platform/engine, then any experience we may have with bulk processing through stored procedures on such a platform is itself a violation and not a marketable skill in the Enzee Universe. More importantly, it institutionalizes the violation and makes it so much worse. We could "maybe" dig ourselves from a ditch if we're using hand-crafted out-in-the-open SQL (also not recommended) but when ensconced behind the fortress of stored procedures, we have to first storm the fortress before we can loot it. Easier said that done.
All that said, folks who have instantiated stored-procedure-based data processing on general purpose platforms have already been doing it the wrong way, so why bring those practices into the Netezza machine? Just sayin'
Rat's Nest Number Three
Ahh, you thought we were done and coming into the home stretch, eh? Well, we're almost there.
This particular rat's nest only appears in places where people have churned a lot of contractors, consultants and other aficionados and hired guns through the company's various revolving doors. And as the person who inherits it rightly recognizes if as a rat's nest, or a hairball, something comes to mind that rushes through their brains like a river of water "Wow, and I deliberately signed up for this. What could I possibly have been thinking?"
Ahh, not to worry, this syndrome is rare, and shall pass. Breathing deeply will override the spooky breathing cadence of the Dark Lord of Expectations, and shall give you extraordinary confidence on how to resolve this problem.
This condition is entirely severable from the technology itself. Any shop that allows the contractors to establish their own standards without oversight is just signin' up for a world-of-hurt somewhere down the line. Fortunately for us, the Netezza machine is like a monster truck with gumbo-mudder tires. No matter how mired-in-clay it may be, we need only fire the engines and punch the accelerator to regain control and be underway in no time.
The first step, like any 12 step program, is to recognize that a problem exists. Bandaging a hemmorraging wound will not heal it. This will only forestall the inevitability of bleeding out. If we are to be proper stewards, bandaging has its benefits while we treat the larger wounds.
First and foremost, commit to some form of data management logistics. And this is not by purchasing a data backup tool. This is a committment to flow-based, insert-only architecture as the rule, with updates and deletes as the exceptions. After all, if we were using an ETL tool, we wouldn't be able to update or delete a flow of work. We can only integrate and filter the data while it's on its way elsewhere, but that elsewhere will always be an insert-only target - because it's a file set and and not a database. Only when we reach the book-end of the database can we perform updates and deletes, and these are largely to support things like constraints and slowly-changing dimensions. We just need to avoid invoking an insert/delete/update protocol for all tables at all times. Center on a theme and accommodate the exceptions. We must have rules, and this is one of them.
Commit to some form of rules-driven architecture. That is, when we encounter a new condition or potential fork in the logic, consider shaping it with a rule (one that we can switch on/off or modify from afar) rather than hard-wired SQL or hard-coded solutions. Is this easy to do? Of course not. Nothing is easy about data warehousing or large-scale flow mechanics, silly rabbit.
Netezza has simplified the harder, tedious and repeatable parts so that we can actually address the issues we never had time for before. The "next level" was never in view, or even on the radar because we were always immersed in the operational weeds of the implementation. With a Netezza machine, lots of that is behind us, but before us stands the new challenge. It goes something like this:
If I were to give you a 400-slice toaster, your problem is no longer toasting bread, but bread management. Keeping the toaster busy has now become a daunting problem of bread logistics, not machine capacity. The problem domain has shifted into a zone that lots of folks don't have any experience with. Time to step up.
Modified on by DavidBirmingham
About a year ago I engaged to assess a Netezza-centric data processing environment. They had used stored procedures to build-out their business processing inside the machine using SQL-transforms. As you know, I'm a big fan of the SQL-transforms approach, but I'm not a big fan of how they implemented it. Stored procedure for back-end processing are a bad idea on any platform. But even if they had done it without stored procs, the implementation was a "total miss". I mean, it could not have been more "off" if they'd done it outside the machine entirely.
I received word some months ago that while their shop remains a strong Netezza environment, for this particular application they intended to go in a different direction, with a different technology. This was unfortunate since I had told them exactly what the problem was - not the hardware but the way it had been deployed. But they were in denial! Forklifting their application onto the new machine, they attempted to tweak and tune it. They actually received marginally more lift at the outset, but then it rapidly degraded when more data started arriving. Now I'm in dialog with them to discuss "what went wrong".
What went wrong began, quite literally, many years prior.
It's like this: If rust starts to build in a water pipe, we won't know it until the water pressure starts to slowly degrade. Eventually it becomes a drip and then one day it's closed off altogether. We could attempt tracing it to a single cause, but it would only be another straw on the proverbial camel's back. What "really" happened was that we treated the machine in a sloppy manner. Or rather, we saw that it had incredible power but we weren't particularly good stewards of it. Netezza makes a an ugly model look great, a good model look stellar, a marginal model look like a superstar, and can make the sloppiest query look like the most eligible bachelor in town. Power tends to make people starry-eyed.
Time and again we coach people on a migration. They say "Wow, we just went from a five-hour query on that Oracle machine to a five-minute query on the Netezza machine. Sold!" and they move everything over "as-is" from the other machine. Never mind that those old data structures were optimized for a different technology entirely and never mind that the data processes running against them were likewise optimized with the older structures in mind. They were both optimized in context of a machine that could not handle the load to begin with. They just didn't know it yet. Now standing in the Netezza machine's shadow, it's painfully obvious what shortcomings the old machine had. Not the least of which being a load-balancing transactional engine, which is always the wrong technology for anything using a SQL-transform.
The bottom line: what if just a little tuning of that five-minute query could make it a five-second query? What if we received a 10x boost moving data "as is" from the old machine, but if we had engaged a little data-tuning we could have received 100x? In short, how many "x" have we left on the cutting-room floor? Enzees have learned (some the hard way!) that performance in a Netezza machine is found in the hardware. This hardware has to actually arrive in massively parallel form, not marginally parallel form. So we know that that expecting production performance from the Emulator or the TwinFin-3 is a quixotic existence. This ultimately leads to two universal maxims:
We don't tune queries, we configure data structures. The data structures unlock hardware power.
We use queries to activate the data structures. In Netezza, "query-tuning" is a lot like using a steering wheel to make the car go faster. It just doesn't work that way.
This "additional boost" or "leftover power" is an important question, especially for the aforementioned Netezza-centric application. Even if we had kept the entire application in stored procedures, their implementation could not have been more wrong. They had of course, outsourced the whole development project to a firm in a distant country, who had given them a marginal development team at best. This team proceeded to treat the Netezza machine like "any other database" and completely missed the performance optimizations that make one of these machines a source of legend.
What that team did, was pull two hundred million records from a master store and use this body as a working data set even though only twenty columns were being processed at any given point. Dragging over two-hundred columns (90 percent dead weight) through every processing query (many dozens of them), and without regarding distribution to manage co-located reads and co-located writes, turned a twenty-minute operation into a fifty-five hour operation. We showed them with a simple prototype how a given fifteen hour process could be reduced to four minutes. The point is, they were ridiculously inefficient in the use of the machine. Nobody in the leadership of the company would accept that they were as much as 20x inefficient.
A major "miss" is in believing that Netezza is a traditional database. It is not. It is an Asymmetric Massively Parallel Processor (AMPP) with a database "façade" on the front end. Anything resembling a "database" is purely for purposes of interacting with the outside world, "adapting" the MPP layer for common utility. This is why it is firmly positioned as an "appliance". If the internal workins' of the machine were directly exposed, it would cause more trouble than not. Interfacing to the MPP "as a database" is where the resemblance ends. This is the first mistake made by so many new users. They plug-in their favorite tools and such, all of which interface (superficially) just fine. Then they wonder why the machine doesn't do what they wanted. Or that they are experiencing the legendary performance.
When an inefficient model is first deployed, we could imagine that even in taking up 10x more machine resources than necessary, it still runs with extraordinary speed. But let's say we have 100 points of power available to us. The application requires at least five points of power but we are running at 50 points of power (10x inefficient). We still haven't breached any limits. As data grows and as we add functionality, this five points of power rises to 8 points, where we are now using 80 points of power in the machine. Wow, that 100 points of power is getting eaten up fast. We're not all that far away from breaching the max. But data always grows and functionality always rises, and one day we breach the 10th point of power. And at 10x inefficiency, we have finally hit the ceiling of the machine. It spontaneously starts running slow. All the time. Nobody can explain it.
The odd thing, is that the Netezza machine is a purpose-built appliance. Why then did we allow our people to migrate a schema optimized for a general-purpose machine into a purpose-built machine? Moreover, why did we continue to maintain that the so-called purpose-built model in the old machine was really a general-purpose model in disguise? Did we use general-purpose techniques? Why?
Did we load data into one set of structures expecting them to be the one-stop-shop for all consumers? A common-mistake in Netezza-centric implementations is that one data structure can serve many disparate constituents. The larger the data structure, the more we will need to configure the structure for a constituent's utility, and may need redundant data stores to serve disparate needs. Is this reprehensible? Which is better? Just declare another identical table with a different distribution/organize and then load the two tables simultaneously? If we go the summary-table route, the cost is in maintaining the special rules to build them along with the latency penalty for their construction. It seems counter-intuitive to just re-manufacture a table, but the only cost is disk space. On these larger platforms, preserving disk space while the users are at-the-gates with spears and torches, doesn't seem to be a good tradeoff.
The point: Don't waste an opportunity to build exactly the data model you need to support the user base. Don't settle for a contrived, purist, general-purpose model. If the modelers say "we don't do that", this is a sure sign that we're leaving something very special on the cutting-room-floor. It's a purpose-built machine, so create a purpose-built model, and like the Enzees say, "give the machine some love."
When capacity seems to be topping-off, with a few, simply-applied techniques we can easily recover that capacity. It's just very annoying that it's caught up with us and nobody can seem to explain why. It's because they are looking in the wrong place. If we had concerned ourselves with the mechanics of the machine and its primary power-brokers, the distribution and the zone-maps, and avoided the most significant sources of drain, like nested views and back-end stored procedures, we might be closer to resolution. If not, it may be that resolution would require rework or retrofit. In the above 50+ hour operation, the only answer would be to overhaul the working queries end-to-end. We wouldn't need to do much to the functional mechanics, just streamline the way the queries perform them.
What does this streamlining look like? Well, if we already knew that, we wouldn't be having any problems. We would have been streamlining all along and most of our capacity would still be well-preserved. People do it all the time.
Symptoms of a PDA system under stress:
The thing is slowing down and we haven't changed anything. It's slowing down and we're doing what we've always done.
It's the last application/table/load we implemented. Things went south from there. But then we backed out the application and it's still bad. I think we broke something. It must be the hardware.
More users are querying the machine than ever before. I think we've reached a hardware limit.
We could not get the base tables to return fast enough, so we built summary tables, but these are pain to maintain and we don't like the latency in their readiness. We were told that we would never have to use these.
It's sort of funny how deterministic the machine is. "We've been doing the same thing all along" and now it's not working right? Perhaps we weren't doing the right thing all along, regardless of how consistent our "wrongness" was? This is how we know it's not a hardware problem. In fact, if our folks are blaming the hardware first, it's the first sign of denial that the implementation itself is flawed. If our people build contrived structures like summaries, it's a sure sign that the data model is flawed. It's also a sure sign that we're trying to implement a general-purpose schema rather than a purpose-built one. If our people spend a lot of time swarming around query-tuning, it's a sure sign that our data structures aren't ready for consumption. Nested views never have a good ending in Netezza.
At one site, the users had to link together eight or nine different tables to get a very common and repeatable result. If the users must have the deep-knowledge of these tables and their results are repeatable, we need to take their work and manufacture a set of core tables that require less joining and are more consumption-ready. Consolidation, denormalization and shrinking the total tables in the model are actually performance boosters. Why is that? The more tables are involved in the mix, the more we have to deliberately touch the disk through joining. If we have fewer tables and more data on larger tables, zone maps allow us to exclude whole swaths of the tables altogether. We stay off the disks because the zone maps are optimized for it.
It sure "seems" right to put everything into a third-normal-form and make the model "look like the business", but nobody is reporting on the business. They're analyzing, not organizing. We should be the ones organizing the data for analysis, not requiring the analysts to organize the data "on demand" through piling tables together in their queries.
I fell asleep dreaming, even as I hit the pillow, glad to be free of that whining analyst on the third floor. Every time he sends me a report, it turns out to be worse than wrong. It's too wrong to be worth the paper it's printed on, yet his supervisors say he's the next Einstein. When I ask him personally about the paper's details, he just scratches his head. When I ask about his conclusions, he scratches it again. Whenever other people ask him for things, he just bobs his head like everything is under control. So between bobbing his head and scratching it, I nickname him Bob Scratchit. Many can appreciate the irony.
But I had acted on his analysis. Woe came to me in waves for many weeks. Nobody wanted to sit with me in the lunch room, return my phone calls or give me access to the data necessary to see what-went-wrong. I became a scourge and a pariah. I blame Scratchit. He started it. My partner had also listened to and acted on Scratchit's advice. He was practically perp-walked out the door and doesn't look particularly good in an orange jumpsuit. I need to watch myself, because Scratchit always has an agenda, and even if his conclusions are wrong, I get the impression that he's toying with me. And that's kinda scary.
Later that evening I startled awake as my laptop screen flashed from the other side of the room. I rose to inquire, and on the screen was the smiling face of my partner Jake, his carefully coiffed Bob Marley locks draping his shoulders. He hadn't changed much, but looked a little tired, and I wondered how he was able to find me and impose himself on my evening, like a ghost from my workplace.
"Knock, knock!" Jake said with a grin, "Anyone home?"
I touched the camera-On button, "Hi Jake, yes I'm here."
"Ahh, good to see you," he said, "Knees'er shakin'?"
"I can tell your knees-er shakin'. Don't be scared."
"Not scared. Just annoyed. It's late, can't this wait?"
"You haven't changed. Still keep puttin' things off 'till later. That's gonna change, my friend."
"I'm sending you three WebEx appointments."
"What are they about?"
"You'll see," Jake said mysteriously. "Come here every night by 8pm for the appointments." He
seemed to pause, pain written on his face, "Don't put it off, and don't take my word for it." he beeped out.
In that moment, the clock chimed for 8pm and startled me. The WebEx session fired up to reveal
the smiling face of-
"Jack Blaines," I said before he could get a word out.
"That's right, mate," the native Australian grinned broadly.
"But, I recall you from what, twenty years ago, when data warehousing was barely a concept and -"
"That's right mate, we were all part of the inception."
"You haven't aged a day - "
"Nope, and where I am, data warehousing hasn't moved an inch either. Here, we had to make the most of what we had."
"The most of data warehousing past, you mean."
"That's right, do more with less. In this time, we don't have any network bandwidth because, well, we don't have any networks. But we still have to do bulk processing and analytics."
"The hard way."
"Here's a takeaway - software only guides the hardware, but the hardware does the work. I'll bet that never changes even for you."
"Well, if I'm honest - "
Just then a series of pictures started flashing across the screen, of machines and technologies that had long ago entered the fossil graveyard of computing history.
"Look - gotta go - hard stop at midnight. see-ya."
"Wait, I have some more quest - "
But the WebEx evaporated leaving only a screen with the usual icons. Why would Jake connect me with Blains? Data warehousing of yesteryear wasn't going to help me.
As I drifted off to sleep, the screen fired-up again and another WebEx started. This time it was the smiling image of Hunter Hardwick, an aficionado of all-things-warehousing.
"Hell0 my friend," he called to me, "I'm so happy we could be together for this celebration." he then raised a glass filled with a bubbly drink.
"What are we celebrating?"
"A toast!" Hardwick ignored me, "A toast to the present-state of data warehousing."
"Hasn't data warehousing sort of - you know - been eclipsed by Big Data?"
"Oh not hardly! The Big Data technologies are still a valuable part of the warehouse, but the warehouse itself isn't going away."
"Wait a second, you're speaking of the warehouse as something larger, not just a data store."
"Well of course, the warehouse is a living environment, not unlike a brick-and-mortar warehouse, with workers, forklifts, loading docks, storage carrels and operational protocols to tie it all together."
"Okay, I see that, but why are you making a toast to data warehousing's present-state? Wouldn't you want to toast to something more profound?"
"What's not to like? Data warehousing appliances, fluid queries, various options for large-scale storage, big-ticket network bandwidth - the world is my oyster. Glasses up!"
Something overcame me and my eyes grew heavy. As much as I wanted to celebrate the present, my body could not go on. I must have slept for hours before the computer screen fired-up again and started flashing. The third WebEx foretold by my former partner Jake, now attempted to materialize on the display.
"Are you there?" I heard my own voice, am I still dreaming? "Is this thing on?"
"Yes I'm here," I said groggily, wondering if this was a recording of me.
"That's what the bad comedians say in Vegas, you know," my voice jested.
"Is this thing on?" my voice chuckled, "Hey, I'm calling you from your future," but the signal was breaking up severely, like a bad cell-phone connection, and the image on the screen was full of static. I could see a silhouette of what might be an image of me, but too fuzzy to make out.
"You mean, I'm calling me from my future," I corrected, trying to gather my senses.
"Whatever," my voice laughed, "You're gonna love it here."
"In the future?"
"Sure, you have a lot to look forward to."
"Tell me more," I sat up.
"Well, networks here are all wireless, with practically unlimited bandwidth."
"Uh, no kidding -?"
"Absolutely, With wireless networks, everything is infinitely elastic. I can transfer terabytes in micro-seconds. Wires are pretty much a thing of the past."
"Well, except for power I suppose."
"Nope, wireless power too. Tesla finally won the argument. Changed the entire architecture of data centers. Now we don't have raised floors, we have raised machines. We can use the horizontal and vertical space in the building. IBM's Websphere machine really is a sphere now, most machines now use their entire outer-cabinet real-estate for something. No more boxes and racks."
"And storage is all in the cloud, nobody has real data centers hosting storage anymore."
"So centralized data centers all over the world?"
"No, as storage became denser and faster, we can put a petabyte on a flash stick. No big deal."
"To your point though, with the use of quantum entanglement and basically tossing out all of Einstein's constraining math, we can transfer data instantly. Storage isn't really a priority anymore."
"Pretty cool," I grinned, "But unlimited power, how -"
"No gas-powered cars anymore either. They all run on electromagnetic quantum engines. Basically generate their own fuel by capturing particles that leap-in-and-out of existence."
"That's - well, incredible."
"Yeah, when I go to work, I fly-by-wireless in my hovercar. No need to refuel - and it gets all its power wirelessly too," my voice yawned, "So yesterday."
"How far away is work?"
"I still live in Texas and can be at our Colorado center within minutes."
"But that's - "
"And I visit our site at Maria Tranquility within the hour."
"Funny, naming a site after a place on the Moon."
"You named a site after a location on the Moon."
"That's because it's actually on the Moon."
"Oh, no way!"
"YES way, dude. Dozens of shuttles and private craft leave every hour. Our biggest problem now is traffic control to the lunar surface."
"And data warehousing, well, it has undergone several facelifts. What was impossible in your time is literally yesterday's news."
"That's a pretty big claim."
"The boast of data warehousing's future was for real," my voice said, "actually, the folks in your time thought they were being visionary, but actually underplayed, or should I say, underestimated the innovators."
"What are your saying?"
"Their predictions came true," my voice said, "but fell short. Their boast of data warehousing's future fell radically short of what actually came to be."
"Like I said, you have a lot to look forward to. But when you think about it, even with predictive analytics, if you can see the future, or even accurately predict it, you've already changed it."
"If you know the outcome, doesn't it affect your behavior? If a corporate exec is very sure of the outcome, won't this affect how he markets his products, or would be just market them the same-old-way because its always been that way? And once he makes this change, it can radically affect the outcome of the prediction, with one problem - the future's not here yet so we won't know if we were right. And if it turns out wrong, we can't go back and replay it because we were the ones who changed the future by acting on it."
"So you're saying that if I can see the future, it's already in my past?"
"Something like that."
"I'd like to think - "
"Stop just thinking about it and start knowing it. We know that if someone tells us that a certain lottery number is a winner, won't we go out and buy a ticket when we wouldn't before? And when that corporate exec unleashes his campaign upon the masses, won't it affect the customer behavior? His predictive model cannot tell anyone what will happen when he acts on the predictions."
"You're right, it's not a problem here. Not yet anyhow."
"Look," my voice said, holding a square object against his silhouette, "Can you see this?"
"No, it's too fuzzy."
"It's a book," the voice said, "Changes a lot of things."
"Who's the author?"
"Someone you know," my voice said mysteriously.
"When will it arrive on bookshelves?" I implored, "And what's the title? I'd like to get a copy."
"That's kind of up to you, But there are people nipping at your heels even now. Those ideas you had for the future really work. Don't let someone else publish them before you do."
"Wait, what are you saying? Is that my book?"
"It could have been. You trained the author on everything in it. You just delayed publication because you thought you had more time."
"But I don't, you're saying."
"These ideas are already in your head, and this author was over a decade late in delivering them," my voice paused, "Why don't you just deliver them now?"
"Where do I start?"
"Start where you would start," my voice was breaking up even more, "Good luck."
The image faded and my voice trailed off. Now the screen was filled with static. I stared at it for long moments, trying to assimilate what I had just experienced.
The most of data warehousing's past, the toast of it's present, and the boast of its future, were these just a dream, or should I -
The phone rang.
"Hello, this is Jim Stone, how are you today?" said the pleasant voice of the robo-caller.
I know that if I say anything except "fine, how are you?" the robot will politely say "Goodbye" and hang up.
So I scream into the receiver, "They have me trapped in a box filled with scorpions! You have to help me!"
"Goodbye," it said pleasantly, and disconnected.
BEEP BEEP BEEP
The alarm clock woke me up.
Modified on by DavidBirmingham
I recently did a Virtual Enzee presentation and listed the Top Ten requirements for scalable bulk data processing inside a Netezza machine.
I'll come back periodically and elaborate on them
1.Platforms easily scale for increasing stress
We have a Netezza machine, so what could go wrong? I was asked a desperate question by an Enzee as to how to get more power out of their machine. After nearly two days of struggling with them I finally asked how big their machine was. It was a TwinFin-3. The answer I gave them, they clearly did not like and even sought solace on the shoulder of another. Who told them the same thing. Get a bigger box. TwinFin-3 is a dev box, not a production box.
Stress comes in many forms. Constantly changing requirements. The need for functional and physical agility. As these things increase, we need a platform that will work with us, not against us.
2.Human intervention eliminated wherever possible (no eyeball-based actions)
This means ALL aspects, not just operational ones. Everything from table maintenance to application development. AUTOMATE!
It is humorous to hear testers offer up their methods, with naive blurbs like "open the application and examine the contents". No, with billions of rows there is no such thing. We must use statistical checking that operates on sets, such as summaries, counts-of etc. No longer can be "eyeball" the data.
Likewise with runtime processes. Define a table with 200 columns and try to put an ELT query against it. 200 columns in the insert phrase, 200 entries in the select phrase, and to maintain it we have to keep them in sync with "eyeballs". No, this doesn't scale.
3.Architecture-centric platforms express applications with patterns
Oddly application developers, like those who develop using stored procedures, whip out a bunch of application-centric 'code" and when the smoke clears, they see repeatable patterns all over it. Unfortunately, they can't take the patterns anywhere because they are hard-wired.
The more architectural approach is to harness the patterns as capabilities and allow our applications to express from them. The application is then an expression of the capabilitis not the center of gravity.
4.Deliberately simple to leverage and operate
Large-scale systems can have mind-numbing characteristics for the un-initiated. It is incumbent upon us to deliberately simplify their interface points to it. Simple utilities, fewer keystrokes to achieve mundane goals, automation for rote tasks..
5.Built for administrative recovery, not reactionary recovery
This can be as simple as, when data arrives and has errors, we don't come to a full stop. We cordon off the error records into an adminstrative /logical status and report them for later remediation. In systems of scale, we cannot halt the processing of tens of millons/billions of records just because a few stragglers are misbehaving. The time it will take to process the data is the problem. If we are 20 minutes away from the process being complete, then we are always 20 minutes away if we have fully stopped the flow for the sake of a few records. If we allow the process to proceed with error-capture, we will close the 20 minutes and then the admins have more breathng room to fix the problem without the scrutiny or pressure of the clock.
6.Data and metadata-driven
The environment can no longer be driven by application code. It has to be driven by an architectural harness that responds and adapts to data and metadata. This is a non-trivial endeavor, of course, but entirely possible to achieve.
What does this look like? The data model is arguable the most volatile component of the solution. Changes in it can destablize a solution. We need ways and utilities to buffer ourselves from the impact of change all-the-while enabling the change. It won't do to tell the users that the data model is frozen for 6 months because we fear impact to our tightly-woven application code (e.g. stored procs)
7.Blended/hybrid approaches quickly adapt and scale
One doesn't have to make an exclusive choice between ETL and ELT. People really want to leverage the power inside the machine but feel constrained that doing so may obviate the ETL tool. Not so - both of these technologies have a major role to play and we should balance them for the best-of-breed solution
8.Template-driven applications: SQL is an artifact, not the center-of-gravity
In the VIrtual Enzee I offered several examples of templates for SQL transforms (insert-into-select-from), views (to avoid nesting) and stored procedures (to build from a template rather than editing them in a SQL tool)
Why do this? The developer puts application logic into the template. At run time, or installation time in case of the SP or View, we formulate the product from the template. This allows us to automatically include non-optional aspects like operational controls, inline status reporting and other elements that we don't want the developer to worry about, much less hand-craft on their own.
Need another bit of operational control? Add it to the template factory and don't worry about the application logic
More importantly, we can generate a template from the catalog and by definition it is tied to the catalog. It is therefore easy to compare the already-deployed templates to changes in the data model. Since 90 percent of all new columns are invariably pass-through columns, running an impact analysis like this captures over 90 percent of the issues in one shot.
9.Inefficiencies are our number one enemy
One of our clients had a TwinFin 48 they were planning to use for their development phase and then cutover internally to production. I asked them to dial back the developers so that it had the effective power of a TwinFin 12. They were a bit stunned at my request until I noted: The TwinFin 12 has a lot of power for development, but a TwinFin 48 will hide bad data models and sloppy code. Lots of power can make any lousy code/data model look spectacular.
Many cases of Netezza machine under stress, upon review we find that many of their inefficient practices have been going on for years, some since the box arrived. But the machine was so powerful it masks the inefficiency, like allowing the box to eat itself from the inside out
Preserve capacity at the processing level, not by guarding the data storage level. Do not be afraid to spin off replica data structures (even large ones) just for a different distribution, if it means that the machine can close its queries faster.
10.Operational integrity drives functional integrity
We understand this as a matter of quality control. Hamburgers from a national chain should taste the same no matter where we buy them. This is not accidental. The end user data is only as good as the processes that are delivering it.
If we make it so the operators have a difficult time handling it, or the admins don't understand it, or the troubleshooters can't get things done, they will start to grouse about the quality of their existence.
On the flip side, I know folks that we radically simplfied things for, and when we showed them the various utilities they would need to keep things in order, they balked. "Do we have to know all this stuff? Why is there so much stuff to know?" And yet, we have reduced a thousand things down to one, but they cannot grasp how much more complex it would be without our having simplified it.
We know that Netezza embraces simplicity. We just have to be mindful to maintain this spirit when we build things around it.
At the functional/capability level,we need to drive operational integrity into the data itself, outfitting the tables and rows with additional columns for the sole purposes of operational control. Otherwise the functional model is pretty much out-in-the-open and we won't have a way to manage the tables in a consistent, harnessed, repeatable form.
After review of a "high performance" ELT platform (that's ELT, database-transform-in-the-box) - I started asking hard questions about things they had not considered. It's a high-performance platform, isn't it?
Well, yes and no. It supports a "continuous" model, but the performance is all in the query and the data, right? Well, we'd like to think that for purposes of long-cycle queries anyhow. Here's the expectation:
In a general-purpose RDBMS, transformation-in-the-box is expensive. Each query can take minutes or even hours to complete. At this point, nobody really cares about the overhead to launch and shepherd the query, or to report its status when completed. All of these infrastructure issues are eclipsed by the duration of the query itself. If only one percent of the operation's duration is in the overhead, who cares if we spend time optimizing or minimizing it?
So in one scenario, the product would launch its queries end-to-end using a scheduler. Each of the queries would be packaged into its own little run-time, then the scheduler would kick off each one and wait for its closure, only then kicking off the next, etc. Some would call this reasonable, others sophisticated. After all, if the duration of each query is protracted, why do we care about inter-transform latency?
Contrast this to a Netezza-centric series of transforms. A general-purpose database, recall, requires us to dogpile lots of logic into each transform, protracting the duration as a matter of necessity. In a Netezza-centric scenario, we will see those ten-or-so general-purpose queries chopped apart into more efficient, tactical form, with each query building upon the last towards a final outcome (in a fraction of the time of the general-purpose equivalent).
Apart from the mechanics of how Netezza makes this happen, look at how the mechanics of the operation changes dramatically. I'll use a known working example of 42 Netezza transforms. When first we had ten-or-so-general-purpose queries running for an hour or more, we now have over forty MPP-queries, each of which runs in less than a second of duration (with some exceptions). So all 42 queries, if we could kick them off one-after-another, will execute in around 45 seconds. If in this case, the users (or process) kicking off the sequence wants less than one-minute turnaround, now we have to deal with squeezing out inter-transform latency.
In plain-vanilla terms, the mechanism using the scheduler noted above, put six-seconds of latency between each transform. What does this mean? With each transform running in one second, and six-seconds of overhead, what could run in less than a minute now runs in six minutes - we have dramaticallly breached our one-minute SLA!
Now David, get serious - who on earth wants to shave seconds off the inter-transform latency? Six seconds of delay between each transform seems perfectly reasonable! Sure, if the transform itself will take twenty or thirty minutes. But if it will take less than a second, our overhead for it is now the glaring culprit with a smoking gun, red-hands and all that. And for those Netezza users who want to eliminate this latency, don't pooh-pooh their needs. The fat is our problem to solve.
And what does this say for ETL tools that perform push-down to the machine? They will also have transitional latency as they hand-off control across components. Geared for the big-fat, long-duration queries of the general-purpose world, nobody has ever cared about the smidgeon of latency between them. Only now, the smidgeon is not so much, and looks a lot like a boulder next to a basketball.
Consider the following breakdown of a standard transform's parts. They look a lot like a CPU's fetch-decode-execute cycle (yeah, that's a little geeky, I know)
Startup Overhead Execution Shutdown
So imagine that we have a tool with controls (like a scheduler) with several seconds of latency in startup, formulating the query (recall, we want query-generation, not hand-coded queries) leading up to Execution, and then some minor overhead to accept the status and transition to the next operation. We'll call it at four (4) seconds of latency.
If we scale the above timeline with several inline / serialized transforms- we would see an effect like this:
|---x--x--| |----x--x--| |---x--x--| ----
See the "DDD" (for dead-time). We lose that time in the additional latency for transform management. This is akin to the startup/shutdown cost of a launcher, scheduler, or ETL tool shepherding a "component" to activate the underpinning SQL statement
Either way, take a look at the total time "between the x's " that is the actual execution time. If this time is several hours, the dead-time is a nit. If the execution time is proportional to the line-drawing above, we have far too much overhead.
So for 42 of these operations, we would have 42x4 seconds of latency plus the 1 second of execution, for a grand total of 210 seconds, or 3.5 minutes. When we seriously consider squeezing the fat from this operation, our primary problem is in the non-optional overhead.
Now take a look at the scenario below. I have kicked off five transforms asynchronously, with their execution times between-the-x's -
See how the first one hands off to the second one, like a baton in a relay race? Notice how each of the transforms has already incurred its overhead in parallel to the first transform, and are now merely waiting (see the "W") for their predecessor to trigger their execution.
What is the start-to-finish time of these five transforms if executed like "boxcars" in serialized mode, accepting the penalty for intertransform latency? From start to finish is around 25 seconds. But with the above example, the inter-transform latency has been practically removed except for the simple handshake as they hand-off. The first one launches and we incur the initial four-seconds of latency, a necessary penalty. Then each subsequent transform requires one second, for a grand total of 10 seconds of runtime. We have effectively moved from a model with 25 seconds of runtime to 10 seconds of runtime.
While this is around 36% of the original run time, it does not seem as dramatic as when we compare it to the original model of 42 transforms. That is, four seconds of overhead plus 42 seconds is 46 seconds of runtime. Add to this 1/10th second for the handoff delay (another 4 seconds), for 50 seconds of grand total run time.
Its original time was 42*5 seconds, or 210 seconds. This new time of 50 seconds is 24% of our original run time. That's a 300% improvement in run-time and of course, is well within the boundaries of the one-minute SLA.
Offsetting the transform run-times like this, so that the overhead is essentially invisible, is a common hardware-stablization approach for micro-electronics. Here's an example:
Many years ago I worked for a company that was repackaging a design into much smaller form. Essentually we were reducing a rack of hardware into a 1-foot cube. All of the large wire-wrapped boards had been designed-down to much smaller form. This is when instabilities were first detected.
In one particular case, the engineer described it to me as an engineering-101 mistake on the part of the original designer. Every hardware circuit is driven by signals that pass through conditioners and gates (transistors) so that the final outcome is a signal of some kind on a particular part of the board's interface(s). The mistake was in that the timing signal, a pulse sent to the hardware 60 times a second, was being applied at the first input of logic. So a general signal would "sit" on a given input location, the timing signal would "fire", opening the gate, and the signal would then traverse through the many other components and paths to reach the output path of the hardware. But here was the problem: the signal took too long to make it from point-A to point-B before all of the other signals had already left it behind. The solution to this: Put the timing-trigger signal on the output side of the board. This way, when the original signal first arrived, it would make its way across all the necessary components and paths and then present its signal to a final gate, which was triggered by the timer. So when the trigger hit, the signal was already present and passed through instantly without a problem.
This sort of "triggered-steering-logic" is the theme of the noted inter-transform handoff scenario above. The transforms navigate their overhead to a stopping point and wait for the signal. In this case, they are waiting for a "done" signal from the transform preceding them. When this signal hits, the next transform is queued and ready to immediately execute its query without further delay. The subsequent transforms fall like dominos.
But that's really only part of the story. These 42 transforms don't just run in a serialized stream. Some of them, after all, have no dependencies whatsoever. Others have dependencies in a discrete chain. Why serialize them when their dependencies are relatively few? Here's an example:
DEF JKL MNO VWX
Transforms A,B and C are dependent upon each other, so will serialize. In the meantime, the D,E,F and G,H,I transforms are independent of A,B,C, so can run side-by-side. J,K,L transforms are dependent on the outcomes of C,F and I, so will wait on them to complete, then finish the K,L transforms as serialized. But then another set-of-three transforms takes off. In ETL tools, we recognize this as branching or "component-parallelism". However, in an ETL tool the branches must be wired together and follow each other by design of the graphic on the canvas. ELT however, is much more dynamic than that.
Look at the effect: We're saving 6 seconds of time by not executing the A-I transforms serialized. Likewise the P-U transforms will now take 3 seconds, not 9 seconds, saving another 6 seconds. If we can find the otherwise independent transforms and move them to the front of the chain(s), the total time for the run is the longest chain of dependent transforms. In this case, we shrank 42 transforms into the same time frame as 20 serialized transforms. This shrank the total time from 50 seconds down to less than 28 seconds.
But David, you mean we have to carefully weave these transforms together so that we can squeeze the fat from the timeline? Actually no. We have a sniffer that examines the "filter clause" for the dependent tables, you know, what the given transform will actually join against. This allows the transforms to self-discover their own optimum path without having to deal with painful weaving. If we happen to add another transform, or change the filter phrase in a transform to include/exclude another table in the join, it will automatically realign the priorities based on the intrinsic dependencies of the transforms. And your ETL tool won't do this dynamically.
Then we have the intake protocol, that of transferring data from one machine to another, or loading from files. We have a couple of options, that of loading the data completely before taking actions in the transforms, or we can load the data asynchronously to the transforms. Just as the transforms will "wait" on a prior table in the chain-of-transforms, we can also make them "wait" for the arrival of a particular intake table. Rather than waiting for all of the intake tables to arrive, we can initiate a transform chain when its particular data set has arrived (or for that matter, not at all, in case its data never arrives).
An in our second example, the total time allocated for loading the data, executing it and ensconcing it to the proper target tables was less than three minutes. Since serialized, all-or-nothing loading easily absorbed over half of this time, we found it important to squeeze the fat from this process. By causing the entire chain to run asynchronously, when the given intake table is ready, its transform-chain would automatically launch. As fortune would have it, the longest processing chain also had small tables to load. So we could kick off those transforms very early. Some of the shortest chains also had the largest load files, so by the time saved by starting the loads as-early-as-possible.
In the end, what started out as a five-minute-plus operation shrank to 160 total seconds of time, well-inside the 3-minute SLA with some room for recovery. Here's how it looked:
Load Time = L
Transform Time = T
Original - all transforms launch when the all loads are complete
LLLL TTT TTTTT
Async form, transforms begin when their source tables are ready:
One may ask, why wouldn't we do it in the second form anyhow? It seems that this is the optimum way to process data. Well, one answer is simply : checkpointing. If we launch the transforms prior to all of the loads finishing, it is much harder to recover in case of load-failure. We would have to cancel the in-flight transforms and otherwise bring the processing to a halt. If we load it all first, then proceed, any errors can stop the processing immediately. No wasted cycles. Efficiency is important with these kinds of transform-chains in a database like Netezza. Still, if we need to shrink the total job time, the recovery-shutdown protocol (in case of load failure) is a necessary capability. Besides, our protocol does not itself finalize anything to the target database until the last transform is complete, lest the data start to arrive intermittently and out-of-context. So as long as the load time is balanced with overall transform time, shutting down before finalization is usually doable.
This of course invites the obvious question - are any of the ETL tools (or ELT offerings) attempting to squeeze out the fat in a similar manner? Methinks not, and for one reason: They, like the general-purpose databases, likewise consider themselves to be general-purpose tools. They are not philosophically committed to squeezing out latency because Netezza is the only platform that can benefit from it. The uptake in setting up and coordinating this sort of baton-like handoff, while not particularly difficult, is also non-trivial and represents a significant effort for a product tool to embrace. When compared to interfacing with the general-purpose-platforms - Netezza is not a large-enough user base to justify ramping-up this high-performance, zero-latency capability. In short, the product's features are market-driven.
The Netezza user base is growing, however, and with IBM pushing the hardware, it will grow more and faster.
Sitting at the top-end of this food chain are the big-ticket TwinFin 24/48/96 machines that do massive amounts of processing-in-the-box and want to move toward continuous models, processing data as-the-world turns. The age of the nightly batch cycle is waning and Netezza is stepping up to the vast opportunities within the "continuous" world. Latency in this world is like poison. Just because it appears to be acceptable to the general-purpose world, this is an illusion. If the general-purpose queries ever dropped into subsecond-duration, the tools facing them would need to re-tool. Actually they need to re-tool now - they just don't feel the pressure yet.
Modified on by DavidBirmingham
Many years ago we encountered an environment where the client wanted the old system refactored into the new. The "new" here being the Netezza platform and the "old" here being an overwhelmed RDBMS that couldn't hope to keep up with the workload. So the team landed on the ground with all hopes high. The client had purchased the equivalent of a 4-rack striper for production and a 1-rack Striper for development. Oddly, the same thing happened here as happens in many places. The 4-rack was dispatched to the protected production enclave and the 1-rack was dropped into the local data center with the developers salivating to get started. And get started they did.
The first team inherited about half a terabyte of raw data from the old system and started crunching on it. The second team, starting a week later, began testing on the work of the first team. A third team entered the fray, building out test cases and a wide array of number-crunching exercises. While these three teams dogpiled onto and hammered the 1-rack, the 4-rack sat elsewhere, humming with nothing to do.
We know that in any environment we encounter, with any technology we can name, the development machines are underpowered compared to the production environment. And while the production environment has a lot of growing priorities for ongoing projects, we don't have this scenario for our first project, do we? Our first project has a primary, overarching theme: it is a huge bubble of work that we need to muscle-through with as much power as possible. That "as much as possible" in our case, was the 4-rack sitting behind the smoked glass, mocking us.
And this is the irony - for a first project we have a huge "first-bubble" of work before us that will never appear again. the bubble includes all the data movement, management and backfilling of structures that we will execute only once, right? Really? I've been in places where these processes have to be executed dozens if not hundreds of times in a development or integration environment as a means to boil out any latent bugs prior to its maiden - and only - conversion voyage. But is this a maiden-and-only voyage? Hardly - typically the production guys will want to make several dry runs of the stuff too. We can multiply their need for dry runs with ours, because we have no intention of invoking such a large-scale movement of data without extensive testing.
And yet, we're doing it on the smaller machine. No doubt the 1-rack has some stuff - but I've seen cases where it might take us two weeks to wrap up a particularly heavy-lifting piece of logic. If we'd done this on the larger 4-rack, we would ahve finished it in days or less. Double the power, half the time-to-deliver (when the time is deliver is governed by testing)
In practically every case of a data warehouse conversion, the actual 'coding' and development itself is a nit compered to the timeline required for testing. I've noted this in a number of places and forms, in that the testing load for a data warehouse conversion is the largest and most protracted part of the effort. And if testing (as in our case) is largely loading, crunching and presenting the data, we need the strongest possible hardware to get past the first bubble. A data conversion project is a "testing" project more so than a "development" project, and with the volumes we'll ultimately crunch, hardware is king.
But I've had this conversation with more people than I can count. Why can't you deploy the production environment with all its power, for use in getting past the first bubble, then scratch the system and deploy for production? What is the danger here? I know plenty of people, some of them vendor product engineers, who would be happy to validate such a 'scratch' so that the production system arrives with nothing but its originally deployed default environment.
Yet another philosophy is that we would pre-configure the machine for production deployment, but nobody likes developers doing this kind of thing in a vacuum. They would rather see deployment/implementation scripts that "promote" the implementation. I'm a big fan of that, too, for the first and every following deployment. That's why I would prefer we used the production-destined system to get past the first-bubble-blues, then scratch it, and get the original environment standing up straight, and only then treat it as an operational production asset.
Most projects like this have a very short runway in their time-to-market, and we do a disservice to the hard-working folks who are doing their best to stand up this environment, They need all the power they can get, especially when they enter the testing cycle.
And for this, it's an 80/20 rule for every technical work product we will ever produce. Take a look sometime at what it takes to roll out a simple Java Bean, or a C# application, or a web site. Part of the time is spent in raw development, and part of it in testing. If I include the total number of minutes spent by the developer in unit testing, and then by hardcore testers in a UAT or QA environment, and it is clear that the total wall-clock hours spent in producing quality technology breaks into the 80/20 rule - 20 percent of the time is spent in development, and 80 percent in testing.
And if the majority of the time is spent in testing, what are we testing on Enzee space? The machine's ability to load, internally crunch and then publish the data. On a Netezza machine, this last operation is largely a function of the first two. But we have to test all the loading don't we? And when testing the full processing cycle we have to load-and-crunch in the same stream, no? What does it take to do this? Hardware, baby, and lots of it.
I can say that multiple small teams can get a lot of "ongoing" work done on a 1-rack, no doubt a very powerful environment. I can also say that a machine like this, for multiple teams in the first-bubble effort, will gaze longingly at the 4-rack in the hopes they can get to it soon, because so much testing is still before them, and they need the power to close.
What are some options to make this work? Typically the production controllers and operators don't like to see any "development" work in the machines that sit inside the production enclosure. They want tried-and-tested solutions that are production-ready while they're running. At the same time, they have no issues with allowing a pre-production instance into the environment because they know a pre-production instance is often necessary for performance testing. Here's the rub: the entire conversion and migration is one giant performance test! So designating the environment as pre-production isn't subtle, nuanced, disingenuous or sneaky - it accurately defines what we're trying to do. It's a performance-centric conversion of a pre-existing production solution, now de-engineered for the Netezza machine. As I noted, development is usually a nit, where the testing is the centerpiece of the work.
With that, Netezza gives us the power to close, to handily muscle-through this first-bubble without the blues - we only hurt ourselves with "policies" for the environment that are impractical for the first-bubble.
This brings us full-circle yet again to a common problem with environments assimilating a Netezza machine. The scales and protocols put pressure on policies, because those policies are geared for general-purpose environments. There's nothing wrong with the policies, they protect things inside those general-purpose environments. But the same policies that protect things in general-purpose realm actually sacrifice performance in the Netezza realm. Don't toss those policies - adapt them.
So I published this book last Spring ('11) on how the Netezza machine is a change-agent. It initiates transformation upon people or products that happen to intersect with it. Most of the time this transformation makes the subject better. Sort of like how heavy-lifting of weights will make the body stronger. Or the pressure can crush the subject. Stress works that way. We could imagine the Netezza machine as the change-agent entering the environment. Everything brushing against it or interacting with it will have to step-up, beef-up or adapt. I sometimes hear the new players say things like "But if the Netezza machine could only.." That's like a Buck Private saying of his drill sargeant, "If he could only ..." No, the subject must consider that the Netezza machine is never the object of transformation but rather is the initiator of it. But it's not a harsh existence by any means. Products that can adapt are far-and-away better than before. Those that cannot adapt now, will eventually, or remain in their current tier.
Having been directly or indirectly alongside these sorts of product integrations and proof-of-concepts (POCs) numerous times, it's always an interesting ride. The vendor shows up ready-to-go with visions-of-sugarplums in their head. And the suits who show up with them, are salivating for the ink on the license agreement. In less than an hour into the POC, all of them have a very different opinion of their product than when they arrived. Their bravado is reduced to a shy, sort of sheepish spin. Throw them a bone, not everyone walks out of this ring intact. Some of them shake their fist at the Netezza machine. It is unimpressed. Others shake their fist at their own product. Alas, it is but virtual, inanimate matter. What is transforming now? The person in the seat.
So I have watched them scramble to make the product hit-the-mark. Patches? We don't need no stinkin' patches. Except for today, when they will be on the phone in high-intensity conversations with their "engine-room" begging for special releases while on-site. Alas such malaise could have been avoided if only they had connected their product - at least once - to a Netezza machine. In so many cases, they will claim that they have Netezza machines in-the-shop so they are prepared-and-all-that. It is revealed, sometimes within the first hour, that the product has never been connected to a Netezza machine. It doesn't even do the basics, or address the machine correctly. It is especially humorous to hear them speak in terms of scalability as though a terabyte is a high-water mark for them. One may well ask, why are we wasting our time with underpowered technology? Well, in point of fact, when placed next to the Netezza machine it's all underpowered, so really it's just a matter of degree.
Case in point, Enzees know that in order to copy data from one database to another, we have to connect to the target database (we can only write to the database we are connected to). And then use a fully-qualifed database/tablename to grab data from elsewhere - in the "select" phrase. Forsooth, their product wants to do it like "all the others do" and connect to the source, pushing data to the target. Staring numb at the white board in realization of this fundamental flaw, they mutter "If only Netezza could....". But that's not the point. They arrived on site, product CD in hand, without ever having performed even one test on real Netezza machine, or this issue (and others) would have hit them on the first operation. They would have pulled up a chair in their labs, started the process of integration and perhaps call the potential customer "Can we push the POC off until next week? We have some issues (insert fabricated storyline here) and need to do this later."
Cue swarming engineers. Transformation ensues.
Another case in point, many enterprise products are built to standards that are optimized for the target runtime server. That is, they fully intend to bring the data out of the machine, process it and send it back. One of my colleagues made a joke about Jim Carrey's "The Grinch" and the mayor's lament for a "Grinch-less" Christmas. Well, didn't the Grinch tell Cindy-Lou Who that in order to fix a light on the tree, he would take the tree, fix it and bring it back? Seems like a lot of hassle for one light? Why can't you fix it here and not take it anywhere? Enzees see the analogy unfolding. No, we don't want to take the data out, process it and put it back. We want "Grinch-less" processing, too. Fix the data where it already is.
Why do this? Well, in 6.0 version of the NPS Host, the compression engine could easily give us up to 32x compression on the disk. Or even a nominal 16x compression, meaning that our 80 terabytes is now 5TB of storage. And while we may have to de-compress it on the inside of the machine to process it, the machine is well-suited to moving these quantities around internally. Woe unto the light-of- heart who would pull the data out into-the-open, blooming it to its full un-compressed glory, on the network, CPU, the network again - just to process it and put it back.
Unprepared for the largesse of such data stores, our vendor contender's product centers on common scalar data types. Integer, character, varchar, date. No big deal. Connect to the Netezza machine and find out that the "common" database size is in the many billions and tens of billions of rows. A chocolate-and-vanilla software product without regard to a BigInt (8 byte) data type, cannot exceed the ceiling of 2 billion (that's the biggest a simple integer can hold). This does not bode well for integrating to a database with a minmum of ten billion records and that's just the smallest table. Having integers peppered throughout the software architecture by default - would require a sweeping overhaul to remediate. As the day wears on, we see them struggle with singleton inserts (a big No-No in Netezza circles) and lack of process control over the Netezza return states and status. These are not exotic or odd, but no two databases behave the same way. The moment that Netezza returned the row-count that it had successfully copied four billion rows, we watched the product crash because it could not store the row-count anywhere - the product had standardized on integers, not big integers, so the internal variable overflowed and tossed everything overboard. Quite unfortunately, this was a data-transfer product and performed destructive operations on the data (copy over there, delete the original source over here). So any hiccup meant that we could lose data, and lots of it.
Cue announceer: "And the not-ready-for-prime-time-players..."
Oh, and that "lose data and lots of it" needs to be underscored. In a database holding tens of billions of rows (hundreds of terabytes) of structured data, that is, each record in inventory, with fiducial, legal, contractual, perhaps even regulatory wrappers around it, and we're way, way past the coffin zone. Some of you recall the "coffin zone" is the point-of-no-return for an extreme rock-face climber. Cross that boundary and you can't climb down. But we're not climbing a rock face are we? The principle is the same. Lose that data and we'll get a visit from the grim reaper. He doesn't hold a sickle, just a pink slip in one hand and a cardboard box in the other (just big enough for empty a desk-full of personal belongings).
One test after another either fails or reveals another product flaw. When the smoke clears, the "rock solid offering" complete with sales-slicks and slick-salesmen, is beaten and battered and ready for the showers. The product engineers must now overhaul their technology (transform it) and fortify it for Netezza, or remain in their tier. The Netezza machine has spoken, reset itself into a resting-stance, presses a fist into a palm, graciously bows, and with a terse, gutteral challenge of a sensei master, says: "Your Kung Fu is not strong!"
Now it's transformation-fu.
Superficially, this can look like a common product-integration firefight. But this kind of scramble tells a larger tale: They weren't really ready for the POC. This would be similar to an "expert" big-city fireman, supremely trained and battle-hardened in the art of firefighting and all its risks, joining Red Adair's oil-well -fire-fighting team ( a niche to be sure) and finding that none of the equipment or procedures he is familiar with apply any longer. He will have to unlearn what he knows in order to be effective on a radically larger scale. He might have been a superhero back home, faster than a speeding bullet, able to leap tall (burning) buildings in a single bound, but when he shows up at Red Adair's place, they will tell him to exchange his clothing for a fireproof form and get rid of the cape. Nobody's a hero around an oil-field fire. Heroes leave the site horizontally, feet-first. No exceptions.
Enzees have experienced a similar transformation (with a different kind of fire). The most-oft-asked questions at conferences are just that flavor: How do we bring newbees into the fold? How do we get them from thinking in row-based/transactional solutions into set-based solutions? How do we help them understand how to use sweeping-query-scans to process billions of rows? Or use one-rule-multiple-row approaches versus cursor-based multiple-rule-per-row? How do we get testers into a model of testing with key-based summaries instead of eyeballs-on-rows (when rows are in billions)?
We were dealing with a backup problem at one site because of a lack of external disk space. Commodity tools often use external disk space for this purpose, until they are connected to a Netezza machine and their admin tool complains that they need to add "another hundred terabytes" of workspace. We gulp, realizing that the workspace is only today a grand total of ten terabytes in size. And you need another hundred! Yeesh, you big-data-people!
Most of the universe outside the Enzee universe will never have to address problems on this scale. It is not the machine itself that is the niche. It is the problem/solution domain. Most of the commodity products that are stepping up are doing so only because it's clear that Netezza is here to stay and they need to step into Netezza's domain. I suppose at some point they expected Netezza to give them a call to start the integration process, but the Netezza Enzee Universe already had all that under control. It's amazing how lots-of-power can simplify hard tasks to the end of ignoring commodity products entirely.
Another case in point, a product vendor "popped over" with a couple of his newbee product guys and spent two weeks trying to get their product to play in-scale with Netezza. Before throwing in the towel, they offered up the common litany of observations. "No indexes? What the?" and "Netezza needs to change X", or the favorite "Nobody stores this much data in one place." The short version is, you brought a knife to a gun fight, as Sean Connery would assert, or perhaps, you brought a pick-axe and a rope to scale Mt. Everest. What were you thinking? You see, most people who have never heard of Netezza (I know, there really are folks out there who don't know about it, strange as is seems) do not understand the scale of data inside its enclosure. Billions of records? Tens of billions of records? A half-trillion records? Is that all you got?
We will watch a switch flip over in their brains as they assess what they are trying to bite off. A small group will embrace the problem and work toward harnessing the Netezza machine in every way possible. Another group will provide a bolt-on adapter for Netezza to interface to their core product engine. While another, larger group will assess the expense of such things, the marketplace they currently address, and conclude that they will for now remain in their current tier. This is like a 180-lb fighter climbing into the ring with a heavyweight, and walking away realiizing that they need to add some muscle, some speed, and some toughness or just stay in their own weight class and be successful there.
Another case-in-point is the need for high-intensity data processing in-the-box in a continuous form, coupled with the need for the reporting environment to share the data once-processed, likewise coupled with the need for backup/restore/archive and perhaps a hot-swap failover protocol. We can do these things with smaller machines and their supporting vendor software products. But what about Netezza, with such daunting data sizes, adding the complexity of data processing?
At one site we had a TwinFin 48 (384 processors) and two TwinFin 24's (192 processors) with the '48 doing the heavy-lifting for both production roles. When it came time to get more hardware, the architects decided to get another '48 and split the roles, so that one of the machines would do hard-data-processing and simply replicate-final-results to the second '48, limiting its processing impact for any given movement. This was not the only part of their plan. They then set up replicators to make "hot" versions of each of these databases on the other server. This allowed them to store all of the data on both, providing a hot DR live/live configuration, but it would only cost them storage, not CPU power. Configured correctly, neither of the live databases would know the difference. Our replicators (nzDIF modules) seamlessly operated this using the Netezza compressed format to achieve an effective 6TB/hour inter-machine transfer rate, plenty of power for incremental/trickle feeds across the machines.
Some say "I want an enterprise product that I can use for all of my databases". Well, this is the problem isn't it? Netezza is not like "all of our other databases". Products that have a smashing time with the lower-volume environments start to think that a "big" version of one of those environments somehow qualifies their product to step-up. I am fond of noting that Ab Initio, at one site loading a commodity SMP RDBMS, was achieving fifteen million rows in two hours. Ab Initio can load data a lot faster than that (and is on record as the only technology that can feed Netezza's load capacity). So what was the problem? The choice of database hardware? Software? Disk space? Actually it was the mistaken belief that any of those can scale to the same heights as Netezza. I could not imagine, for example, that if fifteen million rows would take two hours, what about a billion rows (1300 hours? ). Netezza's cruising-speed is over a million rows a second from one stream, and can load multiple streams-at-a-time.
Many very popular enteprise products have not bothered to integrate with a Netezza machine, and many of those who have, provide some form of bolt-on adapter for it. It usually works, but because the problem domain is a niche, it's not on their "product radar". It's not "integrated as-one". What does this mean? Netezza's ecosystem, and now assimilated by IBM, through IBM's product genius and sheer integration muscle, will ultimately have a powerful stack for enterprise computing such that none of the other players will be able to catch up. If those vendors have not integrated by now, the goal-line to achieve it is even now racing ahead of them toward the horizon. Perhaps they won't catch up. Perhaps they won't keep up. Some products (e.g. nzDIF) are at the front-edge, but nzDIF is not a shrink-wrapped or download-and-go kind of toolkit. We use it to accelerate our clients and differentiate our approaches. It's a development platform, an operational environment and expert system (our best and brightest capture Netezza best practices directly into the core). This has certainly been a year where we've gotten the most requests for it. But there's only one way to get a copy.
Cue Red Adair.
"No capes!" - Edna Mode, clothing-designer-for-the-gods, Disney/Pixar's The Incredibles
Within a few minutes, Neon and Ruth Guardian arrived at the office of the Architect. The label "Most Recent" was alongside the open door. Inside, they could see a high-back leather chair and the back of a man's head. Beyond him. an array of computer screens in a grid, filled and scrolling with operational information.
Neon knocked and entered. He wondered why such a complex command console was required for such a simple data warehouse "Mr. Recent?"
The Architect turned around to greet him, "Hello, Neon - I've been expecting you." He clicked his remote mouse toward the wall and several screens blinked. "But I'm afraid that Most Recent is a status, not a name." He then pointed to the floor, "Please remain standing. You won't be long."
Neon glanced toward Ruth and wanted to roll his eyes, but refrained. "I have a few questions," Neon said, pulling up a chair.
The Architect confidently smiled and said, "I suspect that you do, but keep in mind that you are still irrevocably a consultant. Some of my answers will make sense, and some won't."
Guardian leaned into Neon's ear and whispered, "Actually none of his answers will make sense."
The Architect continued with the tiniest smile, "The question you ask first will be the most important to you, but is also the most irrelevant."
Neon paused, then asked, "Why didn't you protect the company and its interests when you designed the data warehouse?"
"Okay, I was wrong - that is a very good question. Let me think about that for a moment."
"Take your time."
"I suppose the question is not about the data warehouse itself, but the outcome of its applications."
"No, it's the data warehouse itself." Neon focused.
"But the data warehouse is practically perfect in every way." He was noticeably uncomfortable.
'It's full of junk," Guardian sighed.
The Architect continued, "Look at the algorithms, the flows, the data management, the sheer muscle in the machinery. Infinitely capable and infinitely scalable." He grinned and laughed to himself, "Null, missing and invalid values are simply scattered anomalies within what is otherwise a harmony of mathematical precision."
"But you're supposed to correct those," Neon challenged, "the data warehouse process has a responsibility to scrub bad data and either exclude it or make it right."
"What if the data doesn't want to be excluded? What if we don't know how to make it right? What if it cannot be made right? What if excluded data finds a way to come back?"
"The data isn't like an animal," Neon huffed, "it doesn't have a personality. It doesn't feel pain or make decisions. It doesn't - " he paused, leaning backward.
"It doesn't," The Architect agreed, "and it isn't." He shared a long, intense eye-to-eye gaze with Neon, "you still don't understand, do you? All data is merely binary one's and zero's recorded on media, and the metadata to describe it is no different. Data or metadata - what is the difference? - it isn't real. It is a collection of magnetic signals that record the semblance of something in the real world"
There is no data Neon recalled. And if metadata itself is data, then it, too is not real. There is no data, no metadata. What then - is reality? "This is just a philosophical crock," he concluded.
"When data flows through the systems," The Architect explained, "it is not really flowing, only signals representing the data are changing in memory, organized such that the mathematical states cascade in value from one machine to the next - but it's not like water. It doesn't have - physicality."
It's not real in the physical sense, Neon's brain was on fire, but it represents reality, and affects reality - like - sorcery?
"He's doing it too," Guardian said, "talking in circles and not accepting responsibility."
"Don't reduce," Neon focused, "the data is junk, no matter how you choose to describe it, or how capable the technology is. Magnetic state leaves one place and lands in another, and you took no responsibility to make it right - is that about it?"
"Anomalies in the source data reduce its quality," the Architect asserted, "to a level that is beyond correction, but not beyond control."
"They are beyond neither," Neon corrected, "correction and control are simply a function of diligence."
"Diligence and control are easy," he said, "but correction is the realm of the data stewards. What if I correct the data? From what incorrect state to what correct state? And who decides? Who controls?"
It suddenly dawned on Neon that a major missing piece had nothing to do with the architecture, data, or technologies involved. It had to do with governance. No controlling authority existed to control quality of process or result.
"But you're not controlling it at all. Data quality is not beholden to anomalies," Guardian pointed out, "Increasing the quality - eliminating junk - is a deliberate act of will."
"Choice?" the Architect clarified, "is not relevant. The data flows. The programs run. Everything is doing what it is supposed to do. Everything is fulfilling its purpose."
"And that would be?"
"This data mart here," Neon pointed to a screen, "What is it's purpose?"
"Interesting - ", the Architect raised an eyebrow," you asked that question much sooner than any of the others. Impressive." He cleared his throat and continued, "the mart was designed to deliver business facts and dimensions."
"He didn't answer your question," Guardian said, "ask it a different way."
"I don't decide purpose any more than you choose functionality," said the Architect, "Only the users know the purpose. Why do you need to know?"
Neon was growing impatient, "And who are the users of this data mart?"
"The warehouse has developed over the years through several iterations," the Architect continued. "In each case, the basic data remained the same, only the architects and technology changed. We stabilized the staff and the technology, so now all things are safe from harm."
"All things," Neon said - "like - "
"Jobs, careers," the Architect said, "and their security. So we are not beholden to fate. You do understand fate, don't you Neon?'
"But the data -"
"The data supports our existence, but is not the reason for it."
"The technology is about to change. We're tossing out the hand-crafted scripts and replacing them with a framework environment." Neon asserted.
"Netezza?" the Architect rubbed his chin, "will change nothing. We will implement it in the same manner as all before it."
"The changing of the technology might change more than you think. It may change how you approach the problem. And reduce anomalies to zero."
"A problem is simply an opportunity without a solution. The technology serves us, and we are its master. What is the significance of technology without the technologists?"
Neon recalled his conversation with Miro the Virginian, and muttered to himself, "the slaves."
"Pardon?" the Architect queried.
"Change," Neon observed, crossing his arms, "the problem is change. Your architecture is not adaptive, so cannot account for change. It exists with the fear of change, not the adaptation to it. Your technologists are slaves to their own creation - and you are their slavemaster."
Ruth's eyes widened. She'd never seen a consultant speak to the Architect this way before.
Neon continued, "Hmm. From where I'm sitting, the only thing more significant than the data warehouse itself is -"
"Me," interjected the Architect. "Data changes shape, is delivered different ways, but we continue."
"I was about to say that the only thing more significant than the warehouse is its failure," Neon sighed, "to preserve the security and integrity of the data and fulfill known purposes of the users."
"Users, schmoozers," the Architect snickered, "Developers implement a data warehouse. Users get what they ask for."
"And not what they need?"
"Who can say what they need?"
"I can," Guardian offered, "they need to be successful. The data isn't helping them. You seem to think the data is secondary."
"Secondary only to my own importance, which cannot be questioned."
"I am an engine of rhetoric. I ask rhetorical questions that are not meant to be answered, and I give rhetorical answers that are not meant to be questioned," he stared down his inquisitor, irritated and unamused. "I am the Architect."
"Do you really?", the Architect leaned forward, now oblivious to the screens flashing all around him, "and what exactly do you see?"
"Someone on his way out," Neon said, rising, "you have a choice. Pick the door that cooperates with me and save your skin. Or pick the door that defends this nonsense and - "
"The loss will be for Red Corporation either way," the Architect said, "the loss of a sublime data architecture in favor of something simple - and maintainable by simpletons," he cleared his throat, "or the loss of very important staff and the business knowledge between their ears."
"It can be better - "
"Optimism," The architect observed, "your greatest strength - and your greatest weakness."
Neon sighed, "I'm a realistic optimist. I think things can't get any worse than they are now - "
Neon turned to leave and Ruth followed.
"I've built a lot of these," the Architect called after him, "and I will build many more. I've gotten very efficient at it!"
Neon and Ruth made a quick exit. Walking down the hallway back to their offices, Neon wondered what Ruth's role in this mess had been. "Why didn't you stop them?"
"I wasn't here," Ruth said, "to guide the effort." Ruth thought about it for a moment and said, "The leaders need good data to make sound decisions. It protects their careers, the company and ultimately my job and the reason for it. I hunt for the bad data and kill it. I hunt for the good data and preserve it."
"So your something of a data hunter?"
"More like a data protector."
"We shouldn't borrow lines from another script."
"Oh, sorry, couldn't resist. In any case, I protect things."
"I protect, " she paused for effect, "that which matters most."
Modified on by DavidBirmingham
Neon strode through the crowd, half-gliding as his long black raincoat floated behind him. Orpheus had called - and now it was time to meet with the database vendor. He made his way confidently toward the large black building.
Orpheus joined Neon outside building and both of them entered. Orpheus led him to a large lobby full of people, signed him into the visitor list, turned and said, "You're here to see Oracle. Whatever they tell you is only for you to know and nobody else. It's a non-disclosure thing." With that, he departed and left Neon amidst the others.
Neon found his way to a chair and sat down to wait quietly. Next to him was a young man who looked like he'd just graduated college. He was tapping feverishly on the keyboard of a well-worn laptop.
"Why are you here?" the kid asked.
"To see Oracle," Neon muttered, closing his eyes, "and you?"
"Metadata," the kid whispered, almost hypnotized by his own work. "If you notice," he tapped several keys and filled the screen with data, "how the information moves when the metadata changes..."
He was right. As he moved his mouse, the metadata changed, controlling the information on the screen almost like water in a pool.
"That's interesting," Neon said, "how are you doing that?"
"Making the information move like that?"
"There is no information," the kid said, "only metadata."
"But there has to be -"
"Think about it," said the kid mysteriously, "When we see an "A" on the screen, it's not really an "A", but the way the screen has been programmed to represent the digital code for "A". And below that, the "A" is a hexadecimal value, which is another representation for binary digits. And we know that binary digits are representations of electrical signals that can only be "on" or "off". Ultimately, each level is a representation of the one below it, so there is no true data, just representations of electrical signals. It's all metadata."
"There is no information," Neon whispered.
"Mr. Neon," an admin called out, and Neon left the young man to his own mysterious devices. "Oracle is down the hall to the left."
Neon entered the conference room where only a small, unassuming little woman stood pouring herself some coffee. "Oracle?"
"Shortly," she said, not turning around, "they're in the next room wrapping up. But we can talk for now."
Neon waited in the uncomfortable silence while the woman slowly stirred her coffee.
"You don't drink coffee do you?" she asked nicely.
"Not really, no."
"How did I know that?" she asked, motioning for him to sit down. She answered her own question, "Your eyes are naturally bright. You don't need the stuff."
They moved toward the large conference table while she continued, "Ever wanted to predict the future?" She sipped her coffee and paused for effect. "Or see things others cannot?"
"What do you mean?"
"Predictive analytics. Remember the little girl on the beach in Indonesia? The water rushed out to sea, and everyone thought it was odd, but still they ignored it. She knew a tsunami was coming because she'd studied it in school. She sounded the warning, saved lives - and I noticed - that none of the other tabloid prophets had anything to say about it."
Neon smiled, "That's because you can't really predict the future."
"What if you could?" she smiled, "in the same way that she did - she obviously knew what was about to happen before it happened."
"That's different. That's not predicting, that's analytics."
"In the business world, and in technology, there's no difference."
Neon paused for a moment, "Go on."
"People go about their business, everything is routine. Then something odd happens one day. Maybe it's deja-vu. Maybe something more. Ever been somewhere that the odd things were so glaringly out of place to you, but nobody else seemed to notice?"
Neon was waiting for her to make her point. Perhaps like information, there is no point, he thought.
"Have a seat," she invited, "I've got something to show you."
She produced a laptop, popped it open and said, "Look." Filling the screen were various utility scripts.
Neon's eyes got wide, "Is this what I think it is?"
"Yes," she said, "From the underground." She went on to explain that while all of the scripts had been built for utility, she'd done it under the license of a prior customer. The customer had received value for their money, and she had walked out with the reusable collateral.
"So these weren't developed with a bootleg copy of Linux?" Neon queried.
"Would it matter?" she said, "Either way, they're from the underground. How much would something like this be worth to you?"
Neon thought for a moment, "Functionally I'm sure there's value, but we'd have to retrofit these into our current naming conventions, configuration management - I mean, it would be easier to build them from scratch than accepting them from you."
"But you have to know what to build," she said mysteriously," you have to know ahead of time what you're going to need. What if these contain ideas about things you didn't know you needed, but really do?"
"Cut to the chase, just give me the insights and I'll apply them."
"Insights are valuable too," she said, "like seeing into the future. You will need some of these functions, and you won't know it until it's too late."
"It will never be too late," Neon said, "we can always add it later."
"And if you never do, you'll suffer when there could be a better way. Ignorance will enslave you."
"I'll take that risk."
"That's not the Netezza way," she said, standing, "the Netezza way - adaptive architecture - is to add it to the mix - keep it in the arsenal - knowing that you may need it because so many people often do."
"That's a little extreme - "
"Is it?" she leaned forward, "What must you have in place before developing? What are the top three pitfalls of development? What are the top three missed opportunities if you start wrong? Do you know how to apply metadata so that data doesn't really matter anymore?"
Neon shifted his weight in the chair.
"Like the little girl - predicting the future from prior knowledge. Harness the information, and harness the future."
"Nobody can amass that much information," he shook his head.
"It's not about the information," she said, "it's about how the information is used."
Neon whispered in recollection, "There is no information... only -"
"What transforms raw data into actionable information? Metadata - Behavioral and structural metadata are not the realm of tacticians," she said, "they only understand raw data. Rise above the data - connecting it with metadata - and become an information alchemist. You'll have the persona of a shaman, or a sorcerer, not simply a developer - once you master the power of metadata, and its power over data to create information."
"But data drives the processes - "
"But metadata determines which processes to drive - "
Neon's eyes widened, "I'll take that cup of coffee now."
She retrieved the coffee, using the time to explain, "In the corporate layoffs of earlier years, many IT professionals developed a deep distrust for corporate IT, to the point that most of them had gone underground, to a "1099-culture" of contractors that popped on and off the grid like ghosts. This created an underground of disavowed professionals who sometimes take extreme measures to be successful."
"What do you mean, ghosts?" Neon asked.
"You see those folks out there?" she pointed out a small window to a view of a hallway, "they work here and fulfill a purpose. Sometimes they are replaced by someone who is better or different than they are - happens all the time." She took a long draw of her coffee, "then there are those who don't really have a purpose, or their purpose is lost. They might leave the company and reappear as consultants or contractors."
"Happens all the time," Neon observed.
"Yes, but the legends that you hear - of vampire firms that suck the life blood and money from a company, or ghosts that appear on and off the books to fix spot problems, or aliens that come from far away places", she shook her head, "are part of an underground - or an underworld. In many cases these are people who have lost their original purpose - or who are trying to find one - some are even outcasts who live in the underground and supplant or assist people - " she pointed to the hallway, "like them."
Neon reflected, "Go on -"
"So there's nothing really wrong with being a ghost," she said, "I'm something of one myself. The point is - the underground is a dangerous place for unprotected hex," she grinned.
Neon raised an eyebrow.
"I mean software - and software licenses. There are bootleggers and pirates, crackers and hackers all over the globe. Vendors have wisely chosen to lock down their own licensing model to keep the vampires at bay. A form of garlic, if you will."
"And the vampires reciprocate with a model of entry by invitation only," Neon smiled.
"Or we'd have bootleg AI products," Neon finished, "But this makes it even harder for a consultant to succeed - they always have to be working for a licensed customer in order to get exposure to the product - making them even more desperate to get a copy of it themselves."
"Bingo," she said, taking another sip, "it's a pickle, no doubt about it."
"So how do we solve it?"
"What's to solve? People work their way around it," she said, then explained how one consulting firm put their daytime workforce in play and then replaced them with a night-time workforce to do unit testing for the daytime work force. The night work force was "free" to the company, but they were getting trained in a live environment.
"So they found a way," Neon leaned back in his chair.
"They always do," she said, sipping her coffee and maintaining a long silence.
"I interviewed a fellow the other day," she began, "for the third time."
"Why so many -"
"Wasn't my intention," she smiled, "he just used different names each time. The third attempt he had someone in the room with him coaching him on the answers. Desperate times call for desperate measures." she took a long sip of the coffee. "It's one thing to have the answers - it's another thing to actually know them."
"Seems like it would be easier to get a job somewhere else doing something else?"
"Where else?" she said, "and what else? The tactical underground lives for the information - bootleg or otherwise. Anything involving information will always give them work to do. They live for the "I", so to speak, the "I" in IT".
"But you said there is no information- "
"Bingo. They are living for something that is not real. The information will always change, faster than you can blink."
"So when we say there is no information, it's not that the information isn't there, only that it's passing through so quickly - "
"Like water in the plumbing. The plumbing is the metadata - it carries the water - regulates it - manages it. Which is more important, the water or the plumbing - and who makes more impact - the plumber or the one who is constantly focused on the water?"
"Here's another dilemma," she continued, "a team with ill-trained Netezza people arrives on site and does damage trying to understand how to make the technology fit into how they do things. It's how they do things that's more important to them, because they are predicting they may need a backup - like another technology and another team - to backfill if they fail."
"They are predicting their own failure," Neon said.
"Predicting is hard to do," she smiled, "but planning it is not. You can't fool me - some teams do such a bad job that I can't be convinced their failure wasn't deliberate. Netezza is so easy to use it would be like failing to cross a country road. You have to trip yourself a bunch of times - deliberately - you see what I mean?"
"Sounds almost - insidious."
She sipped her coffee, "You've seen the so-called support systems in the underground - Is this sort of like the blind leading the blind - or perhaps the blind misleading the blind?"
"Never thought about it that way, " Neon mused.
"Think about where you are now," she continued, "I bet you'd like to learn more - can you share information like: How much did you spend on your development staff? Your consulting staff? That much? That little? What was the outcome? How long did it take? that long? That quick? How do you schedule work and scope projects? etc. etc. " she waved her hand in the air, "the list goes on and on - "
"What's the answer?"
"Collaboration," she said slowly, then paused, "and someone to make it happen." She leaned forward - "a facilitator. In gaming systems, it's a sprite - something that exists only to bridge the player into the game." She leaned forward, "the real game."
Neon's eyes sparkled, "So if end users collaborate on each other's work, they could trade ideas and watch each other's backs." He stopped himself, "But that is a huge amount of information -"
"You don't learn very fast, do you?" She smiled. "It's not about the information, it's about the facilitation. As far as you're concerned, they need a broker. Someone to bridge them into the game."
Neon's cell phone beeped and he turned around to check the display. It was Ruth Guardian, and it was time to leave. He turned to the woman to apologize, but she and her laptop were gone. On her chair was a business card with the Oracle logo, and on the back was scribbled a cell phone number.
He looked around quickly, and noted that the door on the far end of the conference room was slowly closing, telling him that she'd exited through it. He darted toward the door and opened it quickly.
The hallway was empty.
Modified on by DavidBirmingham
Neon browsed the list of strange and cryptic entries rolling on the screen. "These are back doors" he said under his breath.
"Yes," said Ruth Guardian, the data quality expert for Red Corp, "Each one represents a potential hole in security for private information."
"It's worse than that."
"If this data content gets to the desktop of a decision maker," he sighed, "it could take the company in a dangerous direction."
Her eyes widened.
"And if the wrong or incomplete data appears on the customer website, the customer will assume - and rightly so - that you're not protecting their information."
"Who's in charge of applications here?"
"His name is Miro," Ruth informed, rising, "Follow me." They exited and moved quickly down the hallway. "He's was born in Virginia, but I think he spent some time overseas in his youth."
"Miro the Virginian doesn't talk like a Southerner, then?" Neon smiled.
"Maybe the South of France," she laughed.
Miro was standing on a platform watching three teams of application developers, slinging code and user screens with abandon. Miro looked like the conductor of - some strange orchestra.
"Need some of your time," Neon approached behind him.
"I can take time," Miro said casually, "if I don't take the time, how will I ever have the time?" He laughed at his own joke.
Neon peered over the shoulder of an application programmer - the complexity of the environment was mind-crushing. But something was odd. The complexity seemed to be - artificial.
"You know why I'm here."
"Yes, your arrival was heralded in company newsletters," Miro said, "but the question is - do you know why you are here?"
"I am here to reduce instability," Neon said.
"But that is a task, not a goal. What is the reason for you to do this?" He smiled and turned away, "You are here because you were sent here by the company leaders. They sent you, and you obeyed. This is the reality."
"The reality is that the environment is unstable."
"No, the only reality - is causality. Dirt breaks the screens, non?" Miro waved his hand, "We have dirt. We have screens. We have to clean the dirt. More dirt, more breakage. More code, less breakage. It's a cause and effect. Dirt is the cause. It creates one of two effects - more code or more breakage. Cause and effect."
Neon held up his hands, "I get it, alright? You have to harden the presentation layer because of all the dirt and disconnection in the data layer - "
"Watch this," Miro smirked as delivery people entered the room with boxes of pizza and two portable soda fountains. They quickly refilled the tankers on each developer's desktop and left the pizza in the middle of the room. "On the hour - every hour."
One of the developers spun around, popped open the first pizza box and hungrily wolfed down two big pieces. He washed it down by gulping his soda loudly.
"You see, " Miro pointed out, "the food and soda are like opiates. It doesn't matter how much work they have, as long as they are fed and watered." He held his hand to his ear and cupped it as though expecting to hear -
BUUUURP! The developer belched and returned to work.
"Satisfaction comes in many forms," Miro grinned, "Cause and effect. The food is the cause and the work is the effect. They don't have to understand it."
Neon recognized them as slaves of the same opiates that once were his master. "All you're doing is dealing with symptoms, not root causes," Neon asserted, "The root cause is bad data - and you've just found ways to manage the symptoms - like over-coding."
"Overcoding?" Miro corrected, "We only code what is needed to make things work. It fulfills its purpose."
Neon held up a finger as if to say hold that thought - then reached down, hit three keys - and all of the screens in the room went BSOD - Blue Screen of Death.
Miro gasped, the developers screamed and began a recovery fire drill. Miro quickly composed himself, raised an eyebrow and said "Okay, you have some skills."
"The environment is brittle because of the overcoding. Less code, fewer failure points. Cause and effect. The extra code is not necessary - it's a symptom -"
"Oh? How will you solve the problem otherwise?" Miro challenged.
Neon rolled his eyes and muttered, "Clean the data first."
"It is so simple for you, consultant," Miro huffed, "But you know nothing. The data is what it is. If you think you can change it - you must approach the Architect. He will not suffer the likes of you - "
"But all of the code, and all of the work," Neon observed, "gets in the way of serving what the users really need."
"You know what the users need?"
"Yes, well -"
"No, I don't think you do." Miro asserted mysteriously, "You think you know what they need, but do you really?"
"He's talking in circles," Ruth observed, "welcome to my world."
Neon turned to leave with Ruth on his heels.
"This isn't over," Miro snapped.
"Oh yes it is, " Neon retorted, "The leaders turned this over to me and I will put an end to it."
"I survived your predecessors, consultant " Miro called after him, "and I will survive you!"
Yep, Neon thought, I've heard that before, too.
Ruth quickly joined Neon in the hallway, "Now what?"
"Can I speak to the data architect?"
"Which one? There have been six before you. The most recent -"
"Just let me talk to him."