At far back as the 2007 Enzee Universe conference in Boston, people have scratched their head on the notion of a large-scale database that actually boasts about the absence of index structures. After all, indexes are the mainstay of relational databases and we simply can't get by without them, right? This is a simple example of how the power and architecture of the technology frees us to think about data loading and storage, data retrieval and in-the-box processing in a completely different way.
Firstly, it's no rumor that Netezza has no indexes. And for those of us who can't stand dealing with them, this is a huge plus. One of the Enzees at the 2011 conference asked point-blank - "What does Netezza have that Oracle does not?" and the clear answers will arise in this and later blog entries, but for now the absence of maintenance of index structures, will do just fine. What does Netezza have?
Or rather, the vendor has spent enormous thought labor into simplification of complex matters so that we don't have to deal with them. We can get to the business of - well - the business. The application data and information that we want to spend more time with if only we weren't dealing with the technical administration of the database engine. This is one of the greatest weaknesses of the traditional relational database when it comes to data warehousing in general and bulk processing in particular. It is also one of their greatest weaknesses in the space of analytics. After all, we want to apply the analytics by selecting subject areas and data sets to analyze, but if we have to stand up lots of infrastructure to support this mission we are regressing toward functional hardening rather than functional flexibility. Brittleness ensues and someone asks for a refactor, a re-architecture or whatever. When this request arrives in relation to a complex, engineered, multi-terabyte (or say tens-of-terabytes) data store, many will see it as a mind-numbing proposal. For most infrastructures, the weight and complexity of the installation has overwhelmed the logistical capacity of the humans to ever hope reeling it back in. There's not enough power in the machine to help. This could take a very long time to remediate.
Not so with an appliance. Such big-data issues are its bread-and-butter zone. If we have issues with a particular data model or information construct, the machine has the power (and then some) to get us out of the bad direction and into the new one. How did the direction get "bad" in the first place? The business changed its information priorities since the inception of the original model, and now the original no longer serves it well. The business has moved, because that's what businesses do. The analysts found opportunities in the nuggets of information-gold, and now want direct access to that gold rather than having to navigate to it or salivate while waiting for it.
One of the core features of the Netezza platform is how the data is distributed on the disk drives. Because it is an MPP, each of the disk drives is its own self-contained, shared-nothing hardware structure. Contrast this to the common SMP-based platform where the CPUs exist in one duck pond and the disk drives exist in another duck pond. The data is then couriered between duck ponds through throttled estuaries. If we issue a query, all of the data has to be pulled through this estuary and presented to the CPU ducks so that the data can be manipulated and examined in software. It is a software engine on a general-purpose platform.
However, imagine the Netezza platform where each Intel CPU is mated with special hardware (an FPGA), a bounty of RAM to manipulate and cache data, a self-contained Linux operating system, and its own dedicated disk drive. Imagine also that the disk drive is itself divided into multiple sections, where the inner sections are used for internal processes like RAID, temp space and system data but the outermost ring is where user data is stored, offering the benefit of the fastest disk speed for the oft-accessed information. All of these little attention-to-detail aspect cause each of the CPUs to run at a much more powerful factor than their common SMP counterparts. Why? Because their data is co-located with the CPU, where with the SMP we have to drag the data out of one duck pond to get it close to the CPUs that may operate on it. And with SMP there is no such thing as dedicated CPU-to-disk access. The SMP CPUs are shared-everything, but also the disk drives - shared-everything at the hardware level but shared-nothing at the functional-logical level. Netezza's CPUs are shared-nothing at the functional-logical level and shared-nothing at the CPU/disk level.
In Netezza, let's imagine putting 100 of these CPU/disk combinations to work. When we form a table on the host, the table exists logically in the catalog in one place. But it exists physically in 100 places. If we load 100 million records into the machine distributed randomly, then each of the CPUs will "own" 1 million of those records. If we issue a query to the table, each of the CPUs will operate only on the million records it can see. So for any given query, we are not serially scanning 100M records. We are scanning 1M records 100 times.
Now some may object, that we're still scanning the data end-to-end, and for plain-vanilla queries, this is true. However, I recall the first time I performed a join on two tables that were 100M records each doing a one-to-one join on unique keys, on a machine with 24 of these CPUs and the answer returned in 13 seconds. This was a plain vanilla test, so your actual mileage may vary. However, I did the same test on the same machine with 1B records joined to 1B records and the answer returned in less than 300 seconds. Imagine attempting this kind of join on a traditional relational database and expecting a return in any time to actually use the result. (and no, I did not use co-location for this test - a matter for another blog entry)
We have an additional "for free" aspect of the machine called a zone map. If data is laid down in contiguous form (the way transactions arrive in a retail point-of-sale model for example) then the machine will automatically track these blocks and keep their locations in a "zone map". If we then query the database with a given date, the machine will ignore the places where it knows the data is not, and use the zone maps to locate the range of data where it needs to search.
As as example of this, we know of a 150-billion-row table that is over 25 terabytes in size, distributed and zoned such that over 95 percent of its queries return in 5 seconds or less. For a traditional RDBMS, it would take more time than that just to get its index scans squared away so that it could even approach the storage table. This is also why the Netezza machine itself can be scaled into the petabyte zone while maintaining the same performance. No indexes are in the way to load it. No indexes are necessary to search it. Imagine now: the two most daunting aspects of data warehousing and analytics - loading the data and keeping up with the user base - have been washed away with the elimination of indexes. (Don't we turn indexes off in a traditional SMP database so it will load faster, and don't we chase a rainbow trying to index and re-index the tables to keep up with user needs?) Not with Netezza. Without indexes on the columns, all columns are fair game for searching. Without indexes in the way for loading, we can deliver information into the machine, a reprocess it while there, with no penalty from the use of indexes.
By this measure, this is an anti-indexing strategy because Netezza operates on the basic principle of where-not-to-look. In other words, if we can tell it where the data is not, it can zone-in and find the data. When we think about it, this is how a common brick-and-mortar warehouse works. If we showed up and asked for a box of nails, the attendant knows that he doesn't have to look in the parts of the warehouse that carry lawn equipment, hammers, saws or window draperies. He knows where the nails don't exist.
Contrast this to the common SMP-based indexed database, which uses exactly the opposite approach. The indexes are searched for the specific key and then this key is applied to the primary data store. This is why indexed structures in general cannot scale in the same manner as an anti-indexed structure. Keep in mind, with a Netezza platform it won't matter how much data we put on that 25-terabyte table. We could double, triple it or more - and it would always provide the answer in a consistent time frame. This is because from query-to-query - it's still not-looking in the same places, no matter how big those places get. It will still continue to ignore them because the data's not there and it already knows that.
I've had people tell me that there is no real difference in the SMP versus the MPP. The SMP, "properly configured" (they claim) is just as good as any old MPP. However, there is no "proper" way to configure a general purpose hardware architecture so that it will scale. The only way we could hope to coordinate these CPUs is in software, and the only way we can get data into the CPUs is by accessing the shared disk drives in another duck pond on completely different hardware. The SMP configuration is by definition the wrong configuration for scalable bulk processing of any kind, so there is no way to "properly configure" something that already the inappropriate storage mechanism. This would be no different than claiming that a "properly configured" VW Bug (Herbie notwithstanding) could be just as fast as a stock car. The VW Bug is not the wrong platform for general purpose transportation. But it's the wrong platform for model requiring high-scale and high-performance, just an an SMP-based RDBMS cannot scale with the same factor (for set-based bulk processing) as a machine (Netezza) specifically built for this mission. Only a purpose-built machine can ultimately harness this level of data, and only an appliance can remove the otherwise mind-numbing complexity of keeping it scalable.
In the next weeks leading up to IOD (where I will be speaking on most-things Netezza) I'll offer up some additional insights on the internals of the architecture and how it differs from the traditional platforms.