When introducing the Netezza platform to a new environment, or even trying to leverage existing technologies to support it, very often the infrastructure admins will have a lot of questions, especially concerning backups and disaster recovery. Not the least of which are "how much" "how often" and such like. More often than not, every one our responses will be met with a common pattern, a sentence starting with the same two words:
Case in point, when we had a casual conversation with some overseers of backup technology as a precursor to "the big meeting" we almost - quite accidentally - shut down the conversation entirely. Just the mention of "billions" of rows, or speaking of the database in "Carl Sagan" scaled terms, caused them to want to scramble for budget and market surveys of technologies that were more scalable than their paltry nightly tape-backup routines. In this particular conversation, we were talking about nightly backups that were larger than the monthly backups of all of their other systems combined. Clearly we were about to pop the seams of their systems and they wanted a little runway to head off the problem in the right way.
But what is the "right way" to perform a backup where one of the tables is over 20 terabytes in size, the entire database is over 40 terabytes in size and the backup systems require two or even three weeks to extract and store this information for just one backup cycle? Quipped one admin "It takes so long to back up the system that we need to start the cycle again before the first cycle is even partially done." and another "Forget the backup. What about the restore? Will it take us three weeks to restore the data too? This seems unreasonable."
Yes it does seem unreasonable precisely because it is - quite unreasonable. As many of you may have already discovered, the Netezza platform is a change-agent, and will either transform or crush the environment where it is installed, so voracious are its processing needs and so mighty is its power to store mind-numbing quantities of information.
The aforementioned admins simply plugged their backup system into the Netezza machine, closed their eyes and flipped the switch, then helplessly watched the malaise unfold. It doesn't have to be so painful. These are very-large-scale systems that we are attempting to interface with smaller-scale systems. We might think that the backup system is the largest scaled system in our entire enclosure, but put a Netezza machine next to it and watch it scream like a little girl.
So here's the deal: No environment of this size shoiuld be handled in a manner that is logistically unworkable for the infrastructure hosting it. We can say all day that these lower-scaled technologies should work better or that Netezza should pony-up some stuff to bridge the difference, but we all know that it's not that easy. Netezza has simplified a lot of things, but simplification of things outside the Netezza machine - aren't we asking a bit much of one vendor?
To avoid pain and injury, think about the things that we need to accomplish that are daunting us, and solve the problem. The problem is not in the technology but in the largesse of the information. We would have the same problem on our home computers if we had a terabyte of data to backup onto a common 50-gig tape drive. We would need twenty tapes to store the data. The backup/restore technology works perfectly fine and reasonably well for a variety of large-scale purposes. We simply need to be creative about adapting it to the Netezza machine. Don't plug it in and hope for the best. Don't do monolithic backups. The data did not arrive in the machine in a monolithic manner so why are we trying to preserve it that way? Leave large-scale storage and retrieval to the Netezza machine and don't crush the supporting technologies with a mission they were never designed for.
Several equally viable schools of thought are in play here. What we are looking for is the most reliable one. Which one will instill the highest confidence with least complexity? The more complex a backup/restore solution becomes, the less operational confidence we have in it. If it cannot backup and restore in a reasonable time frame, we exist in a rather anxious frontier, wondering when the time will come that the restoration may be required and we put our faith in the notion that it either won't, or when it does all of the other collateral operational ssues will eclipse the importance of the restoration. In other words. future circumstance will get us off the hook. There is a better way, like a deterministic and testable means to truly backup and restore the system with high reliability and confidence.
On deck is the simplest form of the solution - another Netezza machine. Many of you already have a Disaster-Recovery machine in play. Trust me when I tell you that this should be fleshed out as a fully functional capability (discussed next) and then the need for a commodity backup/restore technology evaporates. Using another Netezza machine, especially when leveraging the Netezza-compressed information form, allows us to replicate terabytes of information in a matter of minutes. I don't have to point out that none of our secondary technologies can compete with this.
A second strategy requires a bit more thought, but it actually does leverage our current backup/restore technology in a manner that won't choke it. It won't change the fact that the restoration, while reliable, may be slow simply because moving many terabytes in and out of one of these secondary environments is inherently slow already.
A third strategy is a hybrid of the second, in that enormous SAN/NAS resources are deployed as the active storage mechanism for the data that is not on (or no longer on) the Netezza machine. This can be a very expensive proposition on its own. We know of sites that keep the data on SAN in load-ready form, and then load data on-demand to the Netezza machine, query the just-loaded data, return the result to the user and then drop the table. You may not have on-demand needs of this scale, but this shows that Netezza is ready to scale into it.
A fourth strategy is a hybrid of the first, that is we still use a Netezza machine to back up our other Netezza machine, but we use the more cost-effective Netezza High-Capacity server, which is less expensive than the common TwinFin (fewer CPUS, more disk drives) but otherwise behaves in every way identically to its more high-powered brethren. And honestly if we were to put apples-to-apples in a comparison between the cost of a big SAN plant to store these archives versus the High Capacity Server, the server wins hands-down. It's cheaper, simpler to operate, doesn't require any special adaptation and we can replicate data in terms guided by catalog metadata rather than adapting one technology to another.
So let's take these from the least viable to the most viable and compare them in context and contrast, and let the computer chips fall where they may.
Commodity backup/restore technology
If we want to leverage this, we need to understand that it cannot be used to perform monolothic operations. These are unmanageable for a lot of reasons:
To mitigate the above, we need to adapt the large-scale database to the backup technology by decoupling and downsizing the operation into manageable chunks. This is a direct application of the themes surrounding protracted data movement in any environment. The larger the data set, the more the need for checkpointed operations so that the overall event is an aggregation of smaller, manageable events. If any single event fails, it is independently restartable without having to start over. Case in point, if I have 100 steps to complete a task and they are all dependent upon one another, and the series should die on step 71, I still have 29 steps remaining that may have completed without incident, but I cannot run them without first completing step 71. This is what a monolithic backup buys us - an all-or-nothing dependency that is not manageable and I would argue is entirely artificial.
To continue this analogy, lets say that any one of these 100 steps only takes one minute. In the above, I am still 30 minutes away from completion. I arrive at 6am to find that step 71 died, and now I have to restart from step 1 and it will cost me another 100 minutes. Even if I could restart at step 71, I am still 30 minutes away from completion.
Contrast the above to a checkpointed, independent model. If we have 100 independent steps and the step 71 should die, the remaining 29 steps will still continue. We arrive at 6am to find that only one of the 100 steps died and we are only 1 minute away from full completion. The difference in the two models is very dramatic:
Monolithic means we are operationally reactive when failure occurs. The clock is ticking and we have to get things back on track and keep moving. Checkpointed means we are administratively responsive when failure occurs. We don't have to scramble to keep things going. In fact, in the above example, if step 71 should fail and the operator is notified, doesn't the operator have at least half an hour to initiate and close step 71 independently of the remaining 29 steps? Operators need breathing room, not an anxious existence.
Monolithic methods are supported de-facto by the backup/restore technology. If we want to perform a checkpointed operation, we have to adapt the backup/restore process to the physical or logical content of the information. We don't want to directly mate the backup technology itself, so we need to adapt it.
Logical Content Boundaries
This means we have to define logical content boundaries in the data. What's that? You don't have any logical content boundaries, not even for operational purposes? Well, per my constant harping on enhancing our solution data structures with operational content, such as job/audit ID and other quantities, perhaps we need to take a step back and underscore the value of these things because they exist for a variety of reasons. One of them is now upon us - the operatonal support of content-bounded backups. It is required for scalability and adaptability and is not particularly hard to apply or maintain.
A more important aspect of content boundary is the ability to identify old versus new data. If the data is carved out in manageable chunks, some will always be least-recent and some more-recent. Invariably the least-recent chunks will be identical in content no matter how many times we extract them for backup. This means we can extract/backup these only once and then focus our attention on the most active data. In a monolithic model, there is no distinction between least or most active, least or most recent. In large-scale databases, the least-recent data is the majority, so the monolithic backup is painfully redundant when it need not be.
Do we absolutely need content-bounded backups for all of our tables to function correctly? Of course not. But by applying this as a universal theme it allows us to treat all tables as part of a backup framework where all of them behave the same way. So part of this is in the capture of the data but a larger part is the operational consistency of the solution.
Many reference tables such as lookups will never grow larger and we know this. In fact, their data may remain static for many years. For the ones that are tied to the application and grow or change every day, these we will call solution tables. They are typically fed by an upstream source and are modified on a regular basis. Any of these tables can grow out of control. The reference data then represents a very low operational risk. Why then would we not simply fold the reference data into the larger body and treat all the tables the same? There is no operational penalty for it, but enormous benefit from being able to treat all tables the same way inside a common solution.
At this point, the backup/restore framework will address all of the tables the same way, but now we have the ability to leverage rules and conditions within the framework so that special handling is available if necessary. This is a common theme in large-scale processing: Handle everything as though it will grow, but accommodate exceptions with configuration rules. I'll forego this aspect for now and let' take a look at what we need in basic terms:
- An intermedaite landing zone: Some off-machine storage location that can hold the data while in transit - e.g a NAS or SAN volume. This will not be the same size as the database but some fraction of it. It is a workspace for intermediate files, not permanant storage
- Content boundaries: A means to define and manage content bthese are tags or quantities in the data itself. They need to be consistent for all tables so that the backups and restore operate the same way. Of course, if we don't have these, we need to apply them.
- A restoration database: a separate database that will be used for the restoration process, because we won't be restoring the data directly into the table from whence it came. Why not? There is an increasing likelihood that the data structures have changed since the last backup. If the structure has changed, we cannot reload the data into it. We need a way to shepherd the data back into the machine.
- Processes: that capture / backup / restore the data using content boundaries and common utilities. These are typically control scripts that are easily configured, deployed and maintained
- Archival Database (next): Physical content replicas of the masters that hold all of the original information in a form useful for backup and restore. Whether they keep the data active/online is immaterial. Their role is to interface with the backup/restore mechanisms so that any data once made offline can easily be made online through this database.
Setup formal archival databases. Whether these reside on the same server, or on a DR / High Capacity server (below) is immaterial. The point is that the data in the master tables will be actively rolled into these tables, which will form the backbone for all backup and restoration operations. We therefore replicate the masters (with streaming replication) into the archival stores and then at some interval perform the backup of the archives, not the master.
Foreshorten the Master Tables
Now that we have a means to define content boundaries - you did apply those boundaries right? We can now look at the database holistically for optimizations based on active data.
At one site we have a table with ninety billion records in three different fact tables spanning over ten years of information, However, the end-users and principals claimed (and we verified) that the most active data on any given day is the most recent six months. Anything prior to that, they would query perhaps once or twice a year for investigative purposes, but had not tied any reports to it.
So now we have an opportunity to get agreement from the users to shorten the master tables. This is especially necessary for those fact tables. In the end, these fact tables (above) were shortened to less than four billion rows each and these are kept trimmed on a regular basis. The original long-term data is held in another archival database that is the foundation for the backups and restores.
On a per-table basis, use the content boundaries to manufacture/compile a set of checkpointed instructions for each table to support extraction and restoration
Execute the extractions to create flat files
Each extraction will pull data to a Netezza-compressed flat file and also supply another file with the metadata instructions necessary to restore the file's contents. On a per-file basis, these instructions will include:
Original table definition, used to construct a template of the table in the restoration database
File load instructions used to get the content back into the machine
Transformation command, used to move data from the restoration database into its original home
- The files will be extracted (using content boundaries) onto the storage landing zone
- The smaller extractions can be performed in parallel, are independent and form their own checkpointed process
- The backup techology will begin its backup process once the second (metadata companion) file arrives for the set. The metadata companion file behaves as a trigger.
- Once the backup for the file-set is complete, the backup/restore deletes the file and its companion
- This decouples the backup/restore processing from the database (completely) so that it can focus solely on file-based backup and restore
- For restoration, both the content file(s) and companion metadata file(s) will be retrieved and placed on the storage landing zone
- The metadata will be used to reconstruct an empty intermediate version of the original table in the Restoration Database
- Load the table using nzload
- Then the metadata will be used (even a preformulated SQL statement) to copy the data into the original target
- Throw away the intermediate load table, delete the data file and its companion metadata file
- The restoration process can be run as many ways-parallel as supported by nzload
- The restoration process can also be surgical, with the more agile ability to restore data in smaller segments
The above protocol, while likely easy to pull off an administer, still has a number of moving parts that the Netezza based-equivalent (later) will not have.
The only difference between the above protocol and one using a SAN-based storage mechanism, is the absence of a formal backup/restore technology. Rather the SAN is the long-term storage location and we perform the incremental extractions onto it. Rather than delete the files, we keep them.
This has significant implications for the cost of the SAN. After all, if we intend to interface this to the Netezza machine, we would not want common NAS storage because it is too slow and the vendors actively disclaim their technology from being viable for data warehousing. The primary reason is that the network and CPUs are set up for load-balancing, like a transactional database, but not bulk onload/offload of the data.
Not only will we need enough SAN to backup the environment, but to also carry fathers and grandfathers if need be (this is a policy decision). With checkpointed extracts, the father/grandfather issue is largely moot. This is because once a checkpointed extract of older data is pulled and stored, it won't be changing and capturing another one just like it has no value.
In this approach we leverage another Netezza machine like a DR server, as our backup, archival and restore foundation. It can easily hold the information quantity. The difference here is in price, of course, since a fully-functional TwinFin is more expensive than most common SAN installations. However, the High Capacity Server (below) mitigates this pricing problem while delivering a consistent data experience.
One primary benefit of performing backups into the DR server is that it can automatically serve the role of a hot-swap server in case of failure in the primary server.
For this scenario to work however, we would want streaming replication between the active databases and the DR server so that the data is being reconciled while being processed. This allows us to have a fully functional hot-swap if the primary crashes, and we can continue uninterrupted while the primary is serviced. Word to the wise on this kind of scenario however, bringing back the primary means that it is out of sync, since the secondary took over for a span of time. So we would need to be able to reverse the streaming replication to make it whole.
Scenarios like this often embrace the practice of operationally swapping the two machines on defined boundaries, like once a month or once a quarter, where they actually switch roles each time. This allows the operations staff to gain confidence in the two machines as redundant to each other in every way. I have seen cases like this where the primary machine went down, the secondary machine kicked in seamlessly and all was well. I have also seen cases where the principals kept the DR server up to date but when it came time to operationally switch, some important piece (usually in the infrastructure between the devices) was missing causing the failover itself to fail. It is best to have a plan in place, but it's better to have tested the plan and that it actually works.
On a per-table basis, use the content boundaries to manufacture/compile a set of checkpointed instructions for each table to support extraction and restoration
Using a variation of the nzmigrate utility, perform the table-level extractions
Extract data in Netezza-compressed form to a flat file
Load the flat file into a restoration database in the second machine
Perform the transform-execution to copy data from the restoration database into the target table
The extractions, restoratons and transformations can all leverage simple scripts and catalog metadata, not become dependent on deeply hand-crafted code
Intermediate (SAN) landing zone requires less space because the data is being transferred in Netezza-compressed format and it is cleaned-up on-the-fly
Transfer is quicker because Netezza-compressed data is written to the data-slices directly instead of coming through the host.
A word on Netezza-compressed transfer. I wrote about this in Netezza Transformation but it is important to highlight here. We performed an experiment moving half a terabyte scattered across a hundred or more tables. This data was moved from its original home to a database in another machine. The first method used simple SQL-extract into an nzLoad component. This process took over an hour. The second method used transient external tables with compression, coupled with an nzload in compressed mode. The entire transfer took less than six minutes. This was because the compressed form of the data was already 14x compressed.
In other experiment using over 20x compression for the data, we were able to transfer ten terabytes in less than an hour. This kind of data transfer speaks well for the streaming replication necessary for DR server operation (above) but underscores the fact that even when transferring between Netezza machines, it's as though we haven't left the machine at all.
Netezza-based High-Capacity Server
This option is simply a form of the Netezza-based hybrid (above) but on a dedicated server designed to support backup and recovery.
The better part about this server is that is has more disk drives and fewer CPUs, making it far more cost effective for storage than common SAN devices. Couple this with the minimal overhead required for transferring data between machines, and the ability to surgically control the content with the content-boundaries and catalog metadata, and we get the best of all worlds with this device.
Not only that, but it is also scalable to support storage of all other Netezza devices in the shop as well as any non-Netezza device where we simply want to capture structured information for archival purposes. The High Capacity server is queryable also, meaning that even the ad-hoc folks will find some value in keeping the data online and available.
Lastly, in Spring 2010, as part of the safe-harbor presentation one of the principals at IBM Netezza announced plans for a replication server. I can only imagine that this device will deliver us from any additional hiccups associated with streaming replication that we might now be doing in script or other utility control language.
At Brightlight, our data integration framework (nzDIF) has the nz_migrate techniques built directly into the flow substrate of the processing controller, as well as the enforcement and maintenance of the aforementioned content boundaries. We are actively acquiring and applying best, most scalable and simplified approaches as a solution framework firmly lashed to a purpose-built machine. I am a big proponent of encouraging Enzees to take on these things themselves, or at least let us coach you on how to make it happen. The solutions are simple because the Netezza platform itself is simplified in its operation. Stand on the shoulders of genius - the air is good up here.