There was a time when the term "24x7" applied more to convenience stores than to databases and transactional systems. Data management professionals still needed to ensure that their systems were reliable and responsive, but there was always a window for batch programs, ETL jobs, and ongoing maintenance. Not anymore. The Internet has put off-line off-limits, especially for the mission-critical databases whose availability global companies have come to rely on around the clock.
Fortunately, database monitoring tools have evolved accordingly, but they’re often designed more for reactive tuning and troubleshooting than for proactively preventing outages. Monitoring availability, on the other hand, means placing your production systems under the unrelenting gaze of intelligent agents that can spot problems while they're still minor enough to be fixed without incurring an outage.
Here's the exciting news: if you have even a moderately sized IT infrastructure, you probably already have the tools to augment your database monitoring strategy with an early warning system that’s flexible, lightweight, and surprisingly affordable.
The Internet giveth
Even while the web ran off with your maintenance window and left you with a high-volume, transactional headache, it was also kind enough to lay the groundwork for the solution. The emergence of the web posed new and immediate challenges, even to large, mature IT shops that were already managing other 24x7 applications. Early generations of HTTP server software were notoriously fragile and insecure, while other less-popular programs had even shakier reputations. Caretakers at even a modest site could expect curveballs, and woe be to those who had them happen at a critical moment in the day, quarter, or holiday shopping season.
IT shops of all sizes responded to the challenge by adopting network management software such as Nagios, IBM Tivoli NetView, Big Brother, and others. These applications helped them understand what "normal" looked like on their 24x7 systems. Out of the box (or .tar file), the products focused primarily on the status of network devices and a handful of basic server resources, which was hardly complete coverage. However, the programs offered custom monitoring APIs that enabled savvy administrators to extend the reach of their network management software well beyond Layer 3, ultimately polling and measuring thousands of service points.
Keeping everything running was motivation enough for comprehensive monitoring during the wild and woolly days of Web 1.0. But today's age of regulatory compliance, service level agreements, and rightsized virtual environments have made proactive monitoring an absolute must-have technology. In response, companies created a wealth of compelling IT monitoring solutions, each with different strengths, specialties, and licensing options. By now, your operations team has most likely chosen a product and configured it to poll your company's production servers and network devices. The tools are waiting for you—all you need to do is walk over and pick them up.
Wiring up your databases into a network monitoring console is not as difficult as it sounds. Some platforms streamline the process by offering dedicated monitoring agents for IBM database servers, either in the base product or as a separate plug-in. Even then, you'll probably want to create a few custom monitors to watch for specific conditions. Extending an otherwise generic monitoring platform to watch over your databases may sound like a daunting task, but the learning curve isn't that steep for DBAs, and the additional coverage is bound to deliver immediate savings.
How immediate? Think days, not months. Within a week of deploying a custom service check consisting of barely a dozen lines of UNIX shell script code, operations technicians at a large international organization were able to detect an infrequent but serious anomaly in one of their databases and avert an outage that would have disrupted the company's operations for several hours across three continents.
Preparing your data systems
As a data management professional, your knowledge of what to look for in each database server and its associated middleware is an essential part of an effective monitoring plan. Without your insight, the default monitoring checks available may not reveal much beyond the status of the server’s network connection and TCP service port.
Before assessing which of your service checks will or won't require custom programming, draw up a wish list of all the indicators that would warrant dedicated attention in an ideal monitoring environment, along with the DBA commands necessary to capture them. Also consider OS indicators, such as the size of the diagnostic message log and the presence of recently created core-dump files, as well as the following issues.
Timing. With your currently deployed systems in mind, estimate the alert boundaries for gauge-type indicators such as current connections and storage available. Specify acceptable per-minute rates for performance counters that are always increasing, such as lock-wait time, rollbacks, and cache overflows. Decide how often each service check will run; it could be every minute or every 5 or 10 minutes, depending on the importance of the resource and how taxing the service check is. Rank and prioritize your monitoring wish list items by their potential to disrupt production if left unchecked. Regardless of how your service checks are implemented, have a detailed plan of attack ready before jumping into the monitoring suite.
Active or passive. Next, determine if your monitoring checks need to be active or passive (from the monitoring server’s perspective). An active monitor executes on the central monitoring server and checks remote resources by polling them across the network. (For more information, see the sidebar, "Active polling without passwords.") Passive monitors run directly on the production servers and transmit their results to the monitoring server. Monitoring platforms traditionally favor active service checks, but many also support passive monitoring over a variety of network protocols.
It's quite common to mix active and passive service checks when monitoring databases and sophisticated business applications. Some implementations embrace passive monitoring for databases because it can be easier to write service checks that execute locally on the database server without needing to authenticate. It can also be easier because the database server won’t need to allow inbound connections from the monitoring server, which is often located in a less-secure network zone. You may have a preference toward one approach or the other, but don't be disappointed if that decision has already been made by your network manager and operations team.
Simplicity. When it comes to developing custom service checks, simpler is better. After all, each one is just a wrapper around one or two database commands, followed by some text formatting and possibly a bit of math. Stick with programming languages that are bundled with the base OS, such as bash, ksh, or Perl on Linux and UNIX; and Microsoft Windows PowerShell or Microsoft VBScript on Windows servers. Once you've written a few scripts, look for opportunities to make them even smaller and tighter by relocating commonly duplicated code routines to a reusable library.
Comments. Although it's always worthwhile to write code with some unknown maintenance programmer in mind, it becomes even more important when writing service checks, since they’ll require immediate adjustment when something goes wrong (for example, a vendor software update suddenly breaks their testing logic). Your company's collection of custom monitoring scripts may be small, but they're still important and well worth the modest overhead of managing with a version control system such as Apache Subversion, IBM Rational ClearCase, or Git. If you're monitoring any resources with passive checks, the checkout and update features provided by version control programs offer a streamlined, consistent deployment process across multiple servers.
Hierarchies. If the monitoring platform supports the concept of dependencies between resources, take the time to accurately define these hierarchies in order to reduce the amount of downstream chatter that follows a significant problem. For example, if a primary network switch fails, the servers connected to it will also be affected, but you probably won't want to wade through dozens of alerts confirming that those systems and applications are indeed unreachable while the switch is down.
Achieving a culture of availability
"Rudy's Rutabaga Rule: Once you eliminate your number one problem, number two gets a promotion." –Gerald M. Weinberg
Although reducing the frequency and duration of unplanned outages through IT monitoring is an admirable outcome, your efforts don’t have to stop there. After a few rounds of defining monitors to address your most pressing problems and gaining additional insight into your system, move on to capture key performance indicators from your servers and applications. Over time, measurements of throughput, response time, and resource utilization will reveal trends that can be used to plan upgrades, drive consolidation efforts, and minimize costly over-licensing.
On the business side, many application databases can be monitored with simple, inexpensive SQL statements that provide recent counts of customer sign-ups, received orders, gross revenue, and other essential metrics to produce a veritable dashboard of compelling information. Create role-specific web accounts on your monitoring platform, compose a custom view of status indicators and performance trends that are relevant to each role, and you may find formerly contentious departments acting more like a user community once they share a common repository of availability data.
All this may sound overly optimistic, but there is no denying that a web-accessible source providing timely, accurate statistics of your company's mission-critical systems—as well as improved availability overall—will attract more advocates to the work you do.