|Vasanth Bala, Manager of
Scalable Datacenter Analytics
Editor’s note: This article is by Vasanth Bala, a staff scientist at IBM’s Thomas J. Watson Research Center.
It’s inevitable. Servers crash. Applications misbehave. Even if you troubleshoot and figure out the problem, the process of problem diagnosis will likely involve numerous investigative actions to examine the configurations of one or more systems—all of which would be difficult to describe in any meaningful way. And every time you encounter a similar problem, you could end up repeating the same complex process of problem diagnosis and remediation.
As someone who deals with just such scenarios in my role as manager of the Scalable Datacenter Analytics Department at IBM Research, my team and I realized we needed a way to “fingerprint” known bad configuration states of systems. This way, we could reduce the problem diagnosis time by relying on fingerprint recognition techniques to narrow the search space.
CMU and other academic organizations now manage virtual image library. For their work, CMU’s Mahadev Satyanarayanan and Gloriana St. Clair received a two-year grant from the Sloan Foundation “to support the technical development of a platform for archiving executable content and the environment in which it runs, as well as a plan for the institutionalization and ongoing sustainability of work for such an archive.”
Project Origami was thus born from this desire to develop an easier-to-use problem diagnosis system to troubleshoot misconfiguration problems in the data center. Origami, today a collaboration between IBM Open Collaborative Research
, Carnegie Mellon University, the University of Toronto, and the University of California at San Diego, is a collection of tools for fingerprinting, discovering, and mining configuration information on a data center-wide scale. It uses public domain virtual image library, Olive
, an idea created under this Open Collaborative Research a few years ago.
It even provides an ad-hoc interface to the users, as there is no rule language for them to learn. Instead, users give Origami an example of what they deem to be a bad configuration, which Origami fingerprints and adds to its knowledge base. Origami then continuously crawls systems in the data center, monitoring the environment for configuration patterns that match known bad fingerprints in its knowledge base. A match triggers deeper analytics that then examine those systems for problematic configuration settings.
How Origami works
Together with Carnegie Mellon University and the University of Toronto, we developed agent-less system crawlers that are able to continuously scan the configuration state of virtual servers – without requiring any scanning agents to be installed inside them. Think about these crawlers as analogous to web crawlers that silently and non-intrusively scan the contents of web documents to build a central index that can then be searched or mined for insight.
This crawling approach improves usability and security because: there is no scanning agent to install and maintain on tens of thousands of systems; and there is no agent for malware present within these systems to attack. We are now developing advanced fingerprinting technologies that use a concept called “search by example,” where the user provides an example of a problematic configuration, rather than using a complex rule language to declaratively define the details of the problem.
Such a “search by example” can also be created by first crawling a system; making some change to it that represents a configuration adjustment; then re-crawling the system, and finally asking Origami to compute the difference between the two crawled states of the system. This technique allows users to provide arbitrary system changes as examples.
What’s happening inside Origami during all of these processes? It internally computes a fingerprint of the example and stores it in a fingerprint knowledge base. A fingerprint is a collection of hashes that summarize different dimensions of the configuration data for very fast recognition. Various heuristics then adjust the relative weights of different features comprising the fingerprint so that important features (e.g. a network port being opened) are
distinguished from less important ones (e.g. a log file being modified). These heuristics lower false alarms so bad configuration patterns can be distinguished from very similar patterns that are actually benign.
What’s next for Origami?
The overriding question for us is how Olive and Origami together can lead to the production of commercially viable technologies. The above-mentioned problem diagnosis of misconfiguration-related outages is one clear use for the search-by-example technology that we have developed with our OCR partners.
Another technology using Origami, under development with the University of California at San Diego, would mine many different systems in the data centers that are identically configured to automatically learn patterns that tend to produce problems from those that tend to operate well.