As more users switch from running applications on physical computer systems to running applications on virtual machines, the number of virtual machine instances and VM virtual images in a typical IT infrastructure is growing rapidly. With large numbers of virtual machine virtual images, tracking the contents and configuration of each virtual image becomes a significant issue. When users can't easily find the virtual image they need, they just create new virtual images and that makes the problem worse.
As shown in Figures 1 and 2, standardization is the key to keeping the growth of deployed virtual machines from leading to a matching growth in maintenance costs.
Figure 1. An overview of virtualization without standardization
Figure 2. An overview of virtualization with standardization
But standardization can only realistically be achieved by careful management of the library of virtual images based on a thorough understanding of what is in each virtual image.
When you know what is in each virtual image, you can:
- Find clusters of similar images and compare them to determine the differences and decide if the variations are worth keeping.
- Guide users to the image that meets their needs.
- Know with certainty that the image they need does not yet exist and needs to be created.
When reference images are established as standards, it's important that there be a place where they can be safely stored without the possibility they will be modified in an uncontrolled fashion. Giving reference images version numbers makes it possible to identify and understand how standard images change over time.
This article describes the architecture and underlying technology of the IBM Virtual Image Library and then follows with in-depth details of four common image-management scenarios, scenarios that the Virtual Image Library was designed to address.
The IBM Virtual Image Library is an important component of the IBM SmartCloud Provisioning product that was added as part of the recent SmartCloud Provisioning version 1.2 release. The library was developed to help users understand and manage the virtual images and deployed virtual machines in their infrastructure. The most important ways that the library helps are by providing capabilities to:
- Search for virtual images across the entire virtual infrastructure, using criteria that include name, description, operating system, installed software, and last modification time.
- Identify the "drift" between a deployed virtual machine and the virtual image that was used to create it.
- Find clusters of similar virtual images that can be consolidated into a small number of standard images.
- Define and employ a version reference for virtual images and track the history of their distribution and deployment.
Each of these key user scenarios is described in more detail later.
Structure of the Virtual Image Library
The IBM Virtual Image Library is essentially a browser-based GUI that communicates with a web application that runs within WebSphere® Application Server.
Within the library server system (see Figure 3), there are several key components:
- Graphical user interface
- REST API
- Image metadata store
- Reference repository
- Knowledge base
- Indexer
- Analytics engine
Figure 3. Virtual Image Library server system components
Let's look at each in more detail.
The GUI for the Virtual Image Library is browser-based and built using a combination of DOJO, JavaScript, and HTML elements. All user commands that are initiated from the GUI, and all the data that is displayed in the GUI, pass through the REST API of the library.
All user interaction with the Virtual Image Library is done using the REST API including the commands initiated from the GUI. Other programs and tools are able to programmatically access the library by using this interface.
In order to be able to successfully submit commands through the REST API, users must first authenticate using Lightweight Third-Party Authentication technology (LTPA) to obtain an LTPA token. This token is then used when submitting REST API commands to identify the user and verify that the user has a role recognized by the library.
The image metadata store is the location where the library keeps information about the registered hypervisor managers (operational repositories) and corresponding virtual images and deployed virtual machines. This includes all metadata that the hypervisor manager can provide about each virtual image and deployed virtual machine.
The reference repository is the authoritative store for "golden master" images and guarantees that images stored in it cannot be modified.
The reference repository plays a similar role as a source code control server plays in the development of software. It provides versioning for virtual images, a point of collaboration in the development and evolution of virtual images, and guarantees that what is checked in will not be modified.
A virtual image can enter the reference repository either by an import of a new external image from the Virtual Image Library file system or by performing a check-in operation on a virtual image in one of the operational repositories. The reference repository uses sector-based repository technology from the Tivoli® Provisioning Manager for Images product which provides disk format conversion and deduplication capabilities to minimize storage requirements.
The indexer connects to virtual images either in the reference repository or operational repositories, accesses the file systems, and gathers as much information as possible to then populate the knowledge base. The type of information gathered includes:
- Disk-level information such as partitions and file system types and sizes.
- OS information including OS type, distribution, version, and patch level.
- Product information including installed products and patch level.
- File-level information including file listings and summaries of content.
The basis of the Virtual Image Library is to gather what is actually present or installed in the image rather than trust user-created annotations to virtual image metadata which become incorrect almost as soon as they are created.
The knowledge base stores all information gathered by the indexer about the virtual images visible to the Virtual Image Library. The knowledge base also stores any information derived or inferred by the analytics engine whether that is about individual virtual images or relationships between virtual images.
The analytics engine operates on information in the knowledge base to derive additional properties about images and the relationships between them.
An example of the type of information derived by the analytics engine includes the image-similarity metrics which quantitatively define how similar an individual virtual image is to any other virtual image visible in the Virtual Image Library.
The analytics engine also provides the mechanism that makes it possible to access or query information in the knowledge base about virtual images.
These functions are the basis for the core of the Virtual Image Library capabilities deep search, comparison, and similarity analysis.
Deploying the Virtual Image Library
The deployment of the Virtual Image Library is completely non-intrusive to the virtualized environment for which it is to act as a virtual image library. The library only requires credentials to access the VMware or SmartCloud Provisioning environment to start acting as that environment's library.
The library is deployed in its own virtual machine as a web application running in WebSphere Application Server. This virtual machine requires network access to the hypervisor managers that hold the virtual images and deployed virtual machines that the library will list and analyze.
Figure 4 shows how a deployed Virtual Image Library interacts with the production environment.
Figure 4. Conceptual Virtual Image Library deployment
When a connection to a hypervisor manager such as a SmartCloud Provisioning service region is added, the first action is to discover all the virtual images and deployed virtual machines that it manages. These lists are stored in the Virtual Image Library and then all the items on the lists are available for the user to view and manipulate.
Gathering analytics information
The Virtual Image Library uses the indexer to gather information about individual virtual images. No agent is required to be installed on the target system in order to be able to do this. Instead, the virtual image is accessed so that it appears as a large, single disk. If the virtual image is in an operational repository, the disk is accessed by mounting it remotely without copying the image across the network. Indexing images remotely, without having to copy the images to the Virtual Image Library, minimizes network bandwidth requirements.
Once the virtual image is mounted and available, the indexer then traverses the file system of this disk using inventory tools and methods just as it would on a running system. Because no agent is needed, the information collection is not intrusive.
Gathering detailed information about every virtual image may not be required. Recognizing this fact, the Virtual Image Library splits the gathering of information into two different levels: basic indexing and full indexing.
Basic indexing gathers:
- Disk-level information such as partitions and filesystem types and sizes.
- OS information including OS type, distribution, version, and patch level.
- Product information including installed products and patch level.
Full indexing gathers
- All the information gathered by basic indexing.
- File-level information including file listings and summaries of content.
By default, the Virtual Image Library does basic indexing on every virtual image it learns about. For virtual images in the operational repositories, the image is accessed remotely to accomplish the basic indexing — the virtual image remains in its original location.
Users of the image library can also request that a deployed VM be indexed, but this isn't done automatically because of the changing nature of a deployed VM. When an image is checked into the reference repository, full indexing is automatically performed.
However, full indexing isn't limited to the reference repository. Any virtual image or deployed VM in an operational repository can also be fully indexed, at the user's request.
The analytics engine computes a locality-sensitive hash (LSH) based on the information it learns about the image. You can think of the LSH as a much-condensed version of the image contents. The powerful property of these hashes is that they can be used as a proxy in the comparison of the virtual images which they represent. The percentage similarity of any two of these hashes is an approximation of the similarity of the two images they represent. This provides a scalable way to provide similarity metrics, both by products installed, as well as image file contents, across all images in the Virtual Image Library.
Virtual Image Library user scenarios
There are four key user scenarios for which the Virtual Image Library has been specifically designed and developed:
- Deep image search
- Drift analysis
- Controlling image sprawl
- Version control and tracking of reference images
Each of these scenarios is described in detail.
In addition to the traditional approach to virtual image search (in which a user can search for a virtual image based on image metadata such as name, time created, and user-created metadata), the Virtual Image Library provides deep image search.
Deep image search allows a user to search for a virtual image based on its actual contents, not just metadata created by users for the virtual image, ensuring that search results are accurate and not susceptible to the errors common with user-provided metadata.
Figure 5 shows how a user can search for an image or images that match requirements.
Figure 5. IBM Virtual Image Library search screens
In the Virtual Image Library a user can search for images based on:
- Location: Where the virtual image is located. Search can be restricted to only return virtual images in certain locations, allowing a user to search only for reference images or only for images in a certain operational repository or set of operational parameters.
- Image type: Search can return only virtual images or only deployed virtual machines or both.
- Name: The user-defined name for the image or a regular expression is used to find images with names that match the specified pattern.
- Description: The user-defined description for the image or a regular expression is used to find images with descriptions that match the specified pattern.
- OS type and version: The operating system type and version discovered by the indexer when the virtual image is analyzed or a regular expression is used to find images with OS names that match the specified pattern.
- Installed software, software version, and target architecture: This is based on information discovered by the indexer when the virtual image was analyzed so it represents what is actually in the image, not what is thought to be in the image. The Virtual Image Library provides a list of all installed software known in any virtual image so a user can select the software desired from the list. The search can be configured to return images with "any" of the selected software, "all" of the selected software, or "none" of the selected software.
- Last time modified: The last time the virtual image or deployed virtual machine was modified. This is a useful feature in filtering out seldom-used images.
As soon as a virtual machine is deployed using a virtual image, the content of the virtual machine image will change and "drift" from the original content of the virtual image. Most of this drift is normal, but changes such as applying patches, adding or removing software or upgrading the level of existing software may cause the system to no longer function correctly.
When a system no longer functions correctly, the first question you should ask is, "What changed?" Drift analysis answers that question.
The diagram in Figure 6 indicates the flow.
Figure 6. Overview of drift analysis flow
Drift analysis begins by identifying the failing virtual machine and finding the virtual image from which it was deployed. This is done using information from the Family Tree feature of the virtual machine (more information about this feature is provided in the section on version control and image tracking that follows).
The virtual image should already be indexed, so the next step is to index the virtual machine. Depending upon the virtualization technology, it may be necessary to first stop or at least suspend the virtual machine before starting an indexing operation.
When the indexing completes, the virtual machine can resume while the user runs a comparison of the virtual image and the deployed virtual machine from the Virtual Image Library. Differences between the two images are listed so that the user can review the lists and decide what differences might be causing the failure.
Controlling image sprawl scenario
Image sprawl is caused by having a large number of images that are similar but not quite the same. Typically, this happens because all users may start with a common image, but they may apply different patches, an application might be installed or upgraded, or an application may be removed.
Each user captures the changes results and saves it as a virtual image.
To counteract this sprawl, the Virtual Image Library supports searching for images that are similar to a specific image. By taking a standard image that has been published for general use and then searching for all images that are similar to it, the image librarian is able to identify the virtual images that are similar to it.
The screen capture in Figure 7 shows the results of such a search.
Figure 7. Results of a search by Similarity
All the virtual images that have been indexed are ordered from highest to lowest in similarity. The virtual images at the top of the list are the best candidates for being withdrawn and replaced.
Version control/tracking reference images scenario
As virtual images become important assets within an IT infrastructure, it becomes very important to store the virtual images in a safe place and to control modifications to the images. Each modification should be identifiable using a version number and it should be possible to retrieve each numbered version of the image.
This ability is essentially like a source code control systems for images.
The Virtual Image Library provides this capability with its reference repository. Any image that is checked in to the reference repository has a copy made and stored within the repository. The image is either added to an existing image version chain or a new chain is created and then the image is assigned a version number unique within the chain.
The flow for modifying a reference image is shown in Figure 8.
Figure 8. Flow for updating a reference image for creating a new version
The box labeled A.1 represents a virtual image that was previously checked into the reference repository but which now needs updating.
So A.1 is checked out to an operational repository and booted to create a new virtual machine. The new virtual machine is modified, for example, by applying patches or installing new applications, to create a new image in the operational repository. This new image is then checked in to the reference repository to create the reference virtual image A.2 that is placed by default into the same version chain as the original image, A.1.
Version chains are defined and managed by the users of the Virtual Image Library. Although the default is to put the new A.2 image into the same version chain as A.1, that is not mandatory. The user could decide to start a new version chain with A.2 as its root or could decide to add A.2 to the end of any other version chain in the reference repository. And if a user determines that a virtual image is in the wrong version chain, the virtual image can be moved to the correct version chain.
Having a reference repository that supports version chains of images is the foundation for understanding and controlling important images within an IT infrastructure.
But images in the reference repository have to be checked out to an operational repository in order to be deployed and put to use. For this reason it is important to be able to record when a reference image is checked out to an operational repository and then track that image as it is used to deploy images or is copied to create other virtual images. The Virtual Image Library collects this information and displays it as the Family Tree of an image.
An example version chain and family tree are shown in Figure 9.
Figure 9. Version chain and Family Tree example
The version chain (left) contains two reference images that are identified by the icons with the gold tabs. Each time that a reference image is checked out to an operation repository, a branch is created and given a version number of the form 1.x.y where 1.x is the version number of the reference image the branch comes from.
Every time an image is used to create a new image or used to deploy a virtual machine, a new branch is created and a version number is assigned. The Family Tree (right) shows all the connections between the images as they were copied to make new images or deployed to create new virtual machines.
The IBM Tivoli Virtual Image Library provides sophisticated image-management capabilities a customer can use to tackle the difficult issues of understanding and controlling the contents of his virtual infrastructure. The combination of image repository management, image analytics, and a reference repository with version control makes it possible to solve virtual infrastructure problems by providing capabilities that support four key user scenarios:
- Deep image search
- Drift analysis
- Controlling image sprawl
- Version control and tracking of reference images
These features are commonly needed in organizations today.
And because the information needed to support these scenarios is collected without the need to install any agent into the virtual images or to copy images to a central location, the IBM Virtual Image Library can be easily added to an organization's IT infrastructure simply and easily.
Learn
-
For more on how to perform tasks in the IBM Cloud, visit these resources:
- Up and download files from a Windows instance.
- Install IIS web server on Windows 2008 R2.
- Create an IBM Cloud instance with the Linux command line.
- Create an IBM Cloud instance with the Windows command line.
- Extend your corporate network with the IBM Cloud.
- High availability apps in the IBM Cloud.
- Parameterize cloud images for custom instances on the fly.
- Windows-targeted approaches to IBM Cloud provisioning.
- Deploy products using rapid deployment service.
- Integrate your authentication policy using a proxy.
- Configure the Linux Logical Volume Manager.
- Deploy a complex topology using a deployment utility tool.
- Provision and configure an instance that spans a public and private VLAN.
- Secure IBM Cloud access for Android devices.
- Recover data in IBM SmartCloud Enterprise.
- Secure virtual machine instances in the cloud.
-
In the developerWorks cloud developer resources, discover and share knowledge and experience of application and services developers building their projects for cloud deployment.
-
Find out how to access IBM SmartCloud Enterprise.
Get products and technologies
- Download a trial version
of IBM SmartCloud Provisioning.
-
See the product images available for IBM SmartCloud Enterprise.
Discuss
-
Join a cloud computing group on developerWorks.
-
Read all the great cloud blogs on developerWorks.
-
Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.
Joe Wigglesworth is a Senior Technical Staff Member at IBM and was previously the architect of the IBM Virtual Image Library component of IBM SmartCloud Provisioning. Before his work on the Virtual Image Library, he was a product architect on the Tivoli Provisioning Manager (TPM) development team, focused on virtualization technology and image management features. Joe has also been the Manager of IBM's Centre for Advanced Studies at the Toronto Lab, responsible for encouraging and facilitating joint research projects between IBM development teams and academic researchers throughout the world. He is co-author of the textbook: "Java Programming: Advanced Topics" and a recipient of the University of Toronto School of Continuing Studies' Excellence in Teaching Award.
Darrell Reimer is a Senior Technical Staff Member at IBM Research and has worked in the area of advanced virtual image management for the past four years. Darrell is the architect for the research project which through a great collaboration with Tivoli resulted in the IBM Virtual Image Library. Darrell has led a wide range of projects at IBM Research including application virtualization, automated software defect detection, and performance analysis.




