You can use logical partitions to host different operating systems, and you can create them by allocating hardware resources that meets requirements of the operating system and applications in that partition. Virtualization in System p™ provides strict isolation between different logical partitions. An important feature of virtualization is the dynamic configuration and reconfiguration of the logical partitions without need for rebooting the entire physical machine or disturbing other partitions. Support for Micro-Partitioning™ and virtual I/O are new features specific to POWER5™.
In order to support virtualization, hardware, firmware, and operating system, support is required. Hardware support for virtualization includes introduction of new registers in POWER5 and the design of the interrupt control hardware. The POWER Hypervisor is a firmware component for virtualization in POWER5. This article addresses the role of the POWER Hypervisor and how it facilitates performance in a partitioned environment. Simultaneous Multi-Threading is a feature in POWER5 that enables independent instruction streams or threads to execute simultaneously on the same physical processor. The AIX®operating system offers features to support virtualization including partition performance monitoring commands. This article also discusses the hardware, firmware, and operating system support for virtualization in System p.
What is virtualization?
In the computing world, virtualization is a mechanism of abstracting physical resources. It provides a logical way of viewing physical resources, independent of their organization, implementation, or geographic location.
Introduction to para-virtualization
Para-virtualization, a type of virtualization implemented in the System p, defines new layer called hypervisor. This layer resides on top of the hardware and operates through a set of low-level routines, with the hardware called hypervisor calls. The operating system interfaces with the hypervisor through these hypervisor calls.
Para-virtualization in the System p is realized through logical partitioning. This is different from physical partitioning, where resources and hardware are divided among physical boundaries to create partitions. In physical partitioning, each partition might run the same operating system or a different version of the same operating system. I/O resources, processors ,and memory are not shared between partitions and, hence, there is complete isolation.
In logical partitioning, a logical partition is created with some physical processors, memory, and I/O devices. There is no rule for the quantity of resources to be allocated to one partition, though there are minimum resources required to make a partition. Thus, physical processors, memory, and I/O are divided among one or more partitions. Resources can be allocated in any proportion of each other. Some resources can be shared across all partitions, such as the power supply. The total number of partitions that can be created depends on the system processor model and resources available.
A single partition, also referred as LPAR, can communicate with other partitions, as if each is a separate machine. Each partition can be activated and restarted independently of other partitions.
Figure 1. Logical partitioning
Each logical partition can run a separate operating system or a separate version of the same operating system, but the software is isolated between the different partitions. If the operating system fails in one partition, there is no impact on the operating system running in other partitions. Similarly, applications running in one partition have no impact on applications running in other partitions.
All partitions are aware of the entire physical memory space. However, each partition has its own memory space, and there is no performance impact. Address isolation is maintained by separate address mapping mechanisms.
Resources can be dynamically assigned and reassigned among different partitions, without disturbing other partitions or requiring a reboot of the partitions. Changes to physical resources and reassignments can also be transparent to software. This transparency is significant when a physical CPU allocated to a logical partition fails in the midst of an executing application. The replacement of the ailing physical CPU is transparent to the application running on the CPU.
Introduction to the POWER Hypervisor
The POWER Hypervisor, also referred to as phyp from POWER5, is an important component for implementation of para-virtualization in System p. It is a global firmware image residing outside the partition memory in the first physical memory block at physical address zero. It takes control when the system is powered on and gathers information of how much memory, I/O, and other resources are present in the system. The POWER Hypervisor owns and controls the resources that are global to the system, and it is responsible for setting up logical partitions and defining partition boundaries. It tracks the resources allocated to partitions and provides isolation between partitions.
The POWER Hypervisor performs virtual memory management using a global partition page table, and manages any attempt by a partition to access memory outside its allocated limit. We will discuss in more detail the capabilities of the POWER Hypervisor, including dispatching and memory management in subsequent sections.
Partition-specific firmware support
The partition-specific firmware instance takes care of firmware activities specific to the partition. It locates the operating system image in the partition, loads the boot image into memory, and transfers control for booting. It also generates the device tree specific to the partition. Thus, the operation system is aware of the devices it possesses and uses these devices. Each device tree has only devices assigned to that partition.
The partition-specific firmware provides a set of services specific to AIX and System p called Run-Time Abstraction Services (RTAS). The firmware abstracts specific attributes of the hardware through these services. The operation system calls these services rather than manipulating hardware directly. Thus reducing the need to modify the operating system for each platform as hardware changes. Future changes to the hardware affect only the RTAS services.
Dedicated and shared partitions
You can create two types of partitions in the System p using the logical partitioning concept. They are dedicated and shared partitions.
As indicated in section on logical partitioning, physical processors are logically divided in para-virtualization. This results in three types of processors based on the type of division—dedicated, shared, and virtual. The role of the hypervisor relevant to the processor execution is also presented.
When a whole physical processor is dedicated to a single logical partition, it is called a dedicated processor. A physical processor shared across more than one logical partition is called a shared processor. In other words, only a partial physical processor is allocated to a logical partition in shared processor mode. This partial physical processor that is allocated is called a virtual processor.
Each virtual processor has capacity ranging from 10 percent of the physical processor up to the entire physical processor. Processing power is defined in capacity increments of 0.01 processing units. The power of 1.00 processing unit is equal to one physical processor.
A dedicated partition consists of dedicated physical processors, whereas in shared partitions, the processing power of the physical processors is shared among a set of partitions.
When sharing a processor, the number of processing units to be assigned to that partition and the number of partitions across which that power is to be distributed should be decided. A system can have multiple partitions sharing the same set of processors, dividing the processing capacity among them.
Consider the following example to understand dedicated and shared processors. Assume that there are four physical processors:
P1 and P2 are dedicated processors, and P3 and P4 are shared processors.
The processors pairs P1 and P2 and P3 and P4 have 2.0 processing units each. You can allocated P1 and P2 to the LPAR1 partition, and you can share P3 and P4 among the LPAR2 and LPAR3 partitions.
0.1 to 2.0 processing units is available for LPAR2 and LPAR3. The combined power of P3 and P4 is said to belong to a shared pool. That is, the total processing cycles of LPAR2 and LPAR3 belong to the shared pool. The maximum capacity that can be shared is 2.0. The minimum capacity that has to be allocated per partition is 0.1.
LPAR2 is allocated one virtual processor with a 0.5 processing unit. The remaining 1.5 units are allocated to LPAR3, which has two virtual processors. Each virtual processor in LPAR3 has 0.75 processing power.
Figure 2 depicts this example.
Figure 2.Virtual processor
The processing power to be given to a partition is based on the purpose of the partition's applications.
For shared partition types, the allocated physical processor capacity is called entitled capacity.
There are two types of partition modes—capped and uncapped. If a shared partition has consumed all its allocated capacity, then it can consume idle or unused cycles from the shared pool through hypervisor calls. This is possible if the shared partition is configured as uncapped.
If a shared partition is configured as capped, it cannot use any idle or unused cycles from the shared pool. All dedicated partitions are capped by default.
The hypervisor stores the entire processing cycles of a virtual processor in the shared pool. It is responsible for dispatching virtual processors on specific physical processors. The two hypervisor calls that are used during dispatching are hcede and hconfer.
If the dispatched virtual processor completes its work before the allocated cycles are over, the operating system calls the hcede hypervisor call for handing over the remaining cycles to the hypervisor. The hypervisor uses these cycles for some of its own tasks, such as dispatching or memory management. In case the virtual processor gets new work in the same cycle, the hypervisor gives the unused cycles back.
hconfer is a hypervisor call that is used by the operating system in a shared partition to confer certain processor cycles of one virtual processor to another specific virtual processor in the same partition—it recognizes that the second virtual processor is in need of the excess cycles that the first virtual processor has. For example, assume one virtual processor is holding a lock and does not have enough cycles to release that lock. If another virtual processor needs that lock and has excess cycles, it confers those to the first processor through this call.
Dedicated partitions that have unused processor capacity can donate them to the shared pool for better performance and better CPU utilization of systems with dedicated partitions. Dedicated partitions have an attribute and donation flag that can be set and used to determine whether the partition can donate its unused cycles to the shared pool.
From POWER5 onwards, the hypervisor might forcibly steal cycles from the dedicated partition to do hypervisor work. Though the hypervisor might generally steal when the processors in the partition is idle, it might also steal in a situation when the processors in the partition are in a state waiting for some hypervisor activity to complete. In cases where stealing idle cycles is not sufficient for hypervisor work, it can also borrow cycles when processor is in busy state. This stealing of cycles is completely orthogonal to donation enablement on the processor and any partition settings.
The type or the number of processors and their power alone does not decide the performance of a system. Effective utilization of the hardware resources available play a key role in performance of a system.
This emphasis was recognized and Simultaneous Multi-Threading was introduced in POWER5 systems.
In order to discuss Simultaneous Multi-Threading, you need to understand the normal execution of a single-threaded execution in a processor. See Figure 3 below.
Figure 3. Processor execution
Here FX, FP, and BRX are various hardware execution units. A physical processor is organized as different execution units at the hardware level, for example, fixed point and floating point operation units. A single thread executes through either of these execution units.
In Simultaneous Multi-Threading, two independent instruction streams (threads) from the same partition are made to execute simultaneously on the same physical processor in different hardware units. This is achieved through pipelining at the hardware level.
Simultaneous Multi-Threading ensures that all different units are utilized simultaneously. In POWER5, you can have a maximum of two Simultaneous Multi-Threading threads per processor. The physical processor takes care of synchronization issues between the two threads.
Figure 4 shows a sample execution of processor cycles in a Simultaneous Multi-Threading environment.
Figure 4. Processor cycles in a Simultaneous Multi-Threading environment
More hardware changes
Apart from the changes described so far, the additional hardware changes for implementing logical partitioning are detailed in this section
Interrupt controller hardware
Figure 5 indicates interrupt controller hardware in a symmetric multiprocessing system (SMP system).
Figure 5. Interrupt controller hardware in an SMP system
The interrupt controller hardware sends interrupts to any CPU. It chooses which CPU to send interrupts based on factors such as optimal performance.
In a partitioned system, the interrupt cannot be sent to any processor. The interrupt controller hardware needs to recognize the source of the interrupt and which partition should receive that interrupt. Figure 6 shows an example of how interrupt is handled in partitioned environment.
Figure 6. Interrupt in a partitioned environment
This is implemented by having one interrupt queue for each partition. In the POWER5 system, there are 16 interrupt queues. The firmware associates CPUs with interrupt queues during partition setup. During adapter setup, queues specific to the adapter are specified. Thus, specific adapters are associated to specific logical partitions.
The following new hardware registers were introduced from POWER4 for implementing logical partitioning.
Real mode offset—RMO register
The RMO register is introduced for every partition to reference logical zero in the partitioned memory space.
In a partitioned environment, the entire physical memory is shared among all partitions and the hypervisor resides in physical zero. The real address is the address generated when virtual to physical address translation is disabled. Real addressing mode is indicated by bits in the Machine Status Register (MSR) and is used by system startup code that runs before Virtual Memory Manager (VMM) is configured and is also used when handling interrupts.
When each partition accesses memory in real mode, it needs to reference offset zero. The RMO register is designed for each partition and is intended to facilitate each partition to reference address zero and yet a valid and unique address in the memory. It contains value of the offset of the partition's memory slots from the physical zero. Whenever memory is accessed in real mode, the physical processor adds the value of the RMO register to the partition's specific real address so that it references a true address in the physical memory. The operating system is not aware of this translation. All processors in the same partition have the same value in the RMO register.
Real mode limit—RML register
This register is used to limit the amount of memory that a partition can access in real mode. All processors in the same partition have the same value in the RML register.
Logical partition identity—LPI register
This register contains a value that indicates the partition to which the processor is assigned. All processor in the same partition have the same value in the LPI register.
Processor utilization resource—PUR register
A new register called PUR is introduced for accurate processor utilization measurements.
For reasons of donation or sharing, the allocated cycles for a virtual processor might be different from the actual cycles consumed. Measuring physical processor utilization based on cycles allocated to the CPU would not be so accurate. Hence, the PUR register, introduced in POWER5, stores the actual cycles consumed by each virtual processor. When two threads are using the same processor, as in the case of Simultaneous Multi-Threading, this register is used to calculate how much physical processor each thread actually utilizes over the same time base. The AIX performance monitoring commands use the PUR register to calculate processor utilization.
Hypervisor decrementer, a hardware facility that was introduced in POWER5, provides a timed interrupt to the hypervisor. This is independent of any partition and is used by the hypervisor for dispatching purposes.
I/O in logical partitions should be handled with limited number of physical I/O adapters. This is possible through I/O virtualization. POWER5 systems support partitions with either physical devices, virtual I/O devices, or a mixture of both. Slot-level partitioning of physical devices is also supported. Peripheral component interconnect (PCI) slots in the system can be individually assigned to a logical partition. The hypervisor ensures that each logical partition can access only the PCI slots assigned to it and not other PCI devices, even if they are on the same bus.
Virtual adapters are implemented in software through a Virtual I/O Server. A Virtual I/O Server is a explicitly created logical partition. It connects to physical devices and distributes the resources to logical partitions. In this case, the hypervisor does not own I/O devices, and it is not affected by the introduction of new I/O devices or changes to existing devices.
Virtual I/O Server hosts logical I/O devices to the partitions, such as virtual SCSI disks and virtual Ethernet. The virtual SCSI disks can be the entire physical disks, or only a portion and look like actual disks to the client partitions. Similarly, Ethernet adapters are shared and a virtual Ethernet is provided by the Virtual I/O Server to different partitions. It is similar to high bandwidth Ethernet connection, supporting multiple protocols and is used for communication between partitions.
Figure 7 shows a POWER system with partitions and Virtual I/O Server.
Figure 7. Power system with partitions and VIOS
Starting with POWER5 onwards, two user interface tools, Hardware Management Console (HMC) and Integrated Virtualization Manager (IVM), are available for operational management of logical partitions.
The HMC is a server machine that provides a graphical user interface tool to manage several POWER systems. A server system that is physically attached to and managed by the HMC is called managed system.
The HMC manages systems through messages to the hypervisor and the operating system. It performs tasks that affect the entire managed system, such as powering the system on and off, and helps to configure and activate partitions. The HMC creates and stores logical partition profiles that define physical processors, memory, and I/O resources allocated to an individual partition. It facilitates start, stop, and reset of a partition by selecting the corresponding logical partition profile. It displays a virtual operator panel of the contents and status for the system and the partitions. The HMC is also the control point for dynamic reconfiguration of partitions.
There is no physical console on every partition. The HMC provides a virtual console to each partition. The virtual console is used by the operating system as a console for various processes, such as installations, recovery from crashes, and performing boot-related operations. It is useful for debugging purposes by letting the debugger break into operating system code and examining the hardware registers.
There are many smaller environments where smaller partitions are designed, either for testing purposes or for specific requirements. In such setups, the complex functionalities offered by the HMC are not required. The IVM is a simplified and low-cost solution that inherits some of the HMC features. IVM manages only one server system, and there is no need for an independent machine like HMC.
The IVM is an extension of the Virtual I/O Server. When a system is to be managed by the IVM, it is not partitioned. The virtual I/O is installed as the first operating system in the system. It is then used for configuring logical partitions, startup and shutdown functions for partitions, management of Ethernet and storage adapters, and basic system management functions. The Virtual I/O Server is configured to own all the physical I/O resources and provides virtualization capabilities to the other logical partitions. A browser interface is used for management activities.
AIX optimizations for logical partitioning
This section describes the special features that AIX provides supporting System p systems to implement virtualization
Virtual processor area
AIX maintains an area for each virtual processor called virtual processor area (VPA). The VPA is a two-way communication zone between the operating system and the hypervisor on information required about the virtual processor. The VPA consists of an idle flag that the operating system sets when it is idle, to indicate the status to the hypervisor.
When a context switch occurs in cases such as virtual processor conferring its cycles to another virtual processor, it will be required to save the entire program visible processor state. If not, all resources are used in the virtual processor, saving all of them from a virtual processor context switch. To minimize the cost of a virtual processor context switch, the operating system indicates to the hypervisor whether some resources are in use. The operating system sets the fields corresponding to the used resources in the VPA and maintains a shallow copy of them in the VPA.
The hypervisor calls are executable only in hypervisor mode, just like system calls are executable only in kernel mode. The AIX operating system requests the hypervisor mode through hypervisor calls (hcalls).
The processor transitions from kernel mode to hypervisor mode using the HV bit in the MSR. The HV bit along with the Problem State bit indicate if the processor is in hypervisor mode.
Virtual Memory Manager changes
The Virtual Memory Manager (VMM) undergoes major changes for a partitioned environment. The memory is no more a single contiguous space as it is for a traditional non-partitioned environment.
The whole physical memory is divided into blocks called physical memory blocks (PMB). The logical memory is divided in logical memory blocks (LMB). The sizes of PMB and LMB are variable in POWER5. PMBs assigned to a partition need not be contiguous. Some PMBs are used for special purposes of the hypervisor and are not allocated to any partition. PMBs are mapped to LMBs, as shown in Figure 8 below.
Figure 8. PMB mapping
The hypervisor has access to the entire memory space, and maintains the memory allocated to partitions through a global partition page table. It ensures that partitions do not access the memory of another partition. The global partition page table consists of the mapping of PMBs to the LMBs of different partitions. The operating system cannot access this hypervisor resource directly, and uses hypervisor calls to read or write a new entry to the global page table.
When address translation is turned off, the partition accesses memory in real mode.
When address translation is turned on, the VMM requests the hypervisor to convert a virtual address to a correct logical address. The hypervisor converts a virtual address to a system-wide physical address using the global partition page tables. The operating system is not aware of the address translation within the hypervisor. After obtaining the system-wide physical address (PMB), the hypervisor translates this to the corresponding valid logical address (LMB) for the partition.
Performance monitoring commands
AIX provides memory, I/O, and partition performance monitoring commands. The following is a brief note on each of these commands:
This command displays logical partition-related parameter and hypervisor information. It also lists utilization statistics of each logical partition, which includes percentage of entitled capacity in different modes (user, system, idle, and wait). The command displays information, such as virtual context switches, available physical processors in the pool, percentage of entitled capacity received, capped or uncapped modes, for shared processors.
This command displays statistics of every logical processor, such as number of online logical processors, page faults, thread dispatching statistics, physical processor utilization in different modes (user, system, idle, and wait), logical CPU switches, and percentage of entitled capacity consumed by the CPU.
Virtual memory statistics are reported through this command. It reports statistics about kernel threads, virtual memory, disks, traps, and CPU activity.
This command is used for monitoring I/O device loading by observing the time the physical disks are active in relation to their average transfer rates. These statistics are useful to adapt system configuration to balance the I/O between physical disks and adapters.
is an accounting system that provides statistics of the entire system activity. It
monitors major system resources. Some critical information provided by the
sar command include the following:
- Use of file access system routines (specifying how many times per second several of the system file access routines have been called)
- Buffer activity for transfers, accesses, and cache (kernel block buffer cache) hit ratios
- Activity for each block device (not for tape drives)
- Kernel process activity
- Message and semaphore activities
- Per-processor statistics of selected processors
- Status of processes, kernel-thread, i-node, and file tables
- System switching activity
- Terminal device activity
command reports selected statistics about the activity on the local system. The
command uses the curses library to display its output in a user friendly format in
an 80 x 25 character-based display.
The logical partitioning in System p is shared or dedicated with resources allocated to partitions based on the purpose of the partition and the needs of the application.
We discussed the details of the hypervisor's role in implementation of virtualization. The role of the hypervisor includes resource allocation, partition management, processor dispatching, and virtual memory management.
New registers introduced in the System p and their functionalities are discussed. I/O for all partitions should be managed with the limited number of devices that are present. This is achieved through virtual adapters implemented through a Virtual I/O Server.
Tools are required for the operational management of partitions. Two tools discussed are HMC and IVM.
AIX provides support for Virtualization through changes in the memory management layer (VMM). The OS maintains an area for each virtual processor called Virtual Processor Area (VPA). It also has introduced new performance monitoring commands that help in monitoring and administering partitions.
We are indebted to Saravanan Devendran for taking time to review this article in detail and providing very valuable comments and Arun Anbalagan for his constant support and encouragement.
- Storage Protection Keys on AIX Version 5.3: This paper talks about protection keys used in application space.
- Popular content: See what AIX and UNIX content your peers find interesting.
- Search the AIX and UNIX library by topic:
- AIX and UNIX: The AIX and UNIX developerWorks zone provides a wealth of information relating to all aspects of AIX systems administration and expanding your UNIX skills.
- New to AIX and UNIX?: Visit the "New to AIX and UNIX" page to learn more about AIX and UNIX.
- AIX 6 Wiki: Discover a collaborative environment for technical information related to AIX.
- Safari bookstore: Visit this e-reference library to find specific technical resources.
- developerWorks technical events and webcasts: Stay current with developerWorks technical events and webcasts.
- Podcasts: Tune in and catch up with IBM technical experts.
Get products and technologies
- IBM trial software: Build your next development project with software for download directly from developerWorks.
- Participate in the developerWorks blogs and get involved in the developerWorks community.
- Participate in the AIX and UNIX forums: