IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
 
developerworks > My developerWorks >  Dashboard > PowerVM wiki > Home > POWER5 Hypervisor
developerWorks
Log In   View a printable version of the current page.
Overview Connect Spaces Forums Wikis
POWER5 Hypervisor
Added by Nicolette, last edited by Nicolette on Sep 18, 2009  (view change)
Labels: 
(None)

A major feature of the new POWER5 machines is a new, active Hypervisor that represents a convergence with iSeries systems.

iSeries and pSeries machines now have a common Hypervisor and common functionality, which will mean reduced development effort and faster time to market for new functions. However, each brand will retain a unique value proposition.

New functions provided for pSeries are Shared Processor Partitions and Virtual I/O. Both of these have been available for iSeries on POWER4 systems and pSeries gets the benefit of using tried and tested microcode to implement these functions on POWER5.

iSeries benefits from the POWER Hypervisor convergence as well and gains the ability to run AIX in an LPAR (rather than the more limited PACE environment available today). There are some restrictions for the AIX environment on iSeries (for example, device support) and the primary reason for offering this function is to broaden the range of software applications available to iSeries customers.


TECHNOLOGY PPT, Page 39

This is a simplified diagram showing the sourcing of different elements in the converged POWER Hypervisor.
The blue boxes show functions that have been sourced either directly from the existing pSeries POWER4 Hypervisor or from the pSeries architecture. Purple boxes (lighter shading) show those sourced directly from the iSeries SLIC (System Licensed Internal Code) - which is part of OS/400.

Some boxes are gradated, and these represent functions that combine elements of the pSeries and iSeries implementation models.


TECHNOLOGY PPT, Page 40

  • Same functions as POWER4 Hypervisor.
    • Dynamic LPAR
    • Capacity Upgrade on Demand
  • New, active functions.
    • Dynamic Micro-Partitioning
    • Shared processor pool
    • Virtual I/O
    • Virtual LAN
  • Machine is always in LPAR mode.
    • Even with all resources dedicated to one OS

The POWER Hypervisor provides the same basic functions as the POWER4 Hypervisor, plus some new functions designed for shared processor LPARs and virtual I/O.

Combined with features designed into the POWER5 processor, the POWER Hypervisor delivers functions that enable other system technologies, including micro-partitioning, virtualized processors, IEEE VLAN compatible virtual switch, virtual SCSI adapters, and virtual consoles.

The POWER Hypervisor is a component of the system's firmware that will always be installed and activated, regardless of system configuration. It operates as a hidden partition, with no entitled capacity assigned to it.
Newly architected Hypervisor calls (hcalls) provide a means for the operating system to communicate with the POWER Hypervisor, allowing more efficient usage of physical processor capacity by supporting the scheduling heuristic of minimizing idle time.

The POWER Hypervisor is a key component to the functions shown in the chart. It performs the following tasks:

  • Provides an abstraction layer between the physical hardware resources and the logical partitions using them
  • Enforces partition integrity by providing a security layer between logical partitions
  • Controls the dispatch of virtual processors to physical processors
  • Saves and restores all processor state information during logical processor context switch
  • Controls hardware I/O interrupts management facilities for logical partitions

TECHNOLOGY PPT, Page 41

Power Hypervisor implementation

Design enhancements to previous POWER4 implementation enable the sharing of processors by multiple partitions

  • Hypervisor decrementer (HDECR)
  • New Processor Utilization Resource Register (PURR)
  • Refine virtual processor objects
  • Does not include physical characteristics of the processor
  • New Hypervisor calls

The POWER4 processor introduced support for logical partitioning with a new privileged processor state called Hypervisor mode. It is accessed via a Hypervisor call function, which is generated by the operating system kernel running in a partition. Hypervisor mode allows for a secure mode of operation that is required for various system functions where logical partition integrity and security are required. The Hypervisor validates that the partition has ownership of the resources it is attempting to access, such as processor, memory, and I/O, then completes the function. This mechanism allows for complete isolation of partition resources.

In the POWER5 processor, further design enhancements are introduced that enable the sharing of processors by multiple partitions. The Hypervisor decrementer (HDECR) is a new hardware facility in the POWER5 design that provides the POWER Hypervisor with a timed interrupt independent of partition activity. HDECR interrupts are routed directly to the POWER Hypervisor, and use only POWER Hypervisor resources to capture state information from the partition. The HDECR is used for fine grained dispatching of multiple partitions on shared processors. It also provides a means for the POWER Hypervisor to dispatch physical processor resources for its own execution.

With the addition of shared partitions and SMT, a mechanism was required to track physical processor resource utilization at a processor thread level. System architecture for POWER5 introduces a new register called the processor utilization resource register (PURR) to accomplish this. It provides the partition with an accurate cycle count to measure activity during timeslices dispatched on a physical processor. The PURR is a POWER Hypervisor resource, assigned one per processor thread, that is incremented at a fixed rate whenever the thread running on a virtual processor is dispatched on a physical processor.

TECHNOLOGY PPT, Page 42

POWER Hypervisor processor dispatch

  • Manage a set of processors on the machine (shared processor pool).
  • POWER5 generates a 10 ms dispatch window.
    • Minimum allocation is 1 ms per physical processor.
  • Each virtual processor is guaranteed to get its entitled share of processor cycles during each 10 ms dispatch window.
    • ms/VP = CE * 10 / VPs
  • The partition entitlement is evenly distributed among the online virtual processors.
  • Once a capped partition has received its CE within a dispatch interval, it becomes not-runnable.
  • A VP dispatched within 1 ms of the end of the dispatch interval will receive half its CE at the start of the next dispatch interval.

Multiple logical partitions configured to run with a pool of shared physical processors require a robust mechanism to guarantee the distribution of available processing cycles. The POWER Hypervisor manages this task in the POWER5 processor based servers.

Each Micro-partition is configured with a specific processor entitlement, based on a quantity of processing units, which is referred to as the partition's entitled capacity or capacity entitlement (CE). The entitled capacity, along with a defined number of virtual processors, defines the physical processor resource that will be allotted to the partition. The POWER Hypervisor uses the POWER5 HDECR, which is programmed to generate an interrupt every 10 ms, as a timing mechanism for controlling the dispatch of physical processors to system partitions. Each virtual processor is guaranteed to get its entitled share of processor cycles during each 10 ms dispatch window. The minimum amount of resource that the POWER Hypervisor will allocate to a virtual processor, within a dispatch cycle, is 1 ms of execution time per VP. This gives rise to the current restriction of 10 Micro-Partitions per physical processor. The POWER Hypervisor calculates the amount of time each VP will execute by reference to the CE (as shown on the slide). Note that the calculation for uncapped partitions is more complicated and involves their capacity weight and depends on their being unused capacity available.

The amount of time that a virtual processor runs before it is timesliced is based on the partition entitlement, which is specified indirectly by the system administrator. The partition entitlement is evenly distributed amongst the online virtual processors, so the number of online virtual processors impacts the length of each virtual processor's dispatch cycle. The POWER Hypervisor uses the architectural metaphor of a "dispatch wheel" with a fixed rotation period of X milliseconds to guarantee that each virtual processor receives its share of the entitlement in a timely fashion. Virtual processors are time sliced through the use of the hardware decrementer much like the operating system time slices threads.

In general, the POWER Hypervisor uses a very simple scheduling model. The basic idea is that processor entitlement is distributed with each turn of the POWER Hypervisor's dispatch wheel, so each partition is guaranteed a relatively constant stream of service.

TECHNOLOGY PPT, Page 43

Dispatching and interrupt latencies

  • Virtual processors have dispatch latency.
  • Dispatch latency is the time between a virtual processor becoming runnable and being actually dispatched.
  • Timers have latency issues also.
  • External interrupts have latency issues also.

Virtual processors have dispatch latency, since they are scheduled. When a virtual processor is made runnable, it is placed on a run queue by the POWER Hypervisor, where it sits until it is dispatched. The time between these two events is referred to as dispatch latency.

The dispatch latency of a virtual processor is a function of the partition entitlement and the number of virtual processors that are online in the partition. Entitlement is equally divided among these online virtual processors, so the number of online virtual processors impacts the length of each virtual processor's dispatch. The smaller the dispatch cycle, the greater the dispatch latency.

Timers have latency issues also. The hardware decrementer is virtualized by the POWER Hypervisor at the virtual processor level, so that timers will interrupt the initiating virtual processor at the designated time. If a virtual processor is not running, then the timer interrupt has to be queued with the virtual processor, since it is delivered in the context of the running virtual processor.

External interrupts have latency issues also. External interrupts are routed directly to a partition. When the operating system makes the accept-pending-interrupt Hypervisor call, the POWER Hypervisor, if necessary, dispatches a virtual processor of the target partition to process the interrupt. The POWER Hypervisor provides a mechanism for queuing up external interrupts that is also associated with virtual processors. Whenever this queuing mechanism is used, latencies are introduced.

These latency issues are not expected to cause functional problems, but they may present performance problems for real-time applications. To quantify matters, the worst case virtual processor dispatch latency is 18 milliseconds, since the minimum dispatch cycle that is supported at the virtual processor level is one millisecond. This figure is based on the minimum partition entitlement of 1/10 of a physical processor and the 10 millisecond rotation period of the Hypervisor's dispatch wheel. It can be easily visualized by imagining that a virtual processor is scheduled in the first and last portions of two 10 millisecond intervals. In general, if these latencies are too great, then clients may increase entitlement, minimize the number of online virtual processors without reducing entitlement, or use dedicated processor partitions.

TECHNOLOGY PPT, Page 44

Shared processor pool

  • Processors not associated with dedicated processor partitions.
  • No fixed relationship between virtual processors and physical processors.
  • The POWER Hypervisor attempts to use the same physical processor.
  • Affinity scheduling
  • Home node

The POWER Hypervisor schedules shared processor partitions from a set of physical processors that is called the shared processor pool. By definition, these processors are not associated with dedicated partitions.

In shared partitions, there is not a fixed relationship between virtual processors and the physical processors that actualize them. The POWER Hypervisor may use any physical processor in the shared processor pool when it schedules the virtual processor. By default, it attempts to use the same physical processor, but this cannot always be guaranteed. The POWER Hypervisor employs the notion of a home node for virtual processors, enabling it to select the best available physical processor from a memory affinity perspective for the virtual processor that is to be scheduled.

TECHNOLOGY PPT, Page 45

Affinity scheduling

  • When dispatching a VP, the POWER Hypervisor attempts to preserve affinity by using:
    • Same physical processor as before, or
    • Same chip, or
    • Same MCM
  • When a physical processor becomes idle, the POWER Hypervisor looks for a runnable VP that:
    • Has affinity for it, or
    • Has affinity to no-one, or
    • Is uncapped
  • Similar to AIX affinity scheduling

Affinity scheduling is designed to preserve the content of memory caches, so that the working data set of a job can be read or written in the shortest time period possible. Affinity is actively managed by the POWER Hypervisor, since each partition has a completely different context. Currently, there is one shared processor pool, so all virtual processors are implicitly associated with the same pool.

The POWER Hypervisor attempts to dispatch work in a way that maximizes processor, cache, and memory affinity. When the POWER Hypervisor is dispatching a VP (for example, at the start of a dispatch interval) it will attempt to use the same physical CPU as this VP was previously dispatched on, or a processor on the same chip, or on the same MCM (or in the same node).

If a CPU becomes idle, the POWER Hypervisor will look for work for that processor. Priority will be given to runnable VPs that have an affinity for that processor. If none can be found, then the POWER Hypervisor will select a VP that has affinity to no real processor (for example, because previous affinity has expired) and, finally, will select a VP that is uncapped.

The objective of this strategy is to try to improve system scalability by minimizing inter-cache communication.

TECHNOLOGY PPT, Page 46

  • Micro-Partitioning capable operating systems need to be modified to cede a virtual processor when they have no runnable work
    • Failure to do this results in wasted CPU resources
      • For example, an partition spends its CE waiting for I/O
    • Results in better utilization of the pool
  • May confer the remainder of their timeslice to another VP
    • For example, a VP holding a lock
  • Can be redispatched if they become runnable again during the same dispatch interval

In general, operating systems and applications running in shared partitions need not be aware that they are sharing processors. However, overall system performance can be significantly improved by minor operating system changes. The main problem here is that the POWER Hypervisor cannot distinguish between the OS doing useful work and, for example, spinning on a lock. The result is that the OS may waste much of its CE doing nothing of value. AIX 5L provides support for optimizing overall system performance of shared processor partitions.

An OS therefore needs to be modified so that it can signal to the POWER Hypervisor when it is no longer able schedule work, and it can give up the remainder of its time. This results in better utilization of the real processors in the shared processors in the pool.

The dispatch mechanism may utilizes hcalls to communicate between the operating system and the POWER Hypervisor.

When a virtual processor is active on a physical processor and the operating system detects an inability to utilize processor cycles, it may cede or confer its cycles back to the POWER Hypervisor, enabling it to schedule another virtual processor on the physical processor for the remainder of the dispatch cycle. Reasons for a cede or confer may include the virtual processor running out of work and becoming idle, entering a spin loop to wait for a resource to free, or waiting for a long latency access to complete. There is no concept of credit for cycles that are ceded or conferred. Entitled cycles not used during a dispatch interval are lost.

A virtual processor that has ceded cycles back to the POWER Hypervisor can be reactivated using a prod Hypervisor call. If the operating system running on another virtual processor within the logical partition detects that work is available for one of its idle processors, it can use the prod Hypervisor call to signal the POWER Hypervisor to make the prodded virtual processor runnable again. Once dispatched, this virtual processor would resume execution at the return from the cede Hypervisor call.

The "payback" for the OS is that the POWER Hypervisor will redispatch it if it becomes runnable again during the same dispatch interval - allocating it the remainder of its CE if possible. While not required, the use of these primitives is highly desirable for performance reasons, because they improve locking and minimize idle time.

Response time and throughput should be improved, if these primitives are used. Their use is not required, because the POWER Hypervisor time slices virtual processors, which enables it to sequence through each virtual processor in a continuous fashion. Forward progress is thus assured without the use of the primitives.

TECHNOLOGY PPT, Page 47

Example

In this example, there are three logical partitions defined, sharing the processor cycles from two physical processors, spanning two 10 ms Hypervisor dispatch intervals.

Logical partition 1 is defined with an entitlement capacity of 0.8 processing units, with two virtual processors. This allows the partition 80% of one physical processor for each 10 ms dispatch window for the shared processor pool. For each dispatch window, the workload is shown to use 40% of each physical processor during each dispatch interval. It is possible for a virtual processor to be dispatched more than one time during a dispatch interval. Note that in the first dispatch interval, the workload executing on virtual processor 1 is not a continuous utilization of physical processor resource. This can happen if the operating system confers cycles, and is reactivated by a prod Hypervisor call.

Logical partition 2 is configured with one virtual processor and a capacity of 0.2 processing units, entitling it to 20% usage of a physical processor during each dispatch interval. In this example, a worst case dispatch latency is shown for this virtual processor, where the 2 ms are used in the beginning of dispatch interval 1 and the last 2 ms of dispatch interval 2, leaving 16 ms between processor allocation.

Logical partition 3 contains three virtual processors, with an entitled capacity of 0.6 processing units. Each of the partition's three virtual processors consumes 20% of a physical processor in each dispatch interval, but in the case of virtual processor 0 and 2, the physical processor they run on changes between dispatch intervals. The POWER Hypervisor does attempt to maintain physical processor affinity when dispatching virtual processors. It will always first try to dispatch the virtual processor on the same physical processor as it last ran on, and depending on resource utilization, will broaden its search out to the other processor on the POWER5 chip, then to another chip on the same MCM, then to a chip on another MCM.

TECHNOLOGY PPT, Page 48

  • I/O operations without dedicating resources to an individual partition
  • POWER Hypervisor's virtual I/O related operations
    • Provide control and configuration structures for virtual adapter images required by the logical partitions
    • Operations that allow partitions controlled and secure access to physical I/O adapters in a different partition
    • The POWER Hypervisor does not own any physical I/O devices; they are owned by an I/O hosting partition
  • I/O types supported
    • SCSI
    • Ethernet
    • Serial console

This chart introduces POWER Hypervisor involvement in the virtual I/O functions described later.

With the introduction of micro-partitioning, the ability to dedicate physical hardware adapter slots to each partition becomes impractical. Virtualization of I/O devices allows many partitions to communicate with each other, and access networks and storage devices external to the server, without dedicating I/O to an individual partition. Many of the I/O virtualization capabilities introduced with the POWER5 processor based IBM eServer products are accomplished by functions designed into the POWER Hypervisor.

The POWER Hypervisor does not own any physical I/O devices, and it does not provide virtual interfaces to them. All physical I/O devices in the system are owned by logical partitions. Virtual I/O devices are owned by an I/O hosting partition, which provides access to the real hardware that the virtual device is based on.

The POWER Hypervisor implements the following operations required by system partitions to support virtual I/O:

  • Provide control and configuration structures for virtual adapter images required by the logical partitions
  • Operations that allow partitions controlled and secure access to physical I/O adapters in a different partition
    Along with the operations listed above, the POWER Hypervisor allows for the virtualization of I/O interrupts. To maintain partition isolation, the POWER Hypervisor controls the hardware interrupt management facilities. Each logical partition is provided controlled access to the interrupt management facilities using hcalls. Virtual I/O adapters and real I/O adapters use the same set of Hypervisor calls interfaces.Virtual I/O adapters are defined by system administrators during logical partition definition. Configuration information for the virtual adapters is presented to the partition operating system by the system firmware.

Virtual TTY console support

Each partition needs to have access to a system console. Tasks such as operating system install, network setup, and some problem analysis activities require a dedicated system console. The POWER Hypervisor provides virtual console using a virtual TTY or serial adapter and a set of Hypervisor calls to operate on them.Depending on the system configuration, the operating system console can be provided by the Hardware Management Console (HMC) virtual TTY or from a terminal emulator connected to physical serial ports on the system's service processor.

TECHNOLOGY PPT, Page 49

Performance monitoring and accounting

  • CPU utilization is measured against CE.
    • An uncapped partition receiving more than its CE will record 100% but will be using more.
  • SMT
    • Thread priorities compound the variable speed rate.
    • Twice as many logical CPUs.
  • For accounting, interval may be incorrectly allocated.
    • New hardware support is required.
  • Processor utilization register (PURR) records actual clock ticks spent executing a partition.
    • Used by performance commands (for example, new flags) and accounting modules.
    • Third party tools will need to be modified.

Processor utilization is a critical component of metering, performance monitoring, and capacity planning. With respect to POWER5 technologies, two new advances that will be commonly used will combine to make the concept of utilization much more complex: partitioning, specifically, shared processor partitioning, and simultaneous multi-threading. Individually, they add complexity to this concept, but together they multiply the complexity.

Some changes will be required to performance monitoring and accounting tools for support of Micro-Partitioning.

One issue that will need to be addressed is that CPU utilization (using traditional monitoring methods) will be recorded against CE. Clearly, an uncapped partition may exceed its CE and may therefore use more than 100% of its entitlement.

Similarly, accounting tools (which rely on the 10 ms timer interrupt) may incorrectly record resource utilization for partitions that cede part of their dispatch interval (or which have picked up part of another via a confer Hypervisor call).

The POWER5 processor architecture attempts to deal with these complex issues by introducing a new processor register that is intended for measuring utilization. This new register, Processor Utilization Resource Register (PURR), is used to approximate the time that a virtual processor is actually running on a physical processor. The register advances automatically so that the operating system can always get the current up to date value. The Hypervisor saves and restores the register across virtual processor context switches to simulate a monotonically increasing atomic clock at the virtual processor level.


 
    About IBM Privacy Contact