Simultaneous multithreading

Simultaneous multithreading is the ability of a single physical processor to simultaneously dispatch instructions from more than one hardware thread context. Because there are two hardware threads per physical processor, additional instructions can run at the same time.

Simultaneous multithreading allows you to take advantage of the superscalar nature of the processor by scheduling two applications at the same time on the same processor. No single application can fully saturate the processor.

Benefitting from Simultaneous Multithreading

It is primarily beneficial in commercial environments where the speed of an individual transaction is not as important as the total number of transactions that are performed. Simultaneous multithreading is expected to increase the throughput of workloads with large or frequently changing working sets, such as database servers and Web servers.

Workloads that see the greatest simultaneous multithreading benefit are those that have a high Cycles Per Instruction (CPI) count. These workloads tend to use processor and memory resources poorly. Large CPIs are usually caused by high cache-miss rates from a large working set. Large commercial workloads are somewhat dependent upon whether the two hardware threads share instructions or data, or the hardware threads are completely distinct. Large commercial workloads typically have this characteristic. Workloads that share instructions or data, including those that run extensively in the operating system or within a single application, might see increased benefits from simultaneous multithreading.

Workloads that do not benefit much from simultaneous multithreading are those in which the majority of individual software threads use a large amount of any resource in the processor or memory. For example, workloads that are floating-point intensive are likely to gain little from simultaneous multithreading and are the ones most likely to lose performance. These workloads heavily use either the floating-point units or the memory bandwidth. Workloads with low CPI and low cache miss rates might see a some small benefit.

Measurements taken on a dedicated partition with commercial workloads indicated a 25%-40% increase in throughput. Simultaneous multithreading is should help shared processor partition processing. The extra threads give the partition a boost after simultaneous multithreading is dispatched because the partition recovers its working set more quickly. Subsequently, the threads perform like they would in a dedicated partition. Although it might be somewhat counterintuitive, simultaneous multithreading performs best when the performance of the cache is at its worst.

Setting the mode with the smtctl command

AIX® allows you to control the mode of the partition for simultaneous multithreading with the smtctl command. With this command, you can turn simultaneous multithreading on or off system-wide, either immediately or the next time the system boots. The simultaneous multithreading mode persists across system boots. By default, AIX enables simultaneous multithreading.

The syntax for the smtctl command is as follows:
smtctl [ -m { off  | on } [ { -boot | -now } ] ]

For more information, see the smtctl command in the Commands Reference, Volume 5.

Hardware Management Console Configuration for Simultaneous Multithreading

When you configure shared processor partitions at the Hardware Management Console (HMC), you specify the minimum, desired, and maximum number of virtual processors. For dedicated partitions, you specify the same type of parameters, but the processor terminology is different. For dedicated partitions, the processors are always called processors.

Both partitioning models require that you specify a range of processors that control the boot and runtime assignment of processors to the partition. If possible, the desired processor setting is granted when the system starts. If this is not possible, the POWER Hypervisor chooses a different value based on the set of available resources that is greater than, or equal to, the minimum value.

The number of processors specified at the HMC impacts the number of logical processors that AIX allocates. If the partition is capable of simultaneous multithreading, AIX allocates twice as many logical processors as the maximum processor value because there are two hardware threads per processor and AIX configures each hardware thread as a separate logical processor. This allows AIX to enable or disable simultaneous multithreading without rebooting the partition.

Dynamic Logical Partitioning for Simultaneous Multithreading

While a partition is running, you can change the number of processors that are assigned to a partition through Dynamic Logical Partitioning (DLPAR) procedures at the HMC. You can add or remove processors within the constraints of the processor range defined for the partition. When a processor is added to a partition that is enabled for simultaneous multithreading, AIX starts both hardware threads and two logical processors are brought online. When a processor is removed from a partition that is enabled for simultaneous multithreading, AIX stops both hardware threads and two logical processors are taken offline.

Two DLPAR events are generated when simultaneous multithreading is enabled. One event is generated for each of the logical processors that is added or removed. The API for DLPAR scripts is based on logical processors, so the number of DLPAR events parallels the addition and removal of logical processors. If simultaneous multithreading is not enabled in the partition, there is only one DLPAR event. AIX automatically translates the DLPAR request that is sent from the HMC into the appropriate number of DLPAR events presented to DLPAR-aware applications.

Micro-Partitioning® and Simultaneous Multithreading

The POWER Hypervisor™ saves and restores all necessary processor states when preempting or dispatching virtual processors. For processors that are enabled for simultaneous multithreading, this means two active thread contexts. Each hardware thread is supported as a separate logical processor by AIX. For this reason, a dedicated partition that is created with one physical processor is configured by AIX as a logical 2-way processor. Because this is independent of the partition type, a shared partition with two virtual processors is configured by AIX as a logical 4-way processor and a shared partition with four virtual processors is configured by AIX as a logical 8-way processor. Paired threads are always scheduled together at the same time in the same partition.

Shared processor capacity is always delivered in terms of whole physical processors. Without simultaneous multithreading, AIX configures a 4-way virtual processor partition with 200 units of processor entitlement as a 4-way logical processor partition where each logical processor has the power of 50% of a physical processor. With simultaneous multithreading, the 4-way logical processor partition becomes an 8-way logical processor partition, where each logical processor has the approximate power of 25% of a physical processor. However, with simultaneous multithreading, latency concerns normally associated with a virtual processor's fractional capacity do not apply linearly to the threads. Because both threads are dispatched together, they are active for the duration of a 50% dispatch window and they share the underlying physical processor to achieve the logical 25%. This means that each of the logical processors are able to field interrupts for twice as long as their individual capacities allow.

Hardware thread priorities

The processor allows priorities to be assigned to hardware threads. The difference in priority between sibling threads determines the ratio of physical processor decode slots allotted to each thread. More slots provide better thread performance. Normally, AIX maintains sibling threads at the same priority, but raises or lowers thread priorities in key places to optimize performance. For example, AIX lowers thread priorities when the thread is doing nonproductive work spinning in the idle loop or on a kernel lock. Thread priorities are raised when a thread is holding a critical kernel lock. These priority adjustments do not persist in user mode. AIX does not consider a software thread's dispatching priority when it is choosing the hardware thread priority.

Work is distributed across all primary threads before work is dispatched to secondary threads. The performance of a thread is best when its paired thread is idle. Thread affinity is also considered in idle stealing and in periodic run queue load balancing.