The current Linux symmetric multiprocessing (SMP) kernel at both the 2.4 and 2.5 versions was made aware of Hyper-Threading, and performance speed-up had been observed in multithreaded benchmarks (see Resources later in this article for articles with more details).
This article gives the results of our investigation into the effects of Hyper-Threading (HT) on the Linux SMP kernel. It compares the performance of a Linux SMP kernel that was aware of Hyper-Threading to one that was not. The system under test was a multithreading-enabled, single-CPU Xeon. The benchmarks used in the study covered areas within the kernel that could be affected by Hyper-Threading, such as the scheduler, low-level kernel primitives, the file server, the network, and threaded support.
The results on Linux kernel 2.4.19 show Hyper-Threading technology could improve multithreaded applications by 30%. Current work on Linux kernel 2.5.32 may provide performance speed-up as much as 51%.
Intel's Hyper-Threading Technology enables two logical processors on a single physical processor by replicating, partitioning, and sharing the resources within the Intel NetBurst microarchitecture pipeline.
Replicated resources create copies of the resources for the two threads:
- All per-CPU architectural states
- Instruction pointers, renaming logic
- Some smaller resources (such as return stack predictor, ITLB, etc.)
Partitioned resources divide the resources between the executing threads:
- Several buffers (Re-Order Buffer, Load/Store Buffers, queues, etc.)
Shared resources make use of the resources as needed between the two executing threads:
- Out-of-Order execution engine
Typically, each physical processor has a single architectural state on a single processor core to service threads. With HT, each physical processor has two architectural states on a single core, making the physical processor appear as two logical processors to service threads. The system BIOS enumerates each architectural state on the physical processor. Since Hyper-Threading-aware operating systems take advantage of logical processors, those operating systems have twice as many resources to service threads.
Hyper-Threading support in the Xeon processor
The Xeon processor is the first to implement Simultaneous Multi-Threading (SMT) in a general-purpose processor. (See Resources for more information on the Xeon family of processors.) To achieve the goal of executing two threads on a single physical processor, the processor simultaneously maintains the context of multiple threads that allow the scheduler to dispatch two potentially independent threads concurrently.
The operating system (OS) schedules and dispatches threads of code to each logical processor as it would in an SMP system. When a thread is not dispatched, the associated logical processor is kept idle.
When a thread is scheduled and dispatched to a logical processor, LP0, the Hyper-Threading technology utilizes the necessary processor resources to execute the thread.
When a second thread is scheduled and dispatched on the second logical processor, LP1, resources are replicated, divided, or shared as necessary in order to execute the second thread. Each processor makes selections at points in the pipeline to control and process the threads. As each thread finishes, the operating system idles the unused processor, freeing resources for the running processor.
The OS schedules and dispatches threads to each logical processor, just as it would in a dual-processor or multi-processor system. As the system schedules and introduces threads into the pipeline, resources are utilized as necessary to process two threads.
Hyper-Threading support in Linux kernel 2.4
Under the Linux kernel, a Hyper-Threaded processor with two virtual processors is treated as a pair of real physical processors. As a result, the scheduler that handles SMP should be able to handle Hyper-Threading as well. The support for Hyper-Threading in Linux kernel 2.4.x began with 2.4.17 and includes the following enhancements:
- 128-byte lock alignment
- Spin-wait loop optimization
- Non-execution based delay loops
- Detection of Hyper-Threading enabled processor and starting the logical processor as if machine was SMP
- Serialization in MTRR and Microcode Update driver as they affect shared state
- Optimization to scheduler when system is idle to prioritize scheduling on a physical processor before scheduling on logical processor
- Offset user stack to avoid 64K aliasing
Kernel performance measurement
To assess the effects of Hyper-Threading on the Linux kernel, we measured the
performance of kernel benchmarks on a system containing the Intel Xeon processor
with HT. The hardware was a single-CPU, 1.6 GHz Xeon MP processor with SMT, 2.5 GB
of RAM, and two 9.2 GB SCSI disk drives. The kernel under measurement was stock
version 2.4.19 configured and built with SMP enabled. The kernel Hyper-Threading
support was specified by the boot option
noht for no Hyper-Threading. The existence of
Hyper-Threading support can be seen by using the command
/proc/cpuinfo to show the presence of two processors, processor 0 and
processor 1. Note the
ht flag in Listing 1 for
CPUs 0 and 1. In the case of no Hyper-Threading support, the data will be displayed
for processor 0 only.
Listing 1. Output from cat /proc/cpuinfo showing Hyper-Threading support
processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 1 model name : Intel(R) Genuine CPU 1.60GHz stepping : 1 cpu MHz : 1600.382 cache size : 256 KB . . . fpu : yes fpu_exception: yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 3191.60 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 1 model name : Intel(R) Genuine CPU 1.60GHz stepping : 1 cpu MHz : 1600.382 cache size : 256 KB . . . fpu : yes fpu_exception: yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 3198.15
Linux kernel benchmarks
To measure Linux kernel performance, five benchmarks were used: LMbench, AIM Benchmark Suite IX (AIM9), chat, dbench, and tbench. The LMbench benchmark times various Linux application programming interfaces (APIs), such as basic system calls, context switching latency, and memory bandwidth. The AIM9 benchmark provides measurements of user application workload. The chat benchmark is a client-server workload modeled after a chat room. The dbench benchmark is a file server workload, and tbench is a TCP workload. Chat, dbench, and tbench are multithreaded benchmarks, while the others are single-threaded benchmarks.
Effects of Hyper-Threading on Linux APIs
The effects of Hyper-Threading on Linux APIs were measured by LMbench, which is a microbenchmark containing a suite of bandwidth and latency measurements. Among these are cached file read, memory copy (bcopy), memory read/write (and latency), pipe, context switching, networking, filesystem creates and deletes, process creation, signal handling, and processor clock latency. LMbench stresses the following kernel components: scheduler, process management, communication, networking, memory map, and filesystem. The low level kernel primitives provide a good indicator of the underlying hardware capabilities and performance.
To study the effects of Hyper-Threading, we focused on latency measurements that measure time of message control, (in other words, how fast a system can perform some operation). The latency numbers are reported in microseconds per operation.
Table 1 shows a partial list of kernel functions tested by LMbench. Each data point is the average of three runs, and the data have been tested for their convergence to assure that they are repeatable when subjected to the same test environment. In general, there is no performance difference between Hyper-Threading and no Hyper-Threading for those functions that are running as a single thread. However, for those tests that require two threads to run, such as the pipe latency test and the three process latency tests, Hyper-Threading seems to degrade their latency times. The configured stock SMP kernel is denoted as 2419s. If the kernel was configured without Hyper-Threading support, it is denoted as 2419s-noht. With Hyper-Threading support, the kernel is listed as 2419s-ht.
Table 1. Effects of Hyper-Threading on Linux APIs
|Select on 10 fd's||5.41||5.41||0%|
|Select on 10 tcp fd's||5.69||5.70||0%|
|Signal handler installation||1.56||1.55||0%|
|Signal handler overhead||4.29||4.27||0%|
|Process fork+/bin/sh -c||3051.28||3118.08||-2%|
|Note: Data are in microseconds: smaller is better.|
The pipe latency test uses two processes communicating through a UNIX pipe to measure interprocess communication latencies via socket. The benchmark passes a token back and forth between the two processes. The degradation is 1%, which is small to the point of being insignificant.
The three process tests involve process creation and execution under Linux. The purpose is to measure the time taken to create a basic thread of control. For the process fork+exit test, the data represents the latency time taken to split a process into two (nearly) identical copies and have one exit. This is how new processes are created -- but it is not very useful since both processes are doing the same thing. In this test, Hyper-Threading causes a 4% degradation.
In the process fork+execve, the data represents the time it takes to create a new process and have that new process run a new program. This is the inner loop of all shells (command interpreters). This test sees 6% degradation due to Hyper-Threading.
In the process fork+/bin/sh -c test, the data represents the time taken to create a new process and have that new process run a new program by asking the system shell to find that program and run it. This is how the C library interface called system is implemented. This call is the most general and the most expensive. Under Hyper-Threading, this test runs 2% slower compared to non-Hyper-Threading.
Effects of Hyper-Threading on Linux single-user application workload
The AIM9 benchmark is a single user workload designed to measure the performance of hardware and operating systems. The results are shown in Table 2. Most of the tests in the benchmark performed identically in Hyper-Threading and non-Hyper-Threading, except for the sync file operations and Integer Sieves. The three operations, Sync Random Disk Writes, Sync Sequential Disk Writes, and Sync Disk Copies, are approximately 35% slower in Hyper-Threading. On the other hand, Hyper-Threading provided a 60% improvement over non-Hyper-Threading in the case of Integer Sieves.
Table 2. Effects of Hyper-Threading on AIM9 workload
|add_double||Thousand Double Precision Additions per second||638361||637724||0%|
|add_float||Thousand Single Precision Additions per second||638400||637762||0%|
|add_long||Thousand Long Integer Additions per second||1479041||1479041||0%|
|add_int||Thousand Integer Additions per second||1483549||1491017||1%|
|add_short||Thousand Short Integer Additions per second||1480800||1478400||0%|
|creat-clo||File Creations and Closes per second||129100||139700||8%|
|page_test||System Allocations & Pages per second||161330||161840||0%|
|brk_test||System Memory Allocations per second||633466||635800||0%|
|jmp_test||Non-local gotos per second||8666900||8694800||0%|
|signal_test||Signal Traps per second||142300||142900||0%|
|exec_test||Program Loads per second||387||387||0%|
|fork_test||Task Creations per second||2365||2447||3%|
|link_test||Link/Unlink Pairs per second||54142||59169||9%|
|disk_rr||Random Disk Reads (K) per second||85758||89510||4%|
|disk_rw||Random Disk Writes (K) per second||76800||78455||2%|
|disk_rd||Sequential Disk Reads (K) per second||351904||356864||1%|
|disk_wrt||Sequential Disk Writes (K) per second||154112||156359||1%|
|disk_cp||Disk Copies (K) per second||104343||106283||2%|
|sync_disk_rw||Sync Random Disk Writes (K) per second||239||155||-35%|
|sync_disk_wrt||Sync Sequential Disk Writes (K) per second||97||60||-38%|
|sync_disk_cp||Sync Disk Copies (K) per second||97||60||-38%|
|disk_src||Directory Searches per second||48915||48195||-1%|
|div_double||Thousand Double Precision Divides per second||37162||37202||0%|
|div_float||Thousand Single Precision Divides per second||37125||37202||0%|
|div_long||Thousand Long Integer Divides per second||27305||27360||0%|
|div_int||Thousand Integer Divides per second||27305||27332||0%|
|div_short||Thousand Short Integer Divides per second||27305||27360||0%|
|fun_cal||Function Calls (no arguments) per second||30331268||30105600||-1%|
|fun_cal1||Function Calls (1 argument) per second||112435200||112844800||0%|
|fun_cal2||Function Calls (2 arguments) per second||97587200||97843200||0%|
|fun_cal15||Function Calls (15 arguments) per second||44748800||44800000||0%|
|sieve||Integer Sieves per second||15||24||60%|
|mul_double||Thousand Double Precision Multiplies per second||456287||456743||0%|
|mul_float||Thousand Single Precision Multiplies per second||456000||456743||0%|
|mul_long||Thousand Long Integer Multiplies per second||167904||168168||0%|
|mul_int||Thousand Integer Multiplies per second||167976||168216||0%|
|mul_short||Thousand Short Integer Multiplies per second||155730||155910||0%|
|num_rtns_1||Numeric Functions per second||92740||92920||0%|
|trig_rtns||Trigonometric Functions per second||404000||405000||0%|
|matrix_rtns||Point Transformations per second||875140||891300||2%|
|array_rtns||Linear Systems Solved per second||579||578||0%|
|string_rtns||String Manipulations per second||2560||2564||0%|
|mem_rtns_1||Dynamic Memory Operations per second||982035||980019||0%|
|mem_rtns_2||Block Memory Operations per second||214590||215390||0%|
|sort_rtns_1||Sort Operations per second||481||472||-2%|
|misc_rtns_1||Auxiliary Loops per second||7916||7864||-1%|
|dir_rtns_1||Directory Operations per second||2002000||2001000||0%|
|shell_rtns_1||Shell Scripts per second||95||97||2%|
|shell_rtns_2||Shell Scripts per second||95||96||1%|
|shell_rtns_3||Shell Scripts per second||95||97||2%|
|series_1||Series Evaluations per second||3165270||3189630||1%|
|shared_memory||Shared Memory Operations per second||174080||174220||0%|
|tcp_test||TCP/IP Messages per second||65835||66231||1%|
|udp_test||UDP/IP DataGrams per second||111880||112150||0%|
|fifo_test||FIFO Messages per second||228920||228900||0%|
|stream_pipe||Stream Pipe Messages per second||170210||171060||0%|
|dgram_pipe||DataGram Pipe Messages per second||168310||170560||1%|
|pipe_cpy||Pipe Messages per second||245090||243440||-1%|
|ram_copy||Memory to Memory Copy per second||490026708||492478668||1%|
Effects of Hyper-Threading on Linux multithreaded application workload
To measure the effects of Hyper-Threading on Linux multithreaded applications, we use the chat benchmark, which is modeled after a chat room. The benchmark includes both a client and a server. The client side of the benchmark will report the number of messages sent per second; the number of chat rooms and messages will control the workload. The workload creates a lot of threads and TCP/IP connections, and sends and receives a lot of messages. It uses the following default parameters:
- Number of chat rooms = 10
- Number of messages = 100
- Message size = 100 bytes
- Number of users = 20
By default, each chat room has 20 users. A total of 10 chat rooms will have 20x10 = 200 users. For each user in the chat room, the client will make a connection to the server. So since we have 200 users, we will have 200 connections to the server. Now, for each user (or connection) in the chat room, a "send" thread and a "receive" thread are created. Thus, a 10-chat-room scenario will create 10x20x2 = 400 client threads and 400 server threads, for a total of 800 threads. But there's more.
Each client "send" thread will send the specified number of messages to the server. For 10 chat rooms and 100 messages, the client will send 10x20x100 = 20,000 messages. The server "receive" thread will receive the corresponding number of messages. The chat room server will echo each of the messages back to the other users in the chat room. Thus, for 10 chat rooms and 100 messages, the server "send" thread will send 10x20x100x19 or 380,000 messages. The client "receive" thread will receive the corresponding number of messages.
The test starts by starting the chat server in a command-line session and the client in another command-line session. The client simulates the workload and the results represent the number of messages sent by the client. When the client ends its test, the server loops and accepts another start message from the client. In our measurement, we ran the benchmark with 20, 30, 40, and 50 chat rooms. The corresponding number of connections and threads are shown in Table 3.
Table 3. Number of chat rooms and threads tested
Table 4 show the performance impact of Hyper-Threading on the chat workload. Each data point represents the geometric mean of five runs. The data set clearly indicates that Hyper-Threading could improve the workload throughput from 22% to 28% depending on the number of chat rooms. Overall, Hyper-Threading will boost the chat performance by 24% based on the geometric mean of the 4 chat room samples.
Table 4. Effects of Hyper-Threading on chat throughput
|Number of chat rooms||2419s-noht||2419s-ht||Speed-up|
|Note: Data is the number of messages sent by client: higher is better.|
Figure 1. Effects of Hyper-Threading on the chat workload
Effects of Hyper-Threading on Linux multithreaded file server workload
The effect of Hyper-Threading on the file server was measured with dbench and its companion test, tbench. dbench is similar to the well known NetBench benchmark from the Ziff-Davis Media benchmark program, which lets you measure the performance of file servers as they handle network file requests from clients. However, while NetBench requires an elaborate setup of actual physical clients, dbench simulates the 90,000 operations typically run by a NetBench client by sniffing a 4 MB file called client.txt to produce the same workload. The contents of this file are file operation directives such as SMBopenx, SMBclose, SMBwritebraw, SMBgetatr, etc. Those I/O calls correspond to the Server Message Protocol Block (SMB) that the SMBD server in SAMBA would produce in a netbench run. The SMB protocol is used by Microsoft Windows 3.11, NT and 95/98 to share disks and printers.
In our tests, a total of 18 different types of I/O calls were used including open file, read, write, lock, unlock, get file attribute, set file attribute, close, get disk free space, get file time, set file time, find open, find next, find close, rename file, delete file, create new file, and flush file buffer.
dbench can simulate any number of clients without going through the expense of a physical setup. dbench produces only the filesystem load, and it does no networking calls. During a run, each client records the number of bytes of data moved and divides this number by the amount of time required to move the data. All client throughput scores are then added up to determine the overall throughput for the server. The overall I/O throughput score represents the number of megabytes per second transferred during the test. This is a measurement of how well the server can handle file requests from clients.
dbench is a good test for Hyper-Threading because it creates a high load and activity on the CPU and I/O schedulers. The ability of Hyper-Threading to support multithreaded file serving is severely tested by dbench because many files are created and accessed simultaneously by the clients. Each client has to create about 21 megabytes worth of test data files. For a test run with 20 clients, about 420 megabytes of data are expected. dbench is considered a good test to measure the performance of the elevator algorithm used in the Linux filesystem. dbench is used to test the working correctness of the algorithm, and whether the elevator is aggressive enough. It is also an interesting test for page replacement.
Table 5 shows the impact of HT on the dbench workload. Each data point represents the geometric mean of five runs. The data indicates that Hyper-Threading would improve dbench from as little as 9% to as much as 29%. The overall improvement is 18% based on the geometric mean of the five test scenarios.
Table 5. Effects of Hyper-Threading on dbench throughput
|Number of clients||2419s-noht||2419s-ht||Speed-up|
|Note: Data are throughput in MB/sec: higher is better.|
Figure 2. Effects of Hyper-Threading on the dbench workload
tbench is another file server workload similar to dbench. However, tbench produces only the TCP and process load. tbench does the same socket calls that SMBD would do under a netbench load, but tbench does no filesystem calls. The idea behind tbench is to eliminate SMBD from the netbench test, as though the SMBD code could be made fast. The throughput results of tbench tell us how fast a netbench run could go if we eliminated all filesystem I/O and SMB packet processing. tbench is built as part of the dbench package.
Table 6 depicts the impact of Hyper-Threading on the tbench workload. As before, each data point represents the geometric mean of five runs. Hyper-Threading definitely would improve tbench throughput, from 22% to 31%. The overall improvement is 27% based on the geometric mean of the five test scenarios.
Table 6. Effects of Hyper-Threading on tbench throughput
|Number of clients||2419s-noht||2419s-ht||Speed-up|
|Note: Data are throughput in MB/sec: higher is better.|
Figure 3. Effects of Hyper-Threading on the tbench workload
Hyper-Threading support in Linux kernel 2.5.x
Linux kernel 2.4.x was made aware of HT since the release of 2.4.17. The kernel 2.4.17 knows about the logical processor, and it treats a Hyper-Threaded processor as two physical processors. However, the scheduler used in the stock kernel 2.4.x is still considered naive for not being able to distinguish the resource contention problem between two logical processors versus two separate physical processors.
Ingo Molnar has pointed out scenarios in which the current scheduler gets things wrong (see Resources for a link). Consider a system with two physical CPUs, each of which provides two virtual processors. If there are two tasks running, the current scheduler would let them both run on a single physical processor, even though far better performance would result from migrating one process to the other physical CPU. The scheduler also doesn't understand that migrating a process from one virtual processor to its sibling (a logical CPU on the same physical CPU) is cheaper (due to cache loading) than migrating it across physical processors.
The solution is to change the way the run queues work. The 2.5 scheduler maintains one run queue per processor and attempts to avoid moving tasks between queues. The change is to have one run queue per physical processor that is able to feed tasks into all of the virtual processors. Throw in a smarter sense of what makes an idle CPU (all virtual processors must be idle), and the resulting code "magically fulfills" the needs of scheduling on a Hyper-Threading system.
In addition to the run queue change in the 2.5 scheduler, there are other changes needed to give the Linux kernel the ability to leverage HT for optimal performance. Those changes were discussed by Molnar (again, please see Resources for more on that) as follows.
- HT-aware passive load-balancing:
The IRQ-driven balancing has to be per-physical-CPU, not per-logical-CPU. Otherwise, it might happen that one physical CPU runs two tasks while another physical CPU runs no task; the stock scheduler does not recognize this condition as "imbalance." To the scheduler, it appears as if the first two CPUs have 1-1 task running while the second two CPUs have 0-0 tasks running. The stock scheduler does not realize that the two logical CPUs belong to the same physical CPU.
- "Active" load-balancing:
This is when a logical CPU goes idle and causes a physical CPU imbalance. This is a mechanism that simply does not exist in the stock 1:1 scheduler. The imbalance caused by an idle CPU can be solved via the normal load-balancer. In the case of HT, the situation is special because the source physical CPU might have just two tasks running, both runnable. This is a situation that the stock load-balancer is unable to handle, because running tasks are hard to migrate away. This migration is essential -- otherwise a physical CPU can get stuck running two tasks while another physical CPU stays idle.
- HT-aware task pickup:
When the scheduler picks a new task, it should prefer all tasks that share the same physical CPU before trying to pull in tasks from other CPUs. The stock scheduler only picks tasks that were scheduled to that particular logical CPU.
- HT-aware affinity:
Tasks should attempt to "stick" to physical CPUs, not logical CPUs.
- HT-aware wakeup:
The stock scheduler only knows about the "current" CPU, it does not know about any sibling. On HT, if a thread is woken up on a logical CPU that is already executing a task, and if a sibling CPU is idle, then the sibling CPU has to be woken up and has to execute the newly woken-up task immediately.
At this writing, Molnar has provided a patch to stock kernel 2.5.32 implementing all the above changes by introducing the concept of a shared runqueue: multiple CPUs can share the same runqueue. A shared, per-physical-CPU runqueue fulfills all of the HT-scheduling needs listed above. Obviously this complicates scheduling and load-balancing, and the effects on the SMP and uniprocessor scheduler are still unknown.
The change in Linux kernel 2.5.32 was designed to affect Xeon systems with more than two CPUs, especially in the load-balancing and thread affinity arenas. Due to hardware resource constraints, we were only able to measure its effects in our one-CPU test environment. Using the same testing process employed in 2.4.19, we ran the three workloads, chat, dbench, and tbench, on 2.5.32. For chat, HT could bring as much as a 60% speed-up in the case of 40 chat rooms. The overall improvement was about 45%. For dbench, 27% was the high speed-up mark, with the overall improvement about 12%. For tbench, the overall improvement was about 35%.
Table 7. Effects of Hyper-Threading on Linux kernel 2.5.32
|Number of chat rooms||2532s-noht||2532s-ht||Speed-up|
|Number of clients||2532s-noht||2532s-ht||Speed-up|
|Number of clients||2532s-noht||2532s-ht||Speed-up|
|Note: chat data is the number of messages sent by the client/sec; dbench and tbench data are in MB/sec.|
Intel Xeon Hyper-Threading is definitely having a positive impact on Linux kernel and multithreaded applications. The speed-up from Hyper-Threading could be as high as 30% in stock kernel 2.4.19, to 51% in kernel 2.5.32 due to drastic changes in the scheduler run queue's support and Hyper-Threading awareness.
The author would like to thank Intel's Sunil Saxena for invaluable information gleaned at the LinuxWorld Conference Session Performance tuning for threaded applications -- with a look at Hyper-Threading at the LinuxWorld Conference in San Francisco, August 2002.
- You can download the chat benchmark from the Linux Benchmark Suite Homepage.
- The README file from dbench is courtesy of SAMBA.
- More information on LMbench can be found at the LMbench home page.
- The home of the Ziff-Davis NetBench benchmarking test gives more details of their test suite.
- The Linux elevator algorithm is discussed in the November 23, 2000 edition of the Linux Weekly News Kernel Development section.
- An August 2002 note on Hyper-Threading posted by Ingo Molnar to the kernel list is reprinted in the Linux Weekly News.
- Another August 2002 LWN article also discusses the scheduler and Hyper-Threading (among other things).
- Learn about IBM's developer contributions to Linux at the IBM Linux Technology Center.
- Find more resources for Linux developers in the developerWorks Linux zone.