Level: Introductory Matt Davis (mattdavis@us.ibm.com), Linux Power Technical Consultant, IBM Chakarat Skawratananond, Linux on POWER technical consultant, IBM Nikolay Yevik (yevik@us.ibm.com), Linux on POWER Technical Consultant, IBM
26 Jul 2004 In this article we highlight the differences between the Version 2.4 and 2.6 Linux kernels on POWER.
Main differences between 2.4 and 2.6 Linux kernels on POWER
Module Subsystem, Unified Device Model, and PnP support
Module subsystem has been significantly changed.
Improved Stability
The process for loading kernel modules in and out of the kernel was improved to prevent
modules from being used during this process altogether or at least to reduce cases when it
is possible to use modules while they are being loaded or unloaded, sometimes leading to
system crash.
Unified device model
Creation of Unified Device Model is one of the most important changes to 2.6 kernels. It
promotes standardization of the module interfaces thus allowing for better control and
management of devices, for example:
- Better determination of system devices
- Power management and power state of a device
- Improved system bus structure management.
Plug-and-Play (PnP) support
Changes mentioned in sections 1.1.1 and 1.1.2 combined make Linux running kernel 2.6
a real Plug-and-Play OS. For example, PnP support for ISA PnP extensions, legacy MCA
and EISA buses, and hot-plug PCI devices.
Kernel infrastructure changes
- Kernel modules now have .ko extension to differentiate from regular object files
with .o extension..
- New sysfs filesystem has been created, that represents device tree as kernel
sees it.
Memory support, NUMA Support
Greater amounts of RAM supported
2.6 kernel supports greater amounts of RAM, up to 64GB in paged mode.
NUMA
Support for Non-Uniform Memory Access - NUMA systems is new in 2.6 kernels.
Threading Models, NPTL
New in version 2.6 is NPTL (Native POSIX Threading Library) in comparison to v2.4’s LinuxThreads.
NPTL brings enterprise-class threading support to Linux, far surpassing the performance offered
by LinuxThreads. It is based on 1:1 ratio between user and kernel threads.
As of October 2003, NPTL support was merged into the GNU C library, glibc, and Red Hat first
implemented NPTL within Red Hat Linux 9 and Red Hat Enterprise Linux using a customized
v2.4 kernel.
Performance Improvements
- New Scheduler Algorithm
- New O(1) algorithm has been introduced to 2.6 Linux kernels. It performs especially
well under high loads. The new scheduler improves performance by distributing
timeslices on a per-CPU basis and thus eliminating the global synchronization and
recalculation loop.
- Kernel Preemption
- New 2.6 kernels are preemptive. This will significantly improve performance of
interactive and multimedia applications.
- I/O Performance Improvements
- Linux’s I/O subsystem has also undergone major changes to allow I/O operations to be
more responsive by changing I/O scheduler to ensure that no process is stuck in the
queue for too long waiting to perform input/output operation.
- Fast User-Space Mutexes
-
Responsiveness is also improved by introducing "futexes" (Fast User-Space Mutexes),
that allow threads serializing to avoid race conditions. Improvement is achieved by
implementing "futexes" in part in kernel space to allow prioritizing waiting tasks on a
basis of contention.
Scalability Improvements
- Higher Processor Count
-
Linux kernel 2.6 can support up to 64 CPUs.
- Larger Memory Support
- On 32-bit systems due to PAE (Physical Address Extensions) memory support in paged
mode was increased to 64GB.
- Users and Groups
- Number of unique users and groups has been increased from 65,000 to over 4 billion,
that is from 16-bit to 32-bit.
- Number of PIDs
- Maximum Number PIDs was increased from 32,000 to 1 billion
- Number of Open File Descriptors
- Number of open file descriptors was
not increased, but this parameter is no longer
required to be set up in advance, it will self-scale.
- Greater Number of Devices Supported
- Previous to Linux kernels 2.6 there were limits within the kernel that could constrain
large systems, such as 256 devices per chain. The v2.6 kernel moves well beyond these
limitations, not only supporting more types of devices, but also more devices of the same
type. Under Linux 2.6 system can allow for 4095 major device types and more than a
million of subdevices per a single type.
- File systems Size
- Linux kernel 2.6 allows addressing file systems sizes of up to 16TB.
File systems
Traditional Linux file systems such as ext2, ext3, and ReiserFS were significantly improved. Most notable improvement is introduction of extended attributes, or file metadata. Of the major
importance is implementation of POSIX ACL, an add-on to usual UNIX permissions that allows
for more fine-grained user access control.
In addition to improved support for traditional Linux filesystem's, the new kernel includes full
support for relatively new on Linux XFS filesystem.
Linux 2.6 kernels now also features improved support for NTFS filesystem, now allowing mount
NTFS filesystem in read/write mode.
New features of Linux Distributions for POWER5
Linux distributions that will run on POWER5-based systems are SUSE LINUX Enterprise Server 9 (SLES
9), and Red Hat Enterprise Linux Advanced Server 3 with the third service pack (RHEL AS 3 Update 3).
Both distributions will be generally available in 2004. SLES 9 is based on Linux kernel 2.6. RHEL AS 3
Update 3 is based on Linux kernel 2.4. Both SLES 9 and RHEL AS 3 Update 3 will run on POWER4
hardware as well. The following table highlights POWER5 features supported in the two distributions.
| Function |
SLES 9 |
RHELAS 3 Update 3 |
| Dynamic LPAR |
|
| | -- Processors | Y | N | | -- Memory | N | N | | -- I/O | Y | N | | --Max 254 Partitions | Y | Y | | Sub-Processor partition with 0.1 granularity | Y | Y | | -- Capped and Uncapped partitions | Y | Y | | -- Simultaneous multi-threading | Y | Y | | Storage Options |
|
| | Virtual SCSI Server | N | N | | Virtual SCSI Client | Y
i5 Initially
p5 with AIX 5.3 | Y
i5 Initially
p5 with AIX 5.3 | | Communication Options |
|
| | Virtual LAN | Y | Y | | Large Page Support | Y | N | | PCI Hot Plug | Y | N | | SUE machine check handling | Y | Y |
In the following, we provide detailed description for those features.
Dynamic logical partitioning (Dynamic LPAR)
Logical partitioning allows multiple operating systems to reside on a hardware platform
simultaneously. System resources are divided so that partitions cannot interfere with each other.
Managing LPARs in the system is made possible by the hardware management console (HMC).
With Dynamic LPAR, resources can be dynamically added and removed without requiring a
partition reboot. When these resources need to be added, administrators can reconfigure the
system to recognize these additional resources. The maximum number of logical partitions
supported depends on the number of processors in the server model and the system limit is 254.
Adoption of Dynamic LPAR is ultimately determined by the Linux distributor and the use of the
2.6 kernels. SLES 9 supports the dynamically movement of processors and I/O. RHEL AS 3
Update 3 will not support Dynamic LPAR.
Sub-Processor Partition
A minimum of 0.10 processing units can be configured for any partition using shared processors.
A group of physical processors that can be shared among multiple logical partitions is called a
shared processing pool. The shared processor function allows you to assign partial processors to a
logical partition.
Consider Figure 1 as an example of an environment using the shared processor pool. This figure
represents a fictional setup of a 4-way machine running either i5/OS or AIX. It also has three
additional logical partitions. Assume the second partition is a transactional server that processes
financial transactions; furthermore, assume that this transaction application interacts with either
AIX or i5/OS to store and retrieve its information in a database. The partition labeled "Report" is
the sister-application to the transactional server and it generates financial reports. For the
purpose of load balancing, the company has separated the transactional and report partitions
because the transactional server is time and response sensitive while the report generation can be
done at offpeak times. In the last partition is the company’s development and test partition. This
partition serves as a development space for their engineers. Notice how the processors have been
divided up between the four partitions based on workload.
Figure 1. An example of shared processors
Partitions in the shared processing pool can have a sharing mode of capped or uncapped. A
capped partition indicates that the logical partition will never exceed its assigned processing
capacity. Any unused processing resources will only be used by the uncapped partitions in the
shared processing pool. You can specify whether a partition is capped or uncapped when you
define the partition’s profile. While defining a partition, you can also set a minimum and
maximum processor value for number or fractions of processor power. This fits nicely with the
example that was discussed earlier. Figure 2 is an evolution of the previous example but now the
minimum and maximum values are represented.
Figure 2. Dynamic movement of processor power based on workloads
The advantage of being able to dynamically move processors based on demand is very evident in
the fictitious example. The loads of the transaction and report partitions crystallize the very need
for dynamic processor allocation. The transaction server has one and a half processors allocated
(this being the minimum). It also has the ability to consume the second half of the second and all
of the third virtual processors based on demand. If you assume the reports are run on off-peak
hours during which the system may have more idle time, then the report partition and its
applications can consume up to two virtual processors but no less than three-quarters of one
processor. The same goes for the test partition. Suppose the engineers need to compile their
applications. If the compile is done while there is idle processing power, the Test partition can
consume up to 3 virtual processors allowing their compiles to complete quicker.
Most of these instances for processor sharing have been based on parts of the system being idle so
that other partitions can use the resources, but there will certainly be times where multiple
partitions are asking for more processing power. Consider an example where the both the Report
and Transaction servers require more processor power because of peaking workloads. Because
timely response from the Transaction server is critical to your business, you would prefer the
Transaction server get virtual processing power before the Report partition. This is where setting
weights for processing power becomes important.
Uncapped weight is a number in the range of 0 through 255 that you set for each uncapped
partition in the shared processing pool. By setting the uncapped weight (255 being the highest
weight), any available unused capacity is distributed to contending logical partitions in proportion
to the established value of the uncapped weight. The default uncapped weight value is 128.
Figure 3. Weights determine distribution of unused processors
In the situation where both the Transaction and Report servers are peaking, weights can be set to
determine how processors should be allocated. In Figure 3, the weights for the Transaction server
are set to two and the Report server was set to one. So for every three processing units that are
available during the peak, the hypervisor will assign two processor units to the Transaction server
and one to the Report server.
Simultaneous multi-threading
The POWER5 architecture features the Simultaneous Multi-Threading technology. The POWER4
microprocessor collects a group of up to five instructions per clock cycle and can complete one
group of instructions per clock cycle. The POWER5 microprocessor doubles that throughput by
collecting two groups of up to five instructions per clock cycle and completing two groups per
clock cycle. Both SLES 9 and RHEL AS 3 Update 3 support this technology.
Storage options
For storage and I/O, Linux can take advantage of a variety of real and virtual devices. This
flexibility allows for cost-effective setup of Linux partitions. In the case of disks, Linux logical
partitions support three different storage options.
- Internal storage using SCSI adapters and drives dedicated to the partition.
- External storage using SAN adapters dedicated to the partition.
- Virtual storage using a virtual SCSI adapter and storage in a different partition.
Virtual Disk
Virtual storage allows multiple partitions within a POWER5-based system to share storage.
One partition, the I/O server partition, owns the physical adapters and storage (which may be
internal or external). Virtual adapters allow other partitions, I/O client partitions, to use
storage from the I/O server partition. I/O server partitions can be AIX and i5/OS. Both SLES
9 and RHEL AS 3 Update 3 support this.
Figure 4. AIX or i5/OS can provide virtual disk to Linux partitions
Figure 4 graphically describes how a hosting partition can provide virtual disks to Linux
partitions. The benefits virtual disk include more than saving expense on disk drives. On smaller
machines, adding disks and controllers may be a challenge and may also require the purchase of
an expansion unit. Also, the virtual disks can be managed, backed-up, and quickly replicated by
the hosting system.
CDROM, Tape, and DVD-ROM
You can also share SCSI devices owned by AIX 5.3 or i5/OS with Linux partitions. It works very
similar to the virtual disk function. If AIX or i5/OS owns a CDROM, tape device, or DVDRAM
drive, Linux can use those devices as if it were physically attached to the Linux partition provided
that the hosting partition is not actively using the device. The benefits of virtual SCSI devices are
much the same as virtual disk; less hardware expense and not needing to dedicate a device to each
partition.
Communication options
Linux on POWER5-based systems can establish a TCP/IP connection through either a directly
attached network interface or through a virtual Ethernet interface. Virtual Ethernet provides roughly
the same function as a 1 Gigabit Ethernet adapter. Partitions in POWER5-based servers can
communicate with each other using TCP/IP over the virtual Ethernet communication ports.
You can define up to 4,094 separate virtual Ethernet LANs (VLANs). Each partition can have up to
65,534 virtual Ethernet adapters connected to the virtual switch. Each adapter can be connected to 21
VLANs. The enablement and setup of virtual Ethernet does not require any special hardware or
software. After you enable a specific virtual Ethernet for a partition, a network device named ethXX is
created in the partition. The user can then set up TCP/IP configuration to communicate with other
partitions.
Let’s reconsider the example from the Sub-Processor Partition section on page 7. In the description of
the scenario, the Transaction server uses a database in the AIX or i5/OS partition for storing and
retrieving information. This is a very typical use of virtual Ethernet, because the communication is
very fast and no additional hardware is required. Figure 5 depicts the example with the addition of a
two virtual Ethernet LANs.
Figure 5. Virtual Ethernet LANs are fast and cost effective ways for partitions to communication with each other.
In most cases, however, you will want to allow partitions connected to a virtual Ethernet to also
communicate with the physical network. That requires that at least one partition have both a physical
Ethernet adapter, as well as a virtual Ethernet adapter that is connected to the other partitions. The
partition owning both adapters can route traffic between the physical and virtual Ethernet's.
Figure 6. A partition that has both a physical and virtual network connection, can route traffic to a physical LAN.
A common way to connect your partitions to a physical network is to run a firewall in one of your
partitions. In the firewall partition, you can have a network interface card that connects directly to the
physical network as seen in Figure 6. The other partitions can then communicate with the physical
network by passing traffic through the virtual LAN and the firewall.
At the time of this writing, Virtual LAN is available for both SLES 9 and RHEL AS 3 Update 3.
Large page support
In the 2.6 kernel, there will be support for two virtual page sizes: the traditional 4KB page size and the
16MB page size. Large page usage is primarily intended to provide performance improvements to
memory access intensive applications. With large page support, applications are able to run with text
and data segments backed by large pages (16MB) with no changes to the application code. The
performance improvements are due to the reduced translation look aside buffer (TLB) misses. This is
because TLB is able to map a larger virtual memory range. Large pages also improve memory
prefetching by eliminating the need to restart prefetch operations on 4KB boundaries. Large page is
supported in SLES 9, but not in RHEL AS 3 Update 3.
PCI hot plug
With this capability, you can insert a new PCI hot plug adapter into an available PCI slot while the operating system is running. This can be another adapter of the same type that is currently installed or a different type of PCI adapter. New resources become available to the operating system and applications without having to restart. You can also replace a defective PCI hot plug adapter with another of the same type without shutting down the system. When you exchange the adapter, the existing device driver supports the adapter because it is of the same type. Device configuration and configuration information about devices below the adapter are retained for the replacement device. PCI Hot Plug is supported in SLES 9, but not in RHEL AS 3 Update 3.
SUE Machine Check Handling
This is the capability that allows the system to mark the Special Uncorrectable Errors (SUE) and kill
any dependent processes that reference this resource while the system continues to run without
requiring reboot to recover from error. Both SLES 9 and RHEL AS 3 Update 3 support this feature.
Development toolchain changes
With all the innovation in the Linux 2.6 kernel, accommodations must be made in the libraries and
user space development tools. This section will address changes in the GNU toolchain, referring to
glibc, bintuils, as, ld, and gcc. While this section is far from complete, it should be a good starting
reference to be supplemented by additional reading in the freely available source change logs.
Glibc
Glibc, the GNU C library,has been renovated in version 2.3 to support new and improved 2.6 kernel
features as well as new and extended function for the POWER architecture. Primarily, the changes
have centered around the Native POSIX Threads for Linux model. Changes have also been made for
internationalization, network interface addressing, and regular expression usage, among other things.
- Internationalization
- Internationalization has been improved by enabling iconv to use the system locale. Also, thread-safe
interfaces to locale.h have been implemented. They are not individually reviewed here, but details are kept with the freely available source code documentation.
h
- Network interface
- Network interface addressing has been improved with a BSD-compliant implementation.
- Regular expression
- Regular expression is now considerably faster after a complete rewrite to be POSIX compliant.
- Fexecve
-
Fexecve used to exec file descriptors, is now enabled in Linux.
- Malloc
- Malloc has been based on Doug Lea’s Malloc 2.7.0.c to be faster and more compatible.
- Thread-locale storage
- Thread-locale storage has been implemented to allow for faster collection and storage of void
objects in threads as this is handled by the compiler now. For more information, see Ulrich
Drepper’s whitepaper at http://people.redhat.com/drepper/tls.pdf
GNU binutils
The GNU binutils include ld, as, and several other minor utilities, such as objcopy and readelf. These
minor binutils have not changed specifically to accommodate the 2.6 kernel or the POWER
architecture in this release, but have changed in less significant ways. For example, the utility readelf
can now be used to display information on files stored in archives. A full changelog is available on
the binutils web site at http://sources.redhat.com/binutils/
AS and LD
AS and LD have changed in several ways for POWER architecture. Though because the POWER
architecture is used with a variety of operating systems, these changes are not all for Linux specific
needs. Changes that do affect Linux on POWER include better support for POWER opcode,
additions for VMX extensions (available on PPC970 based Linux offerings). Additionally,
optimization profile are now available for POWER4 and PPC970 chips. The default optimization
profile is -maltivec, supporting optimization for the VMX extensions found in the PPC970. For
POWER4 optimization, use -mpower4.
GCC
GCC has changed notably to include support for NPTL as well, but other changes should not be over
looked. Numerous improvements for POWER scheduling, optimization, and compliance have been added.
- DFA Scheduler
- DFA Scheduler for instructions is support in gcc 3.3.3. Learn more about this project at
http://www.gnu.org/software/gcc/news/dfa.html.
- Directives
- Directives can now be used inside of C macros.
- Includes
- -I library includes are ignored if the library was already in the path. This avoids
unexpected ordering problems with library includes.
- New support for the POWER4 processor
- New support for the POWER4 processor specific optimizations, i.e. -mpower4
- Improvements for VMX extensions
- Several function improvements for VMX extensions in PPC970 chips.
- More ISO C99 compliance.
- For a current status, see this table at http://www.gnu.org/software/gcc/gcc-3.3/c99status.html
About the authors  | |  | Matt Davis is a Linux technical consultant in the IBM eServer Solutions Enablement team. As a member of the pSeries Linux project since its inception, he explored and tested emerging technology for pSeries Linux and wrote several reports summarizing his findings. These include Journaling File Systems for Linux for POWER, Parallel Grid Computing with Linux for POWER, Open Source Alternatives to Commercial Software for Linux for POWER, and the Linux Solutions Catalog, to name a few. He came to IBM as an intern during his tenure as a student at the University of Texas at Austin, from which he earned two degrees. He can be reached at mattdavis@us.ibm.com
|
 | |  | Chakarat Skawratananond works as a Technical Consultant in the IBM eServer Solutions Enablement organization. Chakarat assists ISVs in enabling their applications for AIX and Linux on the IBM pSeries platform. You can contact Chakarat at chakarat@us.ibm.com. |
 | |  | Nikolay Yevik, a Linux on POWER consultant in IBM’s Solutions Enablement group, has more than 5 years of experience working on UNIX platforms, performing development work in C, C++ and Java. He has Masters degrees in Petroleum Engineering and Computer Science. Nikolay can be reached at yevik@us.ibm.com |
Rate this page
|