Skip to main content

skip to main content

developerWorks  >  Linux  >

Migrating from Linux Kernel 2.4 to 2.6 on iSeries and pSeries

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Introductory

Matt Davis (mattdavis@us.ibm.com), Linux Power Technical Consultant, IBM 
Chakarat Skawratananond, Linux on POWER technical consultant, IBM 
Nikolay Yevik (yevik@us.ibm.com), Linux on POWER Technical Consultant, IBM 

26 Jul 2004

In this article we highlight the differences between the Version 2.4 and 2.6 Linux kernels on POWER.

Main differences between 2.4 and 2.6 Linux kernels on POWER

Module Subsystem, Unified Device Model, and PnP support

Module subsystem has been significantly changed.

Improved Stability

The process for loading kernel modules in and out of the kernel was improved to prevent modules from being used during this process altogether or at least to reduce cases when it is possible to use modules while they are being loaded or unloaded, sometimes leading to system crash.

Unified device model

Creation of Unified Device Model is one of the most important changes to 2.6 kernels. It promotes standardization of the module interfaces thus allowing for better control and management of devices, for example:

  • Better determination of system devices
  • Power management and power state of a device
  • Improved system bus structure management.

Plug-and-Play (PnP) support

Changes mentioned in sections 1.1.1 and 1.1.2 combined make Linux running kernel 2.6 a real Plug-and-Play OS. For example, PnP support for ISA PnP extensions, legacy MCA and EISA buses, and hot-plug PCI devices.

Kernel infrastructure changes

  • Kernel modules now have .ko extension to differentiate from regular object files with .o extension..
  • New sysfs filesystem has been created, that represents device tree as kernel sees it.

Memory support, NUMA Support

Greater amounts of RAM supported

2.6 kernel supports greater amounts of RAM, up to 64GB in paged mode.

NUMA

Support for Non-Uniform Memory Access - NUMA systems is new in 2.6 kernels.

Threading Models, NPTL

New in version 2.6 is NPTL (Native POSIX Threading Library) in comparison to v2.4’s LinuxThreads. NPTL brings enterprise-class threading support to Linux, far surpassing the performance offered by LinuxThreads. It is based on 1:1 ratio between user and kernel threads.

As of October 2003, NPTL support was merged into the GNU C library, glibc, and Red Hat first implemented NPTL within Red Hat Linux 9 and Red Hat Enterprise Linux using a customized v2.4 kernel.

Performance Improvements

New Scheduler Algorithm
New O(1) algorithm has been introduced to 2.6 Linux kernels. It performs especially well under high loads. The new scheduler improves performance by distributing timeslices on a per-CPU basis and thus eliminating the global synchronization and recalculation loop.
Kernel Preemption
New 2.6 kernels are preemptive. This will significantly improve performance of interactive and multimedia applications.
I/O Performance Improvements
Linux’s I/O subsystem has also undergone major changes to allow I/O operations to be more responsive by changing I/O scheduler to ensure that no process is stuck in the queue for too long waiting to perform input/output operation.
Fast User-Space Mutexes
Responsiveness is also improved by introducing "futexes" (Fast User-Space Mutexes), that allow threads serializing to avoid race conditions. Improvement is achieved by implementing "futexes" in part in kernel space to allow prioritizing waiting tasks on a basis of contention.

Scalability Improvements

Higher Processor Count
Linux kernel 2.6 can support up to 64 CPUs.
Larger Memory Support
On 32-bit systems due to PAE (Physical Address Extensions) memory support in paged mode was increased to 64GB.
Users and Groups
Number of unique users and groups has been increased from 65,000 to over 4 billion, that is from 16-bit to 32-bit.
Number of PIDs
Maximum Number PIDs was increased from 32,000 to 1 billion
Number of Open File Descriptors
Number of open file descriptors was not increased, but this parameter is no longer required to be set up in advance, it will self-scale.
Greater Number of Devices Supported
Previous to Linux kernels 2.6 there were limits within the kernel that could constrain large systems, such as 256 devices per chain. The v2.6 kernel moves well beyond these limitations, not only supporting more types of devices, but also more devices of the same type. Under Linux 2.6 system can allow for 4095 major device types and more than a million of subdevices per a single type.
File systems Size
Linux kernel 2.6 allows addressing file systems sizes of up to 16TB.

File systems

Traditional Linux file systems such as ext2, ext3, and ReiserFS were significantly improved. Most notable improvement is introduction of extended attributes, or file metadata. Of the major importance is implementation of POSIX ACL, an add-on to usual UNIX permissions that allows for more fine-grained user access control.

In addition to improved support for traditional Linux filesystem's, the new kernel includes full support for relatively new on Linux XFS filesystem.

Linux 2.6 kernels now also features improved support for NTFS filesystem, now allowing mount NTFS filesystem in read/write mode.



Back to top


New features of Linux Distributions for POWER5

Linux distributions that will run on POWER5-based systems are SUSE LINUX Enterprise Server 9 (SLES 9), and Red Hat Enterprise Linux Advanced Server 3 with the third service pack (RHEL AS 3 Update 3). Both distributions will be generally available in 2004. SLES 9 is based on Linux kernel 2.6. RHEL AS 3 Update 3 is based on Linux kernel 2.4. Both SLES 9 and RHEL AS 3 Update 3 will run on POWER4 hardware as well. The following table highlights POWER5 features supported in the two distributions.

Function SLES 9 RHELAS 3 Update 3
Dynamic LPAR    
-- ProcessorsYN
-- MemoryNN
-- I/OYN
--Max 254 PartitionsYY
Sub-Processor partition with 0.1 granularityYY
-- Capped and Uncapped partitionsYY
-- Simultaneous multi-threading YY
Storage Options    
Virtual SCSI ServerNN
Virtual SCSI ClientY
i5 Initially
p5 with AIX 5.3
Y
i5 Initially
p5 with AIX 5.3
Communication Options    
Virtual LANYY
Large Page SupportYN
PCI Hot PlugYN
SUE machine check handlingYY

In the following, we provide detailed description for those features.

Dynamic logical partitioning (Dynamic LPAR)

Logical partitioning allows multiple operating systems to reside on a hardware platform simultaneously. System resources are divided so that partitions cannot interfere with each other. Managing LPARs in the system is made possible by the hardware management console (HMC). With Dynamic LPAR, resources can be dynamically added and removed without requiring a partition reboot. When these resources need to be added, administrators can reconfigure the system to recognize these additional resources. The maximum number of logical partitions supported depends on the number of processors in the server model and the system limit is 254. Adoption of Dynamic LPAR is ultimately determined by the Linux distributor and the use of the 2.6 kernels. SLES 9 supports the dynamically movement of processors and I/O. RHEL AS 3 Update 3 will not support Dynamic LPAR.

Sub-Processor Partition

A minimum of 0.10 processing units can be configured for any partition using shared processors. A group of physical processors that can be shared among multiple logical partitions is called a shared processing pool. The shared processor function allows you to assign partial processors to a logical partition.

Consider Figure 1 as an example of an environment using the shared processor pool. This figure represents a fictional setup of a 4-way machine running either i5/OS or AIX. It also has three additional logical partitions. Assume the second partition is a transactional server that processes financial transactions; furthermore, assume that this transaction application interacts with either AIX or i5/OS to store and retrieve its information in a database. The partition labeled "Report" is the sister-application to the transactional server and it generates financial reports. For the purpose of load balancing, the company has separated the transactional and report partitions because the transactional server is time and response sensitive while the report generation can be done at offpeak times. In the last partition is the company’s development and test partition. This partition serves as a development space for their engineers. Notice how the processors have been divided up between the four partitions based on workload.


Figure 1. An example of shared processors

Partitions in the shared processing pool can have a sharing mode of capped or uncapped. A capped partition indicates that the logical partition will never exceed its assigned processing capacity. Any unused processing resources will only be used by the uncapped partitions in the shared processing pool. You can specify whether a partition is capped or uncapped when you define the partition’s profile. While defining a partition, you can also set a minimum and maximum processor value for number or fractions of processor power. This fits nicely with the example that was discussed earlier. Figure 2 is an evolution of the previous example but now the minimum and maximum values are represented.


Figure 2. Dynamic movement of processor power based on workloads

The advantage of being able to dynamically move processors based on demand is very evident in the fictitious example. The loads of the transaction and report partitions crystallize the very need for dynamic processor allocation. The transaction server has one and a half processors allocated (this being the minimum). It also has the ability to consume the second half of the second and all of the third virtual processors based on demand. If you assume the reports are run on off-peak hours during which the system may have more idle time, then the report partition and its applications can consume up to two virtual processors but no less than three-quarters of one processor. The same goes for the test partition. Suppose the engineers need to compile their applications. If the compile is done while there is idle processing power, the Test partition can consume up to 3 virtual processors allowing their compiles to complete quicker.

Most of these instances for processor sharing have been based on parts of the system being idle so that other partitions can use the resources, but there will certainly be times where multiple partitions are asking for more processing power. Consider an example where the both the Report and Transaction servers require more processor power because of peaking workloads. Because timely response from the Transaction server is critical to your business, you would prefer the Transaction server get virtual processing power before the Report partition. This is where setting weights for processing power becomes important.

Uncapped weight is a number in the range of 0 through 255 that you set for each uncapped partition in the shared processing pool. By setting the uncapped weight (255 being the highest weight), any available unused capacity is distributed to contending logical partitions in proportion to the established value of the uncapped weight. The default uncapped weight value is 128.


Figure 3. Weights determine distribution of unused processors

In the situation where both the Transaction and Report servers are peaking, weights can be set to determine how processors should be allocated. In Figure 3, the weights for the Transaction server are set to two and the Report server was set to one. So for every three processing units that are available during the peak, the hypervisor will assign two processor units to the Transaction server and one to the Report server.

Simultaneous multi-threading

The POWER5 architecture features the Simultaneous Multi-Threading technology. The POWER4 microprocessor collects a group of up to five instructions per clock cycle and can complete one group of instructions per clock cycle. The POWER5 microprocessor doubles that throughput by collecting two groups of up to five instructions per clock cycle and completing two groups per clock cycle. Both SLES 9 and RHEL AS 3 Update 3 support this technology.

Storage options

For storage and I/O, Linux can take advantage of a variety of real and virtual devices. This flexibility allows for cost-effective setup of Linux partitions. In the case of disks, Linux logical partitions support three different storage options.

  1. Internal storage using SCSI adapters and drives dedicated to the partition.
  2. External storage using SAN adapters dedicated to the partition.
  3. Virtual storage using a virtual SCSI adapter and storage in a different partition.

Virtual Disk

Virtual storage allows multiple partitions within a POWER5-based system to share storage. One partition, the I/O server partition, owns the physical adapters and storage (which may be internal or external). Virtual adapters allow other partitions, I/O client partitions, to use storage from the I/O server partition. I/O server partitions can be AIX and i5/OS. Both SLES 9 and RHEL AS 3 Update 3 support this.


Figure 4. AIX or i5/OS can provide virtual disk to Linux partitions

Figure 4 graphically describes how a hosting partition can provide virtual disks to Linux partitions. The benefits virtual disk include more than saving expense on disk drives. On smaller machines, adding disks and controllers may be a challenge and may also require the purchase of an expansion unit. Also, the virtual disks can be managed, backed-up, and quickly replicated by the hosting system.

CDROM, Tape, and DVD-ROM

You can also share SCSI devices owned by AIX 5.3 or i5/OS with Linux partitions. It works very similar to the virtual disk function. If AIX or i5/OS owns a CDROM, tape device, or DVDRAM drive, Linux can use those devices as if it were physically attached to the Linux partition provided that the hosting partition is not actively using the device. The benefits of virtual SCSI devices are much the same as virtual disk; less hardware expense and not needing to dedicate a device to each partition.



Back to top


Communication options

Linux on POWER5-based systems can establish a TCP/IP connection through either a directly attached network interface or through a virtual Ethernet interface. Virtual Ethernet provides roughly the same function as a 1 Gigabit Ethernet adapter. Partitions in POWER5-based servers can communicate with each other using TCP/IP over the virtual Ethernet communication ports.

You can define up to 4,094 separate virtual Ethernet LANs (VLANs). Each partition can have up to 65,534 virtual Ethernet adapters connected to the virtual switch. Each adapter can be connected to 21 VLANs. The enablement and setup of virtual Ethernet does not require any special hardware or software. After you enable a specific virtual Ethernet for a partition, a network device named ethXX is created in the partition. The user can then set up TCP/IP configuration to communicate with other partitions.

Let’s reconsider the example from the Sub-Processor Partition section on page 7. In the description of the scenario, the Transaction server uses a database in the AIX or i5/OS partition for storing and retrieving information. This is a very typical use of virtual Ethernet, because the communication is very fast and no additional hardware is required. Figure 5 depicts the example with the addition of a two virtual Ethernet LANs.


Figure 5. Virtual Ethernet LANs are fast and cost effective ways for partitions to communication with each other.

In most cases, however, you will want to allow partitions connected to a virtual Ethernet to also communicate with the physical network. That requires that at least one partition have both a physical Ethernet adapter, as well as a virtual Ethernet adapter that is connected to the other partitions. The partition owning both adapters can route traffic between the physical and virtual Ethernet's.


Figure 6. A partition that has both a physical and virtual network connection, can route traffic to a physical LAN.

A common way to connect your partitions to a physical network is to run a firewall in one of your partitions. In the firewall partition, you can have a network interface card that connects directly to the physical network as seen in Figure 6. The other partitions can then communicate with the physical network by passing traffic through the virtual LAN and the firewall.

At the time of this writing, Virtual LAN is available for both SLES 9 and RHEL AS 3 Update 3.

Large page support

In the 2.6 kernel, there will be support for two virtual page sizes: the traditional 4KB page size and the 16MB page size. Large page usage is primarily intended to provide performance improvements to memory access intensive applications. With large page support, applications are able to run with text and data segments backed by large pages (16MB) with no changes to the application code. The performance improvements are due to the reduced translation look aside buffer (TLB) misses. This is because TLB is able to map a larger virtual memory range. Large pages also improve memory prefetching by eliminating the need to restart prefetch operations on 4KB boundaries. Large page is supported in SLES 9, but not in RHEL AS 3 Update 3.

PCI hot plug

With this capability, you can insert a new PCI hot plug adapter into an available PCI slot while the operating system is running. This can be another adapter of the same type that is currently installed or a different type of PCI adapter. New resources become available to the operating system and applications without having to restart. You can also replace a defective PCI hot plug adapter with another of the same type without shutting down the system. When you exchange the adapter, the existing device driver supports the adapter because it is of the same type. Device configuration and configuration information about devices below the adapter are retained for the replacement device. PCI Hot Plug is supported in SLES 9, but not in RHEL AS 3 Update 3.

SUE Machine Check Handling

This is the capability that allows the system to mark the Special Uncorrectable Errors (SUE) and kill any dependent processes that reference this resource while the system continues to run without requiring reboot to recover from error. Both SLES 9 and RHEL AS 3 Update 3 support this feature.



Back to top


Development toolchain changes

With all the innovation in the Linux 2.6 kernel, accommodations must be made in the libraries and user space development tools. This section will address changes in the GNU toolchain, referring to glibc, bintuils, as, ld, and gcc. While this section is far from complete, it should be a good starting reference to be supplemented by additional reading in the freely available source change logs.

Glibc

Glibc, the GNU C library,has been renovated in version 2.3 to support new and improved 2.6 kernel features as well as new and extended function for the POWER architecture. Primarily, the changes have centered around the Native POSIX Threads for Linux model. Changes have also been made for internationalization, network interface addressing, and regular expression usage, among other things.

Internationalization
Internationalization has been improved by enabling iconv to use the system locale. Also, thread-safe interfaces to locale.h have been implemented. They are not individually reviewed here, but details are kept with the freely available source code documentation. h
Network interface
Network interface addressing has been improved with a BSD-compliant implementation.
Regular expression
Regular expression is now considerably faster after a complete rewrite to be POSIX compliant.
Fexecve
Fexecve used to exec file descriptors, is now enabled in Linux.
Malloc
Malloc has been based on Doug Lea’s Malloc 2.7.0.c to be faster and more compatible.
Thread-locale storage
Thread-locale storage has been implemented to allow for faster collection and storage of void objects in threads as this is handled by the compiler now. For more information, see Ulrich Drepper’s whitepaper at http://people.redhat.com/drepper/tls.pdf

GNU binutils

The GNU binutils include ld, as, and several other minor utilities, such as objcopy and readelf. These minor binutils have not changed specifically to accommodate the 2.6 kernel or the POWER architecture in this release, but have changed in less significant ways. For example, the utility readelf can now be used to display information on files stored in archives. A full changelog is available on the binutils web site at http://sources.redhat.com/binutils/

AS and LD

AS and LD have changed in several ways for POWER architecture. Though because the POWER architecture is used with a variety of operating systems, these changes are not all for Linux specific needs. Changes that do affect Linux on POWER include better support for POWER opcode, additions for VMX extensions (available on PPC970 based Linux offerings). Additionally, optimization profile are now available for POWER4 and PPC970 chips. The default optimization profile is -maltivec, supporting optimization for the VMX extensions found in the PPC970. For POWER4 optimization, use -mpower4.

GCC

GCC has changed notably to include support for NPTL as well, but other changes should not be over looked. Numerous improvements for POWER scheduling, optimization, and compliance have been added.

DFA Scheduler
DFA Scheduler for instructions is support in gcc 3.3.3. Learn more about this project at http://www.gnu.org/software/gcc/news/dfa.html.
Directives
Directives can now be used inside of C macros.
Includes
-I library includes are ignored if the library was already in the path. This avoids unexpected ordering problems with library includes.
New support for the POWER4 processor
New support for the POWER4 processor specific optimizations, i.e. -mpower4
Improvements for VMX extensions
Several function improvements for VMX extensions in PPC970 chips.
More ISO C99 compliance.
For a current status, see this table at http://www.gnu.org/software/gcc/gcc-3.3/c99status.html


About the authors

Matt Davis is a Linux technical consultant in the IBM eServer Solutions Enablement team. As a member of the pSeries Linux project since its inception, he explored and tested emerging technology for pSeries Linux and wrote several reports summarizing his findings. These include Journaling File Systems for Linux for POWER, Parallel Grid Computing with Linux for POWER, Open Source Alternatives to Commercial Software for Linux for POWER, and the Linux Solutions Catalog, to name a few. He came to IBM as an intern during his tenure as a student at the University of Texas at Austin, from which he earned two degrees. He can be reached at mattdavis@us.ibm.com


Chakarat Skawratananond works as a Technical Consultant in the IBM eServer Solutions Enablement organization. Chakarat assists ISVs in enabling their applications for AIX and Linux on the IBM pSeries platform. You can contact Chakarat at chakarat@us.ibm.com.


Nikolay Yevik, a Linux on POWER consultant in IBM’s Solutions Enablement group, has more than 5 years of experience working on UNIX platforms, performing development work in C, C++ and Java. He has Masters degrees in Petroleum Engineering and Computer Science. Nikolay can be reached at yevik@us.ibm.com




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top