Putting Linux reliability to the test

The Linux Technology Center evaluates the long-term reliability of Linux

This article documents the test results and analysis of the Linux kernel and other core OS components, including everything from libraries and device drivers to file systems and networking, all under some fairly adverse conditions, and over lengthy durations. The IBM Linux Technology Center has just finished this comprehensive testing over a period of more than three months and shares the results of their LTP (Linux Test Project) testing with developerWorks readers.

Share:

Li Ge, Staff Software Engineer, Linux Technology Center, IBM, Software Group

Li Ge is a Staff Software Engineer in the IBM Linux Technology Center. She graduated from New Mexico State University with an MS in Computer Science in 2001. She has been working on Linux for three years and is currently working on Linux kernel validation and Linux reliability measurement.



Linda Scott (lindajs@us.ibm.com), Senior Software Engineer, Linux Technology Center, IBM, Software Group

Linda Scott is a Senior Software Engineer and has worked at IBM development labs in the state of Texas since graduating from Jackson State University. During her career with IBM, Linda has worked on a variety of Unix and Linux projects and is currently working on the Linux Test Project where over 2000 test cases have been delivered to the open source community. She can be reached at lindajs@us.ibm.com.



Mark VanderWiele (markv@us.ibm.com), Senior Technical Staff Member, Linux Technology Center, IBM, Software Group

Mark VanderWiele is a Senior Technical Staff Member and Architect in the IBM Linux Technology Center. He graduated from Florida State University in 1983 and has spent the majority of his career in various aspects of operating system development. He can be reached at markv@us.ibm.com.



17 December 2003

The IBM Linux Technology Center (LTC) was founded in August 1999 to work directly with the Linux development community with a shared vision of making Linux succeed. Its 200-odd employees make it one of the larger corporate groups of open source developers. They contribute code ranging from patches to structural kernel changes; from file systems and internationalization work to GPL'd drivers. They also work to track Linux-related developments within IBM.

Test results at a glance

The following summary is based on the results of the tests and observations on the duration of the runs:

  • The Linux kernel and other core OS components -- including libraries, device drivers, file systems, networking, IPC, and memory management -- operated consistently and completed all the expected durations of runs with zero critical system failures.
  • Every run generated a high success rate (over 95%), with a very small number of expected intermittent failures that were the result of the concurrent executions of tests that are designed to overload resources.
  • Linux system performance was not degraded during the long duration of the run.
  • The Linux kernel properly scaled to use hardware resources (CPU, memory, disk) on SMP systems.
  • The Linux system handled continuous full CPU load (over 99%) and high memory stress well.
  • The Linux system handled overloaded circumstances correctly.

The tests demonstrate that the Linux kernel and other core OS components are reliable and stable over 30, 60, and 90 days, and can provide a robust, enterprise-level environment for customers over long periods of time.

Particular areas of interest for the LTC are Linux scalability, serviceability, reliability, and systems management -- all with a view to making Linux ever more enterprise-ready. Enabling Linux to work on the S/390 mainframe and porting the JFS journaling file system to Linux are among their many contributions to the community.

Another of the LTC's core missions is to professionally test Linux in lab settings the way any commercial project is tested. The LTC contributes to the LTP Linux Test Project (LTP), as do SGI, OSDL, Bull, and Wipro Technologies. What follows are the results obtained from a comprehensive set of tests from the LTP suite on the Linux kernel for an extended period of time. As you may have guessed, Linux held up admirably under the continued stress.

Linux reliability measurement

Objectives

The objective of the Linux reliability effort at the IBM Linux Technology Center is to measure the Linux operating system's stability and reliability over long periods of time with an emphasis on workloads relevant to Linux customer environments using the LTP test suite (see Resources for more on the LTP). Identification of defects was not the primary focus.

Test environment overview

This article describes the test results and analysis of 30- and 60-day Linux reliability measure tests using the LTP test suite. The tests used SuSE Linux Enterprise Server v8 (SLES 8) as the testing kernel and IBM pSeries servers as testing hardware. A specially designed stress-test scenario of LTP was used to exercise a wide range of kernel components in parallel with networking and memory management, and to create a high stress workload on the testing system. The Linux kernel, TCP, NFS, and I/O test components were targeted with a heavy-stress workload.


The tests

At 30 days

30-day LTP stress execution results for pSeries

  • Machine: p650 LPAR
  • CPU: (2) Power4- 1.2 GHz
  • Kernel: Linux 2.4.19-ull-ppc64-SMP (SLES 8 SP 1)
  • LTP version: 20030514
  • 99.00 percent Average CPU utilization (User: 48.65 percent, System: 50.35 percent)
  • 80.09 percent Average memory utilization (8GB)

Observations:

  • SLES 8 PPC64 30-day stress run successfully completed on p650 LPAR
  • LTPstress was the test tool. Test cases were executed both in parallel and in sequence
  • Kernel, TCP, NFS, and I/O test components were targeted with heavy stress workloads
  • Success rate: 97.88 percent
  • Zero critical system failures
Figure 1. 30-day LTP stress execution results
30-day LTP stress execution results for the pSeries

At 60 days

60-day LTP stress execution results: pSeries

  • Machine: B80
  • CPU: (2) Power3- 375 MHz
  • Kernel: Linux 2.4.19-ull-ppc64-SMP (SLES 8 SP 1)
  • LTP version: 20030514
  • 99.96 percent average CPU utilization (User: 75.02 percent, System: 24.94 percent)
  • 61.69 percent average memory utilization (8GB)
  • 3.86 percent average swap utilization (1GB)

Observations:

  • SLES 8 PPC64 60-day stress run successfully completed on pSeries B80
  • LTPstress was the test tool. Test cases were executed both in parallel and in sequence
  • Kernel, TCP, NFS, and I/O tests components were targeted with heavy stress workloads
  • Success rate: 95.12 percent
  • Zero critical system failures
Figure 2. 60-day LTP stress execution results
60-day LTP stress execution results for the pSeries

Test infrastructure

Hardware and software environment

Table 1 shows the hardware environment.

Table 1. Hardware environment
SystemProcessorsMemoryDiskSwap partitionNetwork
pSeries 650 (LPAR) Model 7038-6M2 2 - POWER4+(TM) 1.2GHz 8GB (8196MB) 36GB U320 IBM Ultrastar (other disks present, but unused) 1GB Ethernet controller: AMD PCnet32
pSeries 630 Model 7026-B80 2 - POWER3(TM)+ 375 MHz 8GB (7906MB) 16GB 1GB Ethernet controller: AMD PCnet32

The software environment was the same for both the pSeries 630 Model 7026-B80 and the pSeries 650 (LPAR) Model 7038-6M2. Table 2 shows the software environment.

Table 2. Software environment
ComponentVersion
Linux SuSE SLES 8 with Service Pack 1
Kernel 2.4.19-ul1-ppc64-SMP
LTP 20030514

Methodology

System stability and reliability are generally measured as continuous hours of operation and reliable uptime of a system.

The runs started with a set of 30-day baseline runs and progressed to 60- and 90-day Linux test runs on xSeries and pSeries servers. Initial emphasis was placed on kernel, networking, and I/O testing.


Test tool

The Linux Test Project (LTP; see Resources for links and more information) is a joint project with SGI, IBM, OSDL, Bull, and Wipro Technologies with a goal to deliver test suites to the open source community that test the reliability, robustness, and stability of Linux. The Linux Test Project is a collection of tools for testing the Linux kernel and related features. The goal is to help improve the Linux kernel by bringing test automation to the kernel testing effort.

Currently, there are over 2000 test cases within the LTP suite, covering the majority of kernel interfaces such as syscalls, memory, IPC, I/O, filesystems, and networking. The test suite is updated and released monthly and runs on multiple architectures. There are 11 known LTP test suite tested architectures including i386, ia64, PowerPC, PowerPC 64, S/390, S/390x (64bit), MIPS, mipsel, cris, AMD Opteron, and embedded architectures. We used LTP version 20030514 -- the latest available at the time -- in our reliability testing.


Test strategy

There were two unique phases in the baseline run: a 24-hour "initial test," followed by the stress reliability run phase, or "stress test."

Passing the initial test was an entry requirement. The initial test consisted of a successful 24-hour run of the LTP test suite on the hardware and operating system that would be used for reliability runs. The driver script runalltests.sh, which comes with the LTP test suite package, was used to validate the kernel. This script runs a group of packaged tests in sequential order and reports the overall result. It also has the option to launch several instances running in parallel simultaneously. By default, this script executes:

  • Filesystem stress tests
  • Disk I/O tests
  • Memory management stress tests
  • IPC stress tests
  • Scheduler tests
  • Commands functional verification tests
  • System call functional verification tests

The stress test verified the robustness of the product during high system usage. In addition to runalltests.sh, a test scenario called ltpstress.sh was specially designed to exercise a wide range of kernel components in parallel with networking and memory management and to create a high-stress workload on the testing system. ltpstress.sh is also part of the LTP test suite. The script runs similar test cases in parallel and different test cases in sequence in order to avoid intermittent failures caused by running into the same resources or interfering with one another. By default, this script executes:

  • NFS stress tests
  • Memory management stress tests
  • Filesystem stress tests
  • Math (floating point) tests
  • pthread stress tests
  • Disk I/O tests
  • IPC (pipeio, semaphore) tests
  • System call functional verification tests
  • Networking stress tests

System monitoring

The modified top utility that comes with the LTP test suite was used as a system monitoring tool. top provides an ongoing look at processor activity in real time. The enhanced top utility has additional functions that can save snapshots of top results to a file and give the average summary of the resulting file, including information such as CPU, memory, and swap space utilizations.

In our tests, snapshots of system utilization (or top output files) were taken every 10 seconds and saved to result files. In addition, snapshots of system utilization and LTP test output files were taken daily or weekly to have data points to determine whether systems were degrading during long-period runs. This function was controlled by cron jobs and scripts.

Before testing
All selected testing systems had hardware configured as similarly to each other as possible. Extra hardware was removed to reduce potential hardware failure. Minimum-security options were selected during image installation. At least 2 GB of disk space was reserved for storing the top data files and LTP log files.

Note that this is a testing scenario; in real life, users would be well advised to keep security settings at much higher than minimum.

During testing
The system was left undisturbed for the duration of the tests. Occasional access of the system to verify that the test was still executing was acceptable. Verification included using the ps command, checking top data, and checking LTP log data.

After testing
When the test completed, the system monitoring tool top was stopped immediately. All top data files, including daily or weekly snapshots and LTP log files, were saved and processed in order to provide data for analysis.


Conclusions

The findings discussed in this article are based on a solution that was created and tested under laboratory conditions. These findings may not be realized in all environments, and implementation in such environments may require additional steps, configurations, and performance analysis.

However, as most Linux kernel testing efforts have only been conducted over short periods of time, this series of tests provides us first-hand data and results of longer runs. The series of tests also provides data for heavy-stress workloads on Linux kernel components, as well as TCP, NFS, and other test components. The tests demonstrate that the Linux system is reliable and stable over long durations and can provide a robust, enterprise-level environment.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=11362
ArticleTitle=Putting Linux reliability to the test
publish-date=12172003