Five network/system tricks for Linux on System z

Techniques for starting Linux on S/390 systems

Bringing up Linux® on an IBM® System z machine should be fairly easy, but problems can crop up. If you've had problems, try out these workarounds for annoying obstacles to starting Linux on an S/390 system: "route-unknown" messages, bad network service behaviors, file system corruption on shutdown, too-lengthy boot-path-device processes, and Virtual LAN hardware installation. Added bonus: Warnings (and workarounds) for two SUSE bugs.

Anjali Gupta, System Software Engineer, IBM

Anjali Gupta does Java performance testing on the Linux on System z platform and has been a part of the IBM z/OS Java PVT (performance verification testing) team for three years.



11 February 2009

Also available in Russian

Linux is a very useful operating system for the IBM S/390 platform (also referred to as System z) because it consolidates connectivity among legacy, Linux, and middleware applications such as Web, mail, application servers, firewalls, etc. By working as a "native" operating system, Linux leverages all the hardware capability of the S/390 platform.

This article shares five troubleshooting tips to counter the various problems that can arise when you bring up a Linux system on a System z series machine:

  1. Surviving a Martian invasion: Martians messages are the packets with sources that don't have any known route.
  2. Disciplining network services on restart: Sometimes when bringing up the Linux image on an LPAR, the service just doesn't behave.
  3. Intercepting corruption to your file system on shutdown: There are three ways shutdown can lead your file system astray.
  4. Making the most of cio_ignore: Or, how to shorten your list of "must sense and analyze" boot path devices.
  5. Keeping Virtual LAN from being a pain in the flash: By looking for the physical installation files.

This article refers to various versions of SUSE Linux, so as a bonus tip, I'll shed some light on a few bugs in different versions of SUSE and give you a workaround for those bugs.

1. Martians, meet our leader

Martians are the packets with sources that don't have any known route. These are packets that the Linux operating system isn't expecting, especially considering the destinations they arrive from (such as a packet from an internal host coming in through the external interface).

Martian messages may be due to misconfigurations (like an assignment of an external address to a LAN interface), a mask issue, or address spoofing (in which someone is trying to do something awful to your system). The configuration must be checked and corrected to eliminate these kinds of messages.

Sometimes, these messages are received from the switches as well since many machines are terminated to the same switch. In such a scenario, Martian messages may not be maliciously harmful but can still flood the operating system messages console. To get rid of these messages, the switch needs to be restarted; remember, though, by doing this, you will affect the other machines that terminate on the same switch. And some servers are too important to tamper with in this manner.

Problem: Overwhelmed at the OS messages console

When debugging with the OS messages console can be difficult because you have to wade through the flood of Martian messages (as shown in Figure 1):

Figure 1. A snapshot of Martian messages overloading the messages console
Martian messages overloading the messages console

Solution: Stashing the little green men

You can get a cleaner OS console and hide the Martian messages by:

  • Turning off packets logging:
/proc/sys/net/ipv4/conf/eth0/log_martians   
Ex: echo "0" > /proc/sys/net/ipv4/conf/eth0/log_martians
  • Dialing down kernel logging to level 4:
yast -> system -> /etc/sysconfig Editor -> System -> Logging -> KERNEL_LOGLEVEL   (7 to 4)

2. Bad network services, stand in the corner

When you bring up the Linux image on a LPAR, you may find that the network service does not behave as expected; in spite of allocating the IP address, the network can be unreachable.

Problem: Ping good, login bad

Pinging other machines from the OS messages console can be successful, and the ipconfig command displays the correct information, but when you try to login (ssh/telnet) through putty, the session is unsuccessful.

Solution: Network restart to the rescue

The root user can restart the network using service network restart on the OS messages console as shown in Listing 1.

Listing 1. Output of service network restart
j8806_h117 (root): /opt/javabm/data >service network restart
Shutting down network interfaces:
    eth0
    eth0      configuration: qeth-bus-ccw-0.0.0420                    done
Shutting down service network  .  .  .  .  .  .  .  .  .  .  .  .  .  done
Hint: you may set mandatory devices in /etc/sysconfig/network/config
Setting up network interfaces:
    lo
    lo        IP address: 127.0.0.1/8                                 done
    eth0
    eth0      configuration: qeth-bus-ccw-0.0.0420
    eth0      IP address: 9.12.22.25/24                               done
    vlan538
interface  is not available
SIOCGIFFLAGS: No such device
Cannot enable interface vlan538.
interface vlan538 is not up                                           failed
Setting up service network  .  .  .  .  .  .  .  .  .  .  .  .  .  .  failed
SuSEfirewall2: Warning: ip6tables does not support state matching. 
               Extended IPv6 support disabled.
SuSEfirewall2: Setting up rules from /etc/sysconfig/SuSEfirewall2 ...
SuSEfirewall2: batch committing...
SuSEfirewall2: Firewall rules successfully set

Only the root user can issue this command. If you find that the gateway is incorrect or missing, you will need to include the required details to restart the network. You can do that with the following:

  1. route add default gw 9.12.44.1
  2. service network restart

The output in Listing 1 shows all the steps involved in shutting down and bringing up the network. If the network doesn't come up, then it will not show the IP address in the output. The IP address is defined in the file /etc/sysconfig/network/ifcfg-eth-id-XX:XX:XX:XX:XX—it shows the VLAN configuration and firewall setup. You can use this output to check all the steps if something goes wrong with the network.

For updated information in any of the network configuration files (/etc/sysconfig/network/ifcfg-eth-id-XX:XX:XX:XX:XX, /etc/sysconfig/hardware/ifcfg-eth-id-XX:XX:XX:XX:XX, etc.) to take effect, you must issue the command service network restart (or /etc/init.d/network restart).


3. Getting a corruption conviction

Sometimes in the process of shutting down the system, the file system gets corrupted for any of these reasons:

  • The Linux image is improperly halted
  • Any file system is improperly unmounted
  • The root file system is completely, 100 percent full

In these cases, it is not be possible to bring up the same image in the subsequent attempt of booting the system.

Problem: To login prompt or not

Let's look at two corruption scenarios:

  1. When the root file system (/dev/dasda1) itself is corrupted. In this case, the login prompt is not provided by the system on the OS messages console.
  2. When file systems other than the root file system get corrupted. In this case, the login prompt is provided and only the clean file systems are mounted.

Solution: Enabling fsck whether it wants to or not

The one solution to both scenarios is running fsck, as follows:

  1. The first scenario can be resolved by bringing the root online (by using the command chccwdev -e <DASD address>) on an existing Linux image and running fsck forcibly. This force checks the file system in spite of it being marked clean. For example: fsck -f /dev/dasdx1 (where x indicates the device letter for the newly added dasd for which fsck is being run).
  2. In the second scenario, the command fsck is executed on the corrupted file system from the OS messages console after successful login. Once fsck is run for all the corrupted DASDs, the system needs to be rebooted, resulting in the image coming up successfully on the subsequent reboot.

4. Making faster time on a shorter list

The cio_ignore parameter is a kernel parameter used to specify and analyze all the available devices attached to the machine. When Linux starts booting, it senses and analyzes all the available devices.

You can use cio_ignore to shorten the list of devices that are to be sensed and analyzed during boot process.

Problem: It's a long line to get in

On System z, it takes considerable time for the image to come up since there are many devices (DASD, network devices, etc.) attached to the machines that are required to load different images. Regardless of the devices used by any image, the system will sense and analyze each device while booting, making for a slow booting process.

Solution: Comment out some or all the waiters

In such cases, define the range of addresses for the devices that are not necessary for the current image in the zipl.conf file. This way, all the devices defined in zipl.conf are ignored, and the system boots in a zippy manner. You can do this with:

  • cio_ignore=all: Specifies that all the devices are to be ignored.
  • cio_ignore=all, !0.0.b100-0.0.b1ff, !0.0.a100: Specifies that all devices but the range 0.0.b100 through 0.0.b1ff and the device 0.0.a100 are to be ignored. (Very customizable!)

5. Finding the physical files for your virtual LAN

Virtual LAN is a group of devices on one or more LANs that are configured (using management software) so that they can communicate, thus giving the illusion that they are attached on the same physical wire (when in fact they are located on a number of different LAN segments).

Because they are based on logical, not physical, connections, VLANs are extremely flexible. A VLAN frame looks almost the same as an Ethernet frame; the difference is that a VLAN frame has an extra field containing a number that identifies the VLAN. This number is called a VLAN tag.

Problem: Where's the tag in reality?

A fresh installation of Linux (SUSE or RedHat) on a System z machine is not possible if it is configured for VLAN but doesn't have access to a non-VLAN OSA port (an Ethernet card for System z).

When you are doing a new Linux (SUSE) installation on an LPAR, you have to install it through the network. You need to connect the LPAR (z machine) to the server exactly where the installation files are available. It will ask you for various types of information in order to set up the network before starting the installation. Nothing in these prompts indicates where you can set up to work on a VLAN-configured LAN segment. If the machine (z server) is already VLAN-tagged, then it will look only for the packets having some specific VLAN tag information.

Solution: Install without, then configure for

Neither a SUSE or RedHat distribution supports setting up VLAN on the network interface during installation, so the only solution is to have a working non-VLAN network to install the system first and then configure it for VLAN.

You should have non-VLAN-tagged hardware for a new installation. Hardware should not be configured for VLAN at the time of installation.


Bonus: Two SUSE bugs highlighted

Finally, here are two SUSE-related bugs you should be aware of.

Bug 1: VLAN tag problem in some SLES9

First up is a bug with the very old GA-level kernel SLES9—it doesn't come up on VLAN-tagged hardware. It can cause kernel panic (where the system outputs an error message to the console, dumps an image of kernel memory to disk for post-mortem debugging, and then either waits for the system to be manually rebooted or initiates an automatic reboot). It does not occur on the SLES9 SP4 kernel.

Bug 2: An Awk-ward script error in SLES10

Executing an Awk script on SLES10 and SLES10 SP2 results in the following error:

Listing 2. Executing an Awk script on SLES10 and SLES10 SP2
rapdistro7:~ # awk -f sample_test.awk get_build.messages
section 1
*** glibc detected *** awk: double free or corruption (fasttop): 
0x0000000080055060 ***
======= Backtrace: =========

......

======= Memory map: ========

......

3ffff952000-3ffff967000 rw-p 3ffff952000 00:00 0 
[stack]
Aborted
rapdistro7:~ #

The same script works fine on SLES9. A bug report for this problem has been opened in bugzilla. By the way, the workaround is to export LC_ALL = C. And versions after SLES10/SP2 have fixed this.

Resources

Learn

Get products and technologies

  • With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=369044
ArticleTitle=Five network/system tricks for Linux on System z
publish-date=02112009